What is .. ?..........!: Screen Scraping

As soon as I saw the word , in the training agenda sent by my educator, I felt that it is something related to " Discarding a part of the screen as useless " , as i thought that the word scrap means waste material .
Yes I am in a way correct that scrap means waste .But ................................................................... then when I started studying about this concept, I found that it is scraping ,,, which comes from the word scrape that means "remove from something " .
Then I understood that this topic deals with "EXTRACTING INFORMATION FROM THE SCREEN ".

I took a website http://money.livemint.com and tried to extract the eps field information .
I wanted to learn python , so thought of doing this task in python itself .

import urllib
import re
def eps():
base_url = 'http://money.livemint.com/IID42/F132540/QuickQuote/Company.aspx'
content = urllib.urlopen(base_url).read()
me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
eps = me.group()
ma = re.search(r'\d+\.\d+', eps)
if ma:
epse = ma.group()
else:
epse = 'no match available : '
return epse

This does the screen scraping .....

I wanted to give some color and fragrance for this code so that It makes sense ( the deep seated hidden motive is to make the code in such a way so that the user of our website, may not even get the very thought that the data has been scraped from somewhere )

import urllib
import re

def get_eps():

baseone_url = 'http://money.livemint.com/IID42/'
basethree_url = '/QuickQuote/Company.aspx'
symbol = input('Enter the company name and should be one among TCS , INFOSYS ,HCL , WIPRO : ');
tcs = "TCS"
infe = "INFOSYS"
hcl = "HCL"
wip = "WIPRO"
if symbol == tcs :
code = "F132540"
content = urllib.urlopen(baseone_url + code + basethree_url).read()

me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
eps = me.group()

ma = re.search(r'\d+\.\d+', eps)
if ma:
epse = ma.group()
print 'EPS is ' + epse
else:
epse = 'no match available : '
return epse
elif symbol == infe :
code = "F100209"
content = urllib.urlopen(baseone_url + code + basethree_url).read()

me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
eps = me.group()

ma = re.search(r'\d+\.\d+', eps)
if ma:
epse = ma.group()
print 'EPS is ' + epse
else:
epse = 'no match available : '
return epse
elif symbol == wip :
code = "F107685"
content = urllib.urlopen(baseone_url + code + basethree_url).read()

me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
eps = me.group()

ma = re.search(r'\d+\.\d+', eps)
if ma:
epse = ma.group()
print 'EPS is ' + epse
else:
epse = 'no match available : '
return epse
elif symbol == hcl :
code = "F132281"
content = urllib.urlopen(baseone_url + code + basethree_url).read()

me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
eps = me.group()

ma = re.search(r'\d+\.\d+', eps)
if ma:
epse = ma.group()
print 'EPS is ' + epse
else:
epse = 'no match available : '
return epse
else:
print "Enter a valid company name"

This code gets input from the user and searches the page that belongs to that company and displays the ~~scrapped~~ scraped data .

For your info : Web Scraping from a multitude of sites is known as WEB HARVESTING .

Thank You ,

Meet You in next post ,

Harish Kayarohanam

What is .. ?..........!

Wednesday 4 January 2012

Screen Scraping

No comments:

Post a Comment