Wednesday 4 January 2012

Screen Scraping

As soon as I saw the word , in the training agenda sent by my educator, I felt that it is something related to   " Discarding a part of the screen as useless " , as i thought that the word scrap means waste material .
Yes I am in a way correct that scrap means waste .But ...................................................................    then when I started studying about this concept, I found that it is scraping ,,, which comes from the word scrape that means "remove from something " .
Then I understood that this topic deals with "EXTRACTING INFORMATION FROM THE SCREEN ".

I took a website  http://money.livemint.com and tried to extract the eps field information .
I wanted to learn python , so thought of doing this task in python itself .


import urllib
import re
def eps():
    base_url = 'http://money.livemint.com/IID42/F132540/QuickQuote/Company.aspx'
    content = urllib.urlopen(base_url).read()
    me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
    eps = me.group()
    ma = re.search(r'\d+\.\d+', eps)
    if ma:
                 epse = ma.group()
    else:
                 epse = 'no match available : '
    return epse  
   

This does the screen scraping .....

I wanted to give some color and fragrance for this code so that It makes sense ( the deep seated hidden motive is to make the code in such a way so that the user of our website, may not even get the very thought that the data has been scraped from somewhere )

import urllib
import re


def get_eps():
   
    baseone_url = 'http://money.livemint.com/IID42/'
    basethree_url = '/QuickQuote/Company.aspx'
    symbol = input('Enter the company name and should be one among TCS , INFOSYS ,HCL , WIPRO  : ');
    tcs = "TCS"
    infe = "INFOSYS"
    hcl = "HCL"
    wip = "WIPRO"
    if symbol == tcs :
        code = "F132540"
        content = urllib.urlopen(baseone_url + code + basethree_url).read()
   
        me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
        eps = me.group()
   
        ma = re.search(r'\d+\.\d+', eps)
        if ma:
                 epse = ma.group()
                 print 'EPS is ' + epse
        else:
                 epse = 'no match available : '
                 return epse
    elif symbol == infe :
        code = "F100209"
        content = urllib.urlopen(baseone_url + code + basethree_url).read()
   
        me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
        eps = me.group()
   
        ma = re.search(r'\d+\.\d+', eps)
        if ma:
                 epse = ma.group()
                 print 'EPS is ' + epse
        else:
                 epse = 'no match available : '
                 return epse
    elif symbol == wip :
        code = "F107685"
        content = urllib.urlopen(baseone_url + code + basethree_url).read()
   
        me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
        eps = me.group()
   
        ma = re.search(r'\d+\.\d+', eps)
        if ma:
                 epse = ma.group()
                 print 'EPS is ' + epse
        else:
                 epse = 'no match available : '
                 return epse
    elif symbol == hcl :
        code = "F132281"
        content = urllib.urlopen(baseone_url + code + basethree_url).read()
   
        me = re.search(r'EPS\s*\(Rs\.\)<.*?><.*?>\s*<.*>\s*\d*\.\d*\s*<.*>', content)
        eps = me.group()
   
        ma = re.search(r'\d+\.\d+', eps)
        if ma:
                 epse = ma.group()
                 print 'EPS is ' + epse
        else:
                 epse = 'no match available : '
                 return epse
    else:
        print "Enter a valid company name"
 
 This code gets input from the user and searches the page that belongs to that company and displays the scrapped scraped data .

For your info : Web Scraping from a multitude of sites is known as WEB HARVESTING .

Thank You ,

Meet You in next post ,

Harish Kayarohanam

No comments:

Post a Comment