See All Titles |
![]() ![]() Advanced Web ClientsWeb browsers are basic Web clients. They are used primarily for searching and downloading documents from the Web. Advanced clients of the Web are those applications which do more than download single documents from the Internet. One example of an advanced Web client is a crawler (a.k.a. spider, robot). These are programs which explore and download pages from the Internet for different reasons, some of which include:
The crawler we present below, crawl.py, takes a starting Web address (URL), downloads that page and all other pages whose links appear in succeeding pages, but only those which are in the same domain as the starting page. Without such limitations, you will run out of disk space! The source for crawl.py follows: Example 19.1. An Advanced Web Client: a Web Crawler (crawl.py)The crawler consists of two classes, one to manage the entire crawling process (Crawler), and one to retrieve and parse each downloaded Web page (Retriever). <$nopage> 001 1 #!/usr/bin/env python 002 2 003 3 from sys import argv 004 4 from os import makedirs, unlink 005 5 from os.path import dirname, exists, isdir, splitext 006 6 from string import replace, find, lower 007 7 from htmllib import HTMLParser 008 8 from urllib import urlretrieve 009 9 from urlparse import urlparse, urljoin 010 10 from formatter import DumbWriter, AbstractFormatter 011 11 from cStringIO import StringIO 012 12 013 13 class Retriever: # download Web pages 014 14 015 15 def __init__(self, url): 016 16 self.url = url 017 17 self.file = self.filename(url) 018 18 019 19 def filename(self, url, deffile='index.htm'): 020 20 parsedurl = urlparse(url, 'http:', 0)# parse path 021 21 path = parsedurl[1] + parsedurl[2] 022 22 text = splitext(path) 023 23 if ext[1] == '': # no file, use default 024 24 if newpath[-1] == '/': 025 25 path = path + deffile 026 26 else: <$nopage> 027 27 path = path + '/' + deffile 028 28 dir = dirname(path) 029 29 if not isdir(dir): # create archive dir if nec. 030 30 if exists(dir): unlink(dir) 031 31 makedirs(dir) 032 32 return path 033 33 034 34 def download(self): # download Web page 035 35 try: <$nopage> 036 36 retval = urlretrieve(self.url, self.file) 037 37 except IOError: 038 38 retval = ('*** ERROR: invalid URL "%s"' %\ 039 39 self.url,) 040 40 return retval 041 41 042 42 def parseAndGetLinks(self): # parse HTML, save links 043 43 self.parser = HTMLParser(AbstractFormatter(\ 044 44 DumbWriter(StringIO()))) 045 45 self.parser.feed(open(self.file).read()) 046 46 self.parser.close() 047 47 return self.parser.anchorlist 048 48 049 49 class Crawler: # manage entire crawling process 050 50 051 51 count = 0 # static downloaded page counter 052 52 053 53 def __init__(self, url): 054 54 self.q = [url] 055 55 self.seen = [] 056 56 self.dom = urlparse(url)[1] 057 57 058 58 def getPage(self, url): 059 59 r = Retriever(url) 060 60 retval = r.download() 061 61 if retval[0] == '*': # error situation, do not parse 062 62 print retval, ' skipping parse' 063 63 return <$nopage> 064 64 Crawler.count = Crawler.count + 1 065 65 print '\n(', Crawler.count, ')' 066 66 print 'URL:', url 067 67 print 'FILE:', retval[0] 068 68 self.seen.append(url) 069 69 070 70 links = r.parseAndGetLinks() # get and process links 071 71 for eachLink in links: 072 72 if eachLink[:4] != 'http' and \ 073 73 find(eachLink, '://') == -1: 074 74 eachLink = urljoin(url, eachLink) 075 75 print '* ', eachLink, 076 76 077 77 if find(lower(eachLink), 'mailto:') != -1: 078 78 print ' discarded, mailto link' 079 79 continue <$nopage> 080 80 081 81 if eachLink not in self.seen: 082 82 if find(eachLink, self.dom) == -1: 083 83 print ' discarded, not in domain' 084 84 else: <$nopage> 085 85 if eachLink not in self.q: 086 86 self.q.append(eachLink) 087 87 print ' new, added to Q' 088 88 else: <$nopage> 089 89 print ' discarded, already in Q' 090 90 else: <$nopage> 091 91 print ' discarded, already processed' 092 92 093 93 def go(self):# process links in queue 094 94 while self.q: 095 95 url = self.q.pop() 096 96 self.getPage(url) 097 97 098 98 def main(): 099 99 if len(argv) > 1: 100 100 url = argv[1] 101 101 else: <$nopage> 102 102 try: <$nopage> 103 103 url = raw_input('Enter starting URL: ') 104 104 except (KeyboardInterrupt, EOFError): 105 105 url = '' 106 106 107 107 if not url: return <$nopage> 108 108 robot = Crawler(url) 109 109 robot.go() 110 110 111 111 if __name__ == '__main__': 112 112 main() 113 <$nopage> Line-by-line (Class-by-class) explanation:Lines 1 11The top part of the script consists of the standard Python Unix start-up line and the importation of various module attributes which are employed in this application. Lines 13 47The Retriever class has the responsibility of downloading pages from the Web and parsing the links located within each document, adding them to the "to-do" queue if necessary. A Retriever instance object is created for each page which is downloaded from the net. Retriever consists of several methods to aid in its functionality: a constructor (__init__()), filename(), download(), and parseAndGetLinks(). The filename() method takes the given URL and comes up with a safe and sane corresponding file name to store locally. Basically, it removes the "http://" prefix from the URL and uses the remaining part as the file name, creating any directory paths necessary. URLs without trailing file names will be given a default file name of "index.htm." (This name can be overridden in the call to filename()). The constructor instantiates a Retriever object and stores both the URL string and the corresponding file name returned by filename() as local attributes. The download() method, as you may imagine, actually goes out to the net to download the page with the given link. It calls urllib.urlretrieve() with the URL and saves it to the filename (the one returned by filename()). If the download was successful, the parse() method is called to parse the page just copied from the network, otherwise an error string is returned. If the Crawler determines that no error has occurred, it will invoke the parseAndGetLinks() method to parse newly-downloaded page and determine the cause of action for each link located on that page. Lines 49 96The Crawler class is the "star" of the show, managing the entire crawling process, thus only one instance is created for each invocation of our script. The Crawler consists of three items stored by the constructor during the instantiation phase, the first of which is q, a queue of links to download. Such a list will fluctuate during execution, shrinking as each page is processed and grown as new links are discovered within each downloaded page. The other two data values for the Crawler include seen, a list of all the links which "we have seen" (downloaded) already. And finally, we store the domain name for the main link, dom, and use that value to determine whether any succeeding links are part of the same domain. Crawler also has of a static data item named count. The purpose of this counter is just to keep track of the number of objects we have downloaded from the net. It is incremented for every page successfully download. Crawler has a pair of other methods in addition to its constructor, getPage() and go(). go() is simply the method that is used to start the Crawler and is called from the main body of code. go() consists of a loop that will continue to execute as long as there are new links in the queue which need to be downloaded. The workhorse of this class though, is the getPage() method. getPage() instantiates a Retriever object with the first link and lets it go off to the races. If the page was downloaded successfully, the counter is incremented and the link added to the "already seen" list. It looks recursively at all the links featured inside each downloaded page and determine whether any more links should be added to the queue. The main loop in go() will continue to process links until the queue is empty, at which time victory is declared. Links which are: part of another domain, have already been downloaded, are already in the queue waiting to be processed, or are "mailto:" links are ignored and not added to the queue. Lines 98 112main() is executed if this script is invoked directly and is the starting point of execution. Other modules which import crawl.py will need to invoke main() to begin processing. main() needs a URL to begin processing, If one is given on the command-line (for example which this script is invoked directly), it will just go with the one given. Otherwise, the script enters interactive mode prompting the user for a starting URL. With a starting link in hand, the Crawler is instantiated and away we go. One sample invocation of crawl.py may look like: % crawl.py Enter starting URL: http://www.null.com/home/index.html ( 1 ) URL: http://www.null.com/home/index.html FILE: www.null.com/home/index.html * http://www.null.com/home/overview.html new, added to Q * http://www.null.com/home/synopsis.html new, added to Q * http://www.null.com/home/order.html new, added to Q * mailto:postmaster@null.com discarded, mailto link * http://www.null.com/home/overview.html discarded, already in Q * http://www.null.com/home/synopsis.html discarded, already in Q * http://www.null.com/home/order.html discarded, already in Q * mailto:postmaster@null.com discarded, mailto link * http://bogus.com/index.html discarded, not in domain ( 2 ) URL: http://www.null.com/home/order.html FILE: www.null.com/home/order.html * mailto:postmaster@null.com discarded, mailto link * http://www.null.com/home/index.html discarded, already processed * http://www.null.com/home/synopsis.html discarded, already in Q * http://www.null.com/home/overview.html discarded, already in Q ( 3 ) URL: http://www.null.com/home/synopsis.html FILE: www.null.com/home/synopsis.html * http://www.null.com/home/index.html discarded, already processed * http://www.null.com/home/order.html discarded, already processed * http://www.null.com/home/overview.html discarded, already in Q ( 4 ) URL: http://www.null.com/home/overview.html FILE: www.null.com/home/overview.html * http://www.null.com/home/synopsis.html discarded, already processed * http://www.null.com/home/index.html discarded, already processed * http://www.null.com/home/synopsis.html discarded, already processed * http://www.null.com/home/order.html discarded, already processed After execution, a http://www.null.com directory would be created in the local file system, with a home subdirectory. Within home, all the HTML files processed will be found there.
|
© 2002, O'Reilly & Associates, Inc. |