< BACKMake Note | BookmarkCONTINUE >
156135250194107072078175030179198180025031194137176049106218111004229217069064035085158113

Advanced Web Clients

Web browsers are basic Web clients. They are used primarily for searching and downloading documents from the Web. Advanced clients of the Web are those applications which do more than download single documents from the Internet.

One example of an advanced Web client is a crawler (a.k.a. spider, robot). These are programs which explore and download pages from the Internet for different reasons, some of which include:

  • Indexing or cataloging into a large search engine such as Google, Alta Vista, or Yahoo!,

  • Offline browsing—downloading documents onto a local hard disk and rearranging hyperlinks to create almost a mirror image for local browsing,

  • Downloading and storing for historical or archival purposes, or

  • Web page caching to save superfluous downloading time on Web site revisits.

The crawler we present below, crawl.py, takes a starting Web address (URL), downloads that page and all other pages whose links appear in succeeding pages, but only those which are in the same domain as the starting page. Without such limitations, you will run out of disk space! The source for crawl.py follows:

Example 19.1. An Advanced Web Client: a Web Crawler (crawl.py)

The crawler consists of two classes, one to manage the entire crawling process (Crawler), and one to retrieve and parse each downloaded Web page (Retriever).

 <$nopage>
001 1   #!/usr/bin/env python
002 2
003 3   from sys import argv
004 4   from os import makedirs, unlink
005 5   from os.path import dirname, exists, isdir, splitext
006 6   from string import replace, find, lower
007 7   from htmllib import HTMLParser
008 8   from urllib import urlretrieve
009 9   from urlparse import urlparse, urljoin
010 10  from formatter import DumbWriter, AbstractFormatter
011 11  from cStringIO import StringIO
012 12
013 13  class Retriever:   #   download Web pages
014 14
015 15      def __init__(self, url):
016 16          self.url = url
017 17          self.file = self.filename(url)
018 18
019 19      def filename(self, url, deffile='index.htm'):
020 20          parsedurl = urlparse(url, 'http:', 0)# parse path
021 21          path = parsedurl[1] + parsedurl[2]
022 22          text = splitext(path)
023 23          if ext[1] == '':   # no file, use default
024 24              if newpath[-1] == '/':
025 25                  path = path + deffile
026 26          else: <$nopage>
027 27                  path = path + '/' + deffile
028 28          dir = dirname(path)
029 29          if not isdir(dir):    # create archive dir if nec.
030 30              if exists(dir): unlink(dir)
031 31              makedirs(dir)
032 32          return path
033 33
034 34      def download(self):  # download Web page
035 35          try: <$nopage>
036 36              retval = urlretrieve(self.url, self.file)
037 37          except IOError:
038 38              retval = ('*** ERROR: invalid URL "%s"' %\
039 39                                          self.url,)
040 40          return retval
041 41
042 42      def parseAndGetLinks(self):    # parse HTML, save links
043 43          self.parser = HTMLParser(AbstractFormatter(\
044 44                          DumbWriter(StringIO())))
045 45          self.parser.feed(open(self.file).read())
046 46          self.parser.close()
047 47          return self.parser.anchorlist
048 48
049 49  class Crawler:     # manage entire crawling process
050 50
051 51      count = 0       # static downloaded page counter
052 52
053 53      def __init__(self, url):
054 54          self.q = [url]
055 55          self.seen = []
056 56          self.dom = urlparse(url)[1]
057 57
058 58      def getPage(self, url):
059 59          r = Retriever(url)
060 60          retval = r.download()
061 61          if retval[0] == '*':  # error situation, do not parse
062 62              print retval, '… skipping parse'
063 63              return <$nopage>
064 64          Crawler.count = Crawler.count + 1
065 65          print '\n(', Crawler.count, ')'
066 66          print 'URL:', url
067 67          print 'FILE:', retval[0]
068 68          self.seen.append(url)
069 69
070 70          links = r.parseAndGetLinks() # get and process links
071 71          for eachLink in links:
072 72              if eachLink[:4] != 'http' and \
073 73                      find(eachLink, '://') == -1:
074 74                  eachLink = urljoin(url, eachLink)
075 75              print '* ', eachLink,
076 76
077 77              if find(lower(eachLink), 'mailto:') != -1:
078 78                  print '… discarded, mailto link'
079 79                  continue <$nopage>
080 80
081 81              if eachLink not in self.seen:
082 82                  if find(eachLink, self.dom) == -1:
083 83                      print '… discarded, not in domain'
084 84                  else: <$nopage>
085 85                      if eachLink not in self.q:
086 86                          self.q.append(eachLink)
087 87                          print '… new, added to Q'
088 88                      else: <$nopage>
089 89                  print '… discarded, already in Q'
090 90              else: <$nopage>
091 91                  print '… discarded, already processed'
092 92
093 93  def go(self):# process links in queue
094 94          while self.q:
095 95              url = self.q.pop()
096 96              self.getPage(url)
097 97
098 98  def main():
099 99      if len(argv) > 1:
100 100         url = argv[1]
101 101     else: <$nopage>
102 102         try: <$nopage>
103 103             url = raw_input('Enter starting URL: ')
104 104         except (KeyboardInterrupt, EOFError):
105 105             url = ''
106 106
107 107     if not url: return <$nopage>
108 108     robot = Crawler(url)
109 109     robot.go()
110 110
111 111 if __name__ == '__main__':
112 112     main()
113  <$nopage>

Line-by-line (Class-by-class) explanation:
Lines 1– 11

The top part of the script consists of the standard Python Unix start-up line and the importation of various module attributes which are employed in this application.

Lines 13 – 47

The Retriever class has the responsibility of downloading pages from the Web and parsing the links located within each document, adding them to the "to-do" queue if necessary. A Retriever instance object is created for each page which is downloaded from the net. Retriever consists of several methods to aid in its functionality: a constructor (__init__()), filename(), download(), and parseAndGetLinks().

The filename() method takes the given URL and comes up with a safe and sane corresponding file name to store locally. Basically, it removes the "http://" prefix from the URL and uses the remaining part as the file name, creating any directory paths necessary. URLs without trailing file names will be given a default file name of "index.htm." (This name can be overridden in the call to filename()).

The constructor instantiates a Retriever object and stores both the URL string and the corresponding file name returned by filename() as local attributes.

The download() method, as you may imagine, actually goes out to the net to download the page with the given link. It calls urllib.urlretrieve() with the URL and saves it to the filename (the one returned by filename()). If the download was successful, the parse() method is called to parse the page just copied from the network, otherwise an error string is returned.

If the Crawler determines that no error has occurred, it will invoke the parseAndGetLinks() method to parse newly-downloaded page and determine the cause of action for each link located on that page.

Lines 49 – 96

The Crawler class is the "star" of the show, managing the entire crawling process, thus only one instance is created for each invocation of our script. The Crawler consists of three items stored by the constructor during the instantiation phase, the first of which is q, a queue of links to download. Such a list will fluctuate during execution, shrinking as each page is processed and grown as new links are discovered within each downloaded page.

The other two data values for the Crawler include seen, a list of all the links which "we have seen" (downloaded) already. And finally, we store the domain name for the main link, dom, and use that value to determine whether any succeeding links are part of the same domain.

Crawler also has of a static data item named count. The purpose of this counter is just to keep track of the number of objects we have downloaded from the net. It is incremented for every page successfully download.

Crawler has a pair of other methods in addition to its constructor, getPage() and go(). go() is simply the method that is used to start the Crawler and is called from the main body of code. go() consists of a loop that will continue to execute as long as there are new links in the queue which need to be downloaded. The workhorse of this class though, is the getPage() method.

getPage() instantiates a Retriever object with the first link and lets it go off to the races. If the page was downloaded successfully, the counter is incremented and the link added to the "already seen" list. It looks recursively at all the links featured inside each downloaded page and determine whether any more links should be added to the queue. The main loop in go() will continue to process links until the queue is empty, at which time victory is declared.

Links which are: part of another domain, have already been downloaded, are already in the queue waiting to be processed, or are "mailto:" links are ignored and not added to the queue.

Lines 98 – 112

main() is executed if this script is invoked directly and is the starting point of execution. Other modules which import crawl.py will need to invoke main() to begin processing. main() needs a URL to begin processing, If one is given on the command-line (for example which this script is invoked directly), it will just go with the one given. Otherwise, the script enters interactive mode prompting the user for a starting URL. With a starting link in hand, the Crawler is instantiated and away we go.

One sample invocation of crawl.py may look like:

								
% crawl.py 
Enter starting URL: http://www.null.com/home/index.html

( 1 )
URL: http://www.null.com/home/index.html
FILE: www.null.com/home/index.html
* http://www.null.com/home/overview.html … new, added to Q
* http://www.null.com/home/synopsis.html … new, added to Q
* http://www.null.com/home/order.html … new, added to Q
* mailto:postmaster@null.com … discarded, mailto link
* http://www.null.com/home/overview.html … discarded, already in Q
* http://www.null.com/home/synopsis.html … discarded, already in Q
* http://www.null.com/home/order.html … discarded, already in Q
* mailto:postmaster@null.com … discarded, mailto link
* http://bogus.com/index.html … discarded, not in domain

( 2 )
URL: http://www.null.com/home/order.html
FILE: www.null.com/home/order.html
* mailto:postmaster@null.com … discarded, mailto link
* http://www.null.com/home/index.html … discarded, already processed
* http://www.null.com/home/synopsis.html … discarded, already in Q
* http://www.null.com/home/overview.html … discarded, already in Q

( 3 )
URL: http://www.null.com/home/synopsis.html
FILE: www.null.com/home/synopsis.html
* http://www.null.com/home/index.html … discarded, already processed
* http://www.null.com/home/order.html … discarded, already processed
* http://www.null.com/home/overview.html … discarded, already in Q

( 4 )
URL: http://www.null.com/home/overview.html
FILE: www.null.com/home/overview.html
* http://www.null.com/home/synopsis.html … discarded, already processed
* http://www.null.com/home/index.html … discarded, already processed
* http://www.null.com/home/synopsis.html … discarded, already processed
* http://www.null.com/home/order.html … discarded, already processed

							

After execution, a http://www.null.com directory would be created in the local file system, with a home subdirectory. Within home, all the HTML files processed will be found there.


Last updated on 9/14/2001
Core Python Programming, © 2002 Prentice Hall PTR

< BACKMake Note | BookmarkCONTINUE >

© 2002, O'Reilly & Associates, Inc.