Jump to content

How to build a simple web crawler

+ 2
  adfm's Photo
Posted Feb 16 2010 12:34 PM

If you're creating a search engine you'll need a way to collect documents. In this excerpt from Tony Segaran's Programming Collective Intelligence the author shows you how to set up a simple web crawler using existing tools.


I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler. It will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. This process is called crawling or spidering.

To do this, your code will have to download the pages, pass them to the indexer (which you'll build in the next section), and then parse the pages to find all the links to the pages that have to be crawled next. Fortunately, there are a couple of libraries that can help with this process.

For the examples in this chapter, I have set up a copy of several thousand files from Wikipedia, which will remain static at http://kiwitobes.com/wiki.

You're free to run the crawler on any set of pages you like, but you can use this site if you want to compare your results to those in this chapter.

Using urllib2

urllib2 is a library bundled with Python that makes it easy to download pages—all you have to do is supply the URL. You'll use it in this section to download the pages that will be indexed. To see it in action, start up your Python interpreter and try this:

>> import urllib2

>> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html')

>> contents=c.read( )

>> print contents[0:50]

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'

All you have to do to store a page's HTML code into a string is create a connection and read its contents.

Crawler Code

The crawler will use the Beautiful Soup API, an excellent library that builds a structured representation of web pages. It is very tolerant of web pages with broken HTML, which is useful when constructing a crawler because you never know what pages you might come across.

Using urllib2 and Beautiful Soup you can build a crawler that will take a list of URLs to index and crawl their links to find other pages to index. First, add these import statements to the top of searchengine.py:

import urllib2

from BeautifulSoup import *

from urlparse import urljoin



# Create a list of words to ignore

ignorewords=set(['the','of','to','and','a','in','is','it'])

Now you can fill in the code for the crawler function. It won't actually save anything it crawls yet, but it will print the URLs as it goes so you can see that it's working. You need to put this at the end of the file (so it's part of the crawler class):

def crawl(self,pages,depth=2):

  for i in range(depth):

    newpages=set( )

    for page in pages:

      try:

        c=urllib2.urlopen(page)

      except:

        print "Could not open %s" % page

        continue

      soup=BeautifulSoup(c.read( ))

      self.addtoindex(page,soup)



      links=soup('a')

      for link in links:

        if ('href' in dict(link.attrs)):

          url=urljoin(page,link['href'])

          if url.find("'")!=-1: continue

          url=url.split('#')[0] # remove location portion

          if url[0:4]=='http' and not self.isindexed(url):

            newpages.add(url)

          linkText=self.gettextonly(link)

          self.addlinkref(page,url,linkText)



        self.dbcommit( )



        pages=newpages

This function loops through the list of pages, calling addtoindex on each one (right now this does nothing except print the URL, but you'll fill it in the next section). It then uses Beautiful Soup to get all the links on that page and adds their URLs to a set called newpages. At the end of the loop, newpages becomes pages, and the process repeats.

This function can be defined recursively so that each link calls the function again, but doing a breadth-first search allows for easier modification of the code later, either to keep crawling continuously or to save a list of unindexed pages for later crawling. It also avoids the risk of overflowing the stack.

You can test this function in the Python interpreter (there's no need to let it finish, so press Ctrl-C when you get bored):

>> import searchengine

>> pagelist=['http://kiwitobes.com/wiki/Perl.html']

>> crawler=searchengine.crawler('')

>> crawler.crawl(pagelist)

Indexing http://kiwitobes.com/wiki/Perl.html

Could not open http://kiwitobes.com...ramming%29.html

Indexing http://kiwitobes.com...ry_Project.html

Indexing http://kiwitobes.com...face.html

You may notice that some pages are repeated. There is a placeholder in the code for another function, isindexed, which will determine if a page has been indexed recently before adding it to newpages. This will let you run this function on any list of URLs at any time without worrying about doing unnecessary work.

Programming Collective Intelligence

Learn more about this topic from Programming Collective Intelligence.

This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.

See what you'll learn


1 Reply

0
  akkasi's Photo
Posted Nov 28 2012 02:01 PM

Hi,
in the crawler code you used three method "addtoindex" , "isindexed" ,also "gettextonly" which when i tested your program , they didn't work! so please explain what they do and how can i define them .