If you're creating a search engine you'll need a way to collect documents. In this excerpt from Tony Segaran's Programming Collective Intelligence the author shows you how to set up a simple web crawler using existing tools.
I'll assume for now that you don't have a big collection of HTML documents sitting on your hard drive waiting to be indexed, so I'll show you how to build a simple crawler. It will be seeded with a small set of pages to index and will then follow any links on that page to find other pages, whose links it will also follow. This process is called crawling or spidering.
To do this, your code will have to download the pages, pass them to the indexer (which you'll build in the next section), and then parse the pages to find all the links to the pages that have to be crawled next. Fortunately, there are a couple of libraries that can help with this process.
For the examples in this chapter, I have set up a copy of several thousand files from Wikipedia, which will remain static at http://kiwitobes.com/wiki.
You're free to run the crawler on any set of pages you like, but you can use this site if you want to compare your results to those in this chapter.
urllib2 is a library bundled with Python that makes it easy to download pages—all you have to do is supply the URL. You'll use it in this section to download the pages that will be indexed. To see it in action, start up your Python interpreter and try this:
>> import urllib2 >> c=urllib2.urlopen('http://kiwitobes.com/wiki/Programming_language.html') >> contents=c.read( ) >> print contents[0:50] '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Trans'
All you have to do to store a page's HTML code into a string is create a connection and read its contents.
The crawler will use the Beautiful Soup API, an excellent library that builds a structured representation of web pages. It is very tolerant of web pages with broken HTML, which is useful when constructing a crawler because you never know what pages you might come across.
Using urllib2 and Beautiful Soup you can build a crawler that
will take a list of URLs to index and crawl their links to find other
pages to index. First, add these
import statements to the top of
import urllib2 from BeautifulSoup import * from urlparse import urljoin # Create a list of words to ignore ignorewords=set(['the','of','to','and','a','in','is','it'])
Now you can fill in the code for the crawler function. It won't
actually save anything it crawls yet, but it will print the URLs as it
goes so you can see that it's working. You need to put this at the end
of the file (so it's part of the
def crawl(self,pages,depth=2): for i in range(depth): newpages=set( ) for page in pages: try: c=urllib2.urlopen(page) except: print "Could not open %s" % page continue soup=BeautifulSoup(c.read( )) self.addtoindex(page,soup) links=soup('a') for link in links: if ('href' in dict(link.attrs)): url=urljoin(page,link['href']) if url.find("'")!=-1: continue url=url.split('#') # remove location portion if url[0:4]=='http' and not self.isindexed(url): newpages.add(url) linkText=self.gettextonly(link) self.addlinkref(page,url,linkText) self.dbcommit( ) pages=newpages
This function loops through the list of pages, calling
addtoindex on each one (right now this does
nothing except print the URL, but you'll fill it in the next section).
It then uses Beautiful Soup to get all the links on that page and adds
their URLs to a set called
newpages. At the end of the loop,
pages, and the process repeats.
This function can be defined recursively so that each link calls the function again, but doing a breadth-first search allows for easier modification of the code later, either to keep crawling continuously or to save a list of unindexed pages for later crawling. It also avoids the risk of overflowing the stack.
You can test this function in the Python interpreter (there's no need to let it finish, so press Ctrl-C when you get bored):
>> import searchengine >> pagelist=['http://kiwitobes.com/wiki/Perl.html'] >> crawler=searchengine.crawler('') >> crawler.crawl(pagelist) Indexing http://kiwitobes.com/wiki/Perl.html Could not open http://kiwitobes.com...ramming%29.html Indexing http://kiwitobes.com...ry_Project.html Indexing http://kiwitobes.com...face.html
You may notice that some pages are repeated. There is a
placeholder in the code for another function,
isindexed, which will determine if a page
has been indexed recently before adding it to
newpages. This will let you run this
function on any list of URLs at any time without worrying about doing
Learn more about this topic from Programming Collective Intelligence.
This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.