Jump to content

Use BeautifulSoup to parse data.gov

+ 2
  odewahn1's Photo
Posted Aug 25 2010 07:21 AM

Data.gov is a tremendous information resource that lists thousands of data sets collected and managed by various Federal agencies. I was interested in seeing which ones were being used the most. Unfortunately, this sort of information isn't readily available (or, at least, I couldn't figure out how to do it.) However, because each data set is described by a nice little page generated from what looks like a Drupal website, it's relatively easy to parse out the information yourself. This Answer describes how.

First things first: check out a sample page, like this one for Worldwide M1+ Earthquakes, Past 7 Day. You'll see that the number of downloads are listed in the "Dataset Metrics" section. A quick "View Source" shows how this information is laid out:

<div class="categories">
	<div class="detail-header"><h2>Dataset Metrics</h2></div>
	<table border="0" cellpadding="0" cellspacing="0" class="details-table" >
	<tbody>                  <tr>
                              <td class="detailhead1 tablepad" 
            title='Download represents the number of times users have clicked on <br/> 
                                XML / CSV / XLS / KML/KMZ /Shapefile / Maps in the Download Information section.' 
                        id='tooltipTd'>Number of Downloads</td>
			<td class="tablepad data">
                      1,818                      </td>
                  </tr>
</tbody></table></div>


To pull this out of the HTML, I used the wonderful BeautifulSoup parser for Python. (If you don't have it installed already, you'll need to follow the installation instructions of the site for the next few steps to work.)

BeautifulSoup basically takes ugly, ambiguous HTML and creates a nice, clean data structure that you can navigate in code. To use it, you basically track down the various pieces you want to pull out of a web page (like I did above) and then walk the data structure to pull out just the piece your looking for. In this case, we're looking for the data inside the <td class="tablepad data"> inside the <div class="categories"> that has an <h2> heading of "Dataset Metrics". (Whew!).

Here's the code to do this:

def pull_downloads( url):
   ret_val = ""
   txt = urllib.urlopen(url).read()  #Pull in the url
   soup = BeautifulSoup(txt)
   rows = soup.findAll("div", {"class" : "categories"})
   for row in rows:
      cat = row.findAll('h2');
      if len(cat) > 0:
         if str(cat[0]).find("Dataset Metrics") > 0:
            recs = row.findAll("td", { "class": "tablepad data"})
            for rec in recs:
               for c in rec.renderContents():
                  if c in ["0","1","2","3,","4","5","6","7","8","9"]: ret_val += c
   return ret_val


Finally, we need to process all the data sets in the Data.gov catalog. To do this, I just copied the first column into a file called urls.txt and then passed them in using input redirection. So, here's the urls.txt file:

http://www.data.gov/details/1593
http://www.data.gov/details/1594
http://www.data.gov/details/1595
http://www.data.gov/details/2122
http://www.data.gov/details/2123
http://www.data.gov/details/2125
...


Finally, here's the code that I wrapped around the BeautifulSoup parser to process the urls.txt file:

import urllib
from BeautifulSoup import BeautifulSoup
import sys



def pull_downloads( url):
   ret_val = ""
   txt = urllib.urlopen(url).read()  #Pull in the url
   soup = BeautifulSoup(txt)
   rows = soup.findAll("div", {"class" : "categories"})
   for row in rows:
      cat = row.findAll('h2');
      if len(cat) > 0:
         if str(cat[0]).find("Dataset Metrics") > 0:
            recs = row.findAll("td", { "class": "tablepad data"})
            for rec in recs:
               for c in rec.renderContents():
                  if c in ["0","1","2","3,","4","5","6","7","8","9"]: ret_val += c
   return ret_val


urls = sys.stdin.readlines()
for url in urls:
   u = url.split("\n")[0]
   n = pull_downloads(u)
   print "%s\t%s" % (u, n)


Finally, you can run all this with this command:

python pull_datadotgov.py < urls.txt > usage.txt

Let it run a few hours and you'll have a complete record of all the downloads on the site.

Tags:
1 Subscribe


0 Replies