First things first: check out a sample page, like this one for Worldwide M1+ Earthquakes, Past 7 Day. You'll see that the number of downloads are listed in the "Dataset Metrics" section. A quick "View Source" shows how this information is laid out:
<div class="categories">
<div class="detail-header"><h2>Dataset Metrics</h2></div>
<table border="0" cellpadding="0" cellspacing="0" class="details-table" >
<tbody> <tr>
<td class="detailhead1 tablepad"
title='Download represents the number of times users have clicked on <br/>
XML / CSV / XLS / KML/KMZ /Shapefile / Maps in the Download Information section.'
id='tooltipTd'>Number of Downloads</td>
<td class="tablepad data">
1,818 </td>
</tr>
</tbody></table></div>
To pull this out of the HTML, I used the wonderful BeautifulSoup parser for Python. (If you don't have it installed already, you'll need to follow the installation instructions of the site for the next few steps to work.)
BeautifulSoup basically takes ugly, ambiguous HTML and creates a nice, clean data structure that you can navigate in code. To use it, you basically track down the various pieces you want to pull out of a web page (like I did above) and then walk the data structure to pull out just the piece your looking for. In this case, we're looking for the data inside the <td class="tablepad data"> inside the <div class="categories"> that has an <h2> heading of "Dataset Metrics". (Whew!).
Here's the code to do this:
def pull_downloads( url):
ret_val = ""
txt = urllib.urlopen(url).read() #Pull in the url
soup = BeautifulSoup(txt)
rows = soup.findAll("div", {"class" : "categories"})
for row in rows:
cat = row.findAll('h2');
if len(cat) > 0:
if str(cat[0]).find("Dataset Metrics") > 0:
recs = row.findAll("td", { "class": "tablepad data"})
for rec in recs:
for c in rec.renderContents():
if c in ["0","1","2","3,","4","5","6","7","8","9"]: ret_val += c
return ret_val
Finally, we need to process all the data sets in the Data.gov catalog. To do this, I just copied the first column into a file called urls.txt and then passed them in using input redirection. So, here's the urls.txt file:
http://www.data.gov/details/1593 http://www.data.gov/details/1594 http://www.data.gov/details/1595 http://www.data.gov/details/2122 http://www.data.gov/details/2123 http://www.data.gov/details/2125 ...
Finally, here's the code that I wrapped around the BeautifulSoup parser to process the urls.txt file:
import urllib
from BeautifulSoup import BeautifulSoup
import sys
def pull_downloads( url):
ret_val = ""
txt = urllib.urlopen(url).read() #Pull in the url
soup = BeautifulSoup(txt)
rows = soup.findAll("div", {"class" : "categories"})
for row in rows:
cat = row.findAll('h2');
if len(cat) > 0:
if str(cat[0]).find("Dataset Metrics") > 0:
recs = row.findAll("td", { "class": "tablepad data"})
for rec in recs:
for c in rec.renderContents():
if c in ["0","1","2","3,","4","5","6","7","8","9"]: ret_val += c
return ret_val
urls = sys.stdin.readlines()
for url in urls:
u = url.split("\n")[0]
n = pull_downloads(u)
print "%s\t%s" % (u, n)
Finally, you can run all this with this command:
python pull_datadotgov.py < urls.txt > usage.txt
Let it run a few hours and you'll have a complete record of all the downloads on the site.

Help






