Jump to content

Load Recovery.gov Grant Data into a CouchDB Database

VOTE
+ 1
  • -
  • +
  tmo9d's Photo
Posted Sep 30 2009 08:54 AM

The Grant XML file from Recovery.gov contains information about approximately 61,000 federal grants that comprise the Recovery Act of 2009. While you could go through the trouble of loading this data into a Lucene index or even creating some sort of relational database to store the model, an easier approach would be to parse this 150MB XML file and store each grant as a document in a CouchDB database. You can then use the map and reduce functions in a Couch view to produce interesting reports.

I've created a stub project that uses Staxmate, Json-lib, and Apache HttpClient to load Recovery.gov grant data into a CouchDB database. The following steps describe the logic used to parse the Recovery.gov data and load each grant document into a CouchDB database. This sample project is available on GitHub here: http://github.com/to...n/sample-parse. To clone this project, execute: "git clone git://github.com/tobrien/sample-parse.git"

1. Download the Raw XML Data from Recovery.gov. If you've cloned the sample-parse from from GitHub, this data file will already be in ./data.

2. Install CouchDB On Your Local Machine - That particular answer is OSX-specific. If you are running Linux, you can just download the couched distribution and compile it from source. Be warned, Couch installation is often confusing due to the number of dependencies required. For more information on Couch installation, see Installation on the CouchDB Wiki.

2. Add the following dependencies to a Java application. This application is going to use staxmate, a stax parser, apache httpclient, and json-lib among other libraries

   <dependency>
    	<groupId>commons-collections</groupId>
    	<artifactId>commons-collections</artifactId>
    	<version>3.2.1</version>
    </dependency>
    <dependency>
    	<groupId>org.codehaus.staxmate</groupId>
    	<artifactId>staxmate</artifactId>
    	<version>2.0.0</version>
    	<scope>compile</scope>
    </dependency>
    <dependency>
    	<groupId>org.codehaus.staxmate</groupId>
    	<artifactId>stax2</artifactId>
    	<version>2.1</version>
    	<type>jar</type>
    	<scope>compile</scope>
    </dependency>
    <dependency>
    	<groupId>org.codehaus.woodstox</groupId>
    	<artifactId>wstx-asl</artifactId>
    	<version>4.0.0</version>
    	<type>jar</type>
    	<scope>compile</scope>
    </dependency>
    <dependency>
    	<groupId>net.sf.json-lib</groupId>
    	<artifactId>json-lib</artifactId>
    	<version>2.3</version>
    	<type>jar</type>
    	<scope>compile</scope>
    </dependency>
    <dependency>
    	<groupId>org.apache.httpcomponents</groupId>
    	<artifactId>httpclient</artifactId>
    	<version>4.0</version>
    	<type>jar</type>
    	<scope>compile</scope>
    </dependency>


3. Place the "grants.xml" from Recovery.gov into a directory named "data/". (Note, if you've clone the sample project, you already have this data available)

4. Open up the CouchDB admin interface (this is either via the CouchDBX application on OSX, or via http://localhost:5984/_utils if you have installed Couch on another platform). Click on the create database link on the front summary page as shown below:

Posted Image

5. Create a new database named "grants"

Posted Image

6. Write the following class to parse the Recovery.gov grants.xml file with Staxmate, turn each content node into a JSON document, and store bulk documents in a CouchDB database named "grants":

package com.discursive.sample.parse;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.io.UnsupportedEncodingException;

import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;

import net.sf.json.JSONArray;
import net.sf.json.JSONObject;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.codehaus.staxmate.SMInputFactory;
import org.codehaus.staxmate.in.SMHierarchicCursor;
import org.codehaus.staxmate.in.SMInputCursor;

public class LoadingGrants {

	Logger logger = Logger.getLogger( LoadingGrants.class );
	
	private String couchDbUrl = "http://localhost:5984/grants/_bulk_docs";
	private int bulkSize = 1000;

	public static void main(String[] args) throws Exception {
		BasicConfigurator.configure();
		Logger.getRootLogger().setLevel(Level.DEBUG);
		new LoadingGrants().go();
	}
	
	public void go() throws Exception {
		Reader reader = new FileReader( "./data/grants.xml" );
		
		HttpClient client = new DefaultHttpClient();
		
		// Use Staxmate to parse grants.xml as a stream
		SMInputFactory inf = new SMInputFactory(XMLInputFactory.newInstance());
		File grantsFile = new File("./data/grants.xml");
		SMHierarchicCursor rootC = inf.rootElementCursor( reader );
		rootC.advance();
		// Advance through the "content" elements under the root element.
		QName content = new QName("http://www.w3.org/2005/Atom", "content");
		SMInputCursor contentC = rootC.descendantElementCursor(content);
		JSONArray bulkArray = new JSONArray();
		int counter = 1;
		while( contentC.getNext() != null ) {
			// Create a JSON Object that contains all of the elements under
			// feed/entry/content - this is possible because the grant data is 
			// a series of flat XML elements
			SMInputCursor childC = contentC.childElementCursor();
			JSONObject json = new JSONObject();
			while( childC.getNext() != null ) {
				json.put( childC.getLocalName(), childC.collectDescendantText() );
			}
			bulkArray.add( json );
			
			// Only Write Records to Couch in Bulk
			if( counter % bulkSize == 0 ) {
				postToCouch(couchDbUrl, client, bulkArray);
				bulkArray.clear();
			}
			counter++;
		}				
		// Post the last batch to Couch
		postToCouch(couchDbUrl, client, bulkArray);

		// Close the Stax stream
		rootC.getStreamReader().closeCompletely();
	}

	private void postToCouch(String couchDbUrl, HttpClient client,
			JSONArray jsonArray) throws UnsupportedEncodingException, IOException,
			ClientProtocolException {
		JSONObject request = new JSONObject();
		request.put( "all_or_nothing", true );
		request.put( "docs", jsonArray);
		
		StringEntity entity = new StringEntity(request.toString(), "UTF-8");
		HttpPost post = new HttpPost(couchDbUrl);
		post.setEntity(entity);
		HttpResponse response = client.execute(post);
		HttpEntity ent = response.getEntity();
		if (ent != null) {
			ent.consumeContent();
		}
	}
	
	
}


This class is configured to load JSON via the _bulk_docs interface in batches of 1000. While 1000 is a rather large batch size, it only took about 680 seconds on a relatively overloaded MacBook to load all 61,000 grants into Couch. If you are experiencing performance issues, reduce this bulk size to something more manageable like 100.

7. Once the data is finished loading, you should see that the grants database has approximately 61,000 documents in it.

Posted Image

8. Click on the database name in the CouchDB interface, and it will show you a list of documents, each document represents a grant in the Recovery Act.

Posted Image

9. Clicking on an individual document will load this document in the CouchDB interface.

Posted Image

From this point, you should investigate CouchDB views and create some interesting reports that summarize and group some of the information stored in your new CouchDB database of grants from the Recovery.gov site.

Cover of CouchDB: The Definitive Guide
Learn more about this topic from CouchDB: Rough Cuts Version.  This book introduces you to Apache CouchDB, a document-oriented database designed to work with applications that handle commonplace information such as contacts, invoices, and receipts. In CouchDB: The Definitive Guide, three of the core developers gently explain how to work with CouchDB, using clear and practical scenarios. Each chapter showcases key features, such as simple document CRUD (create, read, update, delete), advanced MapReduce, and deployment tuning for performance and reliability.
Learn More Read Now on Safari







0 Alternative Solutions | 0 Comments

filter by: