I've created a stub project that uses Staxmate, Json-lib, and Apache HttpClient to load Recovery.gov grant data into a CouchDB database. The following steps describe the logic used to parse the Recovery.gov data and load each grant document into a CouchDB database. This sample project is available on GitHub here: http://github.com/to...n/sample-parse. To clone this project, execute: "git clone git://github.com/tobrien/sample-parse.git"
1. Download the Raw XML Data from Recovery.gov. If you've cloned the sample-parse from from GitHub, this data file will already be in ./data.
2. Install CouchDB On Your Local Machine - That particular answer is OSX-specific. If you are running Linux, you can just download the couched distribution and compile it from source. Be warned, Couch installation is often confusing due to the number of dependencies required. For more information on Couch installation, see Installation on the CouchDB Wiki.
2. Add the following dependencies to a Java application. This application is going to use staxmate, a stax parser, apache httpclient, and json-lib among other libraries
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.1</version>
</dependency>
<dependency>
<groupId>org.codehaus.staxmate</groupId>
<artifactId>staxmate</artifactId>
<version>2.0.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.codehaus.staxmate</groupId>
<artifactId>stax2</artifactId>
<version>2.1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.codehaus.woodstox</groupId>
<artifactId>wstx-asl</artifactId>
<version>4.0.0</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>net.sf.json-lib</groupId>
<artifactId>json-lib</artifactId>
<version>2.3</version>
<type>jar</type>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.0</version>
<type>jar</type>
<scope>compile</scope>
</dependency>3. Place the "grants.xml" from Recovery.gov into a directory named "data/". (Note, if you've clone the sample project, you already have this data available)
4. Open up the CouchDB admin interface (this is either via the CouchDBX application on OSX, or via http://localhost:5984/_utils if you have installed Couch on another platform). Click on the create database link on the front summary page as shown below:

5. Create a new database named "grants"

6. Write the following class to parse the Recovery.gov grants.xml file with Staxmate, turn each content node into a JSON document, and store bulk documents in a CouchDB database named "grants":
package com.discursive.sample.parse;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import net.sf.json.JSONArray;
import net.sf.json.JSONObject;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.codehaus.staxmate.SMInputFactory;
import org.codehaus.staxmate.in.SMHierarchicCursor;
import org.codehaus.staxmate.in.SMInputCursor;
public class LoadingGrants {
Logger logger = Logger.getLogger( LoadingGrants.class );
private String couchDbUrl = "http://localhost:5984/grants/_bulk_docs";
private int bulkSize = 1000;
public static void main(String[] args) throws Exception {
BasicConfigurator.configure();
Logger.getRootLogger().setLevel(Level.DEBUG);
new LoadingGrants().go();
}
public void go() throws Exception {
Reader reader = new FileReader( "./data/grants.xml" );
HttpClient client = new DefaultHttpClient();
// Use Staxmate to parse grants.xml as a stream
SMInputFactory inf = new SMInputFactory(XMLInputFactory.newInstance());
File grantsFile = new File("./data/grants.xml");
SMHierarchicCursor rootC = inf.rootElementCursor( reader );
rootC.advance();
// Advance through the "content" elements under the root element.
QName content = new QName("http://www.w3.org/2005/Atom", "content");
SMInputCursor contentC = rootC.descendantElementCursor(content);
JSONArray bulkArray = new JSONArray();
int counter = 1;
while( contentC.getNext() != null ) {
// Create a JSON Object that contains all of the elements under
// feed/entry/content - this is possible because the grant data is
// a series of flat XML elements
SMInputCursor childC = contentC.childElementCursor();
JSONObject json = new JSONObject();
while( childC.getNext() != null ) {
json.put( childC.getLocalName(), childC.collectDescendantText() );
}
bulkArray.add( json );
// Only Write Records to Couch in Bulk
if( counter % bulkSize == 0 ) {
postToCouch(couchDbUrl, client, bulkArray);
bulkArray.clear();
}
counter++;
}
// Post the last batch to Couch
postToCouch(couchDbUrl, client, bulkArray);
// Close the Stax stream
rootC.getStreamReader().closeCompletely();
}
private void postToCouch(String couchDbUrl, HttpClient client,
JSONArray jsonArray) throws UnsupportedEncodingException, IOException,
ClientProtocolException {
JSONObject request = new JSONObject();
request.put( "all_or_nothing", true );
request.put( "docs", jsonArray);
StringEntity entity = new StringEntity(request.toString(), "UTF-8");
HttpPost post = new HttpPost(couchDbUrl);
post.setEntity(entity);
HttpResponse response = client.execute(post);
HttpEntity ent = response.getEntity();
if (ent != null) {
ent.consumeContent();
}
}
}This class is configured to load JSON via the _bulk_docs interface in batches of 1000. While 1000 is a rather large batch size, it only took about 680 seconds on a relatively overloaded MacBook to load all 61,000 grants into Couch. If you are experiencing performance issues, reduce this bulk size to something more manageable like 100.
7. Once the data is finished loading, you should see that the grants database has approximately 61,000 documents in it.

8. Click on the database name in the CouchDB interface, and it will show you a list of documents, each document represents a grant in the Recovery Act.

9. Clicking on an individual document will load this document in the CouchDB interface.

From this point, you should investigate CouchDB views and create some interesting reports that summarize and group some of the information stored in your new CouchDB database of grants from the Recovery.gov site.
Help










