Jump to content

How to Run Simple Frequency Analysis on the Recovery.gov Data

+ 1
  tmo9d's Photo
Posted Sep 29 2009 09:03 AM

While you've read all about the unprecedented transparency of the recent stimulus act, you want to run some simple frequency reports on the raw data to see which states and which cities are getting the most grants from the Federal government. Instead of clicking around on some flashy government-created Flash widgets, you want to work with the raw data.

The sample project for this particular example is a Java project hosted at GitHub here: http://github.com/tobrien/sample-parse

If you want to run some simple frequency reports on this data, you can:

1. Generate a Lucene Index from the Recovery.gov Data

2. Assuming that you've created the Lucene Index with the sample project from this answer , you can now write a class that will print out the frequency of cities are states.

3. Here is a class which will open the index in the "index/" directory and print out the frequency of grants in each state:

package com.discursive.sample.parse;

import java.io.File;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

import org.apache.commons.beanutils.BeanComparator;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;

public class StateGrantFrequency {

	Logger logger = Logger.getLogger( StateGrantFrequency.class );
	
	public static void main(String args[]) throws Exception {
		BasicConfigurator.configure();
		Logger.getRootLogger().setLevel(Level.INFO);
		new StateGrantFrequency().go();
	}

	@SuppressWarnings("unchecked")
	public void go() throws Exception {
		Directory index = new SimpleFSDirectory( new File("index"));
        IndexReader reader = IndexReader.open( index, true );
        TermEnum terms = reader.terms( );
        List<Freq> termList = new ArrayList<Freq>( );
        while( terms.next( ) ) {
            if( terms.term( ).field( ).equals( "state" ) ) {
            	termList.add( new Freq( terms.term().text(), terms.docFreq( ) ) );
            }
        }
        Collections.sort( termList, new BeanComparator( "freq" ) );
        for( Freq freq : termList ) {
        	System.out.println( freq.freq + " " + freq.term );
        }
	}
	
	public class Freq {
		String term;
		int freq;
		public Freq(String term, int freq) {
			this.term = term;
			this.freq = freq;
		}
		public String getTerm() { return term; }
		public int getFreq() { return freq; }
	}

}


This class produces the following output with a grants.xml downloaded from Recovery.gov on September 29th, 2009:

3 Invalid code: 57
3 Marshall Islands
4 Federated States of Micronesia
5 Palau
20 Northern Mariana Islands
21 American Samoa
33 Virgin Islands of the U.S.
36 Guam
59 Invalid code: 00
222 Delaware
250 Hawaii
254 Wyoming
267 Nevada
278 District of Columbia
317 Vermont
366 Alaska
376 Idaho
380 South Dakota
422 North Dakota
449 New Hampshire
482 Rhode Island
489 New Mexico
507 Puerto Rico
516 Montana
531 Utah
543 Maine
545 West Virginia
573 Nebraska
677 Mississippi
757 Arizona
782 Arkansas
793 Kansas
821 Iowa
868 Connecticut
897 Louisiana
973 Kentucky
1003 South Carolina
1013 Oklahoma
1042 Maryland
1045 Virginia
1053 Oregon
1058 Colorado
1178 New Jersey
1219 Wisconsin
1250 Alabama
1343 Minnesota
1347 Missouri
1485 Tennessee
1590 Washington
1599 Indiana
1663 Georgia
1771 Michigan
1797 North Carolina
2085 Florida
2152 Illinois
2183 Massachusetts
2285 Pennsylvania
2555 Ohio
2787 Texas
3500 New York
5561 California


While it might be surprising that Ohio is in the #4 spot, this list isn't tracking the total number of dollars spent, just the number of grants for each state.

4. Another interesting frequency report is the number of grants for each city. Here's the class (Note it is the same as the previous class with the field name changed.... clearly this is a candidate for a refactor.)

package com.discursive.sample.parse;

import java.io.File;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

import org.apache.commons.beanutils.BeanComparator;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;

public class CityGrantFrequency {

	Logger logger = Logger.getLogger( CityGrantFrequency.class );
	
	public static void main(String args[]) throws Exception {
		BasicConfigurator.configure();
		Logger.getRootLogger().setLevel(Level.INFO);
		new CityGrantFrequency().go();
	}

	@SuppressWarnings("unchecked")
	public void go() throws Exception {
		Directory index = new SimpleFSDirectory( new File("index"));
        IndexReader reader = IndexReader.open( index, true );
        TermEnum terms = reader.terms( );
        List<Freq> termList = new ArrayList<Freq>( );
        while( terms.next( ) ) {
            if( terms.term( ).field( ).equals( "city" ) ) {
            	termList.add( new Freq( terms.term().text(), terms.docFreq( ) ) );
            }
        }
        Collections.sort( termList, new BeanComparator( "freq" ) );
        for( Freq freq : termList ) {
        	System.out.println( freq.freq + " " + freq.term );
        }
	}
	
	public class Freq {
		String term;
		int freq;
		public Freq(String term, int freq) {
			this.term = term;
			this.freq = freq;
		}
		public String getTerm() { return term; }
		public int getFreq() { return freq; }
	}

}


Here is the tail end of the output showing the top cities:

250 OLYMPIA
267 SALEM
271 MADISON
289 BALTIMORE
289 PHILADELPHIA
303 TALLAHASSEE
304 LANSING
306 CHICAGO
309 Birmingham
314 LOS ANGELES
354 Chicago
365 ATLANTA
365 COLUMBUS
367 ALBANY
374 Boston
387 INDIANAPOLIS
388 NASHVILLE
395 NEW YORK
419 Indianapolis
426 SACRAMENTO
442 Columbus
457 SPRINGFIELD
463 New York
494 Los Angeles
515 BOSTON


Note that this is the raw data and there appear to be duplicates between "BOSTON" and "Boston". If you really wanted to make sure that these duplicates due to capitalization were removed, you could CAP all text as it was added to the Lucene index.

Cover of Hadoop: The Definitive Guide
Learn more about this topic from Hadoop: The Definitive Guide. 

Apache Hadoop is ideal for organizations with a growing need to process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.

Learn More Read Now on Safari


0 Replies