The sample project for this particular example is a Java project hosted at GitHub here: http://github.com/tobrien/sample-parse
If you want to run some simple frequency reports on this data, you can:
1. Generate a Lucene Index from the Recovery.gov Data
2. Assuming that you've created the Lucene Index with the sample project from this answer , you can now write a class that will print out the frequency of cities are states.
3. Here is a class which will open the index in the "index/" directory and print out the frequency of grants in each state:
package com.discursive.sample.parse;
import java.io.File;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.apache.commons.beanutils.BeanComparator;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;
public class StateGrantFrequency {
Logger logger = Logger.getLogger( StateGrantFrequency.class );
public static void main(String args[]) throws Exception {
BasicConfigurator.configure();
Logger.getRootLogger().setLevel(Level.INFO);
new StateGrantFrequency().go();
}
@SuppressWarnings("unchecked")
public void go() throws Exception {
Directory index = new SimpleFSDirectory( new File("index"));
IndexReader reader = IndexReader.open( index, true );
TermEnum terms = reader.terms( );
List<Freq> termList = new ArrayList<Freq>( );
while( terms.next( ) ) {
if( terms.term( ).field( ).equals( "state" ) ) {
termList.add( new Freq( terms.term().text(), terms.docFreq( ) ) );
}
}
Collections.sort( termList, new BeanComparator( "freq" ) );
for( Freq freq : termList ) {
System.out.println( freq.freq + " " + freq.term );
}
}
public class Freq {
String term;
int freq;
public Freq(String term, int freq) {
this.term = term;
this.freq = freq;
}
public String getTerm() { return term; }
public int getFreq() { return freq; }
}
}This class produces the following output with a grants.xml downloaded from Recovery.gov on September 29th, 2009:
3 Invalid code: 57 3 Marshall Islands 4 Federated States of Micronesia 5 Palau 20 Northern Mariana Islands 21 American Samoa 33 Virgin Islands of the U.S. 36 Guam 59 Invalid code: 00 222 Delaware 250 Hawaii 254 Wyoming 267 Nevada 278 District of Columbia 317 Vermont 366 Alaska 376 Idaho 380 South Dakota 422 North Dakota 449 New Hampshire 482 Rhode Island 489 New Mexico 507 Puerto Rico 516 Montana 531 Utah 543 Maine 545 West Virginia 573 Nebraska 677 Mississippi 757 Arizona 782 Arkansas 793 Kansas 821 Iowa 868 Connecticut 897 Louisiana 973 Kentucky 1003 South Carolina 1013 Oklahoma 1042 Maryland 1045 Virginia 1053 Oregon 1058 Colorado 1178 New Jersey 1219 Wisconsin 1250 Alabama 1343 Minnesota 1347 Missouri 1485 Tennessee 1590 Washington 1599 Indiana 1663 Georgia 1771 Michigan 1797 North Carolina 2085 Florida 2152 Illinois 2183 Massachusetts 2285 Pennsylvania 2555 Ohio 2787 Texas 3500 New York 5561 California
While it might be surprising that Ohio is in the #4 spot, this list isn't tracking the total number of dollars spent, just the number of grants for each state.
4. Another interesting frequency report is the number of grants for each city. Here's the class (Note it is the same as the previous class with the field name changed.... clearly this is a candidate for a refactor.)
package com.discursive.sample.parse;
import java.io.File;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import org.apache.commons.beanutils.BeanComparator;
import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.SimpleFSDirectory;
public class CityGrantFrequency {
Logger logger = Logger.getLogger( CityGrantFrequency.class );
public static void main(String args[]) throws Exception {
BasicConfigurator.configure();
Logger.getRootLogger().setLevel(Level.INFO);
new CityGrantFrequency().go();
}
@SuppressWarnings("unchecked")
public void go() throws Exception {
Directory index = new SimpleFSDirectory( new File("index"));
IndexReader reader = IndexReader.open( index, true );
TermEnum terms = reader.terms( );
List<Freq> termList = new ArrayList<Freq>( );
while( terms.next( ) ) {
if( terms.term( ).field( ).equals( "city" ) ) {
termList.add( new Freq( terms.term().text(), terms.docFreq( ) ) );
}
}
Collections.sort( termList, new BeanComparator( "freq" ) );
for( Freq freq : termList ) {
System.out.println( freq.freq + " " + freq.term );
}
}
public class Freq {
String term;
int freq;
public Freq(String term, int freq) {
this.term = term;
this.freq = freq;
}
public String getTerm() { return term; }
public int getFreq() { return freq; }
}
}Here is the tail end of the output showing the top cities:
250 OLYMPIA 267 SALEM 271 MADISON 289 BALTIMORE 289 PHILADELPHIA 303 TALLAHASSEE 304 LANSING 306 CHICAGO 309 Birmingham 314 LOS ANGELES 354 Chicago 365 ATLANTA 365 COLUMBUS 367 ALBANY 374 Boston 387 INDIANAPOLIS 388 NASHVILLE 395 NEW YORK 419 Indianapolis 426 SACRAMENTO 442 Columbus 457 SPRINGFIELD 463 New York 494 Los Angeles 515 BOSTON
Note that this is the raw data and there appear to be duplicates between "BOSTON" and "Boston". If you really wanted to make sure that these duplicates due to capitalization were removed, you could CAP all text as it was added to the Lucene index.
Apache Hadoop is ideal for organizations with a growing need to process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop is used to solve specific problems.




Help









