We had a project where, when we first looked at the data, we thought it would be really important to geocode 436,106 unique addresses. That is, we wanted to associate a latitude and longitude with each address so that it would be easy to explore fine-grained spatial effects. This is an interesting challenge: how can you geocode nearly half a million addresses?
We started by looking at the well-known web services provided by Google and Yahoo!. These were unsuitable for two reasons: they impose strict daily limits on the number of requests, and there are cumbersome restrictions on the use of the resulting data. The request limit alone meant that it would take well over a month to geocode all the addresses, and then the licensing would have affected publication of the results! After further investigation we found a very useful open service, the USC WebGIS, provided by the GIS research laboratory at the University of Southern California (Goldberg and Wilson 2008). This service is free for noncommercial use and makes no restrictions on the uses of the resulting data. There was no daily usage cap when we began using the service, but there is an implicit cap caused by the speed: we could only geocode about 80,000 addresses per day, so it took us around five days to do all 400,000. The disadvantage of this free service is that the quality of the geocoding is not quite as good (it uses only publicly available address data), but the creators were very helpful and have published an excellent free introduction to the topic in (Goldberg 2008).
As well as latitude and longitude, the USC results also include a categorical variable indicating their degree of accuracy: exact address, zip code, county, etc.
It is generally worth spending a significant amount of time at every stage of an analysis to make sure that the data is accurate, and geocoding was no different. Errors in geocoding came from a number of sources: there are typographical errors in the addresses, new buildings are often not listed in public databases, and zip codes may be reassigned over time. We further suspect that the USC software included a bug during the period we used it, because large numbers of addresses were falsely assigned to the Los Angeles area and elsewhere around the state; we remapped these addresses using another free online service at http://gpsvisualizer.com. Our debugging process included using R to draw simple maps of latitude versus longitude for each county and most towns to identify the addresses that had been located far outside the Bay Area.
The addresses in San Jose posed an interesting geocoding challenge. Sales are listed for several "towns" that are not recognized by any mapping sites we could find, so we assume they are informal names for neighborhoods: North, South, East and West San Jose, Berryessa, Cambrian, and a few others.
Where possible we tried to correct any errors. When that was not possible, we used R's missing values to indicate that we do not know the exact latitude and longitude. This is a better approach than throwing out bad matches, because we need varying levels of accuracy for different purposes: when we map the data at the level of county or city, we can be satisfied with an approximate location. The use of missing values for latitude and longitude ensures that any location with a suspicious geocoding will be dropped from analyses that use latitude and longitude, but included in all others.
Learn more about this topic from Beautiful Data.
With this insightful book, you'll learn from the best data practitioners in the field just how wide-ranging -- and beautiful -- working with data can be. Join 39 contributors as they explain how they developed simple and elegant solutions on projects ranging from the Mars lander to a Radiohead video.