Jump to content

How to Gather Geographic Location Data

+ 5
  chco's Photo
Posted Apr 20 2011 05:55 AM

Geographic information is such a wide field that it probably deserves its own guide, but in this excerpt below from the O'Reilly publication Data Source Handbook we're going to focus on the most useful and accessible data sources available. All of these take some kind of geographic location, either a place name, an address, or latitude/longitude coordinates, and return additional information about that area.


SimpleGeo

This is a compendium of useful geographic data, with a simple REST interface to access it. You can use the Context API to get additional information about a location and Places to find points of interest nearby. There are no rate limits, but you do have to get an API key and use OAuth to authenticate your calls:

http://api.simplegeo.com/1.0/context/37.778381,-122.389388.json

{
   "query":{
      "latitude":37.778381,
      "longitude":-122.389388
   },
   "timestamp":1291766899.794,
   "weather": {
      "temperature": "65F",
      "conditions": "light haze"
    }, {
   "demographics": {
      "metro_score": 9
   },
   "features":[
      {
         "handle":"SG_4H2GqJDZrc0ZAjKGR8qM4D_37.778406_-122.389506",
         "license":"http://creativecommons.org/licenses/by-sa/2.0/",
         "attribution":"(c) OpenStreetMap (http://openstreetmap.org/) and
contributors CC-BY-SA (http://creativecommons.org/licenses/by-sa/2.0/)",
         "classifiers":[
            {
               "type":"Entertainment",
               "category":"Arena",
               "subcategory":"Stadium"
            }
         ],
         "bounds":[
            -122.39115,
            37.777233,
            -122.387775,
            37.779731
         ],
         "abbr":null,
         "name":"AT&T Park",
         "href":"http://api.simplegeo.com/
1.0/features/SG_4H2GqJDZrc0ZAjKGR8qM4D_37.778406_-122.389506.json"
      },
...


Yahoo!

Yahoo! has been a surprising leader in online geo APIs, with Placefinder for converting addresses or place names into coordinates, GeoPlanet for getting category and neighborhood information about places, and Placemaker for analyzing text documents and extracting words or phrases that represent locations. You’ll need to sign up for an app ID, but after that it’s a simple REST/JSON interface.

You can also download a complete list of the locations that Yahoo has in its database, holding their names and the WOEID identifier for each. This can be a useful resource for doing offline processing, though it is a bit hobbled by the lack of any coordinates for the locations:

curl "http://where.yahooapis.com/geocode?\
q=1600+Pennsylvania+Avenue,+Washington,+DC&appid=<App ID>&flags=J"

{"ResultSet":{"version":"1.0","Error":0,"ErrorMessage":"No error","Locale":"us_US",
"Quality":87,"Found":1,"Results":[{
  "quality":85,
  "latitude":"38.898717","longitude":"-77.035974",
  "offsetlat":"38.898590","offsetlon":"-77.035971",
  "radius":500,"name":"","line1":"1600 Pennsylvania Ave NW",
  "line2":"Washington, DC  20006","line3":"",
  "line4":"United States","house":"1600",
  "street":"Pennsylvania Ave NW",
...


Google Geocoding API

You can only use this geocoding API if you’re going to display the results on a Google Map, which severely limits its usefulness. There’s also a default limit of 2,500 requests per day, though commercial customers get up to 100,000. It doesn’t require any key or authentication, and it also supports “reverse geocoding,” where you supply a latitude and longitude and get back nearby addresses:

curl "http://maps.googleapis.com/maps/api/geocode/json?\
address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=false"

{ "status": "OK",
  "results": [ {
    "types": [ "street_address" ],
    "formatted_address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA",
    "address_components": [ {
      "long_name": "1600",
      "short_name": "1600",
      "types": [ "street_number" ]
    }, {
      "long_name": "Amphitheatre Pkwy",
      "short_name": "Amphitheatre Pkwy",
      "types": [ "route" ]
    }, {
...


CityGrid

With listings for eighteen million US businesses, this local search engine offers an API to find companies that are near a particular location. You can pass in a general type of business or a particular name and either latitude/longitude coordinates or a place name. The service offers a REST/JSON interface that requires a sign up, and the terms of service and usage requirements restrict the service to user-facing applications. CityGrid does offer an unusual ad-driven revenue sharing option, though, if you meet the criteria:

curl "http://api.citygridmedia.com/content/places/v2/search/where?\
where=94117&what=bakery&format=json&publisher=<publisher>&api_key=<key>"

{"results":{"query_id":null,
...
"locations":[{"id":904051,"featured":false,"name":"Blue Front Cafe",
  "address":{"street":"1430 Haight St",
 "city":"San Francisco","state":"CA","postal_code":"94117"},
...


Geocoder.us

The Geocoder.us website offers a commercial API for converting US addresses into location coordinates. It costs $50 for 20,000 lookups, but thankfully, Geocoder has also open-sourced the code as a Perl CPAN module. It’s straightforward to install, but the tricky part is populating it with data, since it relies on Tiger/Line data from the US Census. You’ll need to hunt around on the Census website to locate the files you need, and then they’re a multigigabyte download.

Geodict

An open source library similar to Yahoo!’s Placemaker API, my project takes in a text string and extracts country, city, and state names from it, along with their coordinates. It’s designed to run locally, and it only spots words that are highly likely to represent place names. For example, Yahoo! will flag the “New York” in “New York Times” as a location, whereas Geodict requires a state name to follow it or a location word like in or at to precede it:

./geodict.py -f json < testinput.txt

[{"found_tokens": [{
  "code": "ES", "matched_string": "Spain",
  "lon": -4.0, "end_index": 4, "lat": 40.0,
  "type": "COUNTRY", "start_index": 0}]},
...


GeoNames

GeoNames has a large number of APIs available for all sorts of geographic queries, based on its database of eight million place names. You can use a simple REST interface with no authentication required, or you can download the entire database under a Creative Commons license if you want to run your own analysis and processing on it. There’s some unusual data available, including weather, ocean names, and elevation:

curl "http://ws.geonames.org/findNearestAddressJSON?lat=37.451&lng=-122.18"

{"address":{"postalcode":"94025","adminCode2":"081","adminCode1":"CA",
  "street":"Roble Ave","countryCode":"US","lng":"-122.18032",
  "placename":"Menlo Park","adminName2":"San Mateo",
  "distance":"0.04","streetNumber":"671",
  "mtfcc":"S1400","lat":"37.45127","adminName1":"California"}}


US Census

If you’re interested in American locations, the Census site is a mother lode of freely downloadable information. The only problem is that it can be very hard to find what you’re looking for on the site. A good place to start for large data sets is the Summary File 100-Percent Data download interface.

You can select something like ZIP codes or counties in the National Level section (Figure 1-4).

Figure 1-4. US Census site

Attached Image


Next, select the statistics that you’re interested in (Figure 1-5).

Figure 1-5. US Census site

Attached Image


Then you’ll be able to download that data as a text file. The format is a bit odd, a table with columns separated by “|” (pipe) characters, but with a bit of find-and-replace magic, you can convert them into standard CSV files readable by any spreadsheet.

To ensure privacy, the Census does rearrange its very detailed data in some cases, without invalidating its statistical significance. For example, if there’s only one Iranian doctor in Boulder, Colorado, that person may be included in a neighboring city’s data instead and swapped with an equivalent person’s information, so the overall statistics on income, etc. are unaffected.

There’s also a wealth of shape data available, giving detailed coordinates for the boundaries of US counties, states, cities, and congressional districts.

Zillow Neighborhoods

The only US boundary that the Census doesn’t offer is neighborhoods. Thankfully, the Zillow real estate site has made its neighborhood data available as Creative Commons–licensed downloads (Figure 1-6).

Figure 1-6. Zillow neighborhood data

Attached Image


Natural Earth

Natural Earth offers a very clean, accurate, and detailed public domain collection of country and province boundaries for the entire planet, available for download in bulk (Figure 1-7). You’ll need some geo knowledge to convert them into a usable form, but it’s not too much of a challenge. For example, you could load the shapefiles into PostGIS and then easily run reverse geo-code queries to discover which country and state a point lies within.

Figure 1-7. Natural Earth data

Attached Image


US National Weather Service

There are other weather APIs available through Yahoo! and Weather Underground, but the NWS is the only organization to offer one without significant restrictions on commercial and mobile usage. It only covers the United States, unfortunately. The NWS offers a REST/XML interface, and it doesn’t require any authentication or registration, though it does ask that you cache results for any point for an hour, since that’s the update frequency of its data.

You can access either current conditions or forecasts for up to a week ahead, and you can search by city, zip code, or latitude/longitude coordinates. If you’re interested in bulk sets of longer-term historical data on weather, the University of Nebraska has a great guide available. Some of the information stretches back thousands of years, thanks to tree rings:

curl "http://www.weather.gov/forecasts/xml/sample_products/browser_interface/\
ndfdXMLclient.php?lat=38.99&lon=-77.01&product=time-series&\
begin=2004-01-01T00:00:00&end=2013-04-20T00:00:00&maxt=maxt&mint=mint"

<?xml version="1.0"?>
<dwml version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation=
 "http://www.nws.noaa.gov/forecasts/xml/DWMLgen/schema/DWML.xsd">
  <head>
...
  </head>
  <data>
    <location>
      <location-key>point1</location-key>
      <point latitude="38.99" longitude="-77.01"/>
    </location>
...
    <parameters applicable-location="point1">
      <temperature type="maximum" units="Fahrenheit" time-layout="k-p24h-n8-1">
        <name>Daily Maximum Temperature</name>
        <value>38</value>
        <value>33</value>
        <value>41</value>
        <value>41</value>
        <value>35</value>
        <value>32</value>
        <value>30</value>
        <value>35</value>
      </temperature>
      <temperature type="minimum" units="Fahrenheit" time-layout="k-p24h-n7-2">
        <name>Daily Minimum Temperature</name>
        <value>22</value>
        <value>28</value>
        <value>34</value>
        <value>22</value>
        <value>24</value>
        <value>17</value>
        <value>20</value>
      </temperature>
    </parameters>
  </data>
</dwml>


OpenStreetMap

The volunteers at OpenStreetMap have created a somewhat-chaotic but comprehensive set of geographic information, and you can download everything they’ve gathered as a single massive file. One unique strength is the coverage of areas in the developing world that are absent from commercial databases, and since it’s so easy to change, even US locations are often more up-to-date with recent changes than more traditional maps. The downside of the system is that it’s designed for navigation, not analysis, so a lot of information about administrative boundaries and other nonphysical attributes is missing. The Nominatim project attempts to organize the data into a form you can use to look up street addresses, but the lack of good coverage of things like postal codes limits its usefulness. Reconstructing some structure from the soup of roads and points is also computationally very taxing, easily taking a couple weeks of computation time on a high-end machine.

If you’re only working in a more limited geographic region, you may want to look at the Cloudmade extracts, which contain subsets for different areas and attributes.

MaxMind

This is one of the simplest but most useful data sets for geographic applications. It’s a CSV file containing information on 2.7 million cities and towns around the world. It has the latitude and longitude coordinates, region, country, and alternate names for all of them, and the population for many thanks to Stefan Helder’s data. You can just load this file into your favorite database, index by the key you want to query on, and you’ve got a perfectly workable local service for working with addresses and other locations:

...
us,new woodstock,New Woodstock,NY,,42.8483333,-75.8547222
us,new york,New York,NY,8107916,40.7141667,-74.0063889
us,new york mills,New York Mills,NY,,43.1052778,-75.2916667
us,newark,Newark,NY,9365,43.0466667,-77.0955556
...


Data Source Handbook

Learn more about this topic from Data Source Handbook.

If you're a developer looking to supplement your own data tools and services, this concise ebook covers the most useful sources of public data available today. You’ll find useful information on APIs that offer broad coverage, tie their data to the outside world, and are either accessible online or feature downloadable bulk data. You'll also find code and helpful links.

See what you'll learn


Tags:
0 Subscribe


2 Replies

0
  grewing's Photo
Posted Apr 26 2011 01:35 PM

chco,

this is one of the reasons I love oreilly - I came looking for a simple answer to a question and I get a boat load of things to go after. great stuff, thanks a lot.
0
  danielmac's Photo
Posted Jan 19 2014 02:10 PM

I will tell about the sale of 12 school properties to my friend, he asked me to find him properties for sale in our county because he wants to build a farm. I managed to find Kansas City commercial real estate with huntmidwest.com and a couple of other locations, now it`s up to him to choose the location.