Jump to content

How to Map your Professional Network with Google Earth

+ 3
  chco's Photo
Posted Feb 26 2011 08:50 AM

This excerpt from Mining the Social Web applies k-means to the problem of clustering your professional contacts and plots them out in Google Earth. To create a nice visualization of this data you'll need to access extended LinkedIn profile information and it's helpful to have a working knowledge of common clustering algorithms.
An interesting exercise in seeing k-means in action is to use it to visualize and cluster your professional LinkedIn network by putting it on a map—or the globe, if you’re a fan of Google Earth. In addition to the insight gained by visualizing how your contacts are spread out, you can analyze clusters by using your contacts, the distinct employers of your contacts, or the distinct metro areas in which your contacts reside as a basis. All three approaches might yield results that are useful for different purposes. Through the LinkedIn API, you can fetch location information that describes the major metropolitan area, such as “Greater Nashville Area,” in which each of your contacts resides, which with a bit of munging is quite adequate for geocoding the locations back into coordinates that we can plot in a tool like Google Earth.

The primary things that must be done in order to get the ball rolling include:

  • Parsing out the geographic location from each of your contacts’ public profiles. The Python code below demonstrates how to fetch this kind of information.

    Harvesting extended profile information for your LinkedIn contacts (linkedin__get_connections.py)
    # -*- coding: utf-8 -*-
    
    import os
    import sys
    import webbrowser
    import cPickle
    from linkedin import linkedin
    
    KEY = sys.argv[1]
    SECRET = sys.argv[2]
    
    # Parses out oauth_verifier parameter from window.location.href and
    # displays it for the user
    
    RETURN_URL = 'http://miningthesocialweb.appspot.com/static/linkedin_oauth_helper.html'
    
    
    def oauthDance(key, secret, return_url):
        api = linkedin.LinkedIn(key, secret, return_url)
    
        result = api.requestToken()
    
        if not result:
            print >> sys.stderr, api.requestTokenError()
            return None
    
        authorize_url = api.getAuthorizeURL()
    
        webbrowser.open(authorize_url)
    
        oauth_verifier = raw_input('PIN number, bro: ')
    
        result = api.accessToken(verifier=oauth_verifier)
        if not result:
            print >> sys.stderr, 'Error: %s\nAborting' % api.getRequestTokenError()
            return None
    
        return api
    
    
    # First, do the oauth_dance
    
    api = oauthDance(KEY, SECRET, RETURN_URL)
    
    # Now do something like get your connections:
    
    if api:
        connections = api.GetConnections()
    else:
        print >> sys.stderr, 'Failed to aunthenticate. You need to learn to dance'
        sys.exit(1)
    
    # Be careful - this type of API usage is "expensive".
    # See http://developer.linkedin.com/docs/DOC-1112
    
    print >> sys.stderr, 'Fetching extended connections...'
    
    extended_connections = [api.GetProfile(member_id=c.id, url=None, fields=[
        'first-name',
        'last-name',
        'current-status',
        'educations',
        'specialties',
        'interests',
        'honors',
        'positions',
        'industry',
        'summary',
        'location',
        ]) for c in connections]
    
    # Store the data
    
    if not os.path.isdir('out'):
        os.mkdir('out')
    
    f = open('out/linkedin_connections.pickle', 'wb')
    cPickle.dump(extended_connections, f)
    f.close()
    
    print >> sys.stderr, 'Data pickled to out/linkedin_connections.pickle'
    


  • Geocoding the locations back into coordinates. The approach we’ll take is to easy_install geopy and let it handle all the heavy lifting. There’s a nice getting-started guide available online; depending on your choice of geocoder, you may need to request an API key from a service provider such as Google or Yahoo!.

  • Feeding the geocoordinates into the KMeansClustering class of the cluster module to calculate clusters.

  • Constructing KML that can be fed into a visualization tool like Google Earth.


Lots of interesting nuances and variations become possible once you have the basic legwork in place (see the Python code below). The linkedin__kml_utility that’s referenced is pretty uninteresting and just does some XML munging; you can view the details on GitHub.

Note: You can point Google Maps to an addressable URL pointing to a KML file if you’d prefer not to download and use Google Earth.

Geocoding the locations of your LinkedIn contacts and exporting them to KML (linkedin__geocode.py)


# -*- coding: utf-8 -*-

import os
import sys
import cPickle
from urllib2 import HTTPError
from geopy import geocoders
from cluster import KMeansClustering, centroid

# A very uninteresting helper function to build up an XML tree

from linkedin__kml_utility import createKML

K = int(sys.argv[1])

# Use your own API key here if you use a geocoding service
# such as Google or Yahoo!

GEOCODING_API_KEY = sys.argv[2]

CONNECTIONS_DATA = sys.argv[3]

OUT = "clusters.kmeans.kml"

# Open up your saved connections with extended profile information

extended_connections = cPickle.load(open(CONNECTIONS_DATA))
locations = [ec.location for ec in extended_connections]
g = geocoders.Yahoo(GEOCODING_API_KEY)

# Some basic transforms may be necessary for geocoding services to function properly
# Here are a few examples that seem to cause problems for Yahoo. You'll probably need
# to add your own.

transforms = [('Greater ', ''), (' Area', ''), ('San Francisco Bay',
              'San Francisco')]

# Tally the frequency of each location

coords_freqs = {}
for location in locations:

    # Avoid unnecessary I/O

    if coords_freqs.has_key(location):
        coords_freqs[location][1] += 1
        continue
    transformed_location = location

    for transform in transforms:
        transformed_location = transformed_location.replace(*transform)
        while True:
            num_errors = 0
            try:

                # This call returns a generator

                results = g.geocode(transformed_location, exactly_one=False)
                break
            except HTTPError, e:
                num_errors += 1
                if num_errors >= 3:
                    sys.exit()
                print >> sys.stderr, e
                print >> sys.stderr, 'Encountered an urllib2 error. Trying again...'
        for result in results:

            # Each result is of the form ("Description", (X,Y))

            coords_freqs[location] = [result[1], 1]
            break

# Here, you could optionally segment locations by continent
# country so as to avoid potentially finding a mean in the middle of the ocean
# The k-means algorithm will expect distinct points for each contact so build out
# an expanded list to pass it

expanded_coords = []
for label in coords_freqs:
    ((lat, lon), f) = coords_freqs[label]
    expanded_coords.append((label, [(lon, lat)] * f))  # Flip lat/lon for Google Earth

# No need to clutter the map with unnecessary placemarks...

kml_items = [{'label': label, 'coords': '%s,%s' % coords[0]} for (label,
             coords) in expanded_coords]

# It could also be interesting to include names of your contacts on the map for display

for item in kml_items:
    item['contacts'] = '\n'.join(['%s %s.' % (ec.first_name, ec.last_name[0])
                                 for ec in extended_connections if ec.location
                                 == item['label']])

cl = KMeansClustering([coords for (label, coords_list) in expanded_coords
                      for coords in coords_list])

centroids = [{'label': 'CENTROID', 'coords': '%s,%s' % centroid(c)} for c in
             cl.getclusters(K)]

kml_items.extend(centroids)
kml = createKML(kml_items)

if not os.path.isdir('out'):
    os.mkdir('out')

f = open("out/" + OUT, 'w')
f.write(kml)
f.close()

print >> sys.stderr, 'Data pickled to out/' + OUT


Warning: Location values returned as part of LinkedIn profile information are generally of the form “Greater Nashville Area,” and a certain amount of munging is necessary in order to extract the city name. The approach presented here is imperfect, and you may have to tweak it based upon what you see happening with your data to achieve total accuracy.

Most of the work involved in getting to the point where the results can be visualized is data-processing boilerplate. The most interesting details are tucked away inside of KMeansClustering’s getclusters method call, toward the end of the listing. The approach demonstrated groups your contacts by location, clusters them, and then uses the results of the clustering algorithm to compute the centroids. The image below illustrates sample results from running the code above.

From top left to bottom: 1) clustering contacts by location so that you can easily see who lives/works in what city, 2) finding the centroids of three clusters computed by k-means, 3) don’t forget that clusters could span countries or even continents when trying to find an ideal meeting location!

Attached Image


Just visualizing your network can be pretty interesting, but computing the geographic centroids of your professional network can also open up some intriguing possibilities. For example, you might want to compute candidate locations for a series of regional workshops or conferences. Alternatively, if you’re in the consulting business and have a hectic travel schedule, you might want to plot out some good locations for renting a little home away from home. Or maybe you want to map out professionals in your network according to their job duties, or the socioeconomic bracket they’re likely to fit in based on their job titles and experience. Beyond the numerous options opened up by visualizing your professional network’s location data, geographic clustering lends itself to many other possibilities, such as supply chain management and Travelling Salesman types of problems.

Mining the Social Web

Learn more about this topic from Mining the Social Web.

Popular social networks such as Facebook and Twitter generate a tremendous amount of valuable data on topics and use patterns. Who’s talking to whom? What are they talking about? How often are they talking? This concise and practical book shows you how to answer these questions and more by harvesting and analyzing data using social web APIs, Python tools, GitHub, HTML5, and Javascript.

See what you'll learn


0 Replies