Collaborative filtering techniques enable online retailers to recommend products, services, and media for the majority of commercial websites you visit daily. When you purchase something at Amazon.com it flexes its conciderable collective intelligence to make sure you are aware that people who purchased the product you are considering also purchased similar items, and would you like to check them out as well? More often than not you will at least consider those other items because they appeal to you in some way. In this excerpt from Toby Segaran's Programming Collective Intelligence you will learn how to use collected user data to help you find similar users.

This example assumes you are familar with the Python language.
If you'd like to work through
the example in this section, create a file called `recommendations.py`

, and insert the following
code to create the dataset:

# A dictionary of movie critics and their ratings of a small # set of movies critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0}, 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}, 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0, 'Superman Returns': 3.5, 'The Night Listener': 4.0}, 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5}, 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 2.0}, 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5}, 'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

After collecting data about the things people like, you need a way
to determine how similar people are in their tastes. You do this by
comparing each person with every other person and calculating a
*similarity score*. There are a few ways to do this,
and in this section I'll show you two systems for calculating similarity
scores: *Euclidean distance* and *Pearson
correlation*.

One very simple way to calculate a similarity score is to use a Euclidean distance score, which takes the items that people have ranked in common and uses them as axes for a chart. You can then plot the people on the chart and see how close together they are, as shown in Figure 2.1.

This figure shows the people charted in *preference
space*. Toby has been plotted at 4.5 on the Snakes axis and
at 1.0 on the Dupree axis. The closer two people are in the preference
space, the more similar their preferences are. Because the chart is
two-dimensional, you can only look at two rankings at a time, but the
principle is the same for bigger sets of rankings.

To calculate the distance between Toby and LaSalle in the chart,
take the difference in each axis, square them and add them together,
then take the square root of the sum. In Python, you can use the
`pow(n,2)`

function to square a
number and take the square root with the `sqrt`

function:

>>from math import sqrt >>sqrt(pow(4.5-4,2)+pow(1-2,2)) 1.1180339887498949

This formula calculates the distance, which will be smaller for people who are more similar. However, you need a function that gives higher values for people who are similar. This can be done by adding 1 to the function (so you don't get a division-by-zero error) and inverting it:

>>1/(1+sqrt(pow(4.5-4,2)+pow(1-2,2))) 0.47213595499957939

This new function always returns a value between 0 and 1, where a value of 1 means that two people have identical preferences. You can put everything together to create a function for calculating similarity. Add the following code to [inclinecode]recommendations.py[/inlinecode]:

from math import sqrt # Returns a distance-based similarity score for person1 and person2 def sim_distance(prefs,person1,person2): # Get the list of shared_items si={} for item in prefs[person1]: if item in prefs[person2]: si[item]=1 # if they have no ratings in common, return 0 if len(si)==0: return 0 # Add up the squares of all the differences sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) for item in si]) return 1/(1+sqrt(sum_of_squares))

This function can be called with two names to get a similarity score. In your Python interpreter, run the following:

>>> import recommendations >>> reload(recommendations) >>> recommendations.sim_distance(recommendations.critics, ... 'Lisa Rose','Gene Seymour') 0.148148148148

This gives you a similarity score between Lisa Rose and Gene Seymour. Try it with other names to see if you can find people who have more or less in common.

A slightly more sophisticated way to determine the similarity between people's interests is to use a Pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. The formula for this is more complicated than the Euclidean distance score, but it tends to give better results in situations where the data isn't well normalized—for example, if critics' movie rankings are routinely more harsh than average.

To visualize this method, you can plot the ratings of two of the
critics on a chart, as shown in Figure 2.2.
*Superman* was rated 3 by Mick LaSalle and 5 by
Gene Seymour, so it is placed at (3,5) on the chart.

You can also see a straight line on the chart. This is called
the *best-fit line* because it comes as close to
all the items on the chart as possible. If the two critics had
identical ratings for every movie, this line would be diagonal and
would touch every item in the chart, giving a perfect correlation
score of 1. In the case illustrated, the critics disagree on a few
movies, so the correlation score is about 0.4. Figure 2.3 shows an
example of a much higher correlation, one of about 0.75.

One interesting aspect of using the Pearson score, which you can
see in the figure, is that it corrects for *grade
inflation*. In this figure, Jack Matthews tends to give
higher scores than Lisa Rose, but the line still fits because they
have relatively similar preferences. If one critic is inclined to give
higher scores than the other, there can still be perfect correlation
if the difference between their scores is consistent. The Euclidean
distance score described earlier will say that two critics are
dissimilar because one is consistently harsher than the other, even if
their tastes are very similar. Depending on your particular
application, this behavior may or may not be what you want.

The code for the Pearson correlation score first finds the items rated by both critics. It then calculates the sums and the sum of the squares of the ratings for the two critics, and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient, shown in bold in the code below. Unlike the distance metric, this formula is not very intuitive, but it does tell you how much the variables change together divided by the product of how much they vary individually.

To use this formula, create a new function with the same
signature as the `sim_distance`

function in `recommendations.py`

:

# Returns the Pearson correlation coefficient for p1 and p2 def sim_pearson(prefs,p1,p2): # Get the list of mutually rated items si={} for item in prefs[p1]: if item in prefs[p2]: si[item]=1 # Find the number of elements n=len(si) # if they have no ratings in common, return 0 if n==0: return 0 # Add up all the preferences sum1=sum([prefs[p1][it] for it in si]) sum2=sum([prefs[p2][it] for it in si]) # Sum up the squares sum1Sq=sum([pow(prefs[p1][it],2) for it in si]) sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) # Sum up the products pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si]) # Calculate Pearson score<span class="strong"><strong>num=pSum−(sum1*sum2/n) den=sqrt((sum1Sq−pow(sum1,2)/n)*(sum2Sq−pow(sum2,2)/n)) if den==0: return 0 r=num/den</strong></span> return r

This function will return a value between −1 and 1. A value of 1 means that the two people have exactly the same ratings for every item. Unlike with the distance metric, you don't need to change this value to get it to the right scale. Now you can try getting the correlation score for Figure 2.3:

>>>reload(recommendations) >>> print recommendations.sim_pearson(recommendations.critics, ... 'Lisa Rose','Gene Seymour') 0.396059017191

I've introduced functions for two different metrics here, but there are actually many more ways to measure similarity between two sets of data. The best one to use will depend on your application, and it is worth trying Pearson, Euclidean distance, or others to see which you think gives better results.

The functions in the rest of this chapter have an optional
*similarity* parameter, which points to a function
to make it easier to experiment: specify `sim_pearson`

or `sim_distance`

to choose which similarity
parameter to use. There are many other functions such as the
*Jaccard coefficient* or *Manhattan
distance* that you can use as your similarity function, as
long as they have the same signature and return a float where a higher
value means more similar.

You can read about other metrics for comparing items at http://en.wikipedia....ics%29#Examples.

Now that you have functions for comparing two people, you can
create a function that scores everyone against a given person and
finds the closest matches. In this case, I'm interested in learning
which movie critics have tastes simliar to mine so that I know whose
advice I should take when deciding on a movie. Add this function to
[inclinecode]recommendations.py[/code] to get an
ordered list of people with similar tastes to the specified `person`

:

# Returns the best matches for person from the prefs dictionary. # Number of results and similarity function are optional params. def topMatches(prefs,person,n=5,similarity=sim_pearson): scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person] # Sort the list so the highest scores appear at the top scores.sort( ) scores.reverse( ) return scores[0:n]

This function uses a Python *list
comprehension* to compare me to every other user in the
dictionary using one of the previously defined distance metrics. Then
it returns the first *n* items of the sorted
results.

Calling this function with my own name gives me a list of movie critics and their similarity scores:

>>reload(recommendations) >> recommendations.topMatches(recommendations.critics,'Toby',n=3) [(0.99124070716192991, 'Lisa Rose'), (0.92447345164190486, 'Mick LaSalle'), (0.89340514744156474, 'Claudia Puig')]

From this I know that I should be reading reviews by Lisa Rose, as her tastes tend to be similar to mine. If you've seen any of these movies, you can try adding yourself to the dictionary with your preferences and see who your favorite critic should be.

## Learn more about this topic from **Programming Collective Intelligence**.

This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.