Jump to content

How to Find Similar Users with Python

+ 1
  adfm's Photo
Posted Feb 08 2010 06:00 PM

Collaborative filtering techniques enable online retailers to recommend products, services, and media for the majority of commercial websites you visit daily. When you purchase something at Amazon.com it flexes its conciderable collective intelligence to make sure you are aware that people who purchased the product you are considering also purchased similar items, and would you like to check them out as well? More often than not you will at least consider those other items because they appeal to you in some way. In this excerpt from Toby Segaran's Programming Collective Intelligence you will learn how to use collected user data to help you find similar users.


This example assumes you are familar with the Python language. If you'd like to work through the example in this section, create a file called recommendations.py, and insert the following code to create the dataset:

# A dictionary of movie critics and their ratings of a small

# set of movies

critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,

'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,

'The Night Listener': 3.0},

'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,

'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,

'You, Me and Dupree': 3.5},

'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,

'Superman Returns': 3.5, 'The Night Listener': 4.0},

'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,

'The Night Listener': 4.5, 'Superman Returns': 4.0,

'You, Me and Dupree': 2.5},

'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,

'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,

'You, Me and Dupree': 2.0},

'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,

'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},

'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

 

After collecting data about the things people like, you need a way to determine how similar people are in their tastes. You do this by comparing each person with every other person and calculating a similarity score. There are a few ways to do this, and in this section I'll show you two systems for calculating similarity scores: Euclidean distance and Pearson correlation.

Euclidean Distance Score

One very simple way to calculate a similarity score is to use a Euclidean distance score, which takes the items that people have ranked in common and uses them as axes for a chart. You can then plot the people on the chart and see how close together they are, as shown in Figure 2.1.

Figure 2.1. People in preference space

Attached Image

This figure shows the people charted in preference space. Toby has been plotted at 4.5 on the Snakes axis and at 1.0 on the Dupree axis. The closer two people are in the preference space, the more similar their preferences are. Because the chart is two-dimensional, you can only look at two rankings at a time, but the principle is the same for bigger sets of rankings.

To calculate the distance between Toby and LaSalle in the chart, take the difference in each axis, square them and add them together, then take the square root of the sum. In Python, you can use the pow(n,2) function to square a number and take the square root with the sqrt function:

>>from math import sqrt

>>sqrt(pow(4.5-4,2)+pow(1-2,2))

1.1180339887498949

This formula calculates the distance, which will be smaller for people who are more similar. However, you need a function that gives higher values for people who are similar. This can be done by adding 1 to the function (so you don't get a division-by-zero error) and inverting it:

>>1/(1+sqrt(pow(4.5-4,2)+pow(1-2,2)))

0.47213595499957939

This new function always returns a value between 0 and 1, where a value of 1 means that two people have identical preferences. You can put everything together to create a function for calculating similarity. Add the following code to [inclinecode]recommendations.py[/inlinecode]:

from math import sqrt



# Returns a distance-based similarity score for person1 and person2

def sim_distance(prefs,person1,person2):

 # Get the list of shared_items

 si={}

 for item in prefs[person1]:

	if item in prefs[person2]:

 	si[item]=1



 # if they have no ratings in common, return 0

 if len(si)==0: return 0



 # Add up the squares of all the differences

 sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)

 	for item in si])



 return 1/(1+sqrt(sum_of_squares))

This function can be called with two names to get a similarity score. In your Python interpreter, run the following:

>>> import recommendations

>>> reload(recommendations)

>>> recommendations.sim_distance(recommendations.critics,

... 'Lisa Rose','Gene Seymour') 0.148148148148

This gives you a similarity score between Lisa Rose and Gene Seymour. Try it with other names to see if you can find people who have more or less in common.

Pearson Correlation Score

A slightly more sophisticated way to determine the similarity between people's interests is to use a Pearson correlation coefficient. The correlation coefficient is a measure of how well two sets of data fit on a straight line. The formula for this is more complicated than the Euclidean distance score, but it tends to give better results in situations where the data isn't well normalized—for example, if critics' movie rankings are routinely more harsh than average.

To visualize this method, you can plot the ratings of two of the critics on a chart, as shown in Figure 2.2. Superman was rated 3 by Mick LaSalle and 5 by Gene Seymour, so it is placed at (3,5) on the chart.

Figure 2.2. Comparing two movie critics on a scatter plot

Attached Image

You can also see a straight line on the chart. This is called the best-fit line because it comes as close to all the items on the chart as possible. If the two critics had identical ratings for every movie, this line would be diagonal and would touch every item in the chart, giving a perfect correlation score of 1. In the case illustrated, the critics disagree on a few movies, so the correlation score is about 0.4. Figure 2.3 shows an example of a much higher correlation, one of about 0.75.

Figure 2.3. Two critics with a high correlation score

Attached Image

One interesting aspect of using the Pearson score, which you can see in the figure, is that it corrects for grade inflation. In this figure, Jack Matthews tends to give higher scores than Lisa Rose, but the line still fits because they have relatively similar preferences. If one critic is inclined to give higher scores than the other, there can still be perfect correlation if the difference between their scores is consistent. The Euclidean distance score described earlier will say that two critics are dissimilar because one is consistently harsher than the other, even if their tastes are very similar. Depending on your particular application, this behavior may or may not be what you want.

The code for the Pearson correlation score first finds the items rated by both critics. It then calculates the sums and the sum of the squares of the ratings for the two critics, and calculates the sum of the products of their ratings. Finally, it uses these results to calculate the Pearson correlation coefficient, shown in bold in the code below. Unlike the distance metric, this formula is not very intuitive, but it does tell you how much the variables change together divided by the product of how much they vary individually.

To use this formula, create a new function with the same signature as the sim_distance function in recommendations.py:

# Returns the Pearson correlation coefficient for p1 and p2

def sim_pearson(prefs,p1,p2):

 # Get the list of mutually rated items

 si={}

 for item in prefs[p1]:

	if item in prefs[p2]: si[item]=1



 # Find the number of elements

 n=len(si)



 # if they have no ratings in common, return 0

 if n==0: return 0



 # Add up all the preferences

 sum1=sum([prefs[p1][it] for it in si])

 sum2=sum([prefs[p2][it] for it in si])



 # Sum up the squares

 sum1Sq=sum([pow(prefs[p1][it],2) for it in si])

 sum2Sq=sum([pow(prefs[p2][it],2) for it in si])



 # Sum up the products

 pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])



 # Calculate Pearson score<span class="strong"><strong>num=pSum−(sum1*sum2/n)

 den=sqrt((sum1Sq−pow(sum1,2)/n)*(sum2Sq−pow(sum2,2)/n))

 if den==0: return 0

 r=num/den</strong></span>



 return r

This function will return a value between −1 and 1. A value of 1 means that the two people have exactly the same ratings for every item. Unlike with the distance metric, you don't need to change this value to get it to the right scale. Now you can try getting the correlation score for Figure 2.3:

>>>reload(recommendations)

 >>> print recommendations.sim_pearson(recommendations.critics,

 ... 'Lisa Rose','Gene Seymour')

 0.396059017191

Which Similarity Metric Should You Use?

I've introduced functions for two different metrics here, but there are actually many more ways to measure similarity between two sets of data. The best one to use will depend on your application, and it is worth trying Pearson, Euclidean distance, or others to see which you think gives better results.

The functions in the rest of this chapter have an optional similarity parameter, which points to a function to make it easier to experiment: specify sim_pearson or sim_distance to choose which similarity parameter to use. There are many other functions such as the Jaccard coefficient or Manhattan distance that you can use as your similarity function, as long as they have the same signature and return a float where a higher value means more similar.

You can read about other metrics for comparing items at http://en.wikipedia....ics%29#Examples.

Ranking the Critics

Now that you have functions for comparing two people, you can create a function that scores everyone against a given person and finds the closest matches. In this case, I'm interested in learning which movie critics have tastes simliar to mine so that I know whose advice I should take when deciding on a movie. Add this function to [inclinecode]recommendations.py[/code] to get an ordered list of people with similar tastes to the specified person:

# Returns the best matches for person from the prefs dictionary.

# Number of results and similarity function are optional params.

def topMatches(prefs,person,n=5,similarity=sim_pearson):

 scores=[(similarity(prefs,person,other),other)

 	for other in prefs if other!=person]



 # Sort the list so the highest scores appear at the top

 scores.sort( )

 scores.reverse( )

 return scores[0:n]

This function uses a Python list comprehension to compare me to every other user in the dictionary using one of the previously defined distance metrics. Then it returns the first n items of the sorted results.

Calling this function with my own name gives me a list of movie critics and their similarity scores:

>>reload(recommendations)

>> recommendations.topMatches(recommendations.critics,'Toby',n=3)

[(0.99124070716192991, 'Lisa Rose'), (0.92447345164190486, 'Mick LaSalle'),

(0.89340514744156474, 'Claudia Puig')]

From this I know that I should be reading reviews by Lisa Rose, as her tastes tend to be similar to mine. If you've seen any of these movies, you can try adding yourself to the dictionary with your preferences and see who your favorite critic should be.

Programming Collective Intelligence

Learn more about this topic from Programming Collective Intelligence.

This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.

See what you'll learn


1 Reply

 : Aug 18 2010 08:37 PM
hi -

Well, I tried all things above and it produces the illustrated results.

I modified the ratings in the prefs to 1 for all, and the results have one aspect I don't understand -

1. The similarities between one another turns to 0 eternally.