This excerpt from Toby Segaran's Programming Collective Intelligence introduces K-Means Clustering, an algorithm that determines the size of clusters based on the structure of the data.
Hierarchical clustering gives a nice tree as a result, but it has a couple of disadvantages. The tree view doesn't really break the data into distinct groups without additional work, and the algorithm is extremely computationally intensive. Because the relationship between every pair of items must be calculated and then recalculated when items are merged, the algorithm will run slowly on very large datasets.
An alternative method of clustering is K-means clustering. This type of algorithm is quite different from hierarchical clustering because it is told in advance how many distinct clusters to generate. The algorithm will determine the size of the clusters based on the structure of the data.
K-means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing. Figure 3.5 shows this process in action for five items and two clusters.
In the first frame, the two centroids (shown as dark circles) are placed randomly. Frame 2 shows that each of the items is assigned to the nearest centroid—in this case, A and B are assigned to the top centroid and C, D, and E are assigned to the bottom centroid. In the third frame, each centroid has been moved to the average location of the items that were assigned to it. When the assignments are calculated again, it turns out that C is now closer to the top centroid, while D and E remain closest to the bottom one. Thus, the final result is reached with A, B, and C in one cluster, and D and E in the other.
The function for doing K-means clustering takes the same data rows
as input as does the hierarchical clustering algorithm, along with the
number of clusters (k) that the
caller would like returned. Add this code to clusters.py:
import random def kcluster(rows,distance=pearson,k=4): # Determine the minimum and maximum values for each point ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))] # Create k randomly placed centroids clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)] lastmatches=None for t in range(100): print 'Iteration %d' % t bestmatches=[[] for i in range(k)] # Find which centroid is the closest for each row for j in range(len(rows)): row=rows[j] bestmatch=0 for i in range(k): d=distance(clusters[i],row) if d<distance(clusters[bestmatch],row): bestmatch=i bestmatches[bestmatch].append(j) # If the results are the same as last time, this is complete if bestmatches==lastmatches: break lastmatches=bestmatches # Move the centroids to the average of their members for i in range(k): avgs=[0.0]*len(rows[0]) if len(bestmatches[i])>0: for rowid in bestmatches[i]: for m in range(len(rows[rowid])): avgs[m]+=rows[rowid][m] for j in range(len(avgs)): avgs[j]/=len(bestmatches[i]) clusters[i]=avgs return bestmatches
This code randomly creates a set of clusters within the ranges of
each of the variables. With every iteration, the rows are each assigned
to one of the centroids, and the centroid data is updated to the average
of all its assignees. When the assignments are the same as they were the
previous time, the process ends and the k lists, each representing a cluster, are
returned. The number of iterations it takes to produce the final result
is quite small compared to hierarchical clustering.
Because this function uses random centroids to start with, the order of the results returned will almost always be different. It's also possible for the contents of the clusters to be different depending on the initial locations of the centroids.
You can try this function on the blog dataset. It should run quite a bit faster than the hierarchical clustering:
>><strong class="userinput"><code>reload(clusters)</code></strong> >> <strong class="userinput"><code>kclust=clusters.kcluster(data,k=10)</code></strong> Iteration 0 ... >> [blognames[r] for r in kclust[0]] ['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', "Seth's Blog"] >> [blognames[r] for r in kclust[1]] etc..
kclust now contains a list of
IDs for each cluster. Try the clustering with different values of
k and see how it affects the
results.
This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.




Help









