Jump to content

Introducing K-Means Clustering

0
  adfm's Photo
Posted Feb 11 2010 02:41 PM

This excerpt from Toby Segaran's Programming Collective Intelligence introduces K-Means Clustering, an algorithm that determines the size of clusters based on the structure of the data.


Hierarchical clustering gives a nice tree as a result, but it has a couple of disadvantages. The tree view doesn't really break the data into distinct groups without additional work, and the algorithm is extremely computationally intensive. Because the relationship between every pair of items must be calculated and then recalculated when items are merged, the algorithm will run slowly on very large datasets.

An alternative method of clustering is K-means clustering. This type of algorithm is quite different from hierarchical clustering because it is told in advance how many distinct clusters to generate. The algorithm will determine the size of the clusters based on the structure of the data.

K-means clustering begins with k randomly placed centroids (points in space that represent the center of the cluster), and assigns every item to the nearest one. After the assignment, the centroids are moved to the average location of all the nodes assigned to them, and the assignments are redone. This process repeats until the assignments stop changing. Figure 3.5 shows this process in action for five items and two clusters.

Figure 3.5. K-means clustering with two clusters

Attached Image

In the first frame, the two centroids (shown as dark circles) are placed randomly. Frame 2 shows that each of the items is assigned to the nearest centroid—in this case, A and B are assigned to the top centroid and C, D, and E are assigned to the bottom centroid. In the third frame, each centroid has been moved to the average location of the items that were assigned to it. When the assignments are calculated again, it turns out that C is now closer to the top centroid, while D and E remain closest to the bottom one. Thus, the final result is reached with A, B, and C in one cluster, and D and E in the other.

The function for doing K-means clustering takes the same data rows as input as does the hierarchical clustering algorithm, along with the number of clusters (k) that the caller would like returned. Add this code to clusters.py:

import random



def kcluster(rows,distance=pearson,k=4):

 # Determine the minimum and maximum values for each point

 ranges=[(min([row[i] for row in rows]),max([row[i] for row in rows]))

 for i in range(len(rows[0]))]



 # Create k randomly placed centroids

 clusters=[[random.random( )*(ranges[i][1]-ranges[i][0])+ranges[i][0]

 for i in range(len(rows[0]))] for j in range(k)]



 lastmatches=None

 for t in range(100):

	print 'Iteration %d' % t

	bestmatches=[[] for i in range(k)]



	# Find which centroid is the closest for each row

	for j in range(len(rows)):

 	row=rows[j]

 	bestmatch=0

 	for i in range(k):

 	d=distance(clusters[i],row)

 	if d<distance(clusters[bestmatch],row): bestmatch=i

 	bestmatches[bestmatch].append(j)



	# If the results are the same as last time, this is complete

	if bestmatches==lastmatches: break

	lastmatches=bestmatches



	# Move the centroids to the average of their members

	for i in range(k):

 	avgs=[0.0]*len(rows[0])

 	if len(bestmatches[i])>0:

 	for rowid in bestmatches[i]:

 	for m in range(len(rows[rowid])):

 	avgs[m]+=rows[rowid][m]

 	for j in range(len(avgs)):

 	avgs[j]/=len(bestmatches[i])

 	clusters[i]=avgs



 return bestmatches

This code randomly creates a set of clusters within the ranges of each of the variables. With every iteration, the rows are each assigned to one of the centroids, and the centroid data is updated to the average of all its assignees. When the assignments are the same as they were the previous time, the process ends and the k lists, each representing a cluster, are returned. The number of iterations it takes to produce the final result is quite small compared to hierarchical clustering.

Because this function uses random centroids to start with, the order of the results returned will almost always be different. It's also possible for the contents of the clusters to be different depending on the initial locations of the centroids.

You can try this function on the blog dataset. It should run quite a bit faster than the hierarchical clustering:

>><strong class="userinput"><code>reload(clusters)</code></strong>

>> <strong class="userinput"><code>kclust=clusters.kcluster(data,k=10)</code></strong>

Iteration 0

...

>> [blognames[r] for r in kclust[0]]

['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman',

'ProBlogger Blog Tips', "Seth's Blog"]

>> [blognames[r] for r in kclust[1]]

etc..

kclust now contains a list of IDs for each cluster. Try the clustering with different values of k and see how it affects the results.

Cover of Programming Collective Intelligence
Learn more about this topic from Programming Collective Intelligence. 

This fascinating book demonstrates how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.

Learn More Read Now on Safari


0 Replies