941,498 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Unsolved
  • Views: 27451
  • Python RSS
Aug 30th, 2005
1

K-means clustering

Expand Post »
Here's something of mine that might actually be useful: a Python implementation of the K-means clustering algorithm. I wrote something similar last year in Java for a school project, and decided to rewrite it in Python this summer for practice.

The purpose of the algorithm is to discover internal structure in some set of data points - you supply the points and the number of clusters you expect to get, and the algorithm returns the same points, organized into clusters by proximity. Once you have the clusters, you can get their sample means, their variances, do a bunch of statistics, etc. This approach has become very popular among the bioinformatics crowd, and especially among analysts of gene expression (microarray) data.

The central idea behind K-means is the manipulation of things called "centroids." A centroid is an imaginary point specific to a cluster of points. It is an average point - that is, if you took all the points in the cluster, and averaged their coordinates, you'd have the centroid.

K-means starts by creating singleton clusters around k randomly sampled points from your input list. Then, it assigns each point in that list to the cluster with the closest centroid. This shift in the contents of the cluster causes a shift in the position of the centroid. You keep re-assigning points and shifting centroids again and again, until the largest centroid shift distance is smaller than the input cutoff.

But that's the abridged version - see if you can figure out what it's doing.
Python Syntax (Toggle Plain Text)
  1. # clustering.py contains classes and functions that cluster data points
  2. import sys, math, random
  3. # -- The Point class represents points in n-dimensional space
  4. class Point:
  5. # Instance variables
  6. # self.coords is a list of coordinates for this Point
  7. # self.n is the number of dimensions this Point lives in (ie, its space)
  8. # self.reference is an object bound to this Point
  9. # Initialize new Points
  10. def __init__(self, coords, reference=None):
  11. self.coords = coords
  12. self.n = len(coords)
  13. self.reference = reference
  14. # Return a string representation of this Point
  15. def __repr__(self):
  16. return str(self.coords)
  17. # -- The Cluster class represents clusters of points in n-dimensional space
  18. class Cluster:
  19. # Instance variables
  20. # self.points is a list of Points associated with this Cluster
  21. # self.n is the number of dimensions this Cluster's Points live in
  22. # self.centroid is the sample mean Point of this Cluster
  23. def __init__(self, points):
  24. # We forbid empty Clusters (they don't make mathematical sense!)
  25. if len(points) == 0: raise Exception("ILLEGAL: EMPTY CLUSTER")
  26. self.points = points
  27. self.n = points[0].n
  28. # We also forbid Clusters containing Points in different spaces
  29. # Ie, no Clusters with 2D Points and 3D Points
  30. for p in points:
  31. if p.n != self.n: raise Exception("ILLEGAL: MULTISPACE CLUSTER")
  32. # Figure out what the centroid of this Cluster should be
  33. self.centroid = self.calculateCentroid()
  34. # Return a string representation of this Cluster
  35. def __repr__(self):
  36. return str(self.points)
  37. # Update function for the K-means algorithm
  38. # Assigns a new list of Points to this Cluster, returns centroid difference
  39. def update(self, points):
  40. old_centroid = self.centroid
  41. self.points = points
  42. self.centroid = self.calculateCentroid()
  43. return getDistance(old_centroid, self.centroid)
  44. # Calculates the centroid Point - the centroid is the sample mean Point
  45. # (in plain English, the average of all the Points in the Cluster)
  46. def calculateCentroid(self):
  47. centroid_coords = []
  48. # For each coordinate:
  49. for i in range(self.n):
  50. # Take the average across all Points
  51. centroid_coords.append(0.0)
  52. for p in self.points:
  53. centroid_coords[i] = centroid_coords[i]+p.coords[i]
  54. centroid_coords[i] = centroid_coords[i]/len(self.points)
  55. # Return a Point object using the average coordinates
  56. return Point(centroid_coords)
  57. # -- Return Clusters of Points formed by K-means clustering
  58. def kmeans(points, k, cutoff):
  59. # Randomly sample k Points from the points list, build Clusters around them
  60. initial = random.sample(points, k)
  61. clusters = []
  62. for p in initial: clusters.append(Cluster([p]))
  63. # Enter the program loop
  64. while True:
  65. # Make a list for each Cluster
  66. lists = []
  67. for c in clusters: lists.append([])
  68. # For each Point:
  69. for p in points:
  70. # Figure out which Cluster's centroid is the nearest
  71. smallest_distance = getDistance(p, clusters[0].centroid)
  72. index = 0
  73. for i in range(len(clusters[1:])):
  74. distance = getDistance(p, clusters[i+1].centroid)
  75. if distance < smallest_distance:
  76. smallest_distance = distance
  77. index = i+1
  78. # Add this Point to that Cluster's corresponding list
  79. lists[index].append(p)
  80. # Update each Cluster with the corresponding list
  81. # Record the biggest centroid shift for any Cluster
  82. biggest_shift = 0.0
  83. for i in range(len(clusters)):
  84. shift = clusters[i].update(lists[i])
  85. biggest_shift = max(biggest_shift, shift)
  86. # If the biggest centroid shift is less than the cutoff, stop
  87. if biggest_shift < cutoff: break
  88. # Return the list of Clusters
  89. return clusters
  90. # -- Get the Euclidean distance between two Points
  91. def getDistance(a, b):
  92. # Forbid measurements between Points in different spaces
  93. if a.n != b.n: raise Exception("ILLEGAL: NON-COMPARABLE POINTS")
  94. # Euclidean distance between a and b is sqrt(sum((a[i]-b[i])^2) for all i)
  95. ret = 0.0
  96. for i in range(a.n):
  97. ret = ret+pow((a.coords[i]-b.coords[i]), 2)
  98. return math.sqrt(ret)
  99. # -- Create a random Point in n-dimensional space
  100. def makeRandomPoint(n, lower, upper):
  101. coords = []
  102. for i in range(n): coords.append(random.uniform(lower, upper))
  103. return Point(coords)
  104. # -- Main function
  105. def main(args):
  106. num_points, n, k, cutoff, lower, upper = 10, 2, 3, 0.5, -200, 200
  107. # Create num_points random Points in n-dimensional space
  108. points = []
  109. for i in range(num_points): points.append(makeRandomPoint(n, lower, upper))
  110. # Cluster the points using the K-means algorithm
  111. clusters = kmeans(points, k, cutoff)
  112. # Print the results
  113. print "\nPOINTS:"
  114. for p in points: print "P:", p
  115. print "\nCLUSTERS:"
  116. for c in clusters: print "C:", c
  117. # -- The following code executes upon command-line invocation
  118. if __name__ == "__main__": main(sys.argv)
Similar Threads
Reputation Points: 41
Solved Threads: 31
Junior Poster
G-Do is offline Offline
146 posts
since Jun 2005
Jan 27th, 2010
0
Re: K-means clustering
Nice man. A barebone k-means imple. Learn a lot from you.

Some suggestion:

Make a second iteration, so that we can try different k values and take the has the smallest within-cluster sum of squares.

Or, output the within-cluster sum of squares in the end. So the user may run your code several times or put it to parallel computering with different k, then just reads your output and chooses the best.
Reputation Points: 10
Solved Threads: 2
Newbie Poster
dgg32 is offline Offline
10 posts
since Apr 2008
Oct 10th, 2010
0
Re: K-means clustering
Very good program but I think your code may have a bug, that is one of the clusters may be empty in the end, and there will be an exception in line 54, the exception will be like: "ZeroDivisionError: float division" because self.points is "[]".

I created an image to illustrate the problem:http://systemsbiozju.org/people/zzm/k-means-weak.png

But I do not know how to avoid it, maybe the number of clusters of k-means algorithm can be changed, I mean we may delete empty clusters in this case, how do you think ?
Reputation Points: 10
Solved Threads: 0
Newbie Poster
nicegiving is offline Offline
1 posts
since Oct 2010
Oct 11th, 2010
0
Re: K-means clustering
You might want to post this in a more permanent place as well like pypi.python.org or Activestate. as they are also places that people look for code.
Last edited by woooee; Oct 11th, 2010 at 5:36 pm.
Reputation Points: 738
Solved Threads: 689
Nearly a Posting Maven
woooee is offline Offline
2,286 posts
since Dec 2006

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: Multiprocess help!
Next Thread in Python Forum Timeline: wx.StaticText & wx.TextCtrl





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC