A Short Survey of Clustering Algorithms

Posted by pat on January 15, 2010

K-means is one of the most used clustering algorithms. But you have to know k to run the clustering. Here is an article that describes an algorithm for choosing K. http://arxiv.org/abs/0912.3983

K-Tree clustering is a close aproximation of k-means but is more efficient for large document collections and produces hierarchy. http://arxiv.org/abs/1001.0830 http://arxiv.org/abs/1001.0827

Streaming k-means for non-batch processing of incoming data. Could be applied to clustering of email as it is received or web pages as they are browsed. http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf

K-means++ a better, faster k-means. http://lingpipe-blog.com/2009/03/23/arthur-vassilvitskii-2007-kmeans-the-advantages-of-careful-seeding/

Vector Space Model: Clustering with K-Means

Posted by pat on March 30, 2009

Another useful tool used in machine learning is clustering. It is useful when you have no human help to bootstrap the learning process (unsupervized learning). The type of clustering I’ll talk about is used on documents to guess at which should be in the same group or category. There are many different clustering algorithms, some of which I’ll describe later, but they all have in common the attempt to find groupings in a corpus of objects described by vectors. For documents we use the vector space model I described earlier.

k-means clusteringIn the illustration each document is plotted as a square.  For k-means clustering you chose k possible centroids to start with—k = 3 for this illustration. In standard k-means the points are chosen randomly in the vector space defined by the documents.  You find the nearest documents, minimizing their distance from the centroid candidate.  Then after you have k clusters you recalculate the centroid from the data and use it to iterate.  After some number of iterations if you have well-behaved data you will end up with a reasonable guess at k categories of documents.  In the illustration the ending clusters are color coded and the centroids plotted as circles.

One problem with this method of finding groups is that you still have no idea what to call the groups.  A human might see Basketball, Baseball, and Soccer but these category names are difficult to extract from the documents themselves. Also it is difficult to determine ahead of time what k should be. These are all interesting research areas.

The Vector Space Model: Cosine Similarity

Posted by pat on February 20, 2009

Lots of machine learning is based on a few nifty tools. One of the most flexible is the vector space model for representing information. Humans can scan a couple of documents and, even without understanding the subject matter, can tell if the two are similar. We see similar words, we understand synonyms, or recognize the same subject names. Machines can imitate this by breaking documents down into their important terms or phrases, and then comparing the terms of one document with those of another.

First let’s take a document and break it down into features, in the simple case features can be just important words—we’ll throw out trivial word like ‘a’, ‘and’, and ‘the’ that add little to meaning.  Then we count the times a particular word appears in the document.  What we end up with is a term vector or vector of terms and frequencies.

Document Vectors

Even though the vectors will have as many terms as they do important words they can be visualized like regular two dimensional vectors. All the same rules apply in higher dimensionality as do in the simple two dimensional world.  And that means that if you plot two vectors you can tell how close they are by calculating the distance between the two end points.  To make the measurement even easier we only look at the angle between them.

Comparing documents to a standardThis trick is not so different from what the human reader does when comparing a long document and a short one to see if they are on similar subjects.  The human will take into account the fact that one document is an email and the other is a book and still recognize that they are both talking about the same topic.

Here we can tell that d1 is more like q than d2 by noting the angles between the vectors. A shorthand for comparing the angles is to compare their cosines.

Cosine similarity

This idea is quite powerful and can be built up and amplified in many ways to help with machine learning tasks.  To start with we can use it to take a set of documents and rank them by similarity to some example.  This is called measuring their cosine similarity.

The above illustrations are taken from Wikipedia which has a good description of the Vector Space Model.  I repeat some of it here because I want to apply it to some more specific cases in future posts.