Scalable Algorithms for Big Data

Posted by pat on February 28, 2012

Big Data is the buzzword in vogue now and far be it from me to be out of vogue so let’s talk Big Data. When I was at Trailfire back in 2005 we had a web app built on PHP and MySQL. It performed pretty well in a master-slave cluster. However once we added some analytics tools that tracked how people used the site we brought MySQL to its knees in a few weeks. Moral: Almost any data is Big Data. I’d have to be convinced in tackling a new project that the data wasn’t going to immediately scale beyond old style tools. Be ready to look beyond relational DBs and non-parallel algorithms.

In our FinderBots implementation we’ve found MongoDB to be fast and scalable. Some problems may loom for the future when we try to use it’s automagic resharding feature (rumored to have problems). But for now it lets us work without solving every problem at once.

The next question is what framework to use in building scalable algorithms. Here we’ve chosen Hadoop and Mahout. These give us the promise of highly scalable file storage and mapreduce for parallel job execution. A close inspection of Mahout along with some experimentation yields a good selection of algorithmic tools:

  • Text to TFIDF weighted Vectors: Mahout has several tools to take text, calculate n-grams, remove stop words, and create TFIDF weights. It also comes with filters from Lucene to stem text and perform other scrubbing. Nice and all in hadoop mapreduce form.
  • RowSimilarityJob: This built-in job finds the most similar documents for each row in the corpus matrix. Here the matrix has rows corresponding to docs and columns for each dimension. The implementation of this makes use of the sparse nature of the matrix for optimizations and has gone through two versions. I have not benchmarked it yet but it is supposed to be pretty efficient and in hadoop mr form.
  • Canopy Clustering: This is a first phase cluster approximation that can be used to estimate the number and seed centroids for kmeans. It is very fast and since kmeans needs a k as input it’s a good tool. We plan to gather data, compute the document similarity, use the distance of docs to similar docs to guess at thresholds for canopy clustering, then we run clustering…
  • Fuzzy K-Means Clustering: fkmeans is build into mahout with plugable distance measures. It allows us to create overlapping clusters to reflect multi-topic documents. All in hadoop mr form.
  • CosineDistanceMeasure: If the length of all document vectors are normalized (length of 1) then the dot product of two vectors gives the cosine of the angle between them. The angle is a good measure of how close two docs are and ignores the size of the docs (magnitude of their vectors) This is calculated pretty fast and is a good measure which ignores the size of documents on the theory that a topic is a topic no matter how much you talk about it.

Progress

As of 2/27/2012 FinderBots.com has a hand generated example DB behind it. This is running on Ruby on Rails using MongoDB. We have also integrated Solr for indexing and search. You have to get a special account to see the live data though there are some screenshots on the site. We also have a small 6 core 1Tb hadoop cluster that is being used for the Mahout/Hadoop. We are currently experimenting with the reuters news data and recent dumps of wikipedia. There are still a few weeks of tuning the analysis then we will tie the data to the UI.

Issues

One question we haven’t tackled yet is how to do vector based queries. In the screenshot above we want to allow people to click a tag (actually a term from the doc vector) then take the vector that it belongs to, re-weight the clicked term higher, and use the resulting vector as a query for similar docs. This is like allowing the user to query, “this page is close but I need more of this (clicked) topic”. It’s a way to explore the information space via point and click. Currently the only obvious way to do similarity calculation is offline (see RowSimlarityJob above) and so has to be done in the background. There seems to be some support for an approximation of vector queries in Solr perhaps via boosting—stay tuned.

A Short Survey of Clustering Algorithms

Posted by pat on January 15, 2010

K-means is one of the most used clustering algorithms. But you have to know k to run the clustering. Here is an article that describes an algorithm for choosing K. http://arxiv.org/abs/0912.3983

K-Tree clustering is a close aproximation of k-means but is more efficient for large document collections and produces hierarchy. http://arxiv.org/abs/1001.0830 http://arxiv.org/abs/1001.0827

Streaming k-means for non-batch processing of incoming data. Could be applied to clustering of email as it is received or web pages as they are browsed. http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf

K-means++ a better, faster k-means. http://lingpipe-blog.com/2009/03/23/arthur-vassilvitskii-2007-kmeans-the-advantages-of-careful-seeding/

The Vector Space Model: Cosine Similarity

Posted by pat on February 20, 2009

Lots of machine learning is based on a few nifty tools. One of the most flexible is the vector space model for representing information. Humans can scan a couple of documents and, even without understanding the subject matter, can tell if the two are similar. We see similar words, we understand synonyms, or recognize the same subject names. Machines can imitate this by breaking documents down into their important terms or phrases, and then comparing the terms of one document with those of another.

First let’s take a document and break it down into features, in the simple case features can be just important words—we’ll throw out trivial word like ‘a’, ‘and’, and ‘the’ that add little to meaning.  Then we count the times a particular word appears in the document.  What we end up with is a term vector or vector of terms and frequencies.

Document Vectors

Even though the vectors will have as many terms as they do important words they can be visualized like regular two dimensional vectors. All the same rules apply in higher dimensionality as do in the simple two dimensional world.  And that means that if you plot two vectors you can tell how close they are by calculating the distance between the two end points.  To make the measurement even easier we only look at the angle between them.

Comparing documents to a standardThis trick is not so different from what the human reader does when comparing a long document and a short one to see if they are on similar subjects.  The human will take into account the fact that one document is an email and the other is a book and still recognize that they are both talking about the same topic.

Here we can tell that d1 is more like q than d2 by noting the angles between the vectors. A shorthand for comparing the angles is to compare their cosines.

Cosine similarity

This idea is quite powerful and can be built up and amplified in many ways to help with machine learning tasks.  To start with we can use it to take a set of documents and rank them by similarity to some example.  This is called measuring their cosine similarity.

The above illustrations are taken from Wikipedia which has a good description of the Vector Space Model.  I repeat some of it here because I want to apply it to some more specific cases in future posts.