Big Data is the buzzword in vogue now and far be it from me to be out of vogue so let’s talk Big Data. When I was at Trailfire back in 2005 we had a web app built on PHP and MySQL. It performed pretty well in a master-slave cluster. However once we added some analytics tools that tracked how people used the site we brought MySQL to its knees in a few weeks. Moral: Almost any data is Big Data. I’d have to be convinced in tackling a new project that the data wasn’t going to immediately scale beyond old style tools. Be ready to look beyond relational DBs and non-parallel algorithms.
In our FinderBots implementation we’ve found MongoDB to be fast and scalable. Some problems may loom for the future when we try to use it’s automagic resharding feature (rumored to have problems). But for now it lets us work without solving every problem at once.
The next question is what framework to use in building scalable algorithms. Here we’ve chosen Hadoop and Mahout. These give us the promise of highly scalable file storage and mapreduce for parallel job execution. A close inspection of Mahout along with some experimentation yields a good selection of algorithmic tools:
- Text to TFIDF weighted Vectors: Mahout has several tools to take text, calculate n-grams, remove stop words, and create TFIDF weights. It also comes with filters from Lucene to stem text and perform other scrubbing. Nice and all in hadoop mapreduce form.
- RowSimilarityJob: This built-in job finds the most similar documents for each row in the corpus matrix. Here the matrix has rows corresponding to docs and columns for each dimension. The implementation of this makes use of the sparse nature of the matrix for optimizations and has gone through two versions. I have not benchmarked it yet but it is supposed to be pretty efficient and in hadoop mr form.
- Canopy Clustering: This is a first phase cluster approximation that can be used to estimate the number and seed centroids for kmeans. It is very fast and since kmeans needs a k as input it’s a good tool. We plan to gather data, compute the document similarity, use the distance of docs to similar docs to guess at thresholds for canopy clustering, then we run clustering…
- Fuzzy K-Means Clustering: fkmeans is build into mahout with plugable distance measures. It allows us to create overlapping clusters to reflect multi-topic documents. All in hadoop mr form.
- CosineDistanceMeasure: If the length of all document vectors are normalized (length of 1) then the dot product of two vectors gives the cosine of the angle between them. The angle is a good measure of how close two docs are and ignores the size of the docs (magnitude of their vectors) This is calculated pretty fast and is a good measure which ignores the size of documents on the theory that a topic is a topic no matter how much you talk about it.
Progress
As of 2/27/2012 FinderBots.com has a hand generated example DB behind it. This is running on Ruby on Rails using MongoDB. We have also integrated Solr for indexing and search. You have to get a special account to see the live data though there are some screenshots on the site. We also have a small 6 core 1Tb hadoop cluster that is being used for the Mahout/Hadoop. We are currently experimenting with the reuters news data and recent dumps of wikipedia. There are still a few weeks of tuning the analysis then we will tie the data to the UI.

Issues
One question we haven’t tackled yet is how to do vector based queries. In the screenshot above we want to allow people to click a tag (actually a term from the doc vector) then take the vector that it belongs to, re-weight the clicked term higher, and use the resulting vector as a query for similar docs. This is like allowing the user to query, “this page is close but I need more of this (clicked) topic”. It’s a way to explore the information space via point and click. Currently the only obvious way to do similarity calculation is offline (see RowSimlarityJob above) and so has to be done in the background. There seems to be some support for an approximation of vector queries in Solr perhaps via boosting—stay tuned.

This trick is not so different from what the human reader does when comparing a long document and a short one to see if they are on similar subjects. The human will take into account the fact that one document is an email and the other is a book and still recognize that they are both talking about the same topic.