Scalable Algorithms for Big Data

Posted by pat on February 28, 2012

Big Data is the buzzword in vogue now and far be it from me to be out of vogue so let’s talk Big Data. When I was at Trailfire back in 2005 we had a web app built on PHP and MySQL. It performed pretty well in a master-slave cluster. However once we added some analytics tools that tracked how people used the site we brought MySQL to its knees in a few weeks. Moral: Almost any data is Big Data. I’d have to be convinced in tackling a new project that the data wasn’t going to immediately scale beyond old style tools. Be ready to look beyond relational DBs and non-parallel algorithms.

In our FinderBots implementation we’ve found MongoDB to be fast and scalable. Some problems may loom for the future when we try to use it’s automagic resharding feature (rumored to have problems). But for now it lets us work without solving every problem at once.

The next question is what framework to use in building scalable algorithms. Here we’ve chosen Hadoop and Mahout. These give us the promise of highly scalable file storage and mapreduce for parallel job execution. A close inspection of Mahout along with some experimentation yields a good selection of algorithmic tools:

  • Text to TFIDF weighted Vectors: Mahout has several tools to take text, calculate n-grams, remove stop words, and create TFIDF weights. It also comes with filters from Lucene to stem text and perform other scrubbing. Nice and all in hadoop mapreduce form.
  • RowSimilarityJob: This built-in job finds the most similar documents for each row in the corpus matrix. Here the matrix has rows corresponding to docs and columns for each dimension. The implementation of this makes use of the sparse nature of the matrix for optimizations and has gone through two versions. I have not benchmarked it yet but it is supposed to be pretty efficient and in hadoop mr form.
  • Canopy Clustering: This is a first phase cluster approximation that can be used to estimate the number and seed centroids for kmeans. It is very fast and since kmeans needs a k as input it’s a good tool. We plan to gather data, compute the document similarity, use the distance of docs to similar docs to guess at thresholds for canopy clustering, then we run clustering…
  • Fuzzy K-Means Clustering: fkmeans is build into mahout with plugable distance measures. It allows us to create overlapping clusters to reflect multi-topic documents. All in hadoop mr form.
  • CosineDistanceMeasure: If the length of all document vectors are normalized (length of 1) then the dot product of two vectors gives the cosine of the angle between them. The angle is a good measure of how close two docs are and ignores the size of the docs (magnitude of their vectors) This is calculated pretty fast and is a good measure which ignores the size of documents on the theory that a topic is a topic no matter how much you talk about it.

Progress

As of 2/27/2012 FinderBots.com has a hand generated example DB behind it. This is running on Ruby on Rails using MongoDB. We have also integrated Solr for indexing and search. You have to get a special account to see the live data though there are some screenshots on the site. We also have a small 6 core 1Tb hadoop cluster that is being used for the Mahout/Hadoop. We are currently experimenting with the reuters news data and recent dumps of wikipedia. There are still a few weeks of tuning the analysis then we will tie the data to the UI.

Issues

One question we haven’t tackled yet is how to do vector based queries. In the screenshot above we want to allow people to click a tag (actually a term from the doc vector) then take the vector that it belongs to, re-weight the clicked term higher, and use the resulting vector as a query for similar docs. This is like allowing the user to query, “this page is close but I need more of this (clicked) topic”. It’s a way to explore the information space via point and click. Currently the only obvious way to do similarity calculation is offline (see RowSimlarityJob above) and so has to be done in the background. There seems to be some support for an approximation of vector queries in Solr perhaps via boosting—stay tuned.

Data Sources

Posted by pat on December 29, 2011

In 2008 I was at a startup that needed to create a statistical named entity recognition system for email. There are lots of interesting and unique problems in doing this, not the least of which is finding a suitable source of training data. Statistical NERs need to be fed tagged data in order to create a model for recognition and things are only complicated by the fact that email has a different tone (in linguistic terms ‘voice’) than many of the large tagged datasets available. We used the Enron email data, which was made public as part of legal action against them. The task of tagging a significant part of the corpus is a subject for another post (spoiler, it involves Amazon’s Mechanical Turk) but it reminds me of how important and useful good data sources are to machine learning.

Here are some useful data sources (to be updated regularly):

  • Enron: most corporate email for a few years prior to the legal action.
  • Internet Archive: TheWayBackMachine has historical crawl data from early web times. Also a 2 billion page crawl from 2007.
  • Google Book N-grams: word sequences from digitized books.
  • Wiki10+ Dataset: 2009 dataset of tags+url+page data. Sometimes a tag is more descriptive than the contents of a page. For instance searching for things tagged “comedy” is far easier than looking for some combination of words on the page.
  • Wikipedia: Wikipedia on AWS as an EBS volume. This is a blog post which also includes some ideas about using the data. It is only 500GB or so.
  • Wikipedia Links and Page Titles: Useful in everything from pagerank to determining relatedness.
  • Wikipedia Dumps: XML dump files for portions of Wikipedia.
  • Reuters-21578: a snapshot of reuters news stories often used for text analysis.
  • WordNet:  word relationships.
  • Bag Of Words: Collections of documents scrubbed for cluster analysis.

I’ve only listed a few of the best, comment if I’ve missed important examples.

Common Crawl Envy

Posted by pat on December 20, 2011

We haven’t even begun to set the ‘bots loose on Wikipedia and I have data envy. It’s not just the data that is interesting at Common Crawl but their business model. They have a giant crawl of the web that you can access for free—sort of. They are hosting the data on Amazon S3 and as long as you have your analysis code running on EC2 the access is free. If you export some of the data or access from outside Amazon’s Cloud you have to pay data transmission charges.

Here is a post describing how to access and analyze the data using their open source hadoop jobs.

Common Crawl is a non-profit so they don’t directly benefit from the data being accessed but Amazon does. If you rely on the data it would only be economical to also reside in the Amazon Cloud. Hmm, if Amazon doesn’t help fund Common Crawl, maybe they should.