Fun with Google Alerts

Posted by pat on March 16, 2012

Before I pick on Google I should say what an amazing job they do in many areas, but not everything they do is amazing in a good way. If you stick with only keyword analysis it shows its limitations in funny ways. I use Google Alerts to follow topics of long term interest and one of them is “Machine Learning”. You can image the smiles when I found these in my inbox

I didn’t mind finding Emma in my alerts but stopped collecting them when I got the foot-in-mouth guy. Aside from a little comedy relief there isn’t much purpose to about 20% of the alerts. The problem is fairly obvious, the words “machine” and “learning” aren’t enough to give good results.

There is a lot of information being thrown out by GA here. If you look at the full text of the alert pages you would be able to extract a semantic signature (term vector, concept map, etc) describing Emma that is quite distinct from one describing text clustering or dirichlet analysis even though both would contain the words “machine” and “learning”. Given a thumbs up/down or Facebook “like” you could, over time, train a classifier to keep real machine learning articles and filter out Emma and foot-in-mouth man, something like the way spam is filtered.

You could skip a classifier and sort results to return based on similarity to the centroid of the “Liked” docs. Or get fancier and assume several sub-topics by clustering the liked docs then return results similar to the cluster centroids. It would be fun to create the app by sucking up all the alerts and doing the analysis—maybe some day.

I’d miss the smiles though.

Scalable Algorithms for Big Data

Posted by pat on February 28, 2012

Big Data is the buzzword in vogue now and far be it from me to be out of vogue so let’s talk Big Data. When I was at Trailfire back in 2005 we had a web app built on PHP and MySQL. It performed pretty well in a master-slave cluster. However once we added some analytics tools that tracked how people used the site we brought MySQL to its knees in a few weeks. Moral: Almost any data is Big Data. I’d have to be convinced in tackling a new project that the data wasn’t going to immediately scale beyond old style tools. Be ready to look beyond relational DBs and non-parallel algorithms.

In our FinderBots implementation we’ve found MongoDB to be fast and scalable. Some problems may loom for the future when we try to use it’s automagic resharding feature (rumored to have problems). But for now it lets us work without solving every problem at once.

The next question is what framework to use in building scalable algorithms. Here we’ve chosen Hadoop and Mahout. These give us the promise of highly scalable file storage and mapreduce for parallel job execution. A close inspection of Mahout along with some experimentation yields a good selection of algorithmic tools:

  • Text to TFIDF weighted Vectors: Mahout has several tools to take text, calculate n-grams, remove stop words, and create TFIDF weights. It also comes with filters from Lucene to stem text and perform other scrubbing. Nice and all in hadoop mapreduce form.
  • RowSimilarityJob: This built-in job finds the most similar documents for each row in the corpus matrix. Here the matrix has rows corresponding to docs and columns for each dimension. The implementation of this makes use of the sparse nature of the matrix for optimizations and has gone through two versions. I have not benchmarked it yet but it is supposed to be pretty efficient and in hadoop mr form.
  • Canopy Clustering: This is a first phase cluster approximation that can be used to estimate the number and seed centroids for kmeans. It is very fast and since kmeans needs a k as input it’s a good tool. We plan to gather data, compute the document similarity, use the distance of docs to similar docs to guess at thresholds for canopy clustering, then we run clustering…
  • Fuzzy K-Means Clustering: fkmeans is build into mahout with plugable distance measures. It allows us to create overlapping clusters to reflect multi-topic documents. All in hadoop mr form.
  • CosineDistanceMeasure: If the length of all document vectors are normalized (length of 1) then the dot product of two vectors gives the cosine of the angle between them. The angle is a good measure of how close two docs are and ignores the size of the docs (magnitude of their vectors) This is calculated pretty fast and is a good measure which ignores the size of documents on the theory that a topic is a topic no matter how much you talk about it.

Progress

As of 2/27/2012 FinderBots.com has a hand generated example DB behind it. This is running on Ruby on Rails using MongoDB. We have also integrated Solr for indexing and search. You have to get a special account to see the live data though there are some screenshots on the site. We also have a small 6 core 1Tb hadoop cluster that is being used for the Mahout/Hadoop. We are currently experimenting with the reuters news data and recent dumps of wikipedia. There are still a few weeks of tuning the analysis then we will tie the data to the UI.

Issues

One question we haven’t tackled yet is how to do vector based queries. In the screenshot above we want to allow people to click a tag (actually a term from the doc vector) then take the vector that it belongs to, re-weight the clicked term higher, and use the resulting vector as a query for similar docs. This is like allowing the user to query, “this page is close but I need more of this (clicked) topic”. It’s a way to explore the information space via point and click. Currently the only obvious way to do similarity calculation is offline (see RowSimlarityJob above) and so has to be done in the background. There seems to be some support for an approximation of vector queries in Solr perhaps via boosting—stay tuned.

Data Sources

Posted by pat on December 29, 2011

In 2008 I was at a startup that needed to create a statistical named entity recognition system for email. There are lots of interesting and unique problems in doing this, not the least of which is finding a suitable source of training data. Statistical NERs need to be fed tagged data in order to create a model for recognition and things are only complicated by the fact that email has a different tone (in linguistic terms ‘voice’) than many of the large tagged datasets available. We used the Enron email data, which was made public as part of legal action against them. The task of tagging a significant part of the corpus is a subject for another post (spoiler, it involves Amazon’s Mechanical Turk) but it reminds me of how important and useful good data sources are to machine learning.

Here are some useful data sources (to be updated regularly):

  • Enron: most corporate email for a few years prior to the legal action.
  • Internet Archive: TheWayBackMachine has historical crawl data from early web times. Also a 2 billion page crawl from 2007.
  • Google Book N-grams: word sequences from digitized books.
  • Wiki10+ Dataset: 2009 dataset of tags+url+page data. Sometimes a tag is more descriptive than the contents of a page. For instance searching for things tagged “comedy” is far easier than looking for some combination of words on the page.
  • Wikipedia: Wikipedia on AWS as an EBS volume. This is a blog post which also includes some ideas about using the data. It is only 500GB or so.
  • Wikipedia Links and Page Titles: Useful in everything from pagerank to determining relatedness.
  • Wikipedia Dumps: XML dump files for portions of Wikipedia.
  • Reuters-21578: a snapshot of reuters news stories often used for text analysis.
  • WordNet:  word relationships.
  • Bag Of Words: Collections of documents scrubbed for cluster analysis.

I’ve only listed a few of the best, comment if I’ve missed important examples.

Common Crawl Envy

Posted by pat on December 20, 2011

We haven’t even begun to set the ‘bots loose on Wikipedia and I have data envy. It’s not just the data that is interesting at Common Crawl but their business model. They have a giant crawl of the web that you can access for free—sort of. They are hosting the data on Amazon S3 and as long as you have your analysis code running on EC2 the access is free. If you export some of the data or access from outside Amazon’s Cloud you have to pay data transmission charges.

Here is a post describing how to access and analyze the data using their open source hadoop jobs.

Common Crawl is a non-profit so they don’t directly benefit from the data being accessed but Amazon does. If you rely on the data it would only be economical to also reside in the Amazon Cloud. Hmm, if Amazon doesn’t help fund Common Crawl, maybe they should.

Machine Learning with Mahout

Posted by pat on January 02, 2011

This holiday break I started playing around with Mahout, getting it started and running some of the sample data through it. It’s a new part of the Lucene/Hadoop project in Apache which contains a math lib and code generator which builds mapreduce jobs that run in Hadoop and uses the HDFS to store data. It includes a nice vector and matrix library that provides a flexible set of operations and collection types implemented on the parallel mapreduce architecture of Hadoop. Included are several higher level frameworks of general purpose usefulness like:

  • a clustering engine using k-means clustering
  • a Breiman decision forest engine
  • a document classification engine using Bayes and Naive Bayes
  • a collaborative filtering engine
  • cosine similarity via vector dot products

In a previous job we implemented term vector based clustering and similarity in an innovative application for browsing an information space, which was calculated from a number of web pages.  We built a prototype on Hadoop and Hbase and were in the process of moving the calculations to mapreduce when the funding ran out.  It sure would have been easier with Mahout.

Mahout will be a key component of the architecture for the Applications we are discussing.

Vector Space Model: Clustering with K-Means

Posted by pat on March 30, 2009

Another useful tool used in machine learning is clustering. It is useful when you have no human help to bootstrap the learning process (unsupervized learning). The type of clustering I’ll talk about is used on documents to guess at which should be in the same group or category. There are many different clustering algorithms, some of which I’ll describe later, but they all have in common the attempt to find groupings in a corpus of objects described by vectors. For documents we use the vector space model I described earlier.

k-means clusteringIn the illustration each document is plotted as a square.  For k-means clustering you chose k possible centroids to start with—k = 3 for this illustration. In standard k-means the points are chosen randomly in the vector space defined by the documents.  You find the nearest documents, minimizing their distance from the centroid candidate.  Then after you have k clusters you recalculate the centroid from the data and use it to iterate.  After some number of iterations if you have well-behaved data you will end up with a reasonable guess at k categories of documents.  In the illustration the ending clusters are color coded and the centroids plotted as circles.

One problem with this method of finding groups is that you still have no idea what to call the groups.  A human might see Basketball, Baseball, and Soccer but these category names are difficult to extract from the documents themselves. Also it is difficult to determine ahead of time what k should be. These are all interesting research areas.

The Vector Space Model: Cosine Similarity

Posted by pat on February 20, 2009

Lots of machine learning is based on a few nifty tools. One of the most flexible is the vector space model for representing information. Humans can scan a couple of documents and, even without understanding the subject matter, can tell if the two are similar. We see similar words, we understand synonyms, or recognize the same subject names. Machines can imitate this by breaking documents down into their important terms or phrases, and then comparing the terms of one document with those of another.

First let’s take a document and break it down into features, in the simple case features can be just important words—we’ll throw out trivial word like ‘a’, ‘and’, and ‘the’ that add little to meaning.  Then we count the times a particular word appears in the document.  What we end up with is a term vector or vector of terms and frequencies.

Document Vectors

Even though the vectors will have as many terms as they do important words they can be visualized like regular two dimensional vectors. All the same rules apply in higher dimensionality as do in the simple two dimensional world.  And that means that if you plot two vectors you can tell how close they are by calculating the distance between the two end points.  To make the measurement even easier we only look at the angle between them.

Comparing documents to a standardThis trick is not so different from what the human reader does when comparing a long document and a short one to see if they are on similar subjects.  The human will take into account the fact that one document is an email and the other is a book and still recognize that they are both talking about the same topic.

Here we can tell that d1 is more like q than d2 by noting the angles between the vectors. A shorthand for comparing the angles is to compare their cosines.

Cosine similarity

This idea is quite powerful and can be built up and amplified in many ways to help with machine learning tasks.  To start with we can use it to take a set of documents and rank them by similarity to some example.  This is called measuring their cosine similarity.

The above illustrations are taken from Wikipedia which has a good description of the Vector Space Model.  I repeat some of it here because I want to apply it to some more specific cases in future posts.