Data Sources

Posted by pat on December 29, 2011

In 2008 I was at a startup that needed to create a statistical named entity recognition system for email. There are lots of interesting and unique problems in doing this, not the least of which is finding a suitable source of training data. Statistical NERs need to be fed tagged data in order to create a model for recognition and things are only complicated by the fact that email has a different tone (in linguistic terms ‘voice’) than many of the large tagged datasets available. We used the Enron email data, which was made public as part of legal action against them. The task of tagging a significant part of the corpus is a subject for another post (spoiler, it involves Amazon’s Mechanical Turk) but it reminds me of how important and useful good data sources are to machine learning.

Here are some useful data sources (to be updated regularly):

  • Enron: most corporate email for a few years prior to the legal action.
  • Internet Archive: TheWayBackMachine has historical crawl data from early web times. Also a 2 billion page crawl from 2007.
  • Google Book N-grams: word sequences from digitized books.
  • Wiki10+ Dataset: 2009 dataset of tags+url+page data. Sometimes a tag is more descriptive than the contents of a page. For instance searching for things tagged “comedy” is far easier than looking for some combination of words on the page.
  • Wikipedia: Wikipedia on AWS as an EBS volume. This is a blog post which also includes some ideas about using the data. It is only 500GB or so.
  • Wikipedia Links and Page Titles: Useful in everything from pagerank to determining relatedness.
  • WordNet:  word relationships.
  • Bag Of Words: Collections of documents scrubbed for cluster analysis.

I’ve only listed a few of the best, comment if I’ve missed important examples.

Common Crawl Envy

Posted by pat on December 20, 2011

We haven’t even begun to set the ‘bots loose on Wikipedia and I have data envy. It’s not just the data that is interesting at Common Crawl but their business model. They have a giant crawl of the web that you can access for free—sort of. They are hosting the data on Amazon S3 and as long as you have your analysis code running on EC2 the access is free. If you export some of the data or access from outside Amazon’s Cloud you have to pay data transmission charges.

Here is a post describing how to access and analyze the data using their open source hadoop jobs.

Common Crawl is a non-profit so they don’t directly benefit from the data being accessed but Amazon does. If you rely on the data it would only be economical to also reside in the Amazon Cloud. Hmm, if Amazon doesn’t help fund Common Crawl, maybe they should.

FinderBots

Posted by pat on December 19, 2011

New site—woohoo! We’ve been working to put up a scaffolding for some of our ideas. Checkout FinderBots.com. It’s a Rails 3.1 based web app running on MongoDB using the mongoid ORM. Behind the scenes we have Mahout and Hadoop running on HDFS. The idea is to write important results into Mongo and display them through the web app.

We are starting with a dump of, wait for it… Wikipedia so limited crawling at first. We’ll cluster the data into hierarchical topics using a modified k-means clustering algorithm from Mahout. Then we’ll pack ‘em into Mongo, and display them at FinderBots.

For the modified k-means we’ll create a hierarchy by clustering everything, then pages in a cluster will be reclustered, rinse repeat… We should get a hierarchical categorization of the data. It won’t be perfect because k-means will force any given page into one and only one cluster. Not optimal but there are ways around this that we can apply later.

For now it’s just nice to have a scaffolding up even if things aren’t wired together yet.

Web Application Architecture

Posted by pat on November 02, 2011

For web apps among the applications we are investigating we’ll pick as turnkey as possible an architecture and try to explain the reasoning.

Components:

  • Web App: We’ll use Ruby on Rails 3, just because it’s fun. I also believe that it will scale pretty well for the needs it faces. We will only ask RoR to present pages and read from a DB of pre-calculated values. No heavy lifting here so fun counts.
  • DB: NoSQL naturally. In past projects we have used HBase and MongoDB. HBase is a clone of BigTable the Google key value store and when we used it several years ago it was pretty green. It does not seem to be the NoSQL that has gotten the most cred or community support (correct me if I’m wrong). MongoDB is the new kid on the block and is also a little green but I like it because it represents the next gen in easy to set up and use NoSQL DBs. The other possibilities are numerous but the only one I’ll look at for now is Cassandra. It seems to be getting a lot of design wins and is pretty mature. I think a review of these will be another post.
  • Number Cruncher Services: Here we’ll use java and Hadoop with it’s kick ass framework Mahout. This will give us all sorts of benefits including: scalability, great high level libraries, a distributed scalable file system, a mapreduce framework, and all the stuff that comes with java. See a quick overview here.
  • Data Gathering and Crawling: Here there are several different data gathering needs. In some cases there will be occasional downloads, as with Wikipedia and other times there will be a need to crawl part of the web. For this later we’ll probably use Nutch from Apache but I’m not 100% on that yet.
  • Tools: IntelliJ Idea and Eclipse for IDE’s. I name both because, in my small experience with Idea it is really nice but lacks some important plugins that may make it worth using Eclipse for some things. For source control Git. I use it for other projects and have a paid Github account already so it’s an easy decision.
  • Resources: Currently a Macbook Pro and two Ubuntu 11.10 servers that are fairly old but all 64 bit with fair amounts of ram. This will not speed things up very much but will at least prove out the parallel architecture. These are in my home and one doubles as a media center. All double as space heaters.

Architecture:

The RoR app will serve up the pages and data from the DB. The services will crawl and process data using mapreduce and stuff the data in the DB. As best practices indicate for web-scale services we will try to pre-calculate everything and where we can’t we’ll create a service that will do things fairly quickly in java. That way the RoR app is pretty simple.

 

Philosophy:

I should have put this first but it is kind of boring so here it is where you can skip it. We’ll try to stick to best practices and design for scalability, redundancy, and reliability but we have a very small team (1 unless you want to join—hint hint) so we’ll feel free to take shortcuts where there is a payoff. Also in the best iterative practices we may knowingly hack something first to see if it’s worth spending more time to do right.

Some Applications

Posted by pat on October 30, 2011

Wouldn’t it be nice if you picked up your tablet in the morning and had a recommended reading list waiting, filled with things you might find interesting—a list tuned to your recent reading? A list that reflected your separate interests. Even better if the list could be generated automatically without making you do anything out of your routine?

Wouldn’t it be nice if the web were organized by subject and you could navigate it by narrowing or opening your focus? How about being able to veer in a new direction by looking at related subjects? Wouldn’t it be great if any page you visited could work as a jumping off point into this web guide? Wouldn’t if be nice if every page visited was a magnet for similar pages so you could see other things written on the same subject? For that matter wouldn’t it be great if every page came with a link to its closest match on wikipedia or other more appropriate authoritative source?

Why aren’t links highlighted in some way to flag ones that are likely to interest you or match recent searches? Then maybe when you get to a page from Google you’d see the link to your answer instead of having to go back and refine your search.

Don’t get me wrong, search is great but can you imagine going to the library and having only a search prompt to do your research? Search fits a lot of our information discovery needs but it shouldn’t be our only tool. In the next few posts I’ll flesh out the applications on the above wishlist.

Features

Posted by pat on October 30, 2011

The bases of ML techniques are algorithms and features. Algorithms like clustering, similarity, categorization, need to consume descriptions of data expressed in features. The best choice of features to measure can make all the difference.

Features:

  • Words: If you think about it words are extremely simple features to use but not always the most precise. You only have to look at how many synonyms a word has to see how fuzzy its meaning is. Does “bad” mean bad or good? Tough to say without more context. There are a lot of words that have almost no meaning except to structure a sentence—words like “the”, “and”, “so”, & etc. We have ways of weeding these words out but the impreciseness of what is left is still an issue.
  • Concepts: These are more precise than words but also much more difficult to extract. Two completely different words can belong to the same concept as with synonyms but you can extend the idea of a concept to include acronyms, slang, and jargon that can’t be found in a dictionary. An Ice Cream Sandwich can be either a cold treat or a particular release of a mobile OS and it’s an interesting thought experiment to decide how to tell the difference using ML.
  • Frequency: Some algorithms use frequency attached to features, others do not. It may be more important to note if a feature exists than it is to count how many exist. For instance a word with a capitol letter is a good candidate for a proper noun even if there are several caps in the word.
  • Proximity: How close is one word to another. This can be a shortcut to finding relationships without any deeper understanding of gramar. Proximity can also be the most important part of location. For instance the location of a baseball is usually not meaningful but its proximity to the batter’s box is.
  • Structure: For instance if a word is identified as the recipient of an email it is much more likely to be a proper noun.
  • Others: Location may be important, time may be, color, etc.

We often have little choice in the features we have, they are determined by the data we get and we can’t go back and gather some potentially useful extra information. Getting the best results may have to do with finding hidden features that are not at first obvious. Sometimes we can find a proxy for a feature we would wish for as in the case of word proximity. Consider the sentence, “I hate Pat”. “Hate” is known to have emotional content. If “Pat” has been recognized by an NLP system as a named entity then the proximity of a word with emotional content might imply a relationship. Proximity becomes a proxy for what we really want to know, which is determined by grammar. Extracting the grammar would be much much harder than counting how many words separate “hate” from “Pat”.

The point of this post is to introduce the idea of features. Be ready to think creatively when you’re trying to find features. The features you have may even determine your algorithm, but that sounds like another post.

Manifesto 1

Posted by pat on October 02, 2011

This blog is dedicated to applications of machine learning and natural language processing. Most of the posts will be thoughts on using machine learning for practical applications and they will usually be oriented around text analysis. There may be some posts about research but it will usually be towards an application of the idea or technique.

I hope it will be a discussion. I’m really looking for people to collaborate on dreaming up applications and maybe even putting some of them together.

Machine Learning with Mahout

Posted by pat on January 02, 2011

This holiday break I started playing around with Mahout, getting it started and running some of the sample data through it. It’s a new part of the Lucene/Hadoop project in Apache which contains a math lib and code generator which builds mapreduce jobs that run in Hadoop and uses the HDFS to store data. It includes a nice vector and matrix library that provides a flexible set of operations and collection types implemented on the parallel mapreduce architecture of Hadoop. Included are several higher level frameworks of general purpose usefulness like:

  • a clustering engine using k-means clustering
  • a Breiman decision forest engine
  • a document classification engine using Bayes and Naive Bayes
  • a collaborative filtering engine
  • cosine similarity via vector dot products

In a previous job we implemented term vector based clustering and similarity in an innovative application for browsing an information space, which was calculated from a number of web pages.  We built a prototype on Hadoop and Hbase and were in the process of moving the calculations to mapreduce when the funding ran out.  It sure would have been easier with Mahout.

Mahout will be a key component of the architecture for the Applications we are discussing.

A Short Survey of Clustering Algorithms

Posted by pat on January 15, 2010

K-means is one of the most used clustering algorithms. But you have to know k to run the clustering. Here is an article that describes an algorithm for choosing K. http://arxiv.org/abs/0912.3983

K-Tree clustering is a close aproximation of k-means but is more efficient for large document collections and produces hierarchy. http://arxiv.org/abs/1001.0830 http://arxiv.org/abs/1001.0827

Streaming k-means for non-batch processing of incoming data. Could be applied to clustering of email as it is received or web pages as they are browsed. http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf

K-means++ a better, faster k-means. http://lingpipe-blog.com/2009/03/23/arthur-vassilvitskii-2007-kmeans-the-advantages-of-careful-seeding/

Vector Space Model: Clustering with K-Means

Posted by pat on March 30, 2009

Another useful tool used in machine learning is clustering. It is useful when you have no human help to bootstrap the learning process (unsupervized learning). The type of clustering I’ll talk about is used on documents to guess at which should be in the same group or category. There are many different clustering algorithms, some of which I’ll describe later, but they all have in common the attempt to find groupings in a corpus of objects described by vectors. For documents we use the vector space model I described earlier.

k-means clusteringIn the illustration each document is plotted as a square.  For k-means clustering you chose k possible centroids to start with—k = 3 for this illustration. In standard k-means the points are chosen randomly in the vector space defined by the documents.  You find the nearest documents, minimizing their distance from the centroid candidate.  Then after you have k clusters you recalculate the centroid from the data and use it to iterate.  After some number of iterations if you have well-behaved data you will end up with a reasonable guess at k categories of documents.  In the illustration the ending clusters are color coded and the centroids plotted as circles.

One problem with this method of finding groups is that you still have no idea what to call the groups.  A human might see Basketball, Baseball, and Soccer but these category names are difficult to extract from the documents themselves. Also it is difficult to determine ahead of time what k should be. These are all interesting research areas.