Posted by pat
on March 16, 2012
Before I pick on Google I should say what an amazing job they do in many areas, but not everything they do is amazing in a good way. If you stick with only keyword analysis it shows its limitations in funny ways. I use Google Alerts to follow topics of long term interest and one of them is “Machine Learning”. You can image the smiles when I found these in my inbox


I didn’t mind finding Emma in my alerts but stopped collecting them when I got the foot-in-mouth guy. Aside from a little comedy relief there isn’t much purpose to about 20% of the alerts. The problem is fairly obvious, the words “machine” and “learning” aren’t enough to give good results.
There is a lot of information being thrown out by GA here. If you look at the full text of the alert pages you would be able to extract a semantic signature (term vector, concept map, etc) describing Emma that is quite distinct from one describing text clustering or dirichlet analysis even though both would contain the words “machine” and “learning”. Given a thumbs up/down or Facebook “like” you could, over time, train a classifier to keep real machine learning articles and filter out Emma and foot-in-mouth man, something like the way spam is filtered.
You could skip a classifier and sort results to return based on similarity to the centroid of the “Liked” docs. Or get fancier and assume several sub-topics by clustering the liked docs then return results similar to the cluster centroids. It would be fun to create the app by sucking up all the alerts and doing the analysis—maybe some day.
I’d miss the smiles though.
Posted by pat
on February 28, 2012
Big Data is the buzzword in vogue now and far be it from me to be out of vogue so let’s talk Big Data. When I was at Trailfire back in 2005 we had a web app built on PHP and MySQL. It performed pretty well in a master-slave cluster. However once we added some analytics tools that tracked how people used the site we brought MySQL to its knees in a few weeks. Moral: Almost any data is Big Data. I’d have to be convinced in tackling a new project that the data wasn’t going to immediately scale beyond old style tools. Be ready to look beyond relational DBs and non-parallel algorithms.
In our FinderBots implementation we’ve found MongoDB to be fast and scalable. Some problems may loom for the future when we try to use it’s automagic resharding feature (rumored to have problems). But for now it lets us work without solving every problem at once.
The next question is what framework to use in building scalable algorithms. Here we’ve chosen Hadoop and Mahout. These give us the promise of highly scalable file storage and mapreduce for parallel job execution. A close inspection of Mahout along with some experimentation yields a good selection of algorithmic tools:
- Text to TFIDF weighted Vectors: Mahout has several tools to take text, calculate n-grams, remove stop words, and create TFIDF weights. It also comes with filters from Lucene to stem text and perform other scrubbing. Nice and all in hadoop mapreduce form.
- RowSimilarityJob: This built-in job finds the most similar documents for each row in the corpus matrix. Here the matrix has rows corresponding to docs and columns for each dimension. The implementation of this makes use of the sparse nature of the matrix for optimizations and has gone through two versions. I have not benchmarked it yet but it is supposed to be pretty efficient and in hadoop mr form.
- Canopy Clustering: This is a first phase cluster approximation that can be used to estimate the number and seed centroids for kmeans. It is very fast and since kmeans needs a k as input it’s a good tool. We plan to gather data, compute the document similarity, use the distance of docs to similar docs to guess at thresholds for canopy clustering, then we run clustering…
- Fuzzy K-Means Clustering: fkmeans is build into mahout with plugable distance measures. It allows us to create overlapping clusters to reflect multi-topic documents. All in hadoop mr form.
- CosineDistanceMeasure: If the length of all document vectors are normalized (length of 1) then the dot product of two vectors gives the cosine of the angle between them. The angle is a good measure of how close two docs are and ignores the size of the docs (magnitude of their vectors) This is calculated pretty fast and is a good measure which ignores the size of documents on the theory that a topic is a topic no matter how much you talk about it.
Progress
As of 2/27/2012 FinderBots.com has a hand generated example DB behind it. This is running on Ruby on Rails using MongoDB. We have also integrated Solr for indexing and search. You have to get a special account to see the live data though there are some screenshots on the site. We also have a small 6 core 1Tb hadoop cluster that is being used for the Mahout/Hadoop. We are currently experimenting with the reuters news data and recent dumps of wikipedia. There are still a few weeks of tuning the analysis then we will tie the data to the UI.

Issues
One question we haven’t tackled yet is how to do vector based queries. In the screenshot above we want to allow people to click a tag (actually a term from the doc vector) then take the vector that it belongs to, re-weight the clicked term higher, and use the resulting vector as a query for similar docs. This is like allowing the user to query, “this page is close but I need more of this (clicked) topic”. It’s a way to explore the information space via point and click. Currently the only obvious way to do similarity calculation is offline (see RowSimlarityJob above) and so has to be done in the background. There seems to be some support for an approximation of vector queries in Solr perhaps via boosting—stay tuned.
Posted by pat
on December 29, 2011
In 2008 I was at a startup that needed to create a statistical named entity recognition system for email. There are lots of interesting and unique problems in doing this, not the least of which is finding a suitable source of training data. Statistical NERs need to be fed tagged data in order to create a model for recognition and things are only complicated by the fact that email has a different tone (in linguistic terms ‘voice’) than many of the large tagged datasets available. We used the Enron email data, which was made public as part of legal action against them. The task of tagging a significant part of the corpus is a subject for another post (spoiler, it involves Amazon’s Mechanical Turk) but it reminds me of how important and useful good data sources are to machine learning.
Here are some useful data sources (to be updated regularly):
- Enron: most corporate email for a few years prior to the legal action.
- Internet Archive: TheWayBackMachine has historical crawl data from early web times. Also a 2 billion page crawl from 2007.
- Google Book N-grams: word sequences from digitized books.
- Wiki10+ Dataset: 2009 dataset of tags+url+page data. Sometimes a tag is more descriptive than the contents of a page. For instance searching for things tagged “comedy” is far easier than looking for some combination of words on the page.
- Wikipedia: Wikipedia on AWS as an EBS volume. This is a blog post which also includes some ideas about using the data. It is only 500GB or so.
- Wikipedia Links and Page Titles: Useful in everything from pagerank to determining relatedness.
- Wikipedia Dumps: XML dump files for portions of Wikipedia.
- Reuters-21578: a snapshot of reuters news stories often used for text analysis.
- WordNet: word relationships.
- Bag Of Words: Collections of documents scrubbed for cluster analysis.
I’ve only listed a few of the best, comment if I’ve missed important examples.
Posted by pat
on December 20, 2011
We haven’t even begun to set the ‘bots loose on Wikipedia and I have data envy. It’s not just the data that is interesting at Common Crawl but their business model. They have a giant crawl of the web that you can access for free—sort of. They are hosting the data on Amazon S3 and as long as you have your analysis code running on EC2 the access is free. If you export some of the data or access from outside Amazon’s Cloud you have to pay data transmission charges.
Here is a post describing how to access and analyze the data using their open source hadoop jobs.
Common Crawl is a non-profit so they don’t directly benefit from the data being accessed but Amazon does. If you rely on the data it would only be economical to also reside in the Amazon Cloud. Hmm, if Amazon doesn’t help fund Common Crawl, maybe they should.
Posted by pat
on December 19, 2011
New site—woohoo!
We’ve been working to put up a scaffolding for some of our ideas. Checkout FinderBots.com. It’s a Rails 3.1 based web app running on MongoDB using the mongoid ORM. Behind the scenes we have Mahout and Hadoop running on HDFS. The idea is to write important results into Mongo and display them through the web app.
We are starting with a dump of, wait for it… Wikipedia so limited crawling at first. We’ll cluster the data into hierarchical topics using a modified k-means clustering algorithm from Mahout. Then we’ll pack ‘em into Mongo, and display them at FinderBots.
For the modified k-means we’ll create a hierarchy by clustering everything, then pages in a cluster will be reclustered, rinse repeat… We should get a hierarchical categorization of the data. It won’t be perfect because k-means will force any given page into one and only one cluster. Not optimal but there are ways around this that we can apply later.
For now it’s just nice to have a scaffolding up even if things aren’t wired together yet.
Posted by pat
on November 02, 2011
For web apps among the applications we are investigating we’ll pick as turnkey as possible an architecture and try to explain the reasoning.
Components:
- Web App: We’ll use Ruby on Rails 3, just because it’s fun. I also believe that it will scale pretty well for the needs it faces. We will only ask RoR to present pages and read from a DB of pre-calculated values. No heavy lifting here so fun counts.
- DB: NoSQL naturally. In past projects we have used HBase and MongoDB. HBase is a clone of BigTable the Google key value store and when we used it several years ago it was pretty green. It does not seem to be the NoSQL that has gotten the most cred or community support (correct me if I’m wrong). MongoDB is the new kid on the block and is also a little green but I like it because it represents the next gen in easy to set up and use NoSQL DBs. The other possibilities are numerous but the only one I’ll look at for now is Cassandra. It seems to be getting a lot of design wins and is pretty mature. I think a review of these will be another post.
- Number Cruncher Services: Here we’ll use java and Hadoop with it’s kick ass framework Mahout. This will give us all sorts of benefits including: scalability, great high level libraries, a distributed scalable file system, a mapreduce framework, and all the stuff that comes with java. See a quick overview here.
- Data Gathering and Crawling: Here there are several different data gathering needs. In some cases there will be occasional downloads, as with Wikipedia and other times there will be a need to crawl part of the web. For this later we’ll probably use Nutch from Apache but I’m not 100% on that yet.
- Tools: IntelliJ Idea and Eclipse for IDE’s. I name both because, in my small experience with Idea it is really nice but lacks some important plugins that may make it worth using Eclipse for some things. For source control Git. I use it for other projects and have a paid Github account already so it’s an easy decision.
- Resources: Currently a Macbook Pro and two Ubuntu 11.10 servers that are fairly old but all 64 bit with fair amounts of ram. This will not speed things up very much but will at least prove out the parallel architecture. These are in my home and one doubles as a media center. All double as space heaters.
Architecture:
The RoR app will serve up the pages and data from the DB. The services will crawl and process data using mapreduce and stuff the data in the DB. As best practices indicate for web-scale services we will try to pre-calculate everything and where we can’t we’ll create a service that will do things fairly quickly in java. That way the RoR app is pretty simple.

Philosophy:
I should have put this first but it is kind of boring so here it is where you can skip it. We’ll try to stick to best practices and design for scalability, redundancy, and reliability but we have a very small team (1 unless you want to join—hint hint) so we’ll feel free to take shortcuts where there is a payoff. Also in the best iterative practices we may knowingly hack something first to see if it’s worth spending more time to do right.
Posted by pat
on October 30, 2011
Wouldn’t it be nice if you picked up your tablet in the morning and had a recommended reading list waiting, filled with things you might find interesting—a list tuned to your recent reading? A list that reflected your separate interests. Even better if the list could be generated automatically without making you do anything out of your routine?
Wouldn’t it be nice if the web were organized by subject and you could navigate it by narrowing or opening your focus? How about being able to veer in a new direction by looking at related subjects? Wouldn’t it be great if any page you visited could work as a jumping off point into this web guide? Wouldn’t if be nice if every page visited was a magnet for similar pages so you could see other things written on the same subject? For that matter wouldn’t it be great if every page came with a link to its closest match on wikipedia or other more appropriate authoritative source?
Why aren’t links highlighted in some way to flag ones that are likely to interest you or match recent searches? Then maybe when you get to a page from Google you’d see the link to your answer instead of having to go back and refine your search.
Don’t get me wrong, search is great but can you imagine going to the library and having only a search prompt to do your research? Search fits a lot of our information discovery needs but it shouldn’t be our only tool. In the next few posts I’ll flesh out the applications on the above wishlist.
Posted by pat
on October 30, 2011
The bases of ML techniques are algorithms and features. Algorithms like clustering, similarity, categorization, need to consume descriptions of data expressed in features. The best choice of features to measure can make all the difference.
Features:
- Words: If you think about it words are extremely simple features to use but not always the most precise. You only have to look at how many synonyms a word has to see how fuzzy its meaning is. Does “bad” mean bad or good? Tough to say without more context. There are a lot of words that have almost no meaning except to structure a sentence—words like “the”, “and”, “so”, & etc. We have ways of weeding these words out but the impreciseness of what is left is still an issue.
- Concepts: These are more precise than words but also much more difficult to extract. Two completely different words can belong to the same concept as with synonyms but you can extend the idea of a concept to include acronyms, slang, and jargon that can’t be found in a dictionary. An Ice Cream Sandwich can be either a cold treat or a particular release of a mobile OS and it’s an interesting thought experiment to decide how to tell the difference using ML.
- Frequency: Some algorithms use frequency attached to features, others do not. It may be more important to note if a feature exists than it is to count how many exist. For instance a word with a capitol letter is a good candidate for a proper noun even if there are several caps in the word.
- Proximity: How close is one word to another. This can be a shortcut to finding relationships without any deeper understanding of gramar. Proximity can also be the most important part of location. For instance the location of a baseball is usually not meaningful but its proximity to the batter’s box is.
- Structure: For instance if a word is identified as the recipient of an email it is much more likely to be a proper noun.
- Others: Location may be important, time may be, color, etc.
We often have little choice in the features we have, they are determined by the data we get and we can’t go back and gather some potentially useful extra information. Getting the best results may have to do with finding hidden features that are not at first obvious. Sometimes we can find a proxy for a feature we would wish for as in the case of word proximity. Consider the sentence, “I hate Pat”. “Hate” is known to have emotional content. If “Pat” has been recognized by an NLP system as a named entity then the proximity of a word with emotional content might imply a relationship. Proximity becomes a proxy for what we really want to know, which is determined by grammar. Extracting the grammar would be much much harder than counting how many words separate “hate” from “Pat”.
The point of this post is to introduce the idea of features. Be ready to think creatively when you’re trying to find features. The features you have may even determine your algorithm, but that sounds like another post.
Posted by pat
on October 02, 2011
This blog is dedicated to applications of machine learning and natural language processing. Most of the posts will be thoughts on using machine learning for practical applications and they will usually be oriented around text analysis. There may be some posts about research but it will usually be towards an application of the idea or technique.
I hope it will be a discussion. I’m really looking for people to collaborate on dreaming up applications and maybe even putting some of them together.
Posted by pat
on January 02, 2011
This holiday break I started playing around with Mahout, getting it started and running some of the sample data through it. It’s a new part of the Lucene/Hadoop project in Apache which contains a math lib and code generator which builds mapreduce jobs that run in Hadoop and uses the HDFS to store data. It includes a nice vector and matrix library that provides a flexible set of operations and collection types implemented on the parallel mapreduce architecture of Hadoop. Included are several higher level frameworks of general purpose usefulness like:
- a clustering engine using k-means clustering
- a Breiman decision forest engine
- a document classification engine using Bayes and Naive Bayes
- a collaborative filtering engine
- cosine similarity via vector dot products
In a previous job we implemented term vector based clustering and similarity in an innovative application for browsing an information space, which was calculated from a number of web pages. We built a prototype on Hadoop and Hbase and were in the process of moving the calculations to mapreduce when the funding ran out. It sure would have been easier with Mahout.
Mahout will be a key component of the architecture for the Applications we are discussing.