Posted by pat
on November 02, 2011
For web apps among the applications we are investigating we’ll pick as turnkey as possible an architecture and try to explain the reasoning.
Components:
- Web App: We’ll use Ruby on Rails 3, just because it’s fun. I also believe that it will scale pretty well for the needs it faces. We will only ask RoR to present pages and read from a DB of pre-calculated values. No heavy lifting here so fun counts.
- DB: NoSQL naturally. In past projects we have used HBase and MongoDB. HBase is a clone of BigTable the Google key value store and when we used it several years ago it was pretty green. It does not seem to be the NoSQL that has gotten the most cred or community support (correct me if I’m wrong). MongoDB is the new kid on the block and is also a little green but I like it because it represents the next gen in easy to set up and use NoSQL DBs. The other possibilities are numerous but the only one I’ll look at for now is Cassandra. It seems to be getting a lot of design wins and is pretty mature. I think a review of these will be another post.
- Number Cruncher Services: Here we’ll use java and Hadoop with it’s kick ass framework Mahout. This will give us all sorts of benefits including: scalability, great high level libraries, a distributed scalable file system, a mapreduce framework, and all the stuff that comes with java. See a quick overview here.
- Data Gathering and Crawling: Here there are several different data gathering needs. In some cases there will be occasional downloads, as with Wikipedia and other times there will be a need to crawl part of the web. For this later we’ll probably use Nutch from Apache but I’m not 100% on that yet.
- Tools: IntelliJ Idea and Eclipse for IDE’s. I name both because, in my small experience with Idea it is really nice but lacks some important plugins that may make it worth using Eclipse for some things. For source control Git. I use it for other projects and have a paid Github account already so it’s an easy decision.
- Resources: Currently a Macbook Pro and two Ubuntu 11.10 servers that are fairly old but all 64 bit with fair amounts of ram. This will not speed things up very much but will at least prove out the parallel architecture. These are in my home and one doubles as a media center. All double as space heaters.
Architecture:
The RoR app will serve up the pages and data from the DB. The services will crawl and process data using mapreduce and stuff the data in the DB. As best practices indicate for web-scale services we will try to pre-calculate everything and where we can’t we’ll create a service that will do things fairly quickly in java. That way the RoR app is pretty simple.

Philosophy:
I should have put this first but it is kind of boring so here it is where you can skip it. We’ll try to stick to best practices and design for scalability, redundancy, and reliability but we have a very small team (1 unless you want to join—hint hint) so we’ll feel free to take shortcuts where there is a payoff. Also in the best iterative practices we may knowingly hack something first to see if it’s worth spending more time to do right.
Posted by pat
on October 30, 2011
The bases of ML techniques are algorithms and features. Algorithms like clustering, similarity, categorization, need to consume descriptions of data expressed in features. The best choice of features to measure can make all the difference.
Features:
- Words: If you think about it words are extremely simple features to use but not always the most precise. You only have to look at how many synonyms a word has to see how fuzzy its meaning is. Does “bad” mean bad or good? Tough to say without more context. There are a lot of words that have almost no meaning except to structure a sentence—words like “the”, “and”, “so”, & etc. We have ways of weeding these words out but the impreciseness of what is left is still an issue.
- Concepts: These are more precise than words but also much more difficult to extract. Two completely different words can belong to the same concept as with synonyms but you can extend the idea of a concept to include acronyms, slang, and jargon that can’t be found in a dictionary. An Ice Cream Sandwich can be either a cold treat or a particular release of a mobile OS and it’s an interesting thought experiment to decide how to tell the difference using ML.
- Frequency: Some algorithms use frequency attached to features, others do not. It may be more important to note if a feature exists than it is to count how many exist. For instance a word with a capitol letter is a good candidate for a proper noun even if there are several caps in the word.
- Proximity: How close is one word to another. This can be a shortcut to finding relationships without any deeper understanding of gramar. Proximity can also be the most important part of location. For instance the location of a baseball is usually not meaningful but its proximity to the batter’s box is.
- Structure: For instance if a word is identified as the recipient of an email it is much more likely to be a proper noun.
- Others: Location may be important, time may be, color, etc.
We often have little choice in the features we have, they are determined by the data we get and we can’t go back and gather some potentially useful extra information. Getting the best results may have to do with finding hidden features that are not at first obvious. Sometimes we can find a proxy for a feature we would wish for as in the case of word proximity. Consider the sentence, “I hate Pat”. “Hate” is known to have emotional content. If “Pat” has been recognized by an NLP system as a named entity then the proximity of a word with emotional content might imply a relationship. Proximity becomes a proxy for what we really want to know, which is determined by grammar. Extracting the grammar would be much much harder than counting how many words separate “hate” from “Pat”.
The point of this post is to introduce the idea of features. Be ready to think creatively when you’re trying to find features. The features you have may even determine your algorithm, but that sounds like another post.
Posted by pat
on January 02, 2011
This holiday break I started playing around with Mahout, getting it started and running some of the sample data through it. It’s a new part of the Lucene/Hadoop project in Apache which contains a math lib and code generator which builds mapreduce jobs that run in Hadoop and uses the HDFS to store data. It includes a nice vector and matrix library that provides a flexible set of operations and collection types implemented on the parallel mapreduce architecture of Hadoop. Included are several higher level frameworks of general purpose usefulness like:
- a clustering engine using k-means clustering
- a Breiman decision forest engine
- a document classification engine using Bayes and Naive Bayes
- a collaborative filtering engine
- cosine similarity via vector dot products
In a previous job we implemented term vector based clustering and similarity in an innovative application for browsing an information space, which was calculated from a number of web pages. We built a prototype on Hadoop and Hbase and were in the process of moving the calculations to mapreduce when the funding ran out. It sure would have been easier with Mahout.
Mahout will be a key component of the architecture for the Applications we are discussing.
Posted by pat
on March 30, 2009
Another useful tool used in machine learning is clustering. It is useful when you have no human help to bootstrap the learning process (unsupervized learning). The type of clustering I’ll talk about is used on documents to guess at which should be in the same group or category. There are many different clustering algorithms, some of which I’ll describe later, but they all have in common the attempt to find groupings in a corpus of objects described by vectors. For documents we use the vector space model I described earlier.
In the illustration each document is plotted as a square. For k-means clustering you chose k possible centroids to start with—k = 3 for this illustration. In standard k-means the points are chosen randomly in the vector space defined by the documents. You find the nearest documents, minimizing their distance from the centroid candidate. Then after you have k clusters you recalculate the centroid from the data and use it to iterate. After some number of iterations if you have well-behaved data you will end up with a reasonable guess at k categories of documents. In the illustration the ending clusters are color coded and the centroids plotted as circles.
One problem with this method of finding groups is that you still have no idea what to call the groups. A human might see Basketball, Baseball, and Soccer but these category names are difficult to extract from the documents themselves. Also it is difficult to determine ahead of time what k should be. These are all interesting research areas.