Training Software
Lately I’ve been thinking about my job, the fate of AI, and a recent article in Wired. I am the CTO at a startup company that is using text and data mining techniques and natural language processing to help people manage the information they touch. We are trying to automatically make connections between things that you have to manage yourself with today’s software.
AI started with great hype back in the ‘80s trying to model human thought in the algorithms. For instance Natural Language Processing tried to model human grammar in the NLP algorithm. This technique had limited success and the failure of this type of approach led to the long night of AI. Recently new algorithms have been discovered that must be trained from a large corpus of documents. The algorithm has two stages; the training phase where you feed in tagged documents (called a gold standard) and the extraction phase where the algorithm uses the model generated in the training phase to tag untagged documents. So if you have lots of email tagged for places a “person” is mentioned then the algorithm will be able to pick out people from untagged email. It has learned language without having any model of grammar—in fact in the above example the language was not important, it would work in any textual language regardless of the grammar.
AI research has led us to a new flowering of the technology through machine learning. We still make no claim that massively dimensional vector spaces are the way humans learn but they work for the digital processors we have today and they can learn things that are useful about the human experience. This brings me to the article by Chris Anderson of Wired. Chris falls prey to the mainstream journalist’s cliché by exaggerating for attention but his article, “The End of Theory,” does make you think. We are entering a time when the cool new algorithm is being replaced in importance with the awesome new data set. It also occurred to me that there is one entity in the universe with just about all the data—Google.
Which brings me to my job: Train some algorithms with a new data set before Google thinks of it and apply it to a unique customer problem. Not exactly what I had in mind when I started my career but it does add a detecting angle that you have to like.
Powerset: NLP based information extraction and navigation on the web
Powerset released their nlp based search engine/information extraction and gisting engine/information navigation service yesterday. I know—it’s a mouthful but that is the point of this post. They are trying to do a lot of things and succeeding better at some than others. Here is a partial list:
- Query flexibility: Powerset does nlp on your queries so you can ask questions like, “Who are the actors in Pulp Fiction?” This would be a negative feature if it were required but you can also type, “actors pulp fiction.” I haven’t been able to tell if they are doing any synonym checking on the queries.
- Search history: Powerset mines your history for past searches and displays close matches as you type into the search box. This is virtually useless.
What are the chances that I want to search for the exact same thing again? Google makes suggestions based on what everyone searches for and I have come to rely on it as a sort of query tuning. Wouldn’t it be nice to include truly similar queries taking into account semantics and synonyms? The Google experience could be improved but Powerset chose to step backwards. - Auto-tag Cloud: Here Powerset made some interesting improvements to the user experience.
First they use terms found in a document in a tag cloud rather than relying on spotty user generated tags. They also separate nouns and verbs referenced in the document into separate clouds. This has some utility but they currently show too many words and use them only as a way to navigate the information in the document as opposed to information on the web in general. - Gisting: This is where Powerset fails to live up to their hype.
The idea of gisting long documents to produce something that is easily skimable is a powerful idea but they make so many mistakes that the implementation is distracting and of marginal use. Hopefully they will improve this with better tuned nlp n-gram extraction.
Currently Powerset only extracts information from Wikipedia. At some level I wonder why we need that but if you look at the techniques they are using and *imagine* it working across the entire web it would be nice. What disappoints me is that it does no better at finding things and cross-referencing stuff. I could find very few examples of cross-document references in the preview.
These days we hear of many applications of nlp in creating semantic data from unstructured text. This has some great applications but when it comes to finding stuff on the web I don’t need a service that reads single articles for me I’d rather a have service finds related information. That is what I spend most of my time doing while researching things on the web. When planning a trip to Turkey I need information on tickets, hotels, weather, history, news, and not just history but the history of the Byzantine Empire, the Ottomans, Greece, Rome, etc. Why doesn’t a service mine the web for these connections, ones based on related concepts? A service like this would draw perhaps more from categorization technology than raw nlp.