Mahout Version 0.2 is Out
I've been watching "Mahout":http://lucene.apache.org/mahout/ for a little while. I'm very impressed with all the activity they have going on. Mahout is a machine learning framework for the "Hadoop":http://hadoop.apache.org/ framework. They seem to be doing the thing that Tegu only is aware of, at this point. "Alex Handy":http://www.sdtimes.com/author/ahandy.aspx seems to be enthused as well, with his "review of Mahout":http://www.sdtimes.com/blog/post/2009/11/19/We-are-the-big-data-problem.aspx
I think this is the one to watch. I'll be posting my experiments with the framework as they mature and are more presentable.
More Useful SlenderT
I think my SlenderT library is starting to get useful:
- I've run quite a few benchmarks on it
- I've worked on the load to make that pretty fast
- I've implemented the query
- I've played with a fairly large database, finding that it delivers a very expressive tool
- I've written some documentation that can be found at "GitHub":http://github.com/davidrichards/slender_t
Some quick examples from the documentation:
>> db = SlenderT.load('spec/fixtures/business_triples.csv')
>> db.find('BSC', 'name', nil)
=> [["BSC", "name", "Bear Stearns"]]
That tells us that BSC means Bear Stearns. This tells us who we know Bear Stearns contributed to recently, and how much:
>> val = db.query(['?contribution', 'contributor', 'BSC'],
?> ['?contribution', 'recipient', '?recipient'],
?> ['?contribution', 'amount', '?dollars'])
=> [{"?contribution"=>"contrib285", "?dollars"=>30700.0, "?recipient"=>"Orrin Hatch"}, {"?contribution"=>"contrib284",
"?dollars"=>168335.0, "?recipient"=>"Hillary Rodham Clinton"}, {"?contribution"=>"contrib287", "?dollars"=>5600.0,
"?recipient"=>"Christopher Shays"}, {"?contribution"=>"contrib288", "?dollars"=>205100.0, "?recipient"=>"Christopher Dodd"},
{"?contribution"=>"contrib290", "?dollars"=>17300.0, "?recipient"=>"Frank Lautenberg"}, {"?contribution"=>"contrib286",
"?dollars"=>5000.0, "?recipient"=>"Barney Frank"}, {"?contribution"=>"contrib289", "?dollars"=>13000.0, "?recipient"=>"Michael
Dean Crapo"}, {"?contribution"=>"contrib294", "?dollars"=>4600.0, "?recipient"=>"Pete Sessions"},
{"?contribution"=>"contrib295", "?dollars"=>5000.0, "?recipient"=>"Paul E. Kanjorski"}, {"?contribution"=>"contrib292",
"?dollars"=>6600.0, "?recipient"=>"Nita Lowey"}, {"?contribution"=>"contrib293", "?dollars"=>5000.0, "?recipient"=>"Deborah
Pryce"}, {"?contribution"=>"contrib291", "?dollars"=>102260.0, "?recipient"=>"Joe Lieberman"}]
>> val.size
=> 12
It'll get some more lovin', but it's plenty good for this week's deliverables.
Hadoop Online Processing
"Andrew Shafer":http://stochasticresonance.wordpress.com/ told me about "HOP":http://radar.oreilly.com/2009/10/pipelining-and-real-time-analytics-with-mapreduce-online.html last night. It's a really powerful concept:
- Take Hadoop's MapReduce interface
- Allow jobs to be run online, rather than batched
- Co-schedule tasks
- Keep tasks running all the time
This speeds up the delivery of the analytics: no need to read and write to HDFS for intermediate steps. This provides real-time analytics. This cleans up the feedback loop between data and analysis by quite a bit.
I can think of a lot of data rich applications that could use this. Wall Street and search tools could really have a hay day with this stuff.
SlenderT: A Simple Triples Store
I've translated a little code from "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 by Toby Segaran et al tonight. There's a neat little ruby gem available for your pleasure now, "slendert":http://github.com/davidrichards/slendert:
sudo gem install slender_t
It's a triples store. It'll get some love when I work on my recommendation engine again in the morning. I decided to go a little more robust than the "Composite pattern, as I mentioned in yesterday's post":http://blog.tegugears.com/2009/10/18/a-little-education-goes-a-long-way and store the rules in a triples store. I couldn't justify installing "Sesame":http://www.openrdf.org/ or "Redland":http://librdf.org/ for that, so there's this.
I have a few things to look at tomorrow:
- Possibly implementing it with an "RBTreeMap":http://github.com/kanwei/algorithms/blob/master/lib/containers/rbtreemap.rb from the "Algorithms":http://github.com/kanwei/algorithms gem to make it a little nicer when adding triplets. Currently, I have a lazy/nasty piece of work in the add method:
index[a][b] = index[a][b] | [c]
- I could do with a better query language. There are quite a few ideas that I want to take from this project and bring back into some other projects I should finish: marginal, fathom, and overalls.
Anyway, if you'd like to play around a little with semantic technologies, that may give you a start. Also, "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 and "Semantic Web for the Working Ontologist":http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564/ref=sr11?ie=UTF8&s=books&qid=1256026799&sr=1-1 are practical references worth picking up.
Hive
I've watched "Hadoop":http://hadoop.apache.org/ for a while. I even took a trip to San Francisco once to watch "RapLeaf":http://www.rapleaf.com/ demonstrate some of its features. It's an open source version of the "MapReduce pattern":http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAkQFjAA&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fmapreduce-osdi04.pdf&ei=_LfbSvqlLoGsswOj9emxCQ&usg=AFQjCNGR2uEfpCUiHvw6I876kTPeiv-mUA&sig2=7v3SP3QH7MZIr7KLJhwZ8g written about by Google in 2005. The basic idea is that you can have large-scale parallel processing produced by non-specialist programmers by providing a simple MapReduce framework. What you do is:
Define the core work done on each data record in a map function
Define a combination function that can gather processing done on various servers
Partition your data so that it can be run by many machines at once
Run your algorithms in the MapReduce pattern
The simplicity of this framework makes it a general solution. It takes away many of the complexities, and has become quite popular in recent years. There are competing solutions, such as "Message Queue":http://en.wikipedia.org/wiki/Messagequeue systems and the "Linda":http://en.wikipedia.org/wiki/Linda(coordination_language) framework. These other approaches are also very useful and popular (check out these "virtual machines for Linda":http://www.lindaspaces.com/about/index.html), but they don't seem to have as much support. Maybe I'm wrong about that, I just report what I hear and see. I have messed with these other solutions, and I had working prototypes in each for my Tegu framework. However, there were more moving parts and I wasn't convinced I had things in order. I'll need to get back to a working version of Tegu someday, since I've been blogging under its name for about a year and haven't given the world much to play with.
One thing that the MapReduce community has going for it is the community itself. The "Apache Foundation":http://www.apache.org/ has sponsored the "Hadoop project":http://hadoop.apache.org/, an open source Java implementation of the pattern.
They have also sponsored the "Hive project":http://wiki.apache.org/hadoop/Hive, a system that organizes all the data, map, and reduce elements that can go into a production system. With Hive, you have a query language that allows you to define tables, partitions, and buckets of data. You can then filter, sort, select, and manage this large-scale data in a fairly useful way. It has a command line interface, as well as a simple web-based interface.
I'd like to work out some examples using Hive and display them here. I'd also like to put together a "Puppet":http://reductivelabs.com/products/puppet script to install Hadoop and Hive on your systems. This goes into a larger theme that I've been thinking about for some time, data analysis with open source software. Anyway, this is a simple introduction for today.
Reinforcement Learning in the Wild
I've been studying reinforcement learning a bit lately. As I do, I start to see the feedback loops more clearly all the time.
If you look at the deleted scenes on "A Beautiful Mind":http://www.imdb.com/title/tt0268978/, they go more deeply into why John Nash had a tissy fit over losing the game Go. The whole story was that it was mathematically inconsistent. Meaning, the rules he had derived from studying the game couldn't be consistently applied to produce a consistent result, winning. The issue, I suspect, is that he didn't have a complete policy, therefore it seemed inconsistent. John Nash responded to this event by inventing a game more challenging than Go that was complete in his eyes.
I've noticed how my mind often works this same way. I used to play Spider Solitaire when I had a free moment at my computer. It's just a game of matching cards with some unknown cards sitting in the deck. If you play it enough, you begin to learn policies for which cards to work on first, changing your likelihood of matching all the cards and winning. Checkers works the same way. With enough repetition, I begin to see three and four moves ahead, detecting basic patterns and strategies. The game gets boring after a while, and I begin to think about how the computer is organizing its valuation policies. They say that a good Chess player looks ahead 7 moves all the time. The patterns of the strategies begin to have concrete meaning in their decision making process. They also say that the best chess players play against the best chess games to learn new strategies. The best chess games apply reinforcement learning to tease out the more effective patterns of all the patterns available. Games have often been the beginning of reinforcement learning research.
"War Games":http://www.imdb.com/title/tt0086567/ also uses games to learn basic policies. In this movie, the super computer, Joshua, had to decide to shut down a nuclear attack because it could see mutual extinction as the ultimate result of the launch. In the beginning of the movie, you get a list of common games that have been played with reinforcement learning: Chess, Backgammon, and Tic Tac Toe. The way that they taught Joshua that some games can't be won was to play Tic Tac Toe until it concluded that there was no winner available when playing a competent opponent.
The idea of generalizing learning from one algorithm and reusing it is very interesting to me. This is what they asked Joshua to do in War Games. I found "Transfer in reinforcement learning Domains":http://www.amazon.com/Transfer-Reinforcement-Learning-Computational-Intelligence/dp/3642018815/ref=sr125?ie=UTF8&s=books&qid=1255899101&sr=8-25 by Matthew E. Taylor. The book discusses ways to bring a priori knowledge from one reinforcement learning program to another. I'd like to buckle down and go through that book. Sounds like I'll be visiting the university library again sometime soon.
"House":http://www.imdb.com/title/tt0412142/ tends to do a good job with a lot of machine learning concepts. The whole dialog is basically embedded in a big Bayesian network, where Dr. Gregory House's staff is expected to be able to apply prior probabilities to a complex set of tests and symptoms, deciding whether a test is conclusive or not. They also indirectly apply some reinforcement learning principles, such as in the most recent episode where a Vietnam veteran has had 36 years of pain that was overcome when House tricks the brain into accepting new evidence, changing its policy system and no longer believing it has a clenched fist. The assertion was that the brain is just a reinforcement learning algorithm that can be reprogrammed with a little evidence. The first episode of this season was two hours of House learning through trial and error the best way to interact with his new environment, the psych ward, with stunning results.
The "Bourne":http://www.imdb.com/title/tt0258463/ series is based on the idea that the government could reprogram a brain, giving it simulated experience sufficient to make the brain have no distraction and concentrate effectively on being the ultimate assassination machine. This is a common idea, that a learning agent could be optimized for a specific environment or task, rather than just using general agents that are created to react to all environments and all tasks. It makes Bourne an awkward character in social situations, but a really exciting super hero in others.
The "James Bond":http://www.imdb.com/title/tt0830515/ series seems to be introducing a little reinforcement learning between movies. We get to start to see how Bond reacts a little less explosively because of experiences he had in previous movies. The latest movie is a good example, where the final scene surprises his boss, M, because he left the suspect alive. He left the person alive because he is supposed to be reacting to the environment, that killing everything that stands in his way isn't always the most effective way to get things done. Interestingly, Daniel Craig has signed up for another three movies, with the proviso that he gets input on character development. He wants to ensure that James Bond continues to be an effective learning agent, not just a brute killing machine. These changes are most likely Hollywoods ability to gather data on the preferences and buying probabilities of its audience and adjust to that. Audiences want believable albeit fantastic characters.
Whatever Hollywood's take on these ideas, I'm excited to continue to learn the real thing and apply it in various decision support tools.
Data Frame and Meta Data
There's a code smell in "dataframe":http://github.com/davidrichards/dataframe. It is beginning to be a little difficult to add some features that deal with cached values. Data frame is supposed to take a series of named fields and let you do interesting things with it:
There's also a bug there. Can you see it? We lost the category for wind. The last line should have been [0,1] like it was before. Now, I think I could fix this pretty easily, but I am beginning to see that I need to have a better way to work with data. As it sits:
- justenumerablestats does a lot of the fancy work under the covers
- The meta data in a column is not explicitly defined
- The meta data does not follow a copied array the way that I copied it
- data_frame does not have explicit meta data
- The meta data from both justenumerablestats and data_frame cannot be extracted out of their objects
There are a few things that I want to fix to then:
First, I want to be able to extract the meta data out of arrays and data frames to be used on other data sets. I would like to preprocess a data sample, generate its labels, categories, max, min, standard deviation, etc., and then have that available as the context for processing a whole data stream. By making this change, I don't have to limit myself to data that can fit in memory.
Second, justenumerablestats is a dumb name for a gem. I started it as a quick fix for a simple program. Then I kept adding methods to it every time I had a new use. Pretty soon, it became a very interesting collection of methods, but that really only work on Array data. I can't use this on some custom class that includes the Enumerable module, I have too many dependencies on arrays. Hashes can't use it. So, this is really array_stats. So, I should make that gem instead and start depending on that.
So, this is where I want to take dataframe, justenumerablestats, and arraystats. I'll see if I can get back to it this weekend. I'm excited to work on it, but I have some code to turn in for some clients before tomorrow evening, and I better not delay anymore.
Preprocessing 1
"Red Davis":http://redwriteshere.com/ posed an interesting question to me today. How do you normalize a streaming data set? Meaning, if I have data with an unknown maximum or minimum value, and I have too much data to search through it to find the exact answers, how do I work with that?
The answer we came up with was a bit of backtracing;
- Either provide a sample data set or an estimated maximum and minimum value
- Calculate the maximum and minimums based on a normal distribution
- Start normalizing the data stream
- Recalculate the max and min if the data exceeds the bounds
At that point, we can use a series of heuristics or learning algorithms to decide how to setup the normalized data.
The thing about a normalized data set is that it is internally consistent. Meaning, we are adjusting for any skew that a parameter might have on a neural network, say, because its values were recorded between 1 and a million, versus values that were recorded between 1 and 10. To a learning algorithm, the different states are just different states, and normalized data achieves that limitation.
I put together a little piece of code that might be useful for this kind of problem, found here:
Shogun Shotgun
So, I have enjoyed what I have learned lately about "Shogun":http://www.shogun-toolbox.org/. It is a unified interface for several SVM libraries. Today, I can run some sophisticated analysis from R, Octave, or Python, but not Ruby. I was looking through the "Python Interface":http://svn.tuebingen.mpg.de/shogun/releases/shogun0.8.0/src/python/PythonInterface.cpp (also "here":http://svn.tuebingen.mpg.de/shogun/releases/shogun0.8.0/src/python/PythonInterface.h), and it doesn't look too difficult.
I think, taking an honest look at my C/C++ skills (or lack thereof), I realize I should probably pair on this with someone. My programming classes used C, and I never took C++ very seriously, so I may have a bit of a learning curve ahead of me. I know I'm going to have to pay attention to the "Extending Ruby with C":http://www.rubycentral.com/pickaxe/ext_ruby.html chapter in my Pickaxe book.
If someone else would be interested in this kind of project, and wouldn't mind pairing with a novice, I'd love the opportunity. Meanwhile, I'm doing the novice work of finding and using recipes to get a rhythm for extending Ruby.
A Good Twitter Spam Filter
So, "Red Davis":http://redwriteshere.com/ and I are building a spam filter for Twitter. I don't tweet much, maybe I should be, but we thought it would be interesting to create a decent filtered client. Here are a few thoughts we've had so far on what it should look like:
- It should be able to train a general baseline filter for all spammers
- It should be able to report a specific user as a spammer
- It should take into account the types of content and determine if that content is spam
This problem isn't that much different than email spam filters. But, we thought it'd be fun to play around and see what we came up with. At least, we'll be able to play with several technologies to compare them.
If you have any ideas about this, please comment below.