Mahout Version 0.2 is Out
I've been watching "Mahout":http://lucene.apache.org/mahout/ for a little while. I'm very impressed with all the activity they have going on. Mahout is a machine learning framework for the "Hadoop":http://hadoop.apache.org/ framework. They seem to be doing the thing that Tegu only is aware of, at this point. "Alex Handy":http://www.sdtimes.com/author/ahandy.aspx seems to be enthused as well, with his "review of Mahout":http://www.sdtimes.com/blog/post/2009/11/19/We-are-the-big-data-problem.aspx
I think this is the one to watch. I'll be posting my experiments with the framework as they mature and are more presentable.
Hadoop Online Processing
"Andrew Shafer":http://stochasticresonance.wordpress.com/ told me about "HOP":http://radar.oreilly.com/2009/10/pipelining-and-real-time-analytics-with-mapreduce-online.html last night. It's a really powerful concept:
- Take Hadoop's MapReduce interface
- Allow jobs to be run online, rather than batched
- Co-schedule tasks
- Keep tasks running all the time
This speeds up the delivery of the analytics: no need to read and write to HDFS for intermediate steps. This provides real-time analytics. This cleans up the feedback loop between data and analysis by quite a bit.
I can think of a lot of data rich applications that could use this. Wall Street and search tools could really have a hay day with this stuff.
Hive
I've watched "Hadoop":http://hadoop.apache.org/ for a while. I even took a trip to San Francisco once to watch "RapLeaf":http://www.rapleaf.com/ demonstrate some of its features. It's an open source version of the "MapReduce pattern":http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAkQFjAA&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fmapreduce-osdi04.pdf&ei=_LfbSvqlLoGsswOj9emxCQ&usg=AFQjCNGR2uEfpCUiHvw6I876kTPeiv-mUA&sig2=7v3SP3QH7MZIr7KLJhwZ8g written about by Google in 2005. The basic idea is that you can have large-scale parallel processing produced by non-specialist programmers by providing a simple MapReduce framework. What you do is:
Define the core work done on each data record in a map function
Define a combination function that can gather processing done on various servers
Partition your data so that it can be run by many machines at once
Run your algorithms in the MapReduce pattern
The simplicity of this framework makes it a general solution. It takes away many of the complexities, and has become quite popular in recent years. There are competing solutions, such as "Message Queue":http://en.wikipedia.org/wiki/Messagequeue systems and the "Linda":http://en.wikipedia.org/wiki/Linda(coordination_language) framework. These other approaches are also very useful and popular (check out these "virtual machines for Linda":http://www.lindaspaces.com/about/index.html), but they don't seem to have as much support. Maybe I'm wrong about that, I just report what I hear and see. I have messed with these other solutions, and I had working prototypes in each for my Tegu framework. However, there were more moving parts and I wasn't convinced I had things in order. I'll need to get back to a working version of Tegu someday, since I've been blogging under its name for about a year and haven't given the world much to play with.
One thing that the MapReduce community has going for it is the community itself. The "Apache Foundation":http://www.apache.org/ has sponsored the "Hadoop project":http://hadoop.apache.org/, an open source Java implementation of the pattern.
They have also sponsored the "Hive project":http://wiki.apache.org/hadoop/Hive, a system that organizes all the data, map, and reduce elements that can go into a production system. With Hive, you have a query language that allows you to define tables, partitions, and buckets of data. You can then filter, sort, select, and manage this large-scale data in a fairly useful way. It has a command line interface, as well as a simple web-based interface.
I'd like to work out some examples using Hive and display them here. I'd also like to put together a "Puppet":http://reductivelabs.com/products/puppet script to install Hadoop and Hive on your systems. This goes into a larger theme that I've been thinking about for some time, data analysis with open source software. Anyway, this is a simple introduction for today.