Mahout Version 0.2 is Out
I've been watching "Mahout":http://lucene.apache.org/mahout/ for a little while. I'm very impressed with all the activity they have going on. Mahout is a machine learning framework for the "Hadoop":http://hadoop.apache.org/ framework. They seem to be doing the thing that Tegu only is aware of, at this point. "Alex Handy":http://www.sdtimes.com/author/ahandy.aspx seems to be enthused as well, with his "review of Mahout":http://www.sdtimes.com/blog/post/2009/11/19/We-are-the-big-data-problem.aspx
I think this is the one to watch. I'll be posting my experiments with the framework as they mature and are more presentable.
Dreyfus Model 1
I borrowed the idea of the Dreyfus Model from Andy Hunt’s Pragmatic Thinking and Learning. Since I keep asking people where they’re at on a particular skill on that model, I thought I’d outline it in a place where I can reference it. It’s probably good to know in general, because it tends to reduce frustration and focus learning styles by knowing what stage I am on each skill I need to have. The model as I remember it is basically:
- Novice: looking for recipes
- Advanced Beginner: able to work with the reference material nearby, not able to see the big picture
- Competent: Mostly able to work without having to re-architect very often
- Proficient: Able to work at a steady pace, able to self-correct when working
- Expert: People are seeking you out in this field
So, for what it’s worth, there it is.
Ruby and RDF
I joined the "public Ruby RDF mailing group":http://www.w3.org/2001/12/rubyrdf/intro.html a while back. It's a pretty quiet group. The resources listed on the home page is a little out of date. I asked for a status check, and there seems to be a few live projects these days. In no particular order, they are:
- "Redleaf":http://deveiate.org/projects/Redleaf is supposed to be making some advances. It is hand-written bindings to the Redland libraries with the promise to be a little more idiomatic.
- "ruby-rdfa":http://code.google.com/p/ruby-rdfa/, a Ruby-only RDFa parser.
- "RubyRDF":http://esw.w3.org/topic/RubyRdf is a collection of parser, storage, manipulation, and query tools.
- "Reddy":http://github.com/tommorris/reddy is meant to be a "one true way" approach to RDF with Ruby, bringing Redland, Jena, and other libraries together with one Ruby interface.
I don't think "ruby-sesame":http://github.com/pjlegato/ruby-sesame has been worked on for quite some time, but it binds to the Sesame RESTful api.
You'd probably be building your own productivity layers on top of any of these solutions. I would like to work more with all of these, pick one, and get busy on any productivity classes that I need. None of these are really hot, killer apps, however. That makes the decision a bit harder. I'm pretty open and loose about this for a little while. If anyone's had any other/conflicting experience, I'm open to suggestions.
Remote Pair Programming
I'm working more and more in pairs these days. I really enjoy what I learn from others. We tend to build more momentum on projects at the same time. I've heard from a few of the cool cats online that screen + vim + Skype is the way to go. I haven't used screen before, and, like with most things, I have to find the time to dig into it a little. In this case, there wasn't much that I needed to dig into to get screen to work for me. Here's the gist:
- pick a server to work on
- setup ~/.screenrc
- start a session
- have the other person join the session
- turn on Skype and have fun
h2. Details
In this day of virtual servers, there seems to be one available all the time. For me, I needed to setup a few users and gather some public keys to make things work out. Not too different than the regular fare.
This is what I've collected from around various blog posts for my ~/.bashrc:
hardstatus on
hardstatus alwayslastline
hardstatus string "%{rk}%H %{gk}%c %{yk}%M%d %{wk}%?%-Lw%?%{bw}%n*%f%t%?(%u)%?%{wk}%?%+Lw%?"
multiuser on
acladd nate
acladd matt
acladd aldo
The hardstatus configuration gives me a line at the bottom of the screen to share the context. The multiuser allows others to log on. The acl stuff says who can log on.
To start a session, I've been using:
screen -S
To have the remote person join the session:
screen -x
At some point, we also turn on Skype. At this point, we're looking at the same files, the same command line, and we can talk about what we're doing. I think I kept resisting screen because I thought there would be a little more to it. There probably is, but this seems to work for me. If I find any hiccups with this, I'll post them here.
More Useful SlenderT
I think my SlenderT library is starting to get useful:
- I've run quite a few benchmarks on it
- I've worked on the load to make that pretty fast
- I've implemented the query
- I've played with a fairly large database, finding that it delivers a very expressive tool
- I've written some documentation that can be found at "GitHub":http://github.com/davidrichards/slender_t
Some quick examples from the documentation:
>> db = SlenderT.load('spec/fixtures/business_triples.csv')
>> db.find('BSC', 'name', nil)
=> [["BSC", "name", "Bear Stearns"]]
That tells us that BSC means Bear Stearns. This tells us who we know Bear Stearns contributed to recently, and how much:
>> val = db.query(['?contribution', 'contributor', 'BSC'],
?> ['?contribution', 'recipient', '?recipient'],
?> ['?contribution', 'amount', '?dollars'])
=> [{"?contribution"=>"contrib285", "?dollars"=>30700.0, "?recipient"=>"Orrin Hatch"}, {"?contribution"=>"contrib284",
"?dollars"=>168335.0, "?recipient"=>"Hillary Rodham Clinton"}, {"?contribution"=>"contrib287", "?dollars"=>5600.0,
"?recipient"=>"Christopher Shays"}, {"?contribution"=>"contrib288", "?dollars"=>205100.0, "?recipient"=>"Christopher Dodd"},
{"?contribution"=>"contrib290", "?dollars"=>17300.0, "?recipient"=>"Frank Lautenberg"}, {"?contribution"=>"contrib286",
"?dollars"=>5000.0, "?recipient"=>"Barney Frank"}, {"?contribution"=>"contrib289", "?dollars"=>13000.0, "?recipient"=>"Michael
Dean Crapo"}, {"?contribution"=>"contrib294", "?dollars"=>4600.0, "?recipient"=>"Pete Sessions"},
{"?contribution"=>"contrib295", "?dollars"=>5000.0, "?recipient"=>"Paul E. Kanjorski"}, {"?contribution"=>"contrib292",
"?dollars"=>6600.0, "?recipient"=>"Nita Lowey"}, {"?contribution"=>"contrib293", "?dollars"=>5000.0, "?recipient"=>"Deborah
Pryce"}, {"?contribution"=>"contrib291", "?dollars"=>102260.0, "?recipient"=>"Joe Lieberman"}]
>> val.size
=> 12
It'll get some more lovin', but it's plenty good for this week's deliverables.
Hadoop Online Processing
"Andrew Shafer":http://stochasticresonance.wordpress.com/ told me about "HOP":http://radar.oreilly.com/2009/10/pipelining-and-real-time-analytics-with-mapreduce-online.html last night. It's a really powerful concept:
- Take Hadoop's MapReduce interface
- Allow jobs to be run online, rather than batched
- Co-schedule tasks
- Keep tasks running all the time
This speeds up the delivery of the analytics: no need to read and write to HDFS for intermediate steps. This provides real-time analytics. This cleans up the feedback loop between data and analysis by quite a bit.
I can think of a lot of data rich applications that could use this. Wall Street and search tools could really have a hay day with this stuff.
Gem Bundler
Ryan Shaw brought this to my attention. "Gem Bundler":http://litanyagainstfear.com/blog/2009/10/14/gem-bundler-is-the-future/ handles gem dependencies in an intelligent way. This tutorial shows you how to use Gem Bundler.
SlenderT: A Simple Triples Store
I've translated a little code from "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 by Toby Segaran et al tonight. There's a neat little ruby gem available for your pleasure now, "slendert":http://github.com/davidrichards/slendert:
sudo gem install slender_t
It's a triples store. It'll get some love when I work on my recommendation engine again in the morning. I decided to go a little more robust than the "Composite pattern, as I mentioned in yesterday's post":http://blog.tegugears.com/2009/10/18/a-little-education-goes-a-long-way and store the rules in a triples store. I couldn't justify installing "Sesame":http://www.openrdf.org/ or "Redland":http://librdf.org/ for that, so there's this.
I have a few things to look at tomorrow:
- Possibly implementing it with an "RBTreeMap":http://github.com/kanwei/algorithms/blob/master/lib/containers/rbtreemap.rb from the "Algorithms":http://github.com/kanwei/algorithms gem to make it a little nicer when adding triplets. Currently, I have a lazy/nasty piece of work in the add method:
index[a][b] = index[a][b] | [c]
- I could do with a better query language. There are quite a few ideas that I want to take from this project and bring back into some other projects I should finish: marginal, fathom, and overalls.
Anyway, if you'd like to play around a little with semantic technologies, that may give you a start. Also, "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 and "Semantic Web for the Working Ontologist":http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564/ref=sr11?ie=UTF8&s=books&qid=1256026799&sr=1-1 are practical references worth picking up.
Hive
I've watched "Hadoop":http://hadoop.apache.org/ for a while. I even took a trip to San Francisco once to watch "RapLeaf":http://www.rapleaf.com/ demonstrate some of its features. It's an open source version of the "MapReduce pattern":http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAkQFjAA&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fmapreduce-osdi04.pdf&ei=_LfbSvqlLoGsswOj9emxCQ&usg=AFQjCNGR2uEfpCUiHvw6I876kTPeiv-mUA&sig2=7v3SP3QH7MZIr7KLJhwZ8g written about by Google in 2005. The basic idea is that you can have large-scale parallel processing produced by non-specialist programmers by providing a simple MapReduce framework. What you do is:
Define the core work done on each data record in a map function
Define a combination function that can gather processing done on various servers
Partition your data so that it can be run by many machines at once
Run your algorithms in the MapReduce pattern
The simplicity of this framework makes it a general solution. It takes away many of the complexities, and has become quite popular in recent years. There are competing solutions, such as "Message Queue":http://en.wikipedia.org/wiki/Messagequeue systems and the "Linda":http://en.wikipedia.org/wiki/Linda(coordination_language) framework. These other approaches are also very useful and popular (check out these "virtual machines for Linda":http://www.lindaspaces.com/about/index.html), but they don't seem to have as much support. Maybe I'm wrong about that, I just report what I hear and see. I have messed with these other solutions, and I had working prototypes in each for my Tegu framework. However, there were more moving parts and I wasn't convinced I had things in order. I'll need to get back to a working version of Tegu someday, since I've been blogging under its name for about a year and haven't given the world much to play with.
One thing that the MapReduce community has going for it is the community itself. The "Apache Foundation":http://www.apache.org/ has sponsored the "Hadoop project":http://hadoop.apache.org/, an open source Java implementation of the pattern.
They have also sponsored the "Hive project":http://wiki.apache.org/hadoop/Hive, a system that organizes all the data, map, and reduce elements that can go into a production system. With Hive, you have a query language that allows you to define tables, partitions, and buckets of data. You can then filter, sort, select, and manage this large-scale data in a fairly useful way. It has a command line interface, as well as a simple web-based interface.
I'd like to work out some examples using Hive and display them here. I'd also like to put together a "Puppet":http://reductivelabs.com/products/puppet script to install Hadoop and Hive on your systems. This goes into a larger theme that I've been thinking about for some time, data analysis with open source software. Anyway, this is a simple introduction for today.
Gemcutter Gets My Vote
"Gemcutter":http://gemcutter.org/ is an example of a tool built to do just one thing. And they got it right.
I just had about 5 minutes to get started on migrating my gems to "gemcutter":http://gemcutter.org/. I trusted Ryan Bates' "Rails Cast":http://railscasts.com/episodes/183-gemcutter-jeweler to give me the low down and any pragmatic tricks I may want to use. There was nothing to it. Basically:
- sign up
- install gem cutter
- gem tumble
- gem push
I especially like that they added things like gem tumble to my command line. Gem tumble toggles whether gemcutter is in my gem sources list. It's a feature that reduces barriers in a meaningful and simple way.
So, I put "dataframe":http://github.com/davidrichards/dataframe on gemcutter. One down, 63 to go.