Mahout Version 0.2 is Out
I've been watching "Mahout":http://lucene.apache.org/mahout/ for a little while. I'm very impressed with all the activity they have going on. Mahout is a machine learning framework for the "Hadoop":http://hadoop.apache.org/ framework. They seem to be doing the thing that Tegu only is aware of, at this point. "Alex Handy":http://www.sdtimes.com/author/ahandy.aspx seems to be enthused as well, with his "review of Mahout":http://www.sdtimes.com/blog/post/2009/11/19/We-are-the-big-data-problem.aspx
I think this is the one to watch. I'll be posting my experiments with the framework as they mature and are more presentable.
Hadoop Online Processing
"Andrew Shafer":http://stochasticresonance.wordpress.com/ told me about "HOP":http://radar.oreilly.com/2009/10/pipelining-and-real-time-analytics-with-mapreduce-online.html last night. It's a really powerful concept:
- Take Hadoop's MapReduce interface
- Allow jobs to be run online, rather than batched
- Co-schedule tasks
- Keep tasks running all the time
This speeds up the delivery of the analytics: no need to read and write to HDFS for intermediate steps. This provides real-time analytics. This cleans up the feedback loop between data and analysis by quite a bit.
I can think of a lot of data rich applications that could use this. Wall Street and search tools could really have a hay day with this stuff.
Hive
I've watched "Hadoop":http://hadoop.apache.org/ for a while. I even took a trip to San Francisco once to watch "RapLeaf":http://www.rapleaf.com/ demonstrate some of its features. It's an open source version of the "MapReduce pattern":http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAkQFjAA&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fmapreduce-osdi04.pdf&ei=_LfbSvqlLoGsswOj9emxCQ&usg=AFQjCNGR2uEfpCUiHvw6I876kTPeiv-mUA&sig2=7v3SP3QH7MZIr7KLJhwZ8g written about by Google in 2005. The basic idea is that you can have large-scale parallel processing produced by non-specialist programmers by providing a simple MapReduce framework. What you do is:
Define the core work done on each data record in a map function
Define a combination function that can gather processing done on various servers
Partition your data so that it can be run by many machines at once
Run your algorithms in the MapReduce pattern
The simplicity of this framework makes it a general solution. It takes away many of the complexities, and has become quite popular in recent years. There are competing solutions, such as "Message Queue":http://en.wikipedia.org/wiki/Messagequeue systems and the "Linda":http://en.wikipedia.org/wiki/Linda(coordination_language) framework. These other approaches are also very useful and popular (check out these "virtual machines for Linda":http://www.lindaspaces.com/about/index.html), but they don't seem to have as much support. Maybe I'm wrong about that, I just report what I hear and see. I have messed with these other solutions, and I had working prototypes in each for my Tegu framework. However, there were more moving parts and I wasn't convinced I had things in order. I'll need to get back to a working version of Tegu someday, since I've been blogging under its name for about a year and haven't given the world much to play with.
One thing that the MapReduce community has going for it is the community itself. The "Apache Foundation":http://www.apache.org/ has sponsored the "Hadoop project":http://hadoop.apache.org/, an open source Java implementation of the pattern.
They have also sponsored the "Hive project":http://wiki.apache.org/hadoop/Hive, a system that organizes all the data, map, and reduce elements that can go into a production system. With Hive, you have a query language that allows you to define tables, partitions, and buckets of data. You can then filter, sort, select, and manage this large-scale data in a fairly useful way. It has a command line interface, as well as a simple web-based interface.
I'd like to work out some examples using Hive and display them here. I'd also like to put together a "Puppet":http://reductivelabs.com/products/puppet script to install Hadoop and Hive on your systems. This goes into a larger theme that I've been thinking about for some time, data analysis with open source software. Anyway, this is a simple introduction for today.
Adventures in Screen Scraping 2
These last few days I've been scraping a site. A client uses a certain supplier who is still in the stone ages technologically. Getting reliable data for his online store has been a mountain to climb. I wrote a screen scraper with "Ruby's Mechanize":http://mechanize.rubyforge.org/mechanize/, and really enjoyed that so much could be done with so little.
I think all screen scraping projects introduce some of the same issues:
- Dealing with a lot of data
- Managing memory
- Caching page views
- Dealing with threads or events
- Verifying a batch of data
My path to managing these issues was a little meandering, but interesting. First, I got the wow factor from getting data from a site. Mechanize really came through for me. Next, I implemented a 6 or 7 step approach to logging into the site, browsing to the right product page, scraping the data, and transforming things into what I needed them to be. That was good, but error prone, I thought. I wrote my own DSL for making that easier. I really enjoyed how cleanly I could handle caching and stacking concepts neatly together. I could see I would be able to reuse my code with little trouble. After a while, I realized that I wasn't going to need to reuse my code, that the problem really is just plain-ol' validation, the ability to write a method and have a spec ensure it's doing what I thought it should. It turns out it was harder to understand and test code written in my DSL.
I also had some sort of resistance to revisiting my scraping code. By making my code context-free, I solved some interesting problems, but I alienated myself from the code. I was afraid that I'd have unintended consequences for changing this code that is supposed to have no side effects. It all boiled down to not trusting my DSL yet. I've been doing a lot of test-driven development lately, and I had written 40 or 50 tests on the DSL as I wrote it. What I think I really did was over-engineer a problem that was technically correct and pristine, but practically useless. So, I started writing tests for my implementation of the DSL and as I refactored, I ended up stripping all of my DSL out of the project and using plain Ruby instead. It was cleaner, a little faster, and fixed a few issues that I hadn't found earlier in my implementation.
That left me with just one more problem on this project, caching. A lot of the process is repetitive, and some of the network activity can be cached and optimized. My little "repositories gem":http://github.com/davidrichards/repositories/tree/master came in handy. I suspected some intermediate data I was creating would be useful for some analysis down the road, so I refactored my simple eviction policies to take callbacks on eviction. Those changes are released to GitHub this morning. That way I could append the old data to a file in a reloadable format.
All in all, it was an enjoyable adventure with screen scraping. Although the problem was ultimately solvable, RESTful services sure would have been a lot easier to manage. It makes me appreciate the power of frameworks and communities.
8 Reasons Why Puppet Rocks 2
I've been taking the time lately to figure out "Puppet":http://reductivelabs.com/products/puppet/. It's a great server configuration tool. Here are the reasons I think so:
- I can automate: whatever I did once I can do again
- I can stage: a sys-admin version of automated testing
- I can refactor: whatever I did I can improve
- I can share: puppet recipes tend to be portable across platforms
- I can borrow: friends' puppet recipes tend to work for me
- I can deliver: my time isn't spent waiting for downloads and application builds
- I can monitor: simple reporting gives me an idea about how things are working
- I can replace myself: standardized scripts and a support community means someone else can be recruited to do these tasks, building on what I've done already
This doesn't mean that Puppet is 100% complete. I'm excited to play with the new RESTful interface "Reductive Labs":http://reductivelabs.com/ is putting out. There are bugs being worked out all the time. There are ideas that still need some standardization. But, I'm excited that I've got Puppet in my toolbelt now.
Erlang as a System Monitor
I have this wild and crazy idea that I just may do if someone else wants to pair program with me. Maybe you can tell from this blog that I can get quite scatter-brained, working on a lot of projects, improving each incrementally. This isn't a project I'd start without some real commitment to following through and finishing what I started.
The idea is to grease the skids a little in my brain regarding Erlang and put together a file system management tool that would really inform the user about the kind of content he has on his system in a pragmatic way. I thought I'd have these general elements in place:
- A full-text search engine that is fast and concurrent
- A dictionary of meta information on each file
- A classification system that adds semantic meaning to files, like project name, language, development project, installed library, active project, active when, downloaded?, original URI, etc.
It's kind of a combination of several machine learning skills with Erlang. Why Erlang? Because I want to see how portable I can make it, and how integratable it can be to other systems. I also want to demonstrate how quickly and non-intrusively I can make a heavy-lifting app work with Erlang. It would be fun, useful, and maybe not too much work.
I think we can use Joe Armstrong's full-text search engine as a starting point. Also, the command-line command, lsof, is a very powerful way to see what's happening on a system, possibly files being downloaded, etc.
Anyway, this is an open invitation, if it sounds good to anyone.
Sirb is TeguGears
I have this code that's coming together for Ruby 1.9. I've really been inspired by the way Fiber works in 1.9, and I think that I can make it sing. So, I'm branching Sirb and making it TeguGears. I'll probably release my work in time for "Mountain West Ruby Conference":http://mtnwestrubyconf.org/2009/ next weekend, but I am too excited tonight and want to blog about what I'm thinking.
So, the concept is basic:
- take some decent base libraries (like Vector, Array, RNum, and NArray)
- make sure that some good descriptive statistics are available to them
- build these tools up with Ruby and support it with R when necessary or pragmatic
- make sure these tools have concurrent and distributed forms that work well.
It's this last two ideas that are expanding for me tonight. I've been studying Erlang in my spare time, and I'm really impressed with their simple API:
- For concurrent programming: ** spawn ** send ** receive
- For error control: ** link ** exit values ** spawn_link ** keep alive ** system processes
- For distributed programming: ** node ** nodes ** is_alive ** monitor_node ** disconnect_node
Not a bad list. I love how short it is. I can almost repeat the whole list from memory in the car. My task is to combine various tools to provide this kind of interface inside of TeguGears. I'll gut out some of how Tegu was working on these issues (my old code is over-architected and therefore never complete). What I'm thinking today:
- Move my Ruby implementations into Fiber-enriched, composable structures: Filter and Operator
- Add my ThreadPool code from another project and make it a module that can make any class' algorithms concurrent
- Change the ThreadPool interface to handle a simpler API: spawn, send, receive, link, spawnlink, keepalive, and system_process
- Add a distributed component with a simple API: node, nodes, isalive, monitornode, disconnect_node
I'm not sure how I'll implement the distributed portion of TeguGears. My choices seem to be Rinda and Starling. I could also look again at Active MQ, Rosetta Queue (when they release), or possibly another message queuing system (whose name I forget at this hour).
I think I've created enough design for the project to branch Sirb and begin work. This would be, I think, a strictly Ruby 1.9 gem. I don't think I like my composition solution in Sirb (too monkey-patched for general use, I think). I should probably write more concrete examples for what these concepts can do in the real world. My huddle gem will be a good showcase piece. It relies on the tools in Sirb to implement clustering algorithms.
Anyway, those are my thoughts for tonight. They were encouraged by a "Dave Thomas article":http://pragdave.blogs.pragprog.com/pragdave/2008/01/pipelines-using.html that showed me I can have much cleaner thinking on the matter. I think I'm going to turn in and get an earlier start on my day tomorrow.
What Would a Good Graph Database Look Like? 1
I've been working a lot with graphs, using various graph libraries for Ruby. I like these, and I can get a lot done with them. However, none of the Ruby graph libraries have optimization features baked in, and none of them provide transparent persistence. So, I've been Googling around, to see what's out there. There are some pretty neat solutions that may be the way to go. As I look at the adopt/build decision for my uses, I'll use the following criteria to guide my search. Some of these ideas came directly from Renzo Anglez and Claudio Gutierrez' presentation, "Querying from a Graph Database Perspective: the case of RDF":http://www.google.com/url?sa=t&source=web&ct=res&cd=1&url=http%3A%2F%2Fwww.ciw.cl%2Fmaterial%2Firw-2005%2F2005-irw-gutierrez.pdf&ei=mx-fSdiZFor2sAOn2P3MCQ&usg=AFQjCNFKiV0woNWvkOG91dtF7gOh2AlzNg&sig2=LP9byU_vM92oU5lz05vm5A
So, a good graph database should probably offer:
- path queries (Are nodes a and b related? What is the shortest route between a and b? What is the influence of node a?)
- distance queries (What is the distance between nodes a and b?)
- pattern matching (Where do these patterns appear? How much of a pattern is matched?)
- adjacency queries (All nodes that are separated by no more than x degrees from a vertex. All the edges, given the same parameters.)
- degree queries (All nodes with at least/most x in nodes or out nodes or nodes in general)
- concurrency (Shards well in a cluster)
- indexing isomorphic subgraphs (fast lookup for some of the above-mentioned queries)
- adjacency matrices for the whole graph or subgraphs (transparent use of the graph as a graph, or as a matrix)
- a wide range of graph types (the "graph database":http://amalfi.dis.unina.it/graph/ is a dataset of graphs, with 84 distinct types of graphs available)
So, this is really two problems:
- finding/creating a language that expresses these queries
- implementing these ideas efficiently in the graph
I like the Neo4j and Neo4j.rb, and I'd like to run a couple of use cases through that first. There are some interesting resources "here":http://github.com/andreasronge/neo4j/tree/master. I really liked the "slides found here":http://jaikoo.com/assets/presentations/neo4j.pdf
If that doesn't work out, I want to take another look at some RDF solutions. At the end of the day, I may be interested in building an Erlang-based solution, if everything else fails.
My San Francisco Trip 1
I had a great time in San Francisco this week. I met with other Hadoop enthusiasts at RapLeaf's offices and with Joel Dudley in Palo Alto.
The RapLeaf session was very informative. We covered three topics:
- Collector by Bryan Duxbury
- Katta by Stefan Groshupf
- Tuning and Debugging Hadoop by Arun Murthy
Collector is an interesting utility. Given the need to append a large file, how would you do it? You already have a cluster that you're using, right? You can write some scripts, some shell wizardry, or you can use a compact and simplified tool. Collector is the compact and simplified approach. It is meant to be used like a Sink Pattern. Collector offers durability, simplicity, and scalability. RapLeaf is extracting this code from their core systems, and then will release it as open-sourced software.
Katta is basically Lucene + Hadoop + Zookeeper. What excites me about this technology is that it's been hardened with experience. Real applications have needed to evolve into this product. Basically, given the ability to build a Lucene index out of raw data, this adds the ability to manage that index, and scale it as large as it needs to go. In essence, it builds a series of indexes and merges the indexes as needed. It is possibly a better migration path from single Lucene or Solr indexes.
The Tuning and Debugging conversation was hard for me to follow. The people around me also had this experience. Arun covered too much information too quickly. The good news is, we have the slides, and someone that wants to use these will be able to follow along and figure out a better solution more effectively than I did. Meanwhile, I've filled my personal wiki with ideas from this talk. I think this deserves to become a reinforcement learning algorithm that ships with Hadoop, something I could accomplish inside of Tegu and then release as production-ready code. We'll see, it's on my list.
During and after the meeting, everyone kept referring to Thrift. It appears that Tegu is listening to a very valid concern out there, because the intent of Thrift is very similar to the intent of Tegu. From the Thrift web page, thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml. According to the people in the meeting, it's still very alpha, but very promising. Thrift could also work well with Tegu, building some of the services as Thrift services. We'll see.
I had a little bit of a conversation with Bryan after the meeting. Bryan had been quite an avid Rubyist before he bumped up against some of the performance issues that Hadoop solved better than his own version of Map Reduce. I don't remember what it's called, and didn't find it in a Google search. But he's contributed to some Ruby gems that RapLeaf supports, found here. I appreciate how he thinks about software development: you build your productivity layer, and you use the tools you have. That's how he talks about his transition from Ruby to Java.
Tegu has been that productivity layer for me, for someone wanting to use machine learning in their own work. It's written in Ruby, because the tools I wanted weren't in Ruby. In other words, I wanted to bring things to the same table, and I didn't want to rebuild libraries. So, I think Bryan's and my philosophies being pragmatic with which tools to use are similar.
The meeting with Joel was very exciting. He's calm, brilliant, pragmatic, and full of ideas. We're collaborating on a project, that I'll elaborate on at another time.