Hive

Posted by David Richards Mon, 19 Oct 2009 01:21:00 GMT

I've watched "Hadoop":http://hadoop.apache.org/ for a while. I even took a trip to San Francisco once to watch "RapLeaf":http://www.rapleaf.com/ demonstrate some of its features. It's an open source version of the "MapReduce pattern":http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAkQFjAA&url=http%3A%2F%2Flabs.google.com%2Fpapers%2Fmapreduce-osdi04.pdf&ei=_LfbSvqlLoGsswOj9emxCQ&usg=AFQjCNGR2uEfpCUiHvw6I876kTPeiv-mUA&sig2=7v3SP3QH7MZIr7KLJhwZ8g written about by Google in 2005. The basic idea is that you can have large-scale parallel processing produced by non-specialist programmers by providing a simple MapReduce framework. What you do is:

Define the core work done on each data record in a map function

Define a combination function that can gather processing done on various servers

Partition your data so that it can be run by many machines at once

Run your algorithms in the MapReduce pattern

The simplicity of this framework makes it a general solution. It takes away many of the complexities, and has become quite popular in recent years. There are competing solutions, such as "Message Queue":http://en.wikipedia.org/wiki/Messagequeue systems and the "Linda":http://en.wikipedia.org/wiki/Linda(coordination_language) framework. These other approaches are also very useful and popular (check out these "virtual machines for Linda":http://www.lindaspaces.com/about/index.html), but they don't seem to have as much support. Maybe I'm wrong about that, I just report what I hear and see. I have messed with these other solutions, and I had working prototypes in each for my Tegu framework. However, there were more moving parts and I wasn't convinced I had things in order. I'll need to get back to a working version of Tegu someday, since I've been blogging under its name for about a year and haven't given the world much to play with.

One thing that the MapReduce community has going for it is the community itself. The "Apache Foundation":http://www.apache.org/ has sponsored the "Hadoop project":http://hadoop.apache.org/, an open source Java implementation of the pattern.

They have also sponsored the "Hive project":http://wiki.apache.org/hadoop/Hive, a system that organizes all the data, map, and reduce elements that can go into a production system. With Hive, you have a query language that allows you to define tables, partitions, and buckets of data. You can then filter, sort, select, and manage this large-scale data in a fairly useful way. It has a command line interface, as well as a simple web-based interface.

I'd like to work out some examples using Hive and display them here. I'd also like to put together a "Puppet":http://reductivelabs.com/products/puppet script to install Hadoop and Hive on your systems. This goes into a larger theme that I've been thinking about for some time, data analysis with open source software. Anyway, this is a simple introduction for today.

Reinforcement Learning in the Wild

Posted by David Richards Sun, 18 Oct 2009 20:40:00 GMT

I've been studying reinforcement learning a bit lately. As I do, I start to see the feedback loops more clearly all the time.

If you look at the deleted scenes on "A Beautiful Mind":http://www.imdb.com/title/tt0268978/, they go more deeply into why John Nash had a tissy fit over losing the game Go. The whole story was that it was mathematically inconsistent. Meaning, the rules he had derived from studying the game couldn't be consistently applied to produce a consistent result, winning. The issue, I suspect, is that he didn't have a complete policy, therefore it seemed inconsistent. John Nash responded to this event by inventing a game more challenging than Go that was complete in his eyes.

Gemcutter Gets My Vote

Posted by David Richards Sun, 18 Oct 2009 20:30:00 GMT

"Gemcutter":http://gemcutter.org/ is an example of a tool built to do just one thing. And they got it right.

I just had about 5 minutes to get started on migrating my gems to "gemcutter":http://gemcutter.org/. I trusted Ryan Bates' "Rails Cast":http://railscasts.com/episodes/183-gemcutter-jeweler to give me the low down and any pragmatic tricks I may want to use. There was nothing to it. Basically:

  • sign up
  • install gem cutter
  • gem tumble
  • gem push

I especially like that they added things like gem tumble to my command line. Gem tumble toggles whether gemcutter is in my gem sources list. It's a feature that reduces barriers in a meaningful and simple way.

So, I put "dataframe":http://github.com/davidrichards/dataframe on gemcutter. One down, 63 to go.

A Little Education Goes a Long Way

Posted by David Richards Sun, 18 Oct 2009 20:17:00 GMT

Maybe this is a hammer and nail thing: I have a hammer in my hand, I see the world as a nail. Maybe this is a synchronicity thing: the happy occurrence of things happening at an opportune time, without any rational causal connection. Whatever it is, I've enjoyed using the Composite pattern lately.

I was given a task at work, to extract some XML and record the pertinent data in a database. The tricky part came from figuring out how to map the two together. There were a lot of conditions that couldn't be applied in all circumstances. So, I pulled out my trusty "Design Patterns in Ruby":http://www.amazon.com/Design-Patterns-Ruby-Russ-Olsen/dp/0321490452/ref=sr11?ie=UTF8&s=books&qid=1255897292&sr=8-1 and remembered the Composite pattern. Basically, I just encapsulate an operation in a class, then combine this operation where it makes sense. In this way, I can have And, Or, and Not operations tied to basic rules, so that I can combine ideas, keep their meanings clear, and be very flexible for future needs.

Once I had figured that out, I could see how I would want to do this same thing in my arraystats gem. I am extending the justenumerable_stats gem in a new, cleaner gem. One main feature I am adding is the ability to cache the meta data tied to an array in a meaningful way. So, the pattern for caching could get complicated, but a few building blocks makes it a lot simpler to code up.

Next, I needed to figure out a recommendation system for cars being serviced. The manufacturers recommend various services:

  • Oil changes
  • Belt replacements
  • Warranty replacements
  • Factory recalls

The timing and application of all these things can get rather complex, except that I have this handy tool in my pocket, the Composite pattern.

I've used this pattern before, but I often get busy doing other things, and I don't have to go back to old tricks for a while. It's good to review and learn.

Books I'm Picking Up

Posted by David Richards Sun, 18 Oct 2009 20:04:00 GMT

I've just bought three books, all of them look very interesting. Maybe you'd be interested in this as well.

h2. "The Art and Science of CSS":http://www.sitepoint.com/books/cssdesign1/?SID=d9a8bf46b62f9bfb898f4bad122498d0 by Cameron Adams et al

This seems very practical and to the point. It covers the basics, but in a not-so-basic way:

  • Headings
  • Images
  • Backgrounds
  • Navigation
  • Forms
  • Rounded Corners
  • Tables

I bought the book to work on some of my forms, and found that there are many other useful things in the book.

h2. "The Laws of Simplicity":http://www.amazon.com/Laws-Simplicity-Design-Technology-Business/dp/0262134721/ref=sr11?ie=UTF8&s=books&qid=1255896482&sr=8-1 by John Maeden

This was mentioned in a screencast I reviewed today, and thought that it had a lot of promise. The Ten Laws:

  • Reduce
  • Organize
  • Time
  • Learn
  • Differences
  • Context
  • Emotion
  • Trust
  • Failure
  • The One

Basically, Simplicity = Sanity

h2. "Thinking in Systems: A Primer":http://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557/ref=sr11?ie=UTF8&s=books&qid=1255896645&sr=1-1 by Donella H. Meadows

This book, from 1972, has influenced a lot of my life. Many of the books I've read and the approaches I take to machine learning are embedded in these philosophies. In fact, the "Systems Science":http://www.pdx.edu/sysc/ program I attended in Portland is based on these foundations.

The basics:

  • System Structure and Behavior
  • Systems and Us
  • Creating Change-In Systems and in Our Philosophy

I'm excited to get all of these. The CSS book I bought as a PDF book, the others are coming via Amazon.

Front End Frustrations (and Some Solutions) 1

Posted by David Richards Sun, 18 Oct 2009 19:39:00 GMT

I am a novice when it comes to designing user interfaces. For me, that means creating Rails views. This blog really isn't about Rails or Design, but there is a part of TeguLabs that depends on it, and I keep delaying the delivery of relatively simple solutions because of these issues. For instance, I've let the front page of this site remain "under construction" for a very long time. I should fix that. Meanwhile, I have some very important deliverables that I need to address right away.

In a nutshell, these are the technologies I'm using to deliver a user interface:

  • I'm using "Blueprint":http://www.blueprintcss.org/ from inside "Compass":http://compass-style.org/ so that I have a CSS framework that uses plugins instead of creating everything from scratch.
  • "Haml and Sass":http://haml.hamptoncatlin.com/ capture the semantics of my HTML and CSS without asking me to repeat myself all the time.
  • I use "JQuery":http://jquery.com/ to create unobtrusive, standardized Javascript with a lot of pre-built plugins available to me.
  • "Formtastic":http://github.com/justinfrench/formtastic serves as my form builder, reducing a lot of my work to semantic markup instead of getting into the nitty-gritty of each field every time I define one.

Data Frame and Meta Data

Posted by David Richards Fri, 16 Oct 2009 02:50:00 GMT

There's a code smell in "dataframe":http://github.com/davidrichards/dataframe. It is beginning to be a little difficult to add some features that deal with cached values. Data frame is supposed to take a series of named fields and let you do interesting things with it:

There's also a bug there. Can you see it? We lost the category for wind. The last line should have been [0,1] like it was before. Now, I think I could fix this pretty easily, but I am beginning to see that I need to have a better way to work with data. As it sits:

  • justenumerablestats does a lot of the fancy work under the covers
  • The meta data in a column is not explicitly defined
  • The meta data does not follow a copied array the way that I copied it
  • data_frame does not have explicit meta data
  • The meta data from both justenumerablestats and data_frame cannot be extracted out of their objects

There are a few things that I want to fix to then:

First, I want to be able to extract the meta data out of arrays and data frames to be used on other data sets. I would like to preprocess a data sample, generate its labels, categories, max, min, standard deviation, etc., and then have that available as the context for processing a whole data stream. By making this change, I don't have to limit myself to data that can fit in memory.

Second, justenumerablestats is a dumb name for a gem. I started it as a quick fix for a simple program. Then I kept adding methods to it every time I had a new use. Pretty soon, it became a very interesting collection of methods, but that really only work on Array data. I can't use this on some custom class that includes the Enumerable module, I have too many dependencies on arrays. Hashes can't use it. So, this is really array_stats. So, I should make that gem instead and start depending on that.

So, this is where I want to take dataframe, justenumerablestats, and arraystats. I'll see if I can get back to it this weekend. I'm excited to work on it, but I have some code to turn in for some clients before tomorrow evening, and I better not delay anymore.

Preprocessing 1

Posted by David Richards Fri, 16 Oct 2009 01:41:00 GMT

"Red Davis":http://redwriteshere.com/ posed an interesting question to me today. How do you normalize a streaming data set? Meaning, if I have data with an unknown maximum or minimum value, and I have too much data to search through it to find the exact answers, how do I work with that?

The answer we came up with was a bit of backtracing;

  • Either provide a sample data set or an estimated maximum and minimum value
  • Calculate the maximum and minimums based on a normal distribution
  • Start normalizing the data stream
  • Recalculate the max and min if the data exceeds the bounds

At that point, we can use a series of heuristics or learning algorithms to decide how to setup the normalized data.

The thing about a normalized data set is that it is internally consistent. Meaning, we are adjusting for any skew that a parameter might have on a neural network, say, because its values were recorded between 1 and a million, versus values that were recorded between 1 and 10. To a learning algorithm, the different states are just different states, and normalized data achieves that limitation.

I put together a little piece of code that might be useful for this kind of problem, found here:

Shogun Shotgun

Posted by David Richards Mon, 12 Oct 2009 23:59:00 GMT

So, I have enjoyed what I have learned lately about "Shogun":http://www.shogun-toolbox.org/. It is a unified interface for several SVM libraries. Today, I can run some sophisticated analysis from R, Octave, or Python, but not Ruby. I was looking through the "Python Interface":http://svn.tuebingen.mpg.de/shogun/releases/shogun0.8.0/src/python/PythonInterface.cpp (also "here":http://svn.tuebingen.mpg.de/shogun/releases/shogun0.8.0/src/python/PythonInterface.h), and it doesn't look too difficult.

I think, taking an honest look at my C/C++ skills (or lack thereof), I realize I should probably pair on this with someone. My programming classes used C, and I never took C++ very seriously, so I may have a bit of a learning curve ahead of me. I know I'm going to have to pay attention to the "Extending Ruby with C":http://www.rubycentral.com/pickaxe/ext_ruby.html chapter in my Pickaxe book.

If someone else would be interested in this kind of project, and wouldn't mind pairing with a novice, I'd love the opportunity. Meanwhile, I'm doing the novice work of finding and using recipes to get a rhythm for extending Ruby.

Behind the Times

Posted by David Richards Mon, 12 Oct 2009 23:41:00 GMT

I don't read "GitHub's blog":http://github.com/blog. I guess I should. I've left them a few mean notices about what's going on with the interface, why can't I publish gems anymore. They ignored my blasts, and once again, Google is my friend. Github doesn't build gems anymore. I found "this article, published a few days ago":http://github.com/blog/515-gem-building-is-defunct telling me why and what to do.

What this means is I'm going to have to go through all my gems and get them setup afresh. I have 64 gems in my local gems repository, so this could take me a while. While I'm at it, I wanted to re-organize all my gems so that they'd work with "rip":http://github.com/rtomayko/rip, as well as stop requiring rubygems in my libraries ("see the gist":http://gist.github.com/54177). So, I'm taking a simple task and turning it into three or four tasks.

I'm not sure if I'll get that done tomorrow or sometime later, I have quite a bit of deliverable code to release right now, but I just want to make a public acknowledgment that some of my work isn't very available right now.

Older posts: 1 2 3 4 5 ... 10