A Step Out of My Closet
I went out for breakfast today-10:00 on a Wednesday at a local diner. I hadn't realized how secluded I'd become. There were families there all dressed up in ties and dresses, probably a wedding or a funeral. There were retired couples: spouses, brothers, friends. There was a family there on vacation waking up late, a landscape company owner preparing bids, some construction workers who must have started their day with the sun, children playing in the parking lot-joy in every mundane thing. All this rush of humanity really got me thinking about how different our lives can be. I go to work in a technology center: 20 and 30 somethings, mostly male, mostly white, mostly educated with the same degrees, mostly interested in building their small companies. I usually leave the office late--often past midnight. When I'm done with my day job I have Tegu that captures my interest. I see the same people about every day. I know the people at the local Burger King and Wendy's because they feed me most of my meals. Like I said, I hadn't realized how secluded I'd become.
I also noticed how incredibly obvious some of our preference patterns really are. Looking at those people, I could make probable guesses about their education decisions, buying habits, priorities, etc. I heard on TV recently that our brains are relevance machines--always looking for patterns. Probably that's true. It's more interesting if we can capture that insight in our businesses and research.
I have a friend in the UK that's got me interested in the GitHub recommendation system contest. I wrote a generic gem for that last night ("advocate":http://github.com/davidrichards/advocate/tree/master) and we're collaborating on our actual implementation this week. It makes me think about how well or poorly we are able to cater to micro-market segments. Recommendation systems are an attempt at mass-customization, a near-pipe dream talked about in business schools ten years ago. They're not that tricky, and a mediocre one actually adds some interesting value. A really good one, of course, can be fascinating.
The generic strategy is to create a covariance matrix for a series of models and blend the results. You can then use different parsimony tactics to reduce the complexity of your models, making the system computable. You can also approach some or all of the problem with a belief maintenance system, such as a bayesian network. In that case, you are moving from a naive covariance measure into a more robust belief measure. The robustness in some ways is there to reduce computational complexity. Rather than maintaining very large matrices with everything-to-everything links computed whenever new data is presented, a well-managed system can localize computation and reduce the interactions between nodes aggressively.
Whatever the strategy, we are participating in an inside-out analysis. Instead of describing people from the outside in large swaths of demographic data, we are capturing actual behavior and modeling that. We are attempting to get into people's individual actions and model patterns from that. I found some very interesting thoughts on the subject "from reputable people here":http://blog.kiwitobes.com/?p=58.
Anyway, I thought I'd at least leave a note about why there is now an "advocate":http://github.com/davidrichards/advocate/tree/master gem.
Intelligent Collaboration
The ability to solve complex problems seems to be reserved for group work. I am adding features to Tegu and Agency to have them revolve more and more around dialog.
The problem I'm solving, after all, is to find a way to satiate my curiosities on all sorts of problems, with all sorts of co-conspirators.
To get there, I think about the nature of things, including the nature of intelligence. About intelligence, the American Psychological Association says:
“Individuals differ from one another in their ability to understand complex ideas, to adapt effectively to the environment, to learn from experience, to engage in various forms of reasoning, to overcome obstacles...”
Those differences are leverage, rather than obstacles. It is an opportunity to find the different perspectives and make sure they are mixing with my thoughts. Getting people together with different roles, levels of experience, and goals buillds a group dynamic. Using different algorithms for the same purposes is also important. The Encyclopedia Britannica continues:
[Intelligence is the] ability to adapt effectively to the environment, either by making a change in oneself or by changing the environment or finding a new [approach] ...
Adapting to the environment seems to be a collective job, not a singular problem. I'm thinking of the problem of technical financial analysis, competing against the millions of perspective in the market. I'm thinking of textual analysis, having a dynamic use of language, tying to participant reputations. Every non-trivial problem that Tegu is preparing for doesn't seem solvable if I just bring the best algorithms together on a fast platform. Collaboration should drive the whole application, I think.
I'm drawing tonight from "this great resource":http://www.vetta.org/definitions-of-intelligence/, a collection of definitions for intelligence. Interesting to me, many of the thoughts coming from the AI researchers revolve around issues of complexity. They think of intelligence as the ability to adapt in a complex environment. Rising to the challenge, I think about value as:
- Quick
- Easy to understand
- Adaptable
- Composable
- Context revealing
Concretely, this supports building features like:
- Documenting and configuring projecects with IRC
- Making it easier to publish code, data, and documentation
- Building interfaces that keep me focused on the interplay of a system and its environment
- Adding tools to change solutions quickly and with minimal effort
At the end of the day, collective intelligence isn't just what's happening out there, it can also be what's happening here among the problem solvers.
Ruby Metaprogramming
The purpose of computing is insight, not numbers, Richard Hamming
I used to explain to my friends in Portland that I was using Ruby for machine learning for its expressiveness and dynamic features, rather than just speed. A lot of them thought the problem with understanding systems was pure compute power. I think it's mostly a human problem: representing things in a way that we can understand what's going on. By using AMQP, I can read messages off a queue with Scala, Erlang, or OCaml fairly easily and get blazingly-fast throughput on the workhorse part of a problem. Also, with Scala, I can write expressive interfaces to many Java libraries, including the Weka and JDM libraries. But, the bigger problem of understanding the context of a system and the meaning of analysis is still the more important problem. Ruby is a good candidate for that.
Agency is a good example of what I'm talking about. Agency is one of the modeling paradigms I am introducing to Tegu. It breaks the tasks of observing, computing, storing, and reporting into very explicit parts of the model. This way, one clustering algorithm can be swapped out for another, say. Formally, the components are described as:
- Sensors: any class that can report happenings in an environment onto the input queue
- Percepts: predefined data descriptions providing a single unit of observation (a word, a message, a stock price, that sort of thing)
- Agent Functions: the core processing of data
- Queues: communication lines with specific purposes setup ahead of time
- Repositories: knowledge bases and processing caches for storing and staging all the data
- Actuators: any class that can apply learning to the external world
I actually have a "walking skeleton":http://alistair.cockburn.us/Walking+skeleton of the whole project, with many interesting parts fleshed out fairly well. I learned a lot about metaprogramming with Ruby while implementing the Percept class, so that is what I'll share today.
A Percept has many important roles in Agency:
- It defines data formally
- It provides validation for a piece of data
- It provides templates and search tools for the queues being used
- It names each piece of data in the percept, making transformations easier
- It standardizes the data flow, making resource planning a little more comprehenisble
The way I implemented Percept was to create a class method to define a subclass. I want to interact with my data directly, without taking a break to code up some classes. That means that I want to pass in a few blocks, names, and configuration paramaters and have a lot of useful code available to me. Percept.define does this for this class. The basic usage of Percept.define is:
Percept.define(:stock_tick, :ticker_symbol, :volume, :price, :percent_change, :price_change) do |p|
p.ticker_symbol.is_a?(String) and
[p.volume, p.price, p.percent_change, p.price_change].all? {|e| e.is_a?(Numeric)}
end
This creates a StockTick class, which is a subclass of Percept. It has a signature of [:tickersymbol, :volume, :price, :percentchange, :price_change] and a template of [nil, nil, nil, nil, nil]. A Rinda queue would take that template, adjust it to reflect a particular queue, and use that to find data on the queue that might be a stock tick. The block is the validation that the queue can use to ensure it actually is a StockTick, and not just any percept with 5 elements in it. Each element is named, which is useful in the validation block, as well as when working with the percept directly. That is how the p. volume and p.price are made available.
I pulled together quite a few ideas to come up with how I wanted to handle things. For instance, Jay Fields has a "good reference to many ways of defining a method here":http://blog.jayfields.com/2007/10/ruby-defining-class-methods.html, including "why's approach to easily defining singleton methods":http://whytheluckystiff.net/articles/seeingMetaclassesClearly.html. I also found "these thoughts of Jay's useful":http://blog.jayfields.com/2008/02/ruby-dynamically-define-method.html. Coderr had some interesting "thoughts":http://coderrr.wordpress.com/2008/10/29/using-define_method-with-blocks-in-ruby-18/ that I think I implemented too.
But, I got started with the Percept class by defining the validation block. That looks like aspect-oriented programming, so I Googled around for people applying this in Ruby. Martin Traverson and Brian McCallister did some interesting work which "can be found here":http://rubydbc.rubyforge.org/svn/trunk/lib/dbc.rb "and here":http://split-s.blogspot.com/2006/02/design-by-contract-for-ruby.html. This is a bit of a tangent, but I want to show you how Martin and Brian keep their namspaces clean by "replacing methods instead of just aliasing them":http://split-s.blogspot.com/2006/01/replacing-methods.html. In a nutshell:
class A
def go(a, b, c)
puts 'go!!!', a, b, c
end
end
class A
orig_go = instance_method(:go)
define_method(:go) { |*args|
orig_go.bind(self).call(*args)
puts 'home'
}
end
What we end up with is only one method, go. The only code in the universe that can access the old go method is the new go method. By capturing the method with instance_method, we can apply the bindings of the new go method so that the context of calling go is known to the old method as well. Does that make sense? I didn't end up using that trick in the final iteration, but I used it many times, and think it should be in any Ruby programmer's toolbelt.
After all that research, a lot of tests, and some experimenting with how I wanted my Percept.define method to behave, I came up with this:
def define(name, *args, &block)
Agency.class_eval percept_definition_string(name)
klass = full_class_for(name).constantize
if block
klass.class_eval {
define_method(:validate_signature) {|percept| block.call(percept) }
}
end
klass.init(*args)
klass
end
Breaking it out:
- Agency.class_eval allows me to define something in the Agency namespace
- The perceptdefinitionstring is just a string that defines a subclass ("class #{class_for(name)} < Percept; end")
- klass is setup as a working, defined class, StockTick in our case
- The block is the validation block, and I use the block version of class eval instead of the string version. I don't know if it's possible to use a block inside of class_eval with a string, but I didn't figure out how to do that. It's fairly straightforward this way.
- The classeval defines a method, validatesignature, inside the StockTick class. It overrides the one that was there, and just passes the percept to the validation block.
- klass.init just calls StockTick.init and sets up the signature and the template for StockTick.
So, not too bad. There were a lot of other things I figured out about Ruby metaprogramming in the links above that I didn't end up using in this class. There is one other trick I added to the class. I often use method_missing to redirect a method call to another internal object. In this case, if I ask for percept.name and name is part of the signature, I will get name back from the array inside of the Percept class that holds the data for the object. I've read several times that people don't like method missing for various reasons. I think it boils down to testing and debugging missing method call that's gone awry. I haven't had those issues, so I haven't taken the kinds of steps they've suggested. I don't have links for that, but if method missing gives you some heartache, Google around a bit and you'll get some cleaner approaches to your metaprogramming.
Even though I'm working with running code, and I'm pretty pleased with Agency, I'm going to hold off for a bit before I release it. I think the stuff you can get on GitHub right now is a bastard son of an old thought, the grandson of a very different approach to the problem. With Tegu and Agency, I've re-written the code many times, changing it as I solve different problems and see new approaches to the issues I'm dealing with. I'll be sure to post to the blog when I've released the newer/cleaner approaches to agent-based modeling.
cirb
I had some thoughts as I drove home from work the other day. Why not allow an agent in an IRC channel edit some code? I could see this would be a better place to work than IRB for configuring any Tegu work, all the note taking is primary, and the resultant configuration is secondary. So, this is where this project is going:
- Have an agent listen to everything in an IRC channel
- Have Agency bring those message into an InputQueue
- Bring various agent functions to the table to: ** log the conversation ** listen for changes to configuration files, running instances, or source code (Cirb) ** perform searches on the ontology ** suggest links to bolster the conversation ** expand ontology references and wiki references into followable links
So, it's been a fun weekend for those kinds of things to be explored. I've also been having a lot of fun building "Walking Skeletons":http://alistair.cockburn.us/Walking+skeleton instead of just isolated features. I have done a lot more "BDD":http://en.wikipedia.org/wiki/BehaviorDrivenDevelopment than usual, and I think I've been more productive generally. I want to bring another feature or two into Cirb's skeleton, then I'll go ahead and release that code as it is.
Repositories
I extracted out a couple of repositories from some unpublished gems. I should take the time and bring some of my bigger repositories to this project. This helps me reuse some code, and I think I've done a good job testing these. "Check out the code here.":http://github.com/davidrichards/repositories/tree/master
And, a quick demo:
require 'repositories/array_cache'
@a = (1..15).inject(ArrayCache.new) {|a, e| a << e }
# [6,7,8,9,10,11,12,13,14,15]
@a = (1..15).inject(ArrayCache.new(:n => 2)) {|a, e| a << e }
# [14,15]
require 'repositories/hash_cache'
@h = HashCache.new
(1..15).each { |i| @h[i] = i }
@h.hash #=> {11=>11, 6=>6, 12=>12, 7=>7, 13=>13, 8=>8, 14=>14, 9=>9, 15=>15, 10=>10}
@h = HashCache.new(:n => 2)
(1..15).each { |i| @h[i] = i }
@h.hash #=> {14=>14, 15=>15}
Not too much to look at, but if you're using computable dictionaries or percept histories, this will save you an hour or two to reuse them. The next ones to bring in:
- TenaciousG
- FarGratr
New Statisticus 1
So, I spoke at URUG last night, and that was good motivation to clean up some code. Statisticus is doing what I think it should, here's a quick rundown.
h2. Basic Interface to R
To do something that R already does:
stats_class :choose Choose.call(49,6) # => 13983816.0
This example basically knows that R understands choose, which takes two parameters, n and x. N is for the number of possible outcomes per choice, x is the number of choices to make sequentially, and the return is the total number of choices combined. So, for a lottery with 49 possible numbers per ball and 6 balls to choose, then there is a 1 in 13,983,816 chance of picking the lottery number.
The stats_class method is just sugar for:
class Choose include Statisticus end
Meaning, stats_class :choose is the same as above.
To do something which I defined locally, I write an R lib (my_obj.r):
my_obj <- function(n,x) choose(n,x)
This is a trivial example. It defines a function called my_obj in the R runtime. Now, with the same syntax, I can write something like:
stats_class :my_obj MyObj.call(49,6) # => 13983816.0
The example is trivial, we're just passing things to the R function choose. But the power is pretty interesting. If you have R files written per function, then you can just dump all that code in a subdirectory or in ~/.statisticus somewhere, and Statisticus will slurp it up and use it automatically.
Now, if you want to do something more interesting with an R library, if you create a file, some_code.r, and put in it:
anything_else <- function(x) x
Now, you'll need a smarter class to handle the underlying code:
class SomeCode
def process(x)
r.anything_else(x)
end
end
SomeCode.call(123)
# => 123
This introduces an interesting concept. The process method is the default method (coming from TeguGears) for running a chunk of code. The r method is available inside of there to refer to the R runtime. Any code can be sent to or taken from the R runtime. Many steps can be incorporated, which may make sense sometimes. This is a useful approach if you have code that has many functions in it, and you want to access each one with its own Ruby class.
There are a few more tools that may be generally useful:
- stats: a command-line program for starting Irb with Statisticus running in it
- calc/Calc.call('...'): sugar for sending something directly to the R runtime.
Statisticus is built on top of TeguGears, which means you have or can expect:
- memoized method calls
- composable method calls
- a thread pool for running concurrent code
- each method implementing an observable pattern, so that method results can be broadcast across threads, processes, and machines
- a messaging back bone for distributed code
You should be able to install Statisticus with:
sudo gem install davidrichards-statisticus
Dependencies are R, RSRuby, and TeguGears.
Anyway, let me know of any questions or problems you may have. I'd be happy to work out more examples. Many examples are being included in Panorama, which should be more informative than this introduction.
Recent Activities: Panorama, Statisticus, and TeguGears
It's been a while since I've posted anything on here. I've been busy. I've:
- re-built Statisticus from the ground up, creating a cleaner approach to integrating Ruby with R
- re-built TeguGears, getting the composition and memoization on their way
- worked out a system for demonstrating machine learning with Cucumber called Panorama
h2. Statisticus
The R language is magical, powerful, and worth committing to. I have never found an analytical task that doesn't have at least a rudimentary implementation in R. You don't need to dig deep in it because they have a wonderful habit of giving you very good documentation with working examples. For example, in the R runtime, type help(solve) and example(solve). You can see the function notes and working examples of this linear equation solver. You may also want to browse the "R manual":http://cran.r-project.org/doc/manuals/R-intro.html, just enough to get the flavor of R
RSRuby is a project that's made R more accessible to Ruby. It's a bit clunky at times, which is why I work on Statisticus. As I think of any ways to make things smoother, I'll implement those. If you have any suggestions, please let me know. Today, I've been grokking dataframes, thinking through a smoother way to use them.
h2. TeguGears
TeguGears is a gem for managing the low-level workflow issues of analytical work in Ruby. It's meant to offer:
- Memoization and other optimizations
- Composable functions for re-using code more easily
- Concurrently running code on the same machine
- Distributed processes across many machines
It looks like I'll use a thread pool for my concurrency. In 1.9, that should help optimize a single server. I'm leaning a lot towards Nanite these days, instead of Vertebra. They're both excellent projects born in the offices of Engine Yard. Nanite is supposed to run a little faster and be a little easier to manage and Vertebra takes various security issues seriously (hence the speed and complexity differences). I realized that security isn't my concern: if I can't afford to have a percept snipped by someone, then I need to be on a private network instead of an open network.
But, all in all, this project has made a lot of progress. I'm excited every time I use it, even for simple programs.
h2. Panorama
Cucumber has won my heart. I've been using it and Story Runner for about a year and a half, but not religiously. For a problem as wide as machine learning, there are several important problems that need to be addressed:
- Making sure the concepts actually work
- Making sure the concepts work on various images I'm working on (an AMI for running Ruby-based machine learning in EC2)
- Communicating the breadth of ideas that have already been tackled in practical ways
- Training people on the basics of these concepts
- Getting feedback for the kinds of solutions needed
So, Panorama is a collection of Cucumber features that demonstrate machine learning. They're a poor-man's taxonomy, a precursor to our full-blown Tegu Ontology. The top-level organization of Panorama is:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Analytic Tools
- Repositories
Probably, the first time you will have access to Panorama is on the new AMI. The AMI will be ready soon enough.
Outside of those things, I've been looking at SUMO, a very interesting and open-source ontology of over 20,000 terms tied to WordNet. WordNet is the de facto standard for linguistic definition of English terms out of Princeton and a very fascinating source of information. I've been working with RDF through Redland with their Ruby bindings. I'm still pretty new at that, I've mostly done things with graph libraries, RGL, GRATR and TenaciousG. It would be very fun to combine all of these into some working inference engines for various purposes.
Anyway, I wanted to make semi-announcements of some of the changes.
Sirb is TeguGears
I have this code that's coming together for Ruby 1.9. I've really been inspired by the way Fiber works in 1.9, and I think that I can make it sing. So, I'm branching Sirb and making it TeguGears. I'll probably release my work in time for "Mountain West Ruby Conference":http://mtnwestrubyconf.org/2009/ next weekend, but I am too excited tonight and want to blog about what I'm thinking.
So, the concept is basic:
- take some decent base libraries (like Vector, Array, RNum, and NArray)
- make sure that some good descriptive statistics are available to them
- build these tools up with Ruby and support it with R when necessary or pragmatic
- make sure these tools have concurrent and distributed forms that work well.
It's this last two ideas that are expanding for me tonight. I've been studying Erlang in my spare time, and I'm really impressed with their simple API:
- For concurrent programming: ** spawn ** send ** receive
- For error control: ** link ** exit values ** spawn_link ** keep alive ** system processes
- For distributed programming: ** node ** nodes ** is_alive ** monitor_node ** disconnect_node
Not a bad list. I love how short it is. I can almost repeat the whole list from memory in the car. My task is to combine various tools to provide this kind of interface inside of TeguGears. I'll gut out some of how Tegu was working on these issues (my old code is over-architected and therefore never complete). What I'm thinking today:
- Move my Ruby implementations into Fiber-enriched, composable structures: Filter and Operator
- Add my ThreadPool code from another project and make it a module that can make any class' algorithms concurrent
- Change the ThreadPool interface to handle a simpler API: spawn, send, receive, link, spawnlink, keepalive, and system_process
- Add a distributed component with a simple API: node, nodes, isalive, monitornode, disconnect_node
I'm not sure how I'll implement the distributed portion of TeguGears. My choices seem to be Rinda and Starling. I could also look again at Active MQ, Rosetta Queue (when they release), or possibly another message queuing system (whose name I forget at this hour).
I think I've created enough design for the project to branch Sirb and begin work. This would be, I think, a strictly Ruby 1.9 gem. I don't think I like my composition solution in Sirb (too monkey-patched for general use, I think). I should probably write more concrete examples for what these concepts can do in the real world. My huddle gem will be a good showcase piece. It relies on the tools in Sirb to implement clustering algorithms.
Anyway, those are my thoughts for tonight. They were encouraged by a "Dave Thomas article":http://pragdave.blogs.pragprog.com/pragdave/2008/01/pipelines-using.html that showed me I can have much cleaner thinking on the matter. I think I'm going to turn in and get an earlier start on my day tomorrow.
Getting Started
This isn't an article on getting started with Tegu. Not yet. This is an article about getting started with blogging. See, I have about 15,000 lines of code that I've written. Some of it needs a lot of refactoring and tests. Some of it is really quit ready. Some of it is being spun off into other gems that will work as plugins for Tegu. That's a lot of work. It's going to be released. Soon. I've promised. And still promise. I'm taking advantage of nobody needing an immediate bug fix, to run more test code, to think through some edge cases, to make sure that at least I'm happy with a few major issues. This isn't very Agile. Andy Hunt tells me (through his new book, one I'll review soon) that this means I'm not using the right side of my brain. I'd argue that I'm using even less than that sometimes.
Meanwhile, there's this blog. This blog is where I'll demonstrate how to run some basic machine learning or system-centric ideas with Tegu. Kind of a live teaser for what I'm doing in the labs right now. Part of my prep work is to take the machine learning course at Stanford, and concretize it for Tegu users. So, you should expect a series of articles on that. There will be articles on installing and using Tegu. There will be articles on basic systems ideas, such as dynamics and nonmonotonic reasoning. I'm flying to San Francisco in a couple of weeks to listen to the Rapleaf people talk about Hadoop, an important clustering resource for Tegu. So there should be quite a bit of concrete information found here. There's also the Google Group, which will be ever-more interesting as people get their hands on this technology and take it for a spin. Finally, I'll post the Lighthouse and Github links here when they're appropriate and useful for issue tracking and the git source respectively.
If you're just wanting to get your feet wet, you can check out a video taken of me at the Lone Star Ruby Conference. It's pretty rough. Hal Fulton's comment to a friend of mine, after my talk was, "so what does it do?" Good question Hal. I was too nervous, too involved in the project to get that answered. I'll get it right.
So, welcome to my blog, at least. There's a lot of exciting and worthwhile things on their way.