Ruby and RDF
I joined the "public Ruby RDF mailing group":http://www.w3.org/2001/12/rubyrdf/intro.html a while back. It's a pretty quiet group. The resources listed on the home page is a little out of date. I asked for a status check, and there seems to be a few live projects these days. In no particular order, they are:
- "Redleaf":http://deveiate.org/projects/Redleaf is supposed to be making some advances. It is hand-written bindings to the Redland libraries with the promise to be a little more idiomatic.
- "ruby-rdfa":http://code.google.com/p/ruby-rdfa/, a Ruby-only RDFa parser.
- "RubyRDF":http://esw.w3.org/topic/RubyRdf is a collection of parser, storage, manipulation, and query tools.
- "Reddy":http://github.com/tommorris/reddy is meant to be a "one true way" approach to RDF with Ruby, bringing Redland, Jena, and other libraries together with one Ruby interface.
I don't think "ruby-sesame":http://github.com/pjlegato/ruby-sesame has been worked on for quite some time, but it binds to the Sesame RESTful api.
You'd probably be building your own productivity layers on top of any of these solutions. I would like to work more with all of these, pick one, and get busy on any productivity classes that I need. None of these are really hot, killer apps, however. That makes the decision a bit harder. I'm pretty open and loose about this for a little while. If anyone's had any other/conflicting experience, I'm open to suggestions.
More Useful SlenderT
I think my SlenderT library is starting to get useful:
- I've run quite a few benchmarks on it
- I've worked on the load to make that pretty fast
- I've implemented the query
- I've played with a fairly large database, finding that it delivers a very expressive tool
- I've written some documentation that can be found at "GitHub":http://github.com/davidrichards/slender_t
Some quick examples from the documentation:
>> db = SlenderT.load('spec/fixtures/business_triples.csv')
>> db.find('BSC', 'name', nil)
=> [["BSC", "name", "Bear Stearns"]]
That tells us that BSC means Bear Stearns. This tells us who we know Bear Stearns contributed to recently, and how much:
>> val = db.query(['?contribution', 'contributor', 'BSC'],
?> ['?contribution', 'recipient', '?recipient'],
?> ['?contribution', 'amount', '?dollars'])
=> [{"?contribution"=>"contrib285", "?dollars"=>30700.0, "?recipient"=>"Orrin Hatch"}, {"?contribution"=>"contrib284",
"?dollars"=>168335.0, "?recipient"=>"Hillary Rodham Clinton"}, {"?contribution"=>"contrib287", "?dollars"=>5600.0,
"?recipient"=>"Christopher Shays"}, {"?contribution"=>"contrib288", "?dollars"=>205100.0, "?recipient"=>"Christopher Dodd"},
{"?contribution"=>"contrib290", "?dollars"=>17300.0, "?recipient"=>"Frank Lautenberg"}, {"?contribution"=>"contrib286",
"?dollars"=>5000.0, "?recipient"=>"Barney Frank"}, {"?contribution"=>"contrib289", "?dollars"=>13000.0, "?recipient"=>"Michael
Dean Crapo"}, {"?contribution"=>"contrib294", "?dollars"=>4600.0, "?recipient"=>"Pete Sessions"},
{"?contribution"=>"contrib295", "?dollars"=>5000.0, "?recipient"=>"Paul E. Kanjorski"}, {"?contribution"=>"contrib292",
"?dollars"=>6600.0, "?recipient"=>"Nita Lowey"}, {"?contribution"=>"contrib293", "?dollars"=>5000.0, "?recipient"=>"Deborah
Pryce"}, {"?contribution"=>"contrib291", "?dollars"=>102260.0, "?recipient"=>"Joe Lieberman"}]
>> val.size
=> 12
It'll get some more lovin', but it's plenty good for this week's deliverables.
Formtastic Rails Cast
Ryan Bates is putting a two-part screen cast together on "Formtastic":http://github.com/justinfrench/formtastic You can see the "first one here":http://railscasts.com/episodes/184-formtastic-part-1 This rounds out some of my UI thoughts from "Sunday":http://blog.tegugears.com/2009/10/18/front-end-frustrations-and-some-solutions.
Gem Bundler
Ryan Shaw brought this to my attention. "Gem Bundler":http://litanyagainstfear.com/blog/2009/10/14/gem-bundler-is-the-future/ handles gem dependencies in an intelligent way. This tutorial shows you how to use Gem Bundler.
SlenderT: A Simple Triples Store
I've translated a little code from "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 by Toby Segaran et al tonight. There's a neat little ruby gem available for your pleasure now, "slendert":http://github.com/davidrichards/slendert:
sudo gem install slender_t
It's a triples store. It'll get some love when I work on my recommendation engine again in the morning. I decided to go a little more robust than the "Composite pattern, as I mentioned in yesterday's post":http://blog.tegugears.com/2009/10/18/a-little-education-goes-a-long-way and store the rules in a triples store. I couldn't justify installing "Sesame":http://www.openrdf.org/ or "Redland":http://librdf.org/ for that, so there's this.
I have a few things to look at tomorrow:
- Possibly implementing it with an "RBTreeMap":http://github.com/kanwei/algorithms/blob/master/lib/containers/rbtreemap.rb from the "Algorithms":http://github.com/kanwei/algorithms gem to make it a little nicer when adding triplets. Currently, I have a lazy/nasty piece of work in the add method:
index[a][b] = index[a][b] | [c]
- I could do with a better query language. There are quite a few ideas that I want to take from this project and bring back into some other projects I should finish: marginal, fathom, and overalls.
Anyway, if you'd like to play around a little with semantic technologies, that may give you a start. Also, "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 and "Semantic Web for the Working Ontologist":http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564/ref=sr11?ie=UTF8&s=books&qid=1256026799&sr=1-1 are practical references worth picking up.
Data Frame and Meta Data
There's a code smell in "dataframe":http://github.com/davidrichards/dataframe. It is beginning to be a little difficult to add some features that deal with cached values. Data frame is supposed to take a series of named fields and let you do interesting things with it:
There's also a bug there. Can you see it? We lost the category for wind. The last line should have been [0,1] like it was before. Now, I think I could fix this pretty easily, but I am beginning to see that I need to have a better way to work with data. As it sits:
- justenumerablestats does a lot of the fancy work under the covers
- The meta data in a column is not explicitly defined
- The meta data does not follow a copied array the way that I copied it
- data_frame does not have explicit meta data
- The meta data from both justenumerablestats and data_frame cannot be extracted out of their objects
There are a few things that I want to fix to then:
First, I want to be able to extract the meta data out of arrays and data frames to be used on other data sets. I would like to preprocess a data sample, generate its labels, categories, max, min, standard deviation, etc., and then have that available as the context for processing a whole data stream. By making this change, I don't have to limit myself to data that can fit in memory.
Second, justenumerablestats is a dumb name for a gem. I started it as a quick fix for a simple program. Then I kept adding methods to it every time I had a new use. Pretty soon, it became a very interesting collection of methods, but that really only work on Array data. I can't use this on some custom class that includes the Enumerable module, I have too many dependencies on arrays. Hashes can't use it. So, this is really array_stats. So, I should make that gem instead and start depending on that.
So, this is where I want to take dataframe, justenumerablestats, and arraystats. I'll see if I can get back to it this weekend. I'm excited to work on it, but I have some code to turn in for some clients before tomorrow evening, and I better not delay anymore.
Preprocessing 1
"Red Davis":http://redwriteshere.com/ posed an interesting question to me today. How do you normalize a streaming data set? Meaning, if I have data with an unknown maximum or minimum value, and I have too much data to search through it to find the exact answers, how do I work with that?
The answer we came up with was a bit of backtracing;
- Either provide a sample data set or an estimated maximum and minimum value
- Calculate the maximum and minimums based on a normal distribution
- Start normalizing the data stream
- Recalculate the max and min if the data exceeds the bounds
At that point, we can use a series of heuristics or learning algorithms to decide how to setup the normalized data.
The thing about a normalized data set is that it is internally consistent. Meaning, we are adjusting for any skew that a parameter might have on a neural network, say, because its values were recorded between 1 and a million, versus values that were recorded between 1 and 10. To a learning algorithm, the different states are just different states, and normalized data achieves that limitation.
I put together a little piece of code that might be useful for this kind of problem, found here:
Shogun Shotgun
So, I have enjoyed what I have learned lately about "Shogun":http://www.shogun-toolbox.org/. It is a unified interface for several SVM libraries. Today, I can run some sophisticated analysis from R, Octave, or Python, but not Ruby. I was looking through the "Python Interface":http://svn.tuebingen.mpg.de/shogun/releases/shogun0.8.0/src/python/PythonInterface.cpp (also "here":http://svn.tuebingen.mpg.de/shogun/releases/shogun0.8.0/src/python/PythonInterface.h), and it doesn't look too difficult.
I think, taking an honest look at my C/C++ skills (or lack thereof), I realize I should probably pair on this with someone. My programming classes used C, and I never took C++ very seriously, so I may have a bit of a learning curve ahead of me. I know I'm going to have to pay attention to the "Extending Ruby with C":http://www.rubycentral.com/pickaxe/ext_ruby.html chapter in my Pickaxe book.
If someone else would be interested in this kind of project, and wouldn't mind pairing with a novice, I'd love the opportunity. Meanwhile, I'm doing the novice work of finding and using recipes to get a rhythm for extending Ruby.
Behind the Times
I don't read "GitHub's blog":http://github.com/blog. I guess I should. I've left them a few mean notices about what's going on with the interface, why can't I publish gems anymore. They ignored my blasts, and once again, Google is my friend. Github doesn't build gems anymore. I found "this article, published a few days ago":http://github.com/blog/515-gem-building-is-defunct telling me why and what to do.
What this means is I'm going to have to go through all my gems and get them setup afresh. I have 64 gems in my local gems repository, so this could take me a while. While I'm at it, I wanted to re-organize all my gems so that they'd work with "rip":http://github.com/rtomayko/rip, as well as stop requiring rubygems in my libraries ("see the gist":http://gist.github.com/54177). So, I'm taking a simple task and turning it into three or four tasks.
I'm not sure if I'll get that done tomorrow or sometime later, I have quite a bit of deliverable code to release right now, but I just want to make a public acknowledgment that some of my work isn't very available right now.
All That Jazz
I'm behind on my blog these days. I've written a lot of interesting code lately. At least I'm interested in it. Actually, I've been singing a lot lately. I've been using my own gems and they've been making work more accessible and more convenient. I was wanting to write up a decent explanation of each, but it's probably better to just get this out:
- "repositories":http://github.com/davidrichards/repositories/tree/master, a collection of in-memory caching tools.
- "kmeans":http://github.com/davidrichards/kmeans/tree/master, a clustering algorithm.
- "just enumerable stats":http://github.com/davidrichards/justenumerablestats/tree/master, some pretty useful statistics for Arrays.
- "dataframe":http://github.com/davidrichards/dataframe/tree/master, a useful way to work with named data tables, with justenumerablestats available as well.
- "etl":http://github.com/davidrichards/etl/tree/master, extract transform and load with a few conveniences, "mentioned here":http://blog.teguhub.com/2009/08/15/etl-in-the-wild.
I think each one has a decent README file. If not, it will have.
On the way:
- marginal: in-memory probability distribution tables.
- fathom: Bayesian belief maintenance a la Judea Pearl.
The big ones are coming along as well. As I play with little things, extracting stuff out here and there, I'm also making progress to agency, tegu labs, tegu gears, tegu, and the tegu ontology. Labs and ontology both got a good boost this weekend.