Ruby and RDF
I joined the "public Ruby RDF mailing group":http://www.w3.org/2001/12/rubyrdf/intro.html a while back. It's a pretty quiet group. The resources listed on the home page is a little out of date. I asked for a status check, and there seems to be a few live projects these days. In no particular order, they are:
- "Redleaf":http://deveiate.org/projects/Redleaf is supposed to be making some advances. It is hand-written bindings to the Redland libraries with the promise to be a little more idiomatic.
- "ruby-rdfa":http://code.google.com/p/ruby-rdfa/, a Ruby-only RDFa parser.
- "RubyRDF":http://esw.w3.org/topic/RubyRdf is a collection of parser, storage, manipulation, and query tools.
- "Reddy":http://github.com/tommorris/reddy is meant to be a "one true way" approach to RDF with Ruby, bringing Redland, Jena, and other libraries together with one Ruby interface.
I don't think "ruby-sesame":http://github.com/pjlegato/ruby-sesame has been worked on for quite some time, but it binds to the Sesame RESTful api.
You'd probably be building your own productivity layers on top of any of these solutions. I would like to work more with all of these, pick one, and get busy on any productivity classes that I need. None of these are really hot, killer apps, however. That makes the decision a bit harder. I'm pretty open and loose about this for a little while. If anyone's had any other/conflicting experience, I'm open to suggestions.
Remote Pair Programming
I'm working more and more in pairs these days. I really enjoy what I learn from others. We tend to build more momentum on projects at the same time. I've heard from a few of the cool cats online that screen + vim + Skype is the way to go. I haven't used screen before, and, like with most things, I have to find the time to dig into it a little. In this case, there wasn't much that I needed to dig into to get screen to work for me. Here's the gist:
- pick a server to work on
- setup ~/.screenrc
- start a session
- have the other person join the session
- turn on Skype and have fun
h2. Details
In this day of virtual servers, there seems to be one available all the time. For me, I needed to setup a few users and gather some public keys to make things work out. Not too different than the regular fare.
This is what I've collected from around various blog posts for my ~/.bashrc:
hardstatus on
hardstatus alwayslastline
hardstatus string "%{rk}%H %{gk}%c %{yk}%M%d %{wk}%?%-Lw%?%{bw}%n*%f%t%?(%u)%?%{wk}%?%+Lw%?"
multiuser on
acladd nate
acladd matt
acladd aldo
The hardstatus configuration gives me a line at the bottom of the screen to share the context. The multiuser allows others to log on. The acl stuff says who can log on.
To start a session, I've been using:
screen -S
To have the remote person join the session:
screen -x
At some point, we also turn on Skype. At this point, we're looking at the same files, the same command line, and we can talk about what we're doing. I think I kept resisting screen because I thought there would be a little more to it. There probably is, but this seems to work for me. If I find any hiccups with this, I'll post them here.
More Useful SlenderT
I think my SlenderT library is starting to get useful:
- I've run quite a few benchmarks on it
- I've worked on the load to make that pretty fast
- I've implemented the query
- I've played with a fairly large database, finding that it delivers a very expressive tool
- I've written some documentation that can be found at "GitHub":http://github.com/davidrichards/slender_t
Some quick examples from the documentation:
>> db = SlenderT.load('spec/fixtures/business_triples.csv')
>> db.find('BSC', 'name', nil)
=> [["BSC", "name", "Bear Stearns"]]
That tells us that BSC means Bear Stearns. This tells us who we know Bear Stearns contributed to recently, and how much:
>> val = db.query(['?contribution', 'contributor', 'BSC'],
?> ['?contribution', 'recipient', '?recipient'],
?> ['?contribution', 'amount', '?dollars'])
=> [{"?contribution"=>"contrib285", "?dollars"=>30700.0, "?recipient"=>"Orrin Hatch"}, {"?contribution"=>"contrib284",
"?dollars"=>168335.0, "?recipient"=>"Hillary Rodham Clinton"}, {"?contribution"=>"contrib287", "?dollars"=>5600.0,
"?recipient"=>"Christopher Shays"}, {"?contribution"=>"contrib288", "?dollars"=>205100.0, "?recipient"=>"Christopher Dodd"},
{"?contribution"=>"contrib290", "?dollars"=>17300.0, "?recipient"=>"Frank Lautenberg"}, {"?contribution"=>"contrib286",
"?dollars"=>5000.0, "?recipient"=>"Barney Frank"}, {"?contribution"=>"contrib289", "?dollars"=>13000.0, "?recipient"=>"Michael
Dean Crapo"}, {"?contribution"=>"contrib294", "?dollars"=>4600.0, "?recipient"=>"Pete Sessions"},
{"?contribution"=>"contrib295", "?dollars"=>5000.0, "?recipient"=>"Paul E. Kanjorski"}, {"?contribution"=>"contrib292",
"?dollars"=>6600.0, "?recipient"=>"Nita Lowey"}, {"?contribution"=>"contrib293", "?dollars"=>5000.0, "?recipient"=>"Deborah
Pryce"}, {"?contribution"=>"contrib291", "?dollars"=>102260.0, "?recipient"=>"Joe Lieberman"}]
>> val.size
=> 12
It'll get some more lovin', but it's plenty good for this week's deliverables.
From Good to Awesome
I just had an incredible experience. I'm working at "Instructure":http://www.instructure.com/, a startup currently working out of the Novell campus. I went to the cafeteria for lunch, and our investor was there. He has been teaching a class on entrepreneurship for the startups here. So, I sat down to lunch with him, and he was joined by a handful of business owners from the class. We got to brag Instructure up a little bit and explain the strategic positioning of Instructure to the group and then hear about the other companies. One company has been making sales, but not a lot of progress. As I stood up to go back to work, the investor was asking someone from that company, "how can we get you from good to awesome?"
Another way I've heard it is, don't suck less. Meaning, it's not good enough to just suck less than the competition, but rather be a standout in the community.
One way we keep from not sucking less is to be active in the community. Otherwise, it's too easy to just code up more hum drum, or have a hum drum expectation for my business life. But if we are involved in our communities, our excellence is asked of us. I've put a lot of thought into that for myself. I think it's kind of like early-morning jogging for athletes, it's how we expose ourselves to new opportunities, challenging situations that seems to make the difference. For me lately, this has meant:
- joining networking groups. I finally joined the "BYU Management Society":https://marriottschool.byu.edu/mgtsoc/index.cfm. I should have done this 7 years ago when I finished my MBA. Today, I want the feedback this group can give me in the networking luncheons they have. I also want to keep my business skills brushed up. Programming in near-seclusion in jeans and sneakers all day doesn't ask me to polish my interpersonal skills, where I could fit in with executives and managers as anything but the hired help.
- participating in local users groups. For me, that's "URUG":http://groups.google.com/group/urug. At URUG, I can share my ideas, and get better ideas from the professionals around me.
- lending a hand. I've been watching "Mahout":http://lucene.apache.org/mahout/, a Hadoop-style machine learning framework. I'm seeing where I can contribute, and I'm learning about higher standards for my work as I go.
- speaking. I've spoken at "MWRC":http://mwrc2009.confreaks.com/, "LSRC":http://lonestarrubyconf.com/index.html, and "UTOSC":http://utosc.com/pages/home/. The more I speak, the more I learn, the higher my standards go for my work.
Anyway, it's a great feeling when I see the progress available through striving a little in this community of ours.
Formtastic Rails Cast
Ryan Bates is putting a two-part screen cast together on "Formtastic":http://github.com/justinfrench/formtastic You can see the "first one here":http://railscasts.com/episodes/184-formtastic-part-1 This rounds out some of my UI thoughts from "Sunday":http://blog.tegugears.com/2009/10/18/front-end-frustrations-and-some-solutions.
Hadoop Online Processing
"Andrew Shafer":http://stochasticresonance.wordpress.com/ told me about "HOP":http://radar.oreilly.com/2009/10/pipelining-and-real-time-analytics-with-mapreduce-online.html last night. It's a really powerful concept:
- Take Hadoop's MapReduce interface
- Allow jobs to be run online, rather than batched
- Co-schedule tasks
- Keep tasks running all the time
This speeds up the delivery of the analytics: no need to read and write to HDFS for intermediate steps. This provides real-time analytics. This cleans up the feedback loop between data and analysis by quite a bit.
I can think of a lot of data rich applications that could use this. Wall Street and search tools could really have a hay day with this stuff.
Gem Bundler
Ryan Shaw brought this to my attention. "Gem Bundler":http://litanyagainstfear.com/blog/2009/10/14/gem-bundler-is-the-future/ handles gem dependencies in an intelligent way. This tutorial shows you how to use Gem Bundler.
SlenderT: A Simple Triples Store
I've translated a little code from "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 by Toby Segaran et al tonight. There's a neat little ruby gem available for your pleasure now, "slendert":http://github.com/davidrichards/slendert:
sudo gem install slender_t
It's a triples store. It'll get some love when I work on my recommendation engine again in the morning. I decided to go a little more robust than the "Composite pattern, as I mentioned in yesterday's post":http://blog.tegugears.com/2009/10/18/a-little-education-goes-a-long-way and store the rules in a triples store. I couldn't justify installing "Sesame":http://www.openrdf.org/ or "Redland":http://librdf.org/ for that, so there's this.
I have a few things to look at tomorrow:
- Possibly implementing it with an "RBTreeMap":http://github.com/kanwei/algorithms/blob/master/lib/containers/rbtreemap.rb from the "Algorithms":http://github.com/kanwei/algorithms gem to make it a little nicer when adding triplets. Currently, I have a lazy/nasty piece of work in the add method:
index[a][b] = index[a][b] | [c]
- I could do with a better query language. There are quite a few ideas that I want to take from this project and bring back into some other projects I should finish: marginal, fathom, and overalls.
Anyway, if you'd like to play around a little with semantic technologies, that may give you a start. Also, "Programming the Semantic Web":http://www.amazon.com/Programming-Semantic-Web-Toby-Segaran/dp/0596153813/ref=sr11?ie=UTF8&s=books&qid=1256026764&sr=8-1 and "Semantic Web for the Working Ontologist":http://www.amazon.com/Semantic-Web-Working-Ontologist-Effective/dp/0123735564/ref=sr11?ie=UTF8&s=books&qid=1256026799&sr=1-1 are practical references worth picking up.
Data Analysis Book Proposal 6
I'd like some feedback on a book proposal I'm working on. I understand that not many people read this blog, but maybe I can direct a few friends to read this outline and drop me some comments. This is very much a work in progress, but if I don't make progress, then why start, right?
I'd like to write books as a regular part of my work life. In the same way that blogging clarifies my thinking, so might writing books. One of my favorite authors, "Gerald Weinberg":http://en.wikipedia.org/wiki/GeraldWeinberg would write a book about anything he wanted to learn thoroughly. He wrote some very influential books in my life because of that, including "Introduction to General Systems Thinking":http://www.amazon.com/Introduction-General-Systems-Thinking-Anniversary/dp/0932633498/ref=sr1_1?ie=UTF8&s=books&qid=1255901792&sr=8-1 which sent me off to Portland to learn about systems.
I don't want to send anyone to Portland, but I would like to talk about open source data analysis:
h3. Pragmatic Data Analysis with Open Source Software
I would like to publish for the "Pragmatic Bookshelf":http://www.pragprog.com/ which has some very useful "writing guidelines":http://www.pragprog.com/write-for-us. Because of these decisions, the book needs to be:
- approachable
- friendly
- tutorial-focused
This really uncovers why I want to publish with Pragmatic:
- Who wouldn't want an approachable book on data analysis? If you have to learn the subject, you might as well have a followable guide to walk you through the process.
- The style forces me to learn things well enough I could teach them to people that work for me. I'd like to be able to duplicate what I'm learning in other people. Einstein said you don't understand a subject unless you can explain it to a six-year old.
h3. Similar Work
There are some really interesting books in this space already. By this space, I mean people who want to use open source software to find answers from their data. Readers of these books are often software developers, technical managers, or IT professionals. Some books in the space include:
- "Programming Collective Intelligence":http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=sr11?ie=UTF8&s=books&qid=1255902337&sr=8-1 by Toby Segaran, an introduction to recommendation engines, clustering, searching and classifying with Python.
- "Collective Intelligence in Action":http://www.amazon.com/Collective-Intelligence-Action-Satnam-Alag/dp/1933988312/ref=pdsimb_1 by Satnam Alag, a Java response to Programming Collective Intelligence
- "Algorithms of the Intelligent Web":http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665/ref=pdbxgybimgb by H. Marmanis. This looks like a very similar approach to my book.
- "Data Mining":http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0120884070/ref=sr11?ie=UTF8&s=books&qid=1255902665&sr=1-1 by Ian Witten et al. This is a very good introduction to many basic algorithms, along with an introduction to Weka, one of the tools I'd speak about in my book.
- "Beautiful Data":http://www.amazon.com/Beautiful-Data-Stories-Elegant-Solutions/dp/0596157118/ref=pdsimb_4 by Jeff Hammerbacher.
The real difference I see between what I want to do and what others want to do is I want to walk readers through a real project. A lot of the ideas are interesting in these other books, but actually working with a scale project makes a big difference. Readers of this book may have been reading some of these other projects, but now need to scale an application up for production.
h3. Chapters
So, I'd like to take a 200-300 page approach to the subject. I'd like to work through a general problem that will introduce some tools and solutions that will be interesting. The chapters that I thought I could do are:
h4. Overview
Introduce the project, the communities, the reasons we might have to work in this space.
h4. Collecting the data
I'm not sure which approaches I'd use for collecting the data. I have written some solutions, but there are probably some better general solutions out there. Maybe "80 Legs":http://80legs.com/ is really the best general approach.
h4. Processing w/ Hive and Hadoop
Starting with some Puppet scripts, we install Hadoop and Hive onto our system. We then start organizing the data and run some map and reduce scripts that have already been written. Possibly I'll use this for the Collecting the Data chapter.
h4. Basic Classification
From raw data, I want to start to add some classifications. This will be Neural Networks and Support Vector Machines.
h4. Clustering and Decision Trees
Breaking the data into broad categories can also be very useful.
h4. Belief Networks
Dealing with Bayes theorem, belief networks, causality and prediction.
h4. Parameter Selection
Given such a large data set, we could get into trouble if we are not careful. This chapter takes the data set and selects which parameters to use when learning from the data.
h4. Interpreting the Model
A model is only as good as its interpretation. This chapter reviews basic statistics, gotchas, etc. that someone might use to evaluate a model.
h4. Decision Support
The bridge from insightful learning to practical workbenches can be a little daunting. Here is a way to bring this system to the executive or manager that needs the data. This includes data visualization, model interpretation, and knowledge queries.
h4. Reinforcement Learning
The leap from learned data to optimal policy decisions can be a little daunting. Here are tips, tricks, and tools for making that connection.
R/Parallel and REvolution R
I am interested in anyone who may be using R/Parallel in their work or research. I would love to find some working examples and a community of people that I could learn from. The basics of this project is to take your "R":http://www.r-project.org/ code and be able to run it in a parallel environment.
I have also joined the "REvolution R":http://revolution-computing.com/products/revolution-r.php mailing list, hoping to see what's out there.
I don't have many people that read this blog, but if you happen to know something about these things and come across this article, would you mind posting a few notes before you move on? Thanks.