Causality
I've been reading Causality by Judea Pearl, one of my Christmas presents this year. It's an amazing read. I started with the epilogue, to get the grand view of the book from a researcher's perspective. The epilogue's contents can actually be found here.
The epilogue is the transcript from a lecture that Pearl gave to UCLA researchers around the time of the first edition of the book. In it, Pearl outlines the history and issues of causality.
One of the major ideas that Pearl covers is the sticky issue of:
correlation does not imply causation
This issue stems from using algebra instead of causal graphs to describe the world. Causal graphs add a few critical elements to the algebra:
- Preservation of the dynamics of a situation, the ins and the outs to a system
- Using equations as representations of actual mechanisms observable in our experiments
- Interventions setup to explicitly adjust the mechanisms, to uncover actual causality if it exists
Pearl then goes on and gives support and an example of using these ideas.
I'm taking my time with this book, working through ideas with Ruby code and finding out if I understand the concepts deeply enough to be a practitioner. I have a few un-optimized ideas I might share in an upcoming gem, we'll see.
Books I'm Picking Up
I've just bought three books, all of them look very interesting. Maybe you'd be interested in this as well.
h2. "The Art and Science of CSS":http://www.sitepoint.com/books/cssdesign1/?SID=d9a8bf46b62f9bfb898f4bad122498d0 by Cameron Adams et al
This seems very practical and to the point. It covers the basics, but in a not-so-basic way:
- Headings
- Images
- Backgrounds
- Navigation
- Forms
- Rounded Corners
- Tables
I bought the book to work on some of my forms, and found that there are many other useful things in the book.
h2. "The Laws of Simplicity":http://www.amazon.com/Laws-Simplicity-Design-Technology-Business/dp/0262134721/ref=sr11?ie=UTF8&s=books&qid=1255896482&sr=8-1 by John Maeden
This was mentioned in a screencast I reviewed today, and thought that it had a lot of promise. The Ten Laws:
- Reduce
- Organize
- Time
- Learn
- Differences
- Context
- Emotion
- Trust
- Failure
- The One
Basically, Simplicity = Sanity
h2. "Thinking in Systems: A Primer":http://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557/ref=sr11?ie=UTF8&s=books&qid=1255896645&sr=1-1 by Donella H. Meadows
This book, from 1972, has influenced a lot of my life. Many of the books I've read and the approaches I take to machine learning are embedded in these philosophies. In fact, the "Systems Science":http://www.pdx.edu/sysc/ program I attended in Portland is based on these foundations.
The basics:
- System Structure and Behavior
- Systems and Us
- Creating Change-In Systems and in Our Philosophy
I'm excited to get all of these. The CSS book I bought as a PDF book, the others are coming via Amazon.
The Role of "a priori"
I've been reading "Learning from Data by Charkassky and Mulier":http://www.amazon.com/Learning-Data-Concepts-Theory-Methods/dp/0471681822/ref=sr11?ie=UTF8&s=books&qid=1250695854&sr=8-1 lately. It's an amazing book. The book mitigates differences between so much academic literature:
- Identifying synonyms between the schools of thought
- Identifying real differences in opinions, goals, and abilities of various schools of thought
- Presenting a taxonomy of ideas for data analysis
- Developing/expanding a few branches quite well on that tree
I'll probably give the book a decent/full review when I've finished it, and I'll probably put snippets from the book here in the meantime. Today's topic: a priori knowledge.
It really hit me last night how useful a priori knowledge really is. Most of the literature dismiss it, talking about a priori knowledge as something we haven't figured out how to mimic with our algorithms yet. For a researcher pushing the limits of our abilities, that is a very useful perspective. For a practitioner, learning to use things, a different perspective is useful.
A priori knowledge is the stuff that comes from the environment, usually the researcher's mind. It is often subjective, rarely tracked very carefully. A priori knowledge is critical for statistical model learning and predictive learning. In data mining applications, a priori knowledge may be used to justify a model after it has been discovered. Using this knowledge, we can simplify and focus our models, making them more useful for the things we are interested in, making them more interpretable when we have finished them, and making them more computable. We use a priori knowledge to remove loops where loops make an algorithm intractable. We use a priori knowledge to choose which learning algorithms to use and which input variables are uninformative.
Consider formal problem experiment procedure:
State the problem
Formulate the hypothesis
Design the experiment/generate the data
Collect the data and perform preprocessing
Estimate the model
Interpret the model/draw conclusions
At every stage of this process, there is a role for a priori knowledge. I like the formal approach, by the way, even if its just a reminder.
The trick, then, is to make the a priori knowledge explicit. In formal literature, these things are listed as part of the writeup. In our regular research, there should be dialog around a priori knowledge. There can be tests whether that knowledge is actually consistent.
As I read about the issues with model building, I keep thinking how convenient it is to make Tegu a social tool, rather than just a number cruncher. By adding dialog to part of the problem formulation and solution, we are handling a priori knowledge explicitly.
Numerical Recipes
It's not a minor thing, to represent the world I live in mathematically. Modeling the systems in my world has been an interest since Junior High School, but only in the last few years have these things began to be practical for me. A big step for me has been working through Numerical Recipes by William H. Press et al. It's an amazing book, very well written, very practical, and considerately integrative. I can bump around the book and quickly ground myself on the various ideas that are being shared. It is a broad book, covering everything from linear algebra, data modeling, all sorts of statistics, Fast Fourier Transforms, and many other interesting subjects. Reading through these recipes, I am starting to see how so many other libraries contribute to my explorations of these things. Particularly, many of the methods in the GNU Scientific Library are making sense now.
Which reminds me how frustrated I can get with Ruby bindings for the GSL. There are two bindings out there: ruby-gsl and rb-gsl. Ruby-gsl relies partially on rb-gsl, which has some funky limitations that take some special patching. In the past, I've played with a few tricks to get it working. I have it working right on some servers that I use, but I've abandoned the effort on others. This is the reason I'm convinced a Ruby Machine Learning image is necessary to build a community around these kinds of efforts in Ruby. The dependencies are too intricate and tricky and the power is too great to just leave an abstract concept lying around the archives of RAA, Rubygems, Github and SourceForge: powerful tools growing rusty in their obscurity. The sad thing is that there are brilliant tools available today.
I've hesitated because of the cost and time necessary to make this available. I'm making some pretty major changes in my personal life right now, and I'm looking at a lot of nearly-complete projects that need to be released in the near future:
- A workable TeguGears (I've got some interesting implementations of good algorithms on top of TeguGears, so that's looking promising)
- The ontology with a truth maintenance system to query it (so much is left undone on this project, but I still stay up nights working through Pearl's ideas on the subject)
- FarGRATR (I took TenaciousG and redid it with Redis as the adjacency list, a distributable and fast database supporting a very large graph library, overkill for the limitations that GRATR gives us)
- Many other working algorithms, awaiting small things like examples, integration with other gems, and who knows what else holding me back
- Tying all of these tools into some business objects, integrating this power into ERP systems
- A really cool teguhub.com app that's waiting for the ontology before I release it. It doesn't make sense to try and support another app before it's particularly useful. So, there will be an API and an online search of known resources for these things in Ruby.
Bottom line: there are some very fun things going on despite my silence and lack of recent releases.
Statistical Methods for Research Workers
I've been reading one of R. A. Fisher's books: "Statistical Methods for Research Workers":http://www.amazon.com/Statistical-Methods-Experimental-Scientific-Inference/dp/0198522290/ref=pdbbssr_1?ie=UTF8&s=books&qid=1235111308&sr=8-1. The book was an early work, originally a type of recipe book for a knowledge worker to setup experiments. It was written before Pearson and Neyman developed the idea of the null hypothesis.
What's more fascinating is the time Fisher is willing to take to explain things. I have 4 or 5 books on my shelf that explain statistics, geared for a practitioner rather than a theorist. I don't have any interest in expanding statistical theory, but there is a lot that lies between where I am and where I am going in using the stuff. These books could almost be carbon copies of each other. They read like manuals, explaining the terms and concepts, walking the self-same path of discovery. There is no joy of discovery in those books like there is in Fisher's book. For instance, Fisher takes the time to explain what it looks like to see parts of the normal distribution without appealing to abstractions.
They call him the Father of Modern Statistics for a reason, you know. It's because he was there, explaining and discovering at a time when organizing a test was whatever you happened to think of at the time. You are the scientist, so your process is assumed to be scientific.
Reading Fisher reminds me of other great minds, and the simplicity they give to their subjects:
- Polya describes the patterns of general problem solving in "How To Solve It":http://www.amazon.com/How-Solve-Mathematical-Princeton-Science/dp/069111966X/ref=sr11?ie=UTF8&s=books&qid=1235111748&sr=1-1 after a career of teaching mathematics at Stanford and involving himself in many mathematical breakthroughs
- Kline describes Calculus like it's the backyard playground in "Calculus":http://www.amazon.com/Calculus-Intuitive-Physical-Approach-Second/dp/0486404536/ref=sr11?ie=UTF8&s=books&qid=1235111820&sr=1-1. No wonder he was the dean in mathematics at New York University
The world needs more writers like Kline, Polya, and Fisher, when they're at the peak of their experiences. I'm glad I have this resource to keep my mind active and curious in topics that would otherwise bury me quickly.
The Lady Tasting Tea
I was recently evangelizing Tegu at a local Ruby Users Group. While there, a friend suggested that I read The Lady Tasting Tea by David Salsburg. I took that advice, read the book, and I am very glad I did.
Statistics tends to be dry, aloof, and a little intimidating. Salsburg gives us a face of this world unlike any I've ever seen. It is intriguing, personal, and inviting. Salsburg binds his story with three major elements: the people, the ideas, and the impact of statistics. By bringing these together in a simple way, I feel like a welcome guest to the statistics party.
Salsburg's main point is that statistics have created a revolution in almost all corners of science, and that something new is likely preparing to create another such revolution. Understanding how in the last 100 years, most of what we take for granted about scientific thought shapes the kind of analysis I do, and gives me additional creativity in the approaches I might employ.
The story starts with Karl Pearson, who gave us a foundation of probability distributions, chi-square calculations, mean, symmetry, kurtosis, and standard deviation. We move forward with the genius Fisher, who teaches us about the design of our experiments, p-values, and a rigor towards our final results. Learning about the conflicts of ideas in the early days between Pearson and Fisher open our minds about the kinds of thinking that were employed to develop our understanding of statistics. By being directly engaged in the thought process, it becomes like our own. We move forward with Gosset and his "Student"'s t-test, and the strange circumstances that gave it that name. Pearson's son works with Neyman and deliver the null hypothesis test. Kolmogorov brings statistics down to an axiomatic basis and gives us a whole new field of thought, stochastic processes. The story goes on.
It's a fast read, an engaging exercise. I certainly recommend it to anyone who feels even a little intimidated with statistics. For me, it's given me a rather wide base of analytics that I want to run through Tegu. These aren't new ideas to me, but it is very helpful to have a checklist of ideas that work well with each other, with various data types, and with the libraries that I'm using so far.
Competing on Analytics
Last year, the Harvard Business School Press published a book called Competing on Analytics by Davenport and Harris. I had seen the book, but didn't pay too much attention to it, I'm already sold on the idea of data-driven decision making. Last week, I attended a Management Society meeting from my business school. It dawned on me how poorly I was building a case for analytics and Tegu for these kinds of people, managers and executives. I've been too steeped in getting the architecture and implementation of Tegu idea right. I went home that night and worked until 4 in the morning, outlining slides and information that a manager should have in order to be competitive in this field.
This book adds a lot of credibility to the argument. It also creates an executive-level overview of what is possible with analytics and the necessary elements for creating this capability in an organization.
Davenport and Harris lead with the case of Netflix. Reed Hastings, after being charged a $40 late fee from Blockbuster for his rental of Apollo 13 started Netflix. This turned out to be a David and Goliath story that is well-grounded in analytics. The role of analytics in Netflix, and indeed with any successful company, is in building a distinctive capability. Not just any capability, but a strategic competency.
This book follows the typical style for Harvard Business Press: something at a high level that a busy executive can digest and realize the key points. Given that you accept that analytics is the key in this information age, this book offers a roadmap for you. If you're still on the fence, read on.
Understand the Quality of Business Intelligence and Analytics
A great quote:
There is considerable evidence that decisions based on analytics are more likely to be correct than those based on intuition.
As questions get deeper, more sophisticated levels of data access or analytics are useful. The spectrum could be shown as:
- Standard reports
- Ad hoc reports
- Query/drill down
- Alerts
- Statistical analysis
- Forecasting/extrapolation
- Predictive modeling
- Optimization
Tegu, incidentally, is geared for items five through eight.
Another good quote from the book is:
Analytic competition will be something of an arms race, requiring continual development of new measures, new algorithms, and new decision-making approaches.
That quote seems especially poignant. First, we're talking about an archetype of a system, escalation. The problem with escalation is that diminishing returns tend to sour the punch, so to speak. Another poignant issue brought on by this quote is the suggestion that this is the environment in which we compete, getting good at this stuff is imperative now. I like that implication.
Another quote:
In order for quantitative decisions to be implemented effectively, analysis will have to be a broad capability of employees, rather than the province of a few "rocket scientists" with quantitative expertise.
Use a Framework to Develop Analytical Capability
In this context, we're talking about a framework of ideas, not just a software framework.
To build analytical competition, there are four pillars required for the foundation:
- Distinctive Capability
- Enterprise-wide Analytics
- Senior Management Commitment
- Large-scale Ambition
Given this foundation, the ability to actually compete based on analytics will evolve in stages:
- Analytically impaired
- Localized analytics
- Analytical aspirations
- Analytical companies
- Analytical competitors
Hopefully, if you are a manager or executive and you are reading this, you have at least analytical aspirations
The rest of this book is full of ideas about implementing analytics in your organization, from the kinds of analysis you're going to need to an approach to developing these resources in the context of your organization.
I recommend this book, if for no other reason than to have a very well-developed framework to keep you on track as you progress. Of course, I'm of the mindset that results speak louder than words, and a good mini-research-project of an afternoon or weekend will take you further down the road than reading any book or article. So please stay tuned for concrete assistance in your analytical ambitions.