A Bayesian Mind
Since I received Causality for Christmas, I've been thinking a lot more about Bayesian inference. I thought I'd jot down a few notes to see if these things are clearly outlined in my head.
Bayesian inference is based on conditional probability: the probability of some hypothesis based on some pool of evidence. The traditional concept of joint probabilities is the probability that two events occur at the same time. Although this is accurate, Pearl explains that what we are actually doing is contextualizing our beliefs: what is our belief this bird will fly, given that it's roasted?
That seems to be an answerable question to ask. We can further break these questions down into two parts:
- What are the prior odds of our hypothesis? Or, what are the odds that the event would ever even be relevant? What are the odds that birds fly?
- What is the likelihood of our hypothesis? Or, what is the likelihood that this bird flies, this one that's been roasted?
Let's say that 9 out of 10 birds ever observed were observed to fly, but roasted birds have not been observed to fly in the flapping wings and taking off sense. We combine odds with likelihood, so, 90% times 0% is 0%, or a pretty reasonable idea that roasted birds don't fly on their own accord. This leaves us with an overall belief that our dinner of roasted turkey might stay on our plate throughout the meal.
Pearl also takes the time to explain what he means by variables. Variables, in this case, are Mutually Exclusive, Collectively Exhaustive (MECE) packages that are quite convenient to work with. The term MECE actually comes from a book I read years ago about the life and times of a strategic consultant at the Boston Consulting Group. All of their work was required to be MECE. That is because they were reducing the problems of large organizations into models that could safely use Bayesian inference to guide their decisions.
Once we have variables, our goal is to find out which ones are independent. Every time two variables are found independent, an angel gets its wings. OK, maybe I've got It's a Wonderful Life on the brain. Every time two variables are found independent, we are freed from including them in the same equations. Our math gets a lot easier. So, we take our time to show independence so that our models become more nimble and informative and useful and even elegant. The math goes something like this:
P(x|y,z) = P(x|z) whenever P(y,z) > 0
That means, when the probability of x given y and z is equal to the probability of x given z, then y no longer contributes to our understanding of x, once we know z. Put another way, if we focus our attentions on z, we don't have to calculate the value of y. Or, we have a nifty little test: we pull variables out and see if our overall beliefs actually change with those links missing. If not, we're golden, we can leave things out and move on.
How about an example? Let's say I wanted to climb King's Peak this summer. What are the chances I'll climb King's Peak this summer, given the status of my training and the status of my knees? Let's say I went to the gym every day for 9 months? Who cares, can I walk a mile without my knees swelling up? Nope. OK, then I don't need to classify the quality of my training until I figure out what my knees need to handle the pounding of a good, long hike. Unless I can handle the knee issue, I'm not going to make that peak anytime soon. Making the peak is conditionally independent of my training, given my knowledge of my knees. Now, if the knees are working fine, then I better find out if my heart and lungs have been prepared for a two-day hike to the highest peak in Utah. That's conditional independence, as I understand it.
Marginal independence, or unconditional independence, means that two variables are independent, regardless of what other variables we might know. Meaning, my choice of laptop brands and the price of tea in China are unconditionally independent. I really don't want to link the two computations together under any circumstances.
Let's look at why. If everything is linked to everything, then anything I know has to be taken into account. If I know three things, then I have 2 to the 3rd power, or 8 probabilities to store. But, if I know 10 things, I have 1024 probabilities to store. What if I know one thousand things, or one million things? Good luck, the universe isn't big enough to hold all the probabilities that are possible. And we don't have time enough to figure it all out anyway. So, independence is key. The liberation front is where it's at.
So far, so good. I skipped a few of the details: properties of conditional independence; the actual equations for odds, likelihood and belief; equations for probability manipulation such as the product rule and the chain rule. Overall, it seems that we're getting somewhere. Probably to the point of appreciating Bayesian networks.
Bayesian networks are used to keep our thinking clear. Pearl outlines them a little more carefully. The benefits of Bayesian networks are:
- to provide convenient means of expressing substantive assumptions
- to facilitate economical representation of joint probability functions
- to facilitate efficient inferences from observations
Basically, if we can test ideas carefully, or if we can prove independence, we can store these ideas fairly quickly. We learn in a stepwise fashion, and we continue to benefit from this learning over time. We learn to simplify our models, then we don't go back and work with more complicated models. We test out an idea, and if it works, we have the architecture in place to take advantage of that idea. If not, we merely remove the link we created and keep going. Bayesian networks are simply the agile methods of cognitive systems.
Pearl points out that a Bayesian network is meant to emphasize the subjective nature of the inputs. This little piece of honesty could stop many snide remarks from statistical snobs before they start. We are working with a priori knowledge here, and we like it that way. We are purposefully mixing in our personal expertise, subjective as it may be, into our models. The modeler matters. This may be a limitation to the Bayesian perspective, according to some, but I personally like doing the kind of analysis that asks me what I think before getting started. I am the god of that model, and everyone better know it. If I am right, if my analysis turns out consistent, I can come up with better models than my cohorts. I can actually get better answers to life's questions by staying informed and suggesting ways of organizing the inference that can be tested and used. If not, well, let the results of the analysis show that things are inconsistent, that I had a faulty view of the world that is less useful. At least I had a shot at greatness.
But, back to the story. There is a whole set of tools for finding simpler models. One of these is called d-seperation. There are others. We create the simplest models possible. The whole point is to start making inferences. We want to:
- make a coherent interpretation of the model--actually make sense when we describe what we are modeling
- remain consistent with our prior observations and the information on hand
Pearl came up with a process for working the Bayesian network in a linear way. He breaks the network up into cut-sets, or compound variables, that can be calculated as a whole and then used with other compound variables in the network. There are other methods of simplification as well. If these methods do not simplify a problem sufficiently, then stochastic simulation can be used. I'm trying to remember the details of stochastic simulation from another book of Pearl's. If I remember right, it samples from the network to derive clusters that should occur in the network.
So, we get to the point where we have Bayesian networks. These are associative networks: networks that reflect independence and dependence. Now, we want to move on to causal Bayesian networks. These are assertions of actual causality. The rest of the book explores how to do this and use this. The question about why we want these, however, can be my last point.
Once we have a causal network, we can then respond to changes and interventions in a consistent way. When something unexpected happens, we can start to think about the implications of that event. We can start to see that a mopped floor represents a slipping hazard and an escalated marital dispute threatens a night on the couch. We can also start to see that putting a hazard sign out is a legally responsible thing to do and going out for a walk before exploding into the next argument might make more sense. We can actually find support for the decisions we want to be making with these causal graphs.
A lot of these ideas are really basic. I've been around them for quite a while. I don't think I've simplified them to the point that I can explain them to a 6th grader, let alone a 6-year-old. Einstein would say that I don't understand them thoroughly enough yet. At least it may be a step towards having a mind for Bayesian statistics.