## Wednesday, August 29, 2012

### Beliefs

I tend to think that the AI problem is split into two basic problems:

1. What is the space of propositions which we have beliefs about?
2. What are the rules by which beliefs interact?
Bayesianism says that once we answer (1), we simply define a prior (and possibly a utility function) over the space, and answer (2) by a Bayesian update on any evidence we observe, and then a marginalization to find the prediction (or maximize to find the highest-utility action, if we specified a utility function).

This is impractical for large problems, which brings in an interesting question without even giving up Bayesianism: what approximations work well in practice? I personally prefer methods which deal with beliefs in a local way. Factor graphs are my favorite formalism for thinking about this.

The basic algorithm for factor graphs is the summary-product algorithm, which is a generalisation of belief propagation and several other important algorithms. Generalising several algorithms like this is fairly exciting, and gives summary-product an important status among the possible mechanisms for belief interaction. If I had to state the idea of summary-product in one sentence, it would be:

If you want to perform a global operation on a multivariate function which can be represented as a product of several factors, you can often treat the distributive law as approximately true in order to transform it to a local message-passing algorithm which often runs in linear time.

Of course, this is hopelessly vague; you have to look at the math. In any case, this is is typically the "default" approximation for factor-graph problems. Unfortunately, the procedure doesn't come with many theoretical guarantees: we know that it works well in practice, but it does not always converge to the right value, or converge at all. Not very much is known about when or why.

In cases where this does not work well, we can turn to variational methods.

The core variational technique, often referred to as "variational Bayes" (but more specifically know as "mean-field"), seeks to minimise the KL-divergence in the form KL(Q||P), where P is the exact distribution and Q is our approximation. Minimising KL-divergence typically involves taking expectations, and in this case it means taking the expected log-probability. (I won't try to go through the derivation here, but it is quite interesting.) This has a surprising effect: by finding the expected log-probability rather than the expected probability, we replace multiplication with addition in some of the mathematics. This is the kind of crazy simplification that you would not expect to work. Yet, it does work, quite well: this algorithm is guaranteed to converge to the global minimum of the KL divergence! (At least, if I've understood correctly.) The mean-field approach can be understood as a generalisation of the popular EM algorithm.

A second variational method is known as expectation-propagation. Here, we minimise KL(P||Q) rather than KL(Q||P). This actually makes more sense from a mathematical perspective: KL(A||B) measures the amount of unnecessary prediction error from using probability distribution B, if observations are really being drawn from distribution A. Expectation-propagation seems to have it right, minimising the amount of prediction error due to approximation. It is unclear why we might want to take the mean-field approach instead, except that it may be computationally convenient.

Again, minimising KL-divergence often leads us to take expectations. In this case, rather than finding ourselves in the strange land of log-probabilities, we end up taking expectations over probabilities in a more normal way. I won't try to give all the details, but it turns out that in specific cases, this approximation ends up being exactly the same as belief propagation! This happy coincidence again gives credibility to the generality of the summary-product algorithm. At the same time, it gives us a different and useful generalisation of belief propagation.

This also gives us a nice theoretical tool for analysing the accuracy and convergence of belief propagation. Like summary-product, expectation maximisation can fail to converge. However, expectation-maximisation comes with a fairly nice fixed-point analysis which helps us to understand this process (although it doesn't give all the answers).

There is a wider variety of variational methods, but I don't know much about them-- yet. :)

Other, less Bayesian techniques present themselves. In particular, while variational methods can be used to learn the factors themselves, a simpler approximation is to use gradient descent. As in neural network back-propagation, we take the derivative of an objective function with respect to the parameters in order to get a direction in which we should move our beliefs. This does not directly represent any uncertainty about the parameters.

The main problem with this approach is that the derivatives will tend to get very small if we have deep networks with large numbers of hidden variables. Overcoming this problem is the topic of research in deep belief networks. (Interestingly, this formalism bridges the gap between neural network research and Bayes-style graphical model research. It is not the only approach to do so. I have some hope that this distinction will slowly disappear, although there will always be room for some differentiation...) Studying what sort of local belief interactions work best here will be an important part of the puzzle, to find good general rules for the interaction of beliefs.

Another source of inspiration is the PAQ compressor by Matt Mahoney. It is currently the best compressor there is. Information theory tells us that there is a close relationship between probability and compression; so, if a specific sort of computation tends to be the best for compression, it is likely to be the best for approximate probabilistic reasoning, as well. (In fact, PAQ is founded on this principle.) PAQ is by now a fairly complicated mechanism with many special optimisations which I don't claim to understand, but I do understand some of the principles behind it:
1. How should we combine several conditional probability distributions which give different beliefs for some proposition? This is known as the "combination problem" in various circles (it shows up in PLN, NARS, and BLP). It's like asking how to make use of a mangled Bayesian network which has been put together from several conflicting models. PAQ says to combine the different predictions logarithmically (a bit like mean-field), with different weights on each, trained via gradient-descent to find the combination of weights which minimises prediction error.
2. In addition, we can use indirect context models to learn when specific models are appropriate.
3. Favor recent history when counting probabilities. This is an idea which also appears in reinforcement learning; the term which justifies the practice is "non-stationary models" (the idea that the probability distribution may change over time). I should mention that I think reinforcement also has other lessons about belief structure and useful propagation rules.
Many explanations are in Matt's book in progress (which I have not yet read).

There is much more to say, but I suppose this post has gotten long enough. Thoughts?

1. "PAQ says to combine the different predictions logarithmically" reminds me of how Watson uses logistic regression over scorers (IIRC more than a hundred independent modules pursuing various non-stochastic "answer fishing"). The "unreasonable" efficacy of linear models. Watson uses several sets of weights, but for fixed contexts (e.g. kinds of questions, or whether it's seeking the final answer or subgoal...)

Sorry for an off-topic comment ;-)

2. Have you actually tried to apply these ideas to an interesting system of ideas (or propositions or whatever.) I already know that you are concentrating on the more abstract mathematical applications (of some simple systems) rather than an actual test of the ideas. So even though you can read, for example, that Matt's PAQ is only an approximation that works with some cases, you have never tried to apply these methods to some more interesting cases.

If you did, you would find that the fundamental concept is deeply flawed. In order to create an AGI type of analysis (and action) you have to use the probabilities in potentially illogical combinations. Like neural networks, it will work well as long as you cherry pick the examples, and it won't work at all when you try to take it to the next level.

3. Jim,

Thanks for commenting!

I do think that testing "at the next level" is necessary to see if the techniques actually generalize, especially in cases like PAQ where the techniques have little to no theoretical backing.

That being said, we can make some plausible speculations and give reasons why a specific technique will or will not work-- as you have done. But, your comment,

"you have to use the probabilities in potentially illogical combinations"

seems to suggest doing something like what PAQ does: combine probabilities according to what works (as trained by gradient descent), rather than by sound logic. As such, I'm not sure how to evaluate the proposal. What do you suggest, beyond PAQ?