Monday, December 3, 2012

My Unhealthy Obsession with Message-Passing

This post is a follow-up to Beliefs, in which I outlined some of what the machine learning community knows about how beliefs should interact with one another.

One assumption which I didn't spend too much time on in that post was the assumption that a "belief" should be managed by local message-passing as much as possible. This is an assumption which has been strong in the probabilistic graphical models community from the beginning, and which I have taken as a given. In order to get efficient algorithms, it seems like a good idea to do everything through local processes. Also, message-passing simply fits with my intuitions about how humans reason. Even markov networks (as opposed to Bayesian networks) feel insufficiently local, because the probabilistic semantics is distributed (you need to know all the factors before you can determine probabilities). For me, the more local, the better: the popular Monte Carlo methods strike me as "not local enough" (because although we keep making changes locally, we do this to a global complete state, and make a running tally over time... seems unfeasible for extremely large networks representing the quantity of knowledge a human has). I'll question this assumption now.

In the previous post, I looked at some of the local message-passing algorithms I'm interested in. There is more interesting material which I've now understood and would like to summarize, but for the moment I'll cover just one more thing to serve my point.

I mentioned that we can make a distinction between techniques such as Mean Field which seek to minimize KL(Q||P), and those such as Belief Propagation which use KL(P||Q); both of these measure divergence between the true distribution and our approximation, but KL(P||Q) intuitively makes more sense (minimizing error due to approximation), whereas KL(Q||P) has some computational advantages.

It turns out that both belief propagation and mean field can be fit into the framework of minimizing KL(Q||P). Even though this isn't my error function of choice, it's good to be able to explain both approximations within one framework; it allows us to definitively explain how to pass messages between the two, in the case that we want to use both in one network.

The basic idea is that Mean Field (MF) can be justified in a straightforward way by minimizing KL(P||Q) when we assume Q contains a separate factor for every variable. Belief Prop (BP), on the other hand, can be justified by minimization of the Bethe energy. The Bethe energy can be derived from KL(P||Q) when Q is of a peculiar form containing a factor for every relationship between variables, but also a negative factor for every variable. The negative factors divide out the effect of the overabundant positive factors. The negative factors are constrained to be the marginals of the positive factors. This gets a little confusing (I'll plan a full post for later), but provides greater coupling between variables, which we expect to give us a better approximation. (It does in most cases.) Unfortunately, it ruins the convergence guarantee which MF had.

The framework I mentioned allows us to deal with a Q function which mixes these two forms. This is good, because MF and BP are good for dealing with different kinds of things. MF often has a simple closed form in cases where BP does not, or does but is too expensive. MF is guaranteed to converge, while BP is not. However, BP typically gives better results when it does converge. It gives exact results for a tree-shaped belief network, while MF does not. It handles 0s in distributions, while MF cannot.

Because BP is not guaranteed to converge, it makes some sense to argue that it is not suitable to AGI-scale systems of beliefs. There will almost certainly be some portion of the network which will not converge, even if non-convergent behavior is rare overall; and this problem may be likely to spread. Therefore, it could make sense to have small, "safe" BP networks acting as pieces in a larger MF network, to make global convergence easier. If a portion of the network was causing trouble, more of it might be converted to MF.

If I'm not mistaken, this same idea could also accommodate Kikuchi free energy, which gives "generalized belief propagation" (GBP). This goes further than BP, including even more interactions in the approximation. Hence we have a flexible approximation framework, which can handle approximations based on a wide variety of forms for Q. The structure of Q can be adapted to the situation, addressing our needs as well as any local convergence problems.

This might sound like a rather nice situation. But... is it?

We're putting an awful lot of energy into the approximation, here. Why are the belief networks so difficult to control? The structure-learning system has to find these networks in the first place. If we can't even do inference for a hypothesis, how can we choose it? Shouldn't the system choose hypotheses for which inference is manageable?

This is exactly the approach taken by Sum-Product Networks. Graphical models normally give us exponential-time inference in the worst case, but sum-product networks are linear-time instead. We could get a similar advantage by using standard types of networks, like BNs or MNs, and requiring that the network take the form of a tree (for which exact inference is inexpensive). We get guaranteed linear-time inference, no approximations, but we have not severely restricted the kinds of probability distributions which can be expressed: we can add more hidden variables to express more complicated distributions. (If we need to add exponentially many nodes to do what we want, though, then obviously we're getting back into trouble.) SPNs do the same thing, but are an interesting alternative to tree networks, in that they allow us to adjust the probability formula at a lower level; we are literally composing an expression of sums and products to express the distribution (an "arithmetic circuit"). This is more annoying for a human to look at, but at the same time, gives us more freedom (allows us to compactly represent a larger number of functions) without compromising the linear-time inference.

SPNs are very similar to artificial neural networks, particularly feedforward networks, and give us linear-time inference for the same reason: we're directly learning a mathematical expression to give the desired result, not trying to represent some inference problem to be solved. One way of explaining it is: for any approximation method, we end up performing some sequence of arithmetic operations to arrive at an answer. SPNs jump straight to learning a good sequence of arithmetic operations, as our model, rather than learning a potentially difficult model and then trying to find a nice approximation.

I said I would deal with my assumption of local approximation. Going back to the P vs Q idea, a "belief" is a local function; a factor of Q. The question is how to coordinate a number of local functions, in order to get a good approximation of a probability function P. SPNs don't need any of this. Local beliefs give us a very messy belief approximation process. SPNs give us a global model with exact inference.

Why do I call an SPN "global"? It might be local enough to satisfy some. The linear-time inference algorithm iterates over the network via the links. That's local, isn't it? Well, sort of. You can't learn SPNs locally and then compose them together, though. The parts of a sum-product network do not have inherent meaning outside of the network. In a Bayesian network, a conditional probability is a conditional probability; we could evaluate its accuracy apart from everything else, if we could access data for those variables.

SPNs aren't a full AGI system in themselves, but they give me an image of an AGI system which learns very fast belief-formation functions, which simply get evaluated in order to give probabilities at a particular moment. This image is much more efficient than my previous image of a large network of approximate beliefs which are doing local message-passing to minimize KL divergence between Q and the desired P. Local approximation by message-passing is a faster approximation than Monte Carlo methods, but if we can create networks which we can precisely evaluate very quickly, why worry about approximation at all?

On the other hand, I think we really want some FOPI-style behavior (First Order Probabilistic Inference, one approach to integrating logic and probability). The structure of beliefs should include first-order logic as a special case, which means we cannot avoid including problems which are potentially intractable. (IE, real-life inference problems are sometimes intractable.)

Indeed, it is difficult to see how an SPN-like framework can give us persistent beliefs with logical structure. An SPN is composed of 'sum' nodes and 'product' nodes. We can think of the 'sum' nodes as representing 'or', and the 'product' nodes as representing 'and'. So, we can think of 'product' nodes as representing an object: a combination of attributes. 'Sum' nodes represent alternative ways of achieving attributes. However, this is still a very low-level representation! There is no way to represent multiple instances of one object, no way to represent classes of objects with a type/subtype relation... we can't reason about familial relations, for example, trying to decide whether two people are cousins given some other information about their families.

Side note: one possible objection is that belief networks are a representation for the low-level dynamics of an artificial mind, and do not need to represent the high-level stuff like logic. This should emerge as a learned skill of a large network which has access to mental workspaces in which it can implement more complex, longer-term mental processes. This approach may be workable. However, my intuition is that there is something to gain from explicitly representing logically structured knowledge.

Back to the first hand, we can argue that FOPI itself requires a global model in order to start. This is an interesting and important point: local inference rules such as belief propagation and FOPI require a fixed network structure to begin with. (More technically, we need to know what the Markov blankets of our variables are. This is similar to learning a background model of causality: we can't just operate on correlations between variables. Local probability tables are not enough. We also need to know the structure, which tells us which probability tables are relevant to a given situation.)

It would be nice if we could also reason about structure locally, learning which variables are related in an incremental, bottom-up manner. Perhaps this is possible. However, it seems especially difficult with FOPI-like inference: logical models do not behave in a very local way. It seems very "human" to have a large set of potentially-inconsistent logical statements which we are constantly trying to integrate in an approximate way; this again sounds like a Q vs P iterative reasoning scheme. But, again, won't this be terribly messy and inefficient? Won't we end up approximating things by some simple sequence of operations? Wouldn't it be nicer to just learn a good sequence of operations, in the first place, which just gives us the approximate answer?

I don't intend to answer these questions here. I think there's an interesting tension between these "local" messy message-passing approximation algorithms, based on semantically meaningful local beliefs, and "global" methods which can potentially give us fast, reliable answers based on a fixed calculation.