## Friday, July 18, 2014

### Losing Faith in Factor Graphs

In my post Beliefs, I split AGI into two broad problems:

1. What is the space of possible beliefs?
2. How do beliefs interact?
The idea behind #1 was: can we formulate a knowledge representation which could in principle express any concept a human can conceive of?

#2 represents the more practical concern: how do we implement inference over these beliefs?

(This was needlessly narrow. I could at least have included something like: 3. How do beliefs conspire to produce actions? That's more like what my posts on procedural logic discuss. 1 and 2 alone don't exactly get you AGI. Nonetheless, these are central questions for me.)

Factor graphs are a fairly general representation for networks of probabilistic beliefs, subsuming bayesian networks, markov networks, and most other graphical models. Like those two, factor graphs are only as powerful as propositional logic, but can be a useful tool for representing more powerful belief structures as well. In other words, the factor graph "toolbox" includes some useful solutions to #2 which we may be able to apply to whatever solution for #1 we come up with. When I started graduate school at USC 3 years ago, I was basically on board with this direction. That's the direction taken by Sigma, the cognitive architecture effort I joined.

I've had several shifts of opinion since that time.

### I. Learning is Inference

My first major shift (as I recall) was to give up on the idea of a uniform inference technique for both learning models and applying them ("induction" vs "deduction"). In principle, Bayes' Law tells us that learning is just probabilistic inference. In practice, unless you're using Monte Carlo methods, practical implementations tend have much different algorithms for the two cases. There was a time when I hoped that parameter learning could be implemented via belief propagation in factor graphs, properly understood: we just need to find the appropriate way to structure the factor graph such that parameters are explicitly represented as variables we reason about.

It's technically possible, but not really such a good idea. One reason is that we don't usually care about the space of possible parameter settings, so long as we find one combination which predicts data well. There is no need to keep the spread of possibilities except so far as it facilitates making good predictions. On the other hand, we do usually care about maintaining a spread of possible values for the latent variables within the model itself, precisely because this does tend to help. (The distinction could be is ambiguous! In principle there might not be a clean line between latent variables, model parameters, and even the so-called hyperparameters. In practice, the distinction is quite clear, though, and that's the point here.)

Instead, I ended up working on a perfectly normal gradient-descent learning algorithm for Sigma. Either you're already familiar with this term, or you've got a lot to learn about machine learning; so, I wont try to go into details. The main point is that this is a totally non-probabilistic method, which only believes a single parameter setting at any given time, but attempts to adjust these in response to data.

The next big shift in my thinking has been less easily summarized, and has been taking place over a longer period of time. Last year, I wrote a post attempting to think about it from a perspective of "local" vs "global" methods. At that time, my skepticism about whether factor graphs provide a good foundation to start with was already strong, but I don't think I articulated it very well.

### II. Inference is Approximate

Inference in factor graphs is exponential time in the general case, which means that (unless the factor graph has a simple form) we need to use faster approximate inference.

This makes perfect sense if we assume that "inference" is a catch-all term referring to the way beliefs move around in the space of possible beliefs. It should be impossible to do exact inference on the whole belief space.

If we concede that "inference" is distinct from "learning", though, we have a different situation. What's the point in learning a model that is so difficult to apply? If we are going to go ahead and approximate it with a simpler function, doesn't it make sense to learn the simpler function in the first place?

This is part of the philosophy behind sum-product networks (SPNs). Like neural networks, SPN inference is always linear time: you basically just have to propagate function values. Unlike neural networks, everything is totally probabilistic. This may be important for several reasons; it means we can always do reasoning on partial data (filling in the missing parts using the probabilistic model), and reason "in any direction" thanks to Bayes' Law (where neural networks tend to define a one-direction input-output relationship, making reasoning in the reverse direction difficult).

Why are SPNs so efficient?

### III. Models are Factored Representations

The fundamental idea behind factor graphs is that we can build up complicated probability distributions by multiplying together simple pieces. Multiplication acts like a conjunction operation, putting probabilistic knowledge together. Suppose that we know a joint distribution connecting X with YP(X, Y). Suppose that we also know a conditional distribution, defining a probability on Z for any given probability on X: P(Z|X). If we assume that Z and Y are independent given X, we can obtain the joint distribution across all three variables by multiplying the two probability functions together: P(X,Y,Z) = P(X,Y)P(Z|X).

A different way of looking at this is that it allows us to create probabilistic constraints connecting variables. A factor graph is essentially a network of these soft constraints. We would expect this to be a powerful representation, because we already know that constraints are a powerful representation for non-probabilistic systems. We would also expect inference to be very difficult, though, because solving systems of constraints is hard.

SPNs allow multiplication of distributions, but only when it does not introduce dependencies between distributions which must be "solved" in any way. To oversimplify just a smidge, we can multiply just in the case of total independence: P(X)P(Y). We are not allowed to use P(X|Y)P(Y), because if we start allowing those kinds of cross-distribution dependencies, things get tangled and inference becomes hard.

To supplement this reduced ability to multiply, we add in the ability to add. We compose complicated distributions as a series of sums and products of simpler distributions. (Hence the name.) Despite the simplifying requirements on the products, this turns out to be a rather powerful representation.

Just as we can think of a product as conjunctive, imposing a series of constraints on a system, we can think of a sum as being disjunctive, building up a probability distribution by enumeration of possibilities. This should bring to mind mixture models and clustering.

Reasoning based on enumeration is faster because, in a sense, it's just the already-solved version of the constraint problem: you have to explicitly list the set of possibilities, as opposed to starting with all possibilities and listing constraints to narrow them down.

Yet, it's also more powerful in some cases. It turns out that SPNs are much better at representing probabilistic parsing than factor graphs are. Parsing is a technique which is essential in natural language processing, but it's also been used for other purposes; image parsing has been a thing for a long time, and I think it's a good way of trying to get a handle on a more intricately structured model of images and other data. A parse is elegantly represented via a sum of possibilities. It can be represented with constraints, and this approach has been successfully used. Those applications require special-purpose optimizations to avoid the exponential time inference associated with factor graphs and constraint networks, though.

The realization that factor graphs aren't very good at representing this is what really broke my resolve as far as factor graphs go. This indicated to me that factor graphs really were missing a critical representational capability; the ability to enumerate possibilities.

Grammar-like theories of general learning have been an old obsession of mine, which I had naively assumed could be handled well within the factor-graph world.

My new view suggested that inference and learning should both be a combination of sum-reasoning and product-reasoning. Sum-learning includes methods like clustering, boosting, and bagging: we learn enumerative models. Reasoning with these is quite fast. Product-learning splits reality into parts which can be modeled separately and then combined. These two learning steps interact. Through this process, we create inherently fast, grammar-like models of the world.

### IV. Everything is Probabilistic

Around the same time I was coming to these conclusions, the wider research community was starting to get excited about the new distributed representations. My shift in thinking was taking me toward grammars, so I was quite excited about Richard Socher's RNN representation. This demonstrated using one fairly simple algorithm for both language parsing and image parsing, producing state-of-the-art results along with a learned representation that could be quite useful for other tasks. RNNs have continued to produce impressive results moving forward; in fact, I would go so far as to say that they are producing results which look like precisely what we want to see out of a nascent AGI technology. These methods produce cross-domain structured generalizations powerful enough to classify previously-unseen objects based on knowledge obtained by reading, which (as I said in my previous post) seems quite encouraging. Many other intriguing results have been published as well.

Unfortunately, it's not clear how to fit RNNs in with more directly probabilistic models. The vector representations at the heart of RNNs could be re-conceived as restricted Boltzmann machines (RBMs) or another similar probabilistic model, giving a distributed representation with a fully probabilistic semantics (and taking advantage of the progress in deep belief networks). However, this contradicts the conclusion of section II: an RBM is a complex probabilistic model which must be approximated. Didn't I just say that we should do away with overly-complex models if we know we'll just be approximating them?

Carrying forward the momentum from the previous section, it might be tempting to abandon probabilistic methods entirely, in favor of the new neural approaches. SPNs restrict the form of probabilistic models to insure fast inference. But why accept these restrictions? Neural networks are always fast, and they don't put up barriers against certain sorts of complexity in models.

The objective functions for the neural models are still (often) probabilistic. A neural network can be trained to output a probability. We do not need everything inside the network to be a probability. We do lose something in this approach: it's harder to reverse a function (reasoning backwards via Bayes' Law) and perform other probabilistic manipulations. However, there may be solutions to these problems (such as training a network to invert another network).

### V. Further Thought Needed

These neural models are impressive, and it seems as if a great deal could be achieved by extending what's been done and putting those pieces together into one multi-domain knowledge system. However, this could never yield AGI as it stands: as this paper notes (Section 6), vector representations do not perform any logical deduction to answer questions; rather, answers are baked-in during learning. These systems often can correctly answer totally new questions which have not been trained on, but that is because the memorization of the other answers forced the vectors into the right "shape" to make the correct answers evident. While this technique is powerful, it can't capture aspects of intelligence which require thinking.

Similarly, it's not possible for all probabilistic models to be tractable. SPNs may be a powerful tool for creating fast probabilistic models, but the restrictions prevent us from modeling everything. Intelligence requires some inference to be difficult! So, we need a notion of difficult inference!

It seems like there is a place for both shallow and deep methods; unstructured and highly structured models. Fitting all these pieces together is the challenge.