## Friday, July 18, 2014

### Losing Faith in Factor Graphs

In my post Beliefs, I split AGI into two broad problems:

1. What is the space of possible beliefs?
2. How do beliefs interact?
The idea behind #1 was: can we formulate a knowledge representation which could in principle express any concept a human can conceive of?

#2 represents the more practical concern: how do we implement inference over these beliefs?

(This was needlessly narrow. I could at least have included something like: 3. How do beliefs conspire to produce actions? That's more like what my posts on procedural logic discuss. 1 and 2 alone don't exactly get you AGI. Nonetheless, these are central questions for me.)

Factor graphs are a fairly general representation for networks of probabilistic beliefs, subsuming bayesian networks, markov networks, and most other graphical models. Like those two, factor graphs are only as powerful as propositional logic, but can be a useful tool for representing more powerful belief structures as well. In other words, the factor graph "toolbox" includes some useful solutions to #2 which we may be able to apply to whatever solution for #1 we come up with. When I started graduate school at USC 3 years ago, I was basically on board with this direction. That's the direction taken by Sigma, the cognitive architecture effort I joined.

I've had several shifts of opinion since that time.

### I. Learning is Inference

My first major shift (as I recall) was to give up on the idea of a uniform inference technique for both learning models and applying them ("induction" vs "deduction"). In principle, Bayes' Law tells us that learning is just probabilistic inference. In practice, unless you're using Monte Carlo methods, practical implementations tend have much different algorithms for the two cases. There was a time when I hoped that parameter learning could be implemented via belief propagation in factor graphs, properly understood: we just need to find the appropriate way to structure the factor graph such that parameters are explicitly represented as variables we reason about.

It's technically possible, but not really such a good idea. One reason is that we don't usually care about the space of possible parameter settings, so long as we find one combination which predicts data well. There is no need to keep the spread of possibilities except so far as it facilitates making good predictions. On the other hand, we do usually care about maintaining a spread of possible values for the latent variables within the model itself, precisely because this does tend to help. (The distinction could be is ambiguous! In principle there might not be a clean line between latent variables, model parameters, and even the so-called hyperparameters. In practice, the distinction is quite clear, though, and that's the point here.)

Instead, I ended up working on a perfectly normal gradient-descent learning algorithm for Sigma. Either you're already familiar with this term, or you've got a lot to learn about machine learning; so, I wont try to go into details. The main point is that this is a totally non-probabilistic method, which only believes a single parameter setting at any given time, but attempts to adjust these in response to data.

The next big shift in my thinking has been less easily summarized, and has been taking place over a longer period of time. Last year, I wrote a post attempting to think about it from a perspective of "local" vs "global" methods. At that time, my skepticism about whether factor graphs provide a good foundation to start with was already strong, but I don't think I articulated it very well.

### II. Inference is Approximate

Inference in factor graphs is exponential time in the general case, which means that (unless the factor graph has a simple form) we need to use faster approximate inference.

This makes perfect sense if we assume that "inference" is a catch-all term referring to the way beliefs move around in the space of possible beliefs. It should be impossible to do exact inference on the whole belief space.

If we concede that "inference" is distinct from "learning", though, we have a different situation. What's the point in learning a model that is so difficult to apply? If we are going to go ahead and approximate it with a simpler function, doesn't it make sense to learn the simpler function in the first place?

This is part of the philosophy behind sum-product networks (SPNs). Like neural networks, SPN inference is always linear time: you basically just have to propagate function values. Unlike neural networks, everything is totally probabilistic. This may be important for several reasons; it means we can always do reasoning on partial data (filling in the missing parts using the probabilistic model), and reason "in any direction" thanks to Bayes' Law (where neural networks tend to define a one-direction input-output relationship, making reasoning in the reverse direction difficult).

Why are SPNs so efficient?

### III. Models are Factored Representations

The fundamental idea behind factor graphs is that we can build up complicated probability distributions by multiplying together simple pieces. Multiplication acts like a conjunction operation, putting probabilistic knowledge together. Suppose that we know a joint distribution connecting X with YP(X, Y). Suppose that we also know a conditional distribution, defining a probability on Z for any given probability on X: P(Z|X). If we assume that Z and Y are independent given X, we can obtain the joint distribution across all three variables by multiplying the two probability functions together: P(X,Y,Z) = P(X,Y)P(Z|X).

A different way of looking at this is that it allows us to create probabilistic constraints connecting variables. A factor graph is essentially a network of these soft constraints. We would expect this to be a powerful representation, because we already know that constraints are a powerful representation for non-probabilistic systems. We would also expect inference to be very difficult, though, because solving systems of constraints is hard.

SPNs allow multiplication of distributions, but only when it does not introduce dependencies between distributions which must be "solved" in any way. To oversimplify just a smidge, we can multiply just in the case of total independence: P(X)P(Y). We are not allowed to use P(X|Y)P(Y), because if we start allowing those kinds of cross-distribution dependencies, things get tangled and inference becomes hard.

To supplement this reduced ability to multiply, we add in the ability to add. We compose complicated distributions as a series of sums and products of simpler distributions. (Hence the name.) Despite the simplifying requirements on the products, this turns out to be a rather powerful representation.

Just as we can think of a product as conjunctive, imposing a series of constraints on a system, we can think of a sum as being disjunctive, building up a probability distribution by enumeration of possibilities. This should bring to mind mixture models and clustering.

Reasoning based on enumeration is faster because, in a sense, it's just the already-solved version of the constraint problem: you have to explicitly list the set of possibilities, as opposed to starting with all possibilities and listing constraints to narrow them down.

Yet, it's also more powerful in some cases. It turns out that SPNs are much better at representing probabilistic parsing than factor graphs are. Parsing is a technique which is essential in natural language processing, but it's also been used for other purposes; image parsing has been a thing for a long time, and I think it's a good way of trying to get a handle on a more intricately structured model of images and other data. A parse is elegantly represented via a sum of possibilities. It can be represented with constraints, and this approach has been successfully used. Those applications require special-purpose optimizations to avoid the exponential time inference associated with factor graphs and constraint networks, though.

The realization that factor graphs aren't very good at representing this is what really broke my resolve as far as factor graphs go. This indicated to me that factor graphs really were missing a critical representational capability; the ability to enumerate possibilities.

Grammar-like theories of general learning have been an old obsession of mine, which I had naively assumed could be handled well within the factor-graph world.

My new view suggested that inference and learning should both be a combination of sum-reasoning and product-reasoning. Sum-learning includes methods like clustering, boosting, and bagging: we learn enumerative models. Reasoning with these is quite fast. Product-learning splits reality into parts which can be modeled separately and then combined. These two learning steps interact. Through this process, we create inherently fast, grammar-like models of the world.

### IV. Everything is Probabilistic

Around the same time I was coming to these conclusions, the wider research community was starting to get excited about the new distributed representations. My shift in thinking was taking me toward grammars, so I was quite excited about Richard Socher's RNN representation. This demonstrated using one fairly simple algorithm for both language parsing and image parsing, producing state-of-the-art results along with a learned representation that could be quite useful for other tasks. RNNs have continued to produce impressive results moving forward; in fact, I would go so far as to say that they are producing results which look like precisely what we want to see out of a nascent AGI technology. These methods produce cross-domain structured generalizations powerful enough to classify previously-unseen objects based on knowledge obtained by reading, which (as I said in my previous post) seems quite encouraging. Many other intriguing results have been published as well.

Unfortunately, it's not clear how to fit RNNs in with more directly probabilistic models. The vector representations at the heart of RNNs could be re-conceived as restricted Boltzmann machines (RBMs) or another similar probabilistic model, giving a distributed representation with a fully probabilistic semantics (and taking advantage of the progress in deep belief networks). However, this contradicts the conclusion of section II: an RBM is a complex probabilistic model which must be approximated. Didn't I just say that we should do away with overly-complex models if we know we'll just be approximating them?

Carrying forward the momentum from the previous section, it might be tempting to abandon probabilistic methods entirely, in favor of the new neural approaches. SPNs restrict the form of probabilistic models to insure fast inference. But why accept these restrictions? Neural networks are always fast, and they don't put up barriers against certain sorts of complexity in models.

The objective functions for the neural models are still (often) probabilistic. A neural network can be trained to output a probability. We do not need everything inside the network to be a probability. We do lose something in this approach: it's harder to reverse a function (reasoning backwards via Bayes' Law) and perform other probabilistic manipulations. However, there may be solutions to these problems (such as training a network to invert another network).

### V. Further Thought Needed

These neural models are impressive, and it seems as if a great deal could be achieved by extending what's been done and putting those pieces together into one multi-domain knowledge system. However, this could never yield AGI as it stands: as this paper notes (Section 6), vector representations do not perform any logical deduction to answer questions; rather, answers are baked-in during learning. These systems often can correctly answer totally new questions which have not been trained on, but that is because the memorization of the other answers forced the vectors into the right "shape" to make the correct answers evident. While this technique is powerful, it can't capture aspects of intelligence which require thinking.

Similarly, it's not possible for all probabilistic models to be tractable. SPNs may be a powerful tool for creating fast probabilistic models, but the restrictions prevent us from modeling everything. Intelligence requires some inference to be difficult! So, we need a notion of difficult inference!

It seems like there is a place for both shallow and deep methods; unstructured and highly structured models. Fitting all these pieces together is the challenge.

## Tuesday, July 15, 2014

### More Distributed Vectors

Since my last post on distributed vector representations, interest in this area has continued to spread across the research community. This exposition on Colah's blog is quite good, although it unfortunately perpetuates the confusing view that distributed representations are necessarily "deep learning". (In fact, as far as I can see, the trend is in the opposite direction: you can do better by using simpler networks so that you can train faster and scale up to larger datasets. This reflects a very general trend to which deep networks seem to be a rare exception.)

The story Colah tells is quite exciting. Vector representations (AKA "word embeddings") are able to perform well on a variety of tasks which they have not been trained on at all; they seem to "magically" encode general knowledge about language after being trained on just one language task. The form of this knowledge makes it easier to translate between languages, too, because the relationship structure between concepts is similar in the two languages. This even extends to image classification: there have been several successes with so-called zero-shot learning, where a system is able to correctly classify images even when it's never seen examples of those classes before, thanks to the general world knowledge provided by distributed word representations.

For example, it's possible to recognize a cat having only seen dogs, but having read about both dogs and cats.

(This seems rather encouraging!)

Colah mentions that while encoding has been very successful, there is a corresponding decoding problem which seems to be much harder. One paper is mentioned as a hopeful direction for solving this. Colah is talking about trying to decode representations coming out of RNNs, a representation I'm quite fond of because it gives (in some sense) a semantic parse of a sentence. However, another option which I'd like to see tried would be to decode representations based on the Paragraph Vector algorithm. This looks easier, and besides, Paragraph Vector actually got better results for sentiment analysis (one of the key domains where RNNs provide a natural solution). Again, we can point to the general trend of AI toward simpler models. RNNs are a way of combining semantic vectors with probabilistic context-free grammers; Paragraph Vector combines semantic vectors with a markov model. Markov models are simpler and less powerful; therefore, by the contrarian logic of the field, we expect them to do better. And, they do.

All of this makes distributed vectors sound quite promising as a general-purpose representation of concepts within an AI system. In addition to aiding bilingual translation and image-word correspondence problems, RNNs have also been applied to predict links in common-sense knowledge bases, showing that the same model can also be applied to understand information presented in a more logical form (and perhaps to form a bridge between logical representations and natural language). I imagine that each additional task which the vectors are used on can add more implicit knowledge to the vector structure, further increasing the number of "zero-shot" generalizations it may get correct in future tasks. This makes me envision a highly general system which accumulates knowledge in vectors over the course of its existence, achieving lifelong learning as opposed to being re-trained on each task. Vector representations by themselves are obviously not sufficient for AGI (for example, there's no model of problem solving), but they could be a very useful tool within an AGI system.

I mentioned in a previous post that the idea of distributed vectors isn't really that new. One older type of word embedding is latent semantic analysis (LSA). Some researchers who are from the LSA tradition (Marco Baroni, Georgiana Dinu, & German Krusewski) have gotten annoyed at the recent hype, and decided that LSA-style embedding was not getting a fair trial; the new word embeddings have not been systematically compared to the older methods. This paper is the result. The verdict: the new stuff really is better!

When it's coming from people who openly admit that they hoped to prove the opposite, it sounds rather convincing. However, I do have some reservations. The title of their paper is: Don't count, predict! The authors refer to the older LSA-like methods as "count vectors", and newer methods as "predict vectors". The LSA-like methods rely on some transformation of raw co-occurrence counts, whereas the new neural methods train to predict something, such as predicting the current word given several previous words. (For example, LSA uses the singular value decomposition to make a low-rank approximation of tf-idf counts. Paragraph Vector trains to find a latent variable representing the paragraph topic, and from this and several previous words, predict each next word in the paragraph.)

As I said: they are assuming the count vs predict distinction is what explains the different performance of the two options. The main experiment of the paper, an extensive comparison of word2vec (representing the prediction-based techniques) vs DISSECT (representing the counting-based techniques) strongly supports this position. Two other techniques are also tried, though: SENNA and DM. Both of these did worse than word2vec, but the relative comparison of the two is much muddier than the comparison between DISSECT and word2vec. This weakens the conclusion somewhat. Are we expected to believe that new counting-based models will continue to be worse than new prediction-driven models?

If we believed that counting models had reached a state of near-perfection, with only small improvements left to be found, then the conclusion would make sense. Prediction-based vector representations appear to still be in their early days, with large unexplored areas. Presumably, they still have significant room for improvement. If the authors believe that this is not the case for counting-based vector representations, the conclusion makes sense.

However, work I'm involved with may undermine this conclusion.

I've been doing distributed vector stuff with Volkan Ustun and others here at ICT. Our forthcoming paper has some suggestive evidence, showing performance very close to Google's word2vec when trained with similar vector size and amount of data. We are not doing any neural network training. Instead, we are creating representations by summing together initial random representations for each word. This seems to fall firmly into count-based methods.

The incredible thing about Volkan's technique is that it's essentially the first thing we thought of trying; what's going on is much simpler than what happens in word2vec. Yet, we seem to be getting similar results. (We should run a more direct comparison to be sure of what's going on.) If this is the case, it directly contradicts the conclusion of Don't Count, Predict!.

In any case, distributed vectors continue to offer a surprising amount of generality, and have some promise as a cross-task, cross-language, cross-modality unified representation.