Wednesday, August 29, 2012

Beliefs


I tend to think that the AI problem is split into two basic problems:

  1. What is the space of propositions which we have beliefs about?
  2. What are the rules by which beliefs interact?
Bayesianism says that once we answer (1), we simply define a prior (and possibly a utility function) over the space, and answer (2) by a Bayesian update on any evidence we observe, and then a marginalization to find the prediction (or maximize to find the highest-utility action, if we specified a utility function).

This is impractical for large problems, which brings in an interesting question without even giving up Bayesianism: what approximations work well in practice? I personally prefer methods which deal with beliefs in a local way. Factor graphs are my favorite formalism for thinking about this.

The basic algorithm for factor graphs is the summary-product algorithm, which is a generalisation of belief propagation and several other important algorithms. Generalising several algorithms like this is fairly exciting, and gives summary-product an important status among the possible mechanisms for belief interaction. If I had to state the idea of summary-product in one sentence, it would be:

If you want to perform a global operation on a multivariate function which can be represented as a product of several factors, you can often treat the distributive law as approximately true in order to transform it to a local message-passing algorithm which often runs in linear time.

Of course, this is hopelessly vague; you have to look at the math. In any case, this is is typically the "default" approximation for factor-graph problems. Unfortunately, the procedure doesn't come with many theoretical guarantees: we know that it works well in practice, but it does not always converge to the right value, or converge at all. Not very much is known about when or why.

In cases where this does not work well, we can turn to variational methods.

The core variational technique, often referred to as "variational Bayes" (but more specifically know as "mean-field"), seeks to minimise the KL-divergence in the form KL(Q||P), where P is the exact distribution and Q is our approximation. Minimising KL-divergence typically involves taking expectations, and in this case it means taking the expected log-probability. (I won't try to go through the derivation here, but it is quite interesting.) This has a surprising effect: by finding the expected log-probability rather than the expected probability, we replace multiplication with addition in some of the mathematics. This is the kind of crazy simplification that you would not expect to work. Yet, it does work, quite well: this algorithm is guaranteed to converge to the global minimum of the KL divergence! (At least, if I've understood correctly.) The mean-field approach can be understood as a generalisation of the popular EM algorithm.

A second variational method is known as expectation-propagation. Here, we minimise KL(P||Q) rather than KL(Q||P). This actually makes more sense from a mathematical perspective: KL(A||B) measures the amount of unnecessary prediction error from using probability distribution B, if observations are really being drawn from distribution A. Expectation-propagation seems to have it right, minimising the amount of prediction error due to approximation. It is unclear why we might want to take the mean-field approach instead, except that it may be computationally convenient.

Again, minimising KL-divergence often leads us to take expectations. In this case, rather than finding ourselves in the strange land of log-probabilities, we end up taking expectations over probabilities in a more normal way. I won't try to give all the details, but it turns out that in specific cases, this approximation ends up being exactly the same as belief propagation! This happy coincidence again gives credibility to the generality of the summary-product algorithm. At the same time, it gives us a different and useful generalisation of belief propagation.

This also gives us a nice theoretical tool for analysing the accuracy and convergence of belief propagation. Like summary-product, expectation maximisation can fail to converge. However, expectation-maximisation comes with a fairly nice fixed-point analysis which helps us to understand this process (although it doesn't give all the answers).

There is a wider variety of variational methods, but I don't know much about them-- yet. :)

Other, less Bayesian techniques present themselves. In particular, while variational methods can be used to learn the factors themselves, a simpler approximation is to use gradient descent. As in neural network back-propagation, we take the derivative of an objective function with respect to the parameters in order to get a direction in which we should move our beliefs. This does not directly represent any uncertainty about the parameters.

The main problem with this approach is that the derivatives will tend to get very small if we have deep networks with large numbers of hidden variables. Overcoming this problem is the topic of research in deep belief networks. (Interestingly, this formalism bridges the gap between neural network research and Bayes-style graphical model research. It is not the only approach to do so. I have some hope that this distinction will slowly disappear, although there will always be room for some differentiation...) Studying what sort of local belief interactions work best here will be an important part of the puzzle, to find good general rules for the interaction of beliefs.

Another source of inspiration is the PAQ compressor by Matt Mahoney. It is currently the best compressor there is. Information theory tells us that there is a close relationship between probability and compression; so, if a specific sort of computation tends to be the best for compression, it is likely to be the best for approximate probabilistic reasoning, as well. (In fact, PAQ is founded on this principle.) PAQ is by now a fairly complicated mechanism with many special optimisations which I don't claim to understand, but I do understand some of the principles behind it:
  1. How should we combine several conditional probability distributions which give different beliefs for some proposition? This is known as the "combination problem" in various circles (it shows up in PLN, NARS, and BLP). It's like asking how to make use of a mangled Bayesian network which has been put together from several conflicting models. PAQ says to combine the different predictions logarithmically (a bit like mean-field), with different weights on each, trained via gradient-descent to find the combination of weights which minimises prediction error.
  2. In addition, we can use indirect context models to learn when specific models are appropriate.
  3. Favor recent history when counting probabilities. This is an idea which also appears in reinforcement learning; the term which justifies the practice is "non-stationary models" (the idea that the probability distribution may change over time). I should mention that I think reinforcement also has other lessons about belief structure and useful propagation rules.
Many explanations are in Matt's book in progress (which I have not yet read).

There is much more to say, but I suppose this post has gotten long enough. Thoughts?

Tuesday, August 28, 2012

Bayesianism?

Normally, I'm a hard-core Bayesian. This means that I believe all uncertainty is essentially probabilistic uncertainty, and also, goal-oriented behavior is essentially utility-maximization. These are beliefs which make predictions: I predict that AI techniques which attempt to approximate the rules of probability theory and utility maximization will tend to work the best.

In my previous post, I gave a reason to doubt this: Bayesian systems rely on a pre-specified model space, and this model space will be inherently incomplete, for several different reasons. This does not have profoundly bad consequences for probability theory (a Bayesian system will still do its best to make reasonable predictions, in some sense); however, it may have worse consequences for utility theory (it isn't clear to me that the system will do its best to achieve its given goal, in any strong sense).

This, too, is a testable belief. I've been discussing some experiments with Lukasz Stafiniak which will help here (but we have set no definite deadline to get around to this, as we both have other things to do). (I should mention that his motivation for being interested in these experiments is not the same as mine.) It could also be addressed on purely theoretical grounds, if it could be proven that (specific sorts of?) Bayesian systems can or cannot learn specific behaviors (behaviors which other sorts of systems are capable of learning).

What is the competitor to Bayesianism? Model-free learning attempts to directly learn the policy for the environment based on feedback, without trying to make predictions about the world or directly represent world knowledge.

In this view, trial-and-error learning becomes more fundamental than Bayesian learning. This makes some philosophical sense. After all, if we reason in an approximately Bayesian way, it is because we have evolved to do so through a process of mutation and natural selection.

The model-free approach has been popular in the past, and there is still research being done in the area, but model-based methods have the technique of choice for complex problems. To take a somewhat Bayesian line of argument, this is natural, because refusing to state your assumptions doesn't actually exempt you from having assumptions, and explicitly modeling the world allows for data to be used in a more efficient way: we separately optimize the world model based on the data, and then optimize the policy based on the world model.

Monday, August 20, 2012

Truth and AI

I've written extensively in the past about the relationship between foundations of mathematics and AGI; both on this and my previous blog, and in numerous posts to the AGI mailing list. I claimed that numerous problems in the foundations of mathematics needed to be solved before we could create true AGI.

The basic argument was this:

An AGI should, with enough processing power and training, be able to learn any concept which humans can learn. However, in order to learn a concept, it needs to be able to represent that concept in the first place. So, if we find that an AGI system's internal representation can't handle some concept, even in principle, then we should extend it.

I called this the "expressive completeness" requirement. Logicians have some bad news for such a requirement: Tarski showed that any sufficiently powerful logic system is incapable of expressing its own semantics. This means there is at least one concept that can't be expressed, in any knowledge representation: the meaning of that representation.

This is related to Goedel's second incompleteness theorem, which says that we can never prove our own logic sound; any logic which can say if itself "all the results I derive are true" must be wrong about that!

Intuitively, this seems to indicate that whatever logic humans use, we won't be able to figure it out. A logic system can only understand weaker logic systems. This would suggest that we are doomed to forever ponder weak theories of mind, which are unable to account for all of human reasoning.

As a result, my hope for some time was to "get around" these theorems, to solve the expressive completeness problem. (This is not quite as hopeless as it sounds: the specific statements of the theorems do contain loopholes. The problem is to decide which assumptions are not needed.)

However, two years ago, I decided that the problem didn't really need to be solved. Here is the message I sent to the AGI list:
In the past I've done some grumbling about "expressive completeness" of an AGI knowledge representation, specifically related to Tarski's undefinability theorem, which shows that there is always something missing no matter how expressively powerful one's language is. Perhaps you remember, or perhaps you weren't around for it, but basically, I argued that for any given AGI system the Tarski proof could show where it had a representational hole: a concept that it could not even express in its representation.

Today I retract the worry and give a broad, somewhat tentative "thunbs up" to opencog, NARS, Genifer, LIDA, and any similar systems (at least insofar as logical completeness is concerned).

I still think the theory of logical completeness is important, and can bear important fruits, but at the moment it looks like its main result is just to say what many of you knew all along-- a "full steam ahead" on existing systems. I recognize that it's a hard sell to claim that we should do all that work to get the already-obvious answer.

Beyond that point, AGI researchers won't care all that much and I'm more doing some (albeit strange) philosophical logic.

The sketch of the result goes like this.

Jumps up the Tarski hierarchy of languages are fairly easy to justify, due to the various benefits that more logical power offers. These include speed of reasoning and more concise notation of certain concepts. Most AGI systems will be able to see these benefits and, if not explicitly endorsing the move *in their original logic*, will move toward stronger logics implicitly at their procedural level.

Worst-case, the system could replace itself "from the outside" by taking action in the external environment to modify itself or create a successor...

(Ideally, of course, it would be nice to have a "stable" system which explicitly accepted improvements in its initial logic.)

In conclusion, the best style of AGI I can recommend to acheive completeness is what I think of as "Ben's zen AGI approach" (apologies to both Zen and Ben Goertzel for any misrepresentation): give up your attachement to individual algorithmic approaches, and just take the best algorithm for the job. Put these good algorithms together in a blackboard-like environment where each can do its job well where it applies, and after a while, general intelligence will emerge from their co-authorship.
Recently, I have been thinking about these issues again. I think it is time for a bit more on this topic.

As I mentioned, we have some very strong limitative results in logic. To give a version of these results for AGI, we can talk about learning. Wei Dai gives a form of this result, calling it the unformalizability of induction.

A Bayesian learning system has a space of possible models of the world, each with a specific weight, the prior probability. The system can converge to the correct model given enough evidence: as observations come in, the weights of different theories get adjusted, so that the theory which is predicting observations best gets the highest scores. These scores don't rise too fast, though, because there will always be very complex models that predict the data perfectly; simpler models have higher prior weight, and we want to find models with a good balance of simplicity and predictive accuracy to have the best chance of correctly predicting the future.

Unfortunately, the space of models is necessarily incomplete. There exists a model which is intuitively fairly simple, but which necessarily does not exist in our Bayesian learner: the possibility that an "evil demon" (to borrow Decart's idea) is fooling the agent. The demon is performing the computations of the Bayesian update, to find out the probability distribution over the next observations, and then choosing for the agent to see that observation which is least probable according to the agent's probability distribution.

It is impossible that the agent converges to this model; therefore it must not exist in the space of models being considered.

Shane Legg used this idea to show that there is no "elegant universal theory of induction".

Of course, the practical concern is not really over evil demons; that's a rather unconcerning hypothesis. The demon is just a way of showing that the model space is always incomplete. In reality, more practical theories escape the model space.
  • AI based on Hidden Markov Models cannot learn hierarchical patterns such as nested parentheses, and will never learn to count.
  • AI based on context-free grammars cannot learn context-dependent patterns, and will never learn the general rule of subject-verb agreement.
  • AI based on bounded computational resources cannot learn the true model of the world if it requires more computing power than is available (but it generally does!).
We know that Bayesian reasoning is the best option when the true model is within the space of models. But what happens when it is not?

Luckily, we can still say that Bayesian updates will cause beliefs to converge to the models which have the least KL-divergence with reality. For example, in the evil demon case, beliefs will go towards maximum entropy; the agent will treat reality as random. Given the structure of the situation, this is the best strategy to minimize incorrect predictions.

However, there is still more to worry about. In my email 2 years ago, I claimed that an AI system could see the benefits of more inclusive model spaces, and modify themselves to include what they might be missing. It would be good if something like this could be made more formal and proven.

For example, is it possible for an agent based on hidden markov models to learn to count if we allow it a 'cognitive workspace' which it is expected to learn to use? We can imagine an RL agent which works by doing HMM learning and then HMM planning to maximize reward. We augment it by giving it a memory containing a stack of symbols, and actions to push symbols to memory and pop symbols out of memory. The symbol popped out of memory is presented as an observation to the system. Can the agent learn to use memory to its benefit? Can it now learn pushdown automata, rather than only learning models with finite state-space?

I think the answer is, no. Because the agent does explicit HMM planning, it will not perform an action unless the HMM predicts a benefit in the future. This makes it very difficult for the agent to learn when to push symbols, because it would need to be able to look ahead to the time of their being popped; but the whole point is that the HMM cannot do that.

This suggests that my conclusion in the old email was wrong: reasonable systems may reject obvious-seeming self-improvements, due to lack of imagination.

Taking this as an example, it seems like it may be a good idea to look for learning methods which answer "yes" to these kinds of questions.