Monday, December 3, 2012

My Unhealthy Obsession with Message-Passing

This post is a follow-up to Beliefs, in which I outlined some of what the machine learning community knows about how beliefs should interact with one another.

One assumption which I didn't spend too much time on in that post was the assumption that a "belief" should be managed by local message-passing as much as possible. This is an assumption which has been strong in the probabilistic graphical models community from the beginning, and which I have taken as a given. In order to get efficient algorithms, it seems like a good idea to do everything through local processes. Also, message-passing simply fits with my intuitions about how humans reason. Even markov networks (as opposed to Bayesian networks) feel insufficiently local, because the probabilistic semantics is distributed (you need to know all the factors before you can determine probabilities). For me, the more local, the better: the popular Monte Carlo methods strike me as "not local enough" (because although we keep making changes locally, we do this to a global complete state, and make a running tally over time... seems unfeasible for extremely large networks representing the quantity of knowledge a human has). I'll question this assumption now.

In the previous post, I looked at some of the local message-passing algorithms I'm interested in. There is more interesting material which I've now understood and would like to summarize, but for the moment I'll cover just one more thing to serve my point.

I mentioned that we can make a distinction between techniques such as Mean Field which seek to minimize KL(Q||P), and those such as Belief Propagation which use KL(P||Q); both of these measure divergence between the true distribution and our approximation, but KL(P||Q) intuitively makes more sense (minimizing error due to approximation), whereas KL(Q||P) has some computational advantages.

It turns out that both belief propagation and mean field can be fit into the framework of minimizing KL(Q||P). Even though this isn't my error function of choice, it's good to be able to explain both approximations within one framework; it allows us to definitively explain how to pass messages between the two, in the case that we want to use both in one network.

The basic idea is that Mean Field (MF) can be justified in a straightforward way by minimizing KL(P||Q) when we assume Q contains a separate factor for every variable. Belief Prop (BP), on the other hand, can be justified by minimization of the Bethe energy. The Bethe energy can be derived from KL(P||Q) when Q is of a peculiar form containing a factor for every relationship between variables, but also a negative factor for every variable. The negative factors divide out the effect of the overabundant positive factors. The negative factors are constrained to be the marginals of the positive factors. This gets a little confusing (I'll plan a full post for later), but provides greater coupling between variables, which we expect to give us a better approximation. (It does in most cases.) Unfortunately, it ruins the convergence guarantee which MF had.

The framework I mentioned allows us to deal with a Q function which mixes these two forms. This is good, because MF and BP are good for dealing with different kinds of things. MF often has a simple closed form in cases where BP does not, or does but is too expensive. MF is guaranteed to converge, while BP is not. However, BP typically gives better results when it does converge. It gives exact results for a tree-shaped belief network, while MF does not. It handles 0s in distributions, while MF cannot.

Because BP is not guaranteed to converge, it makes some sense to argue that it is not suitable to AGI-scale systems of beliefs. There will almost certainly be some portion of the network which will not converge, even if non-convergent behavior is rare overall; and this problem may be likely to spread. Therefore, it could make sense to have small, "safe" BP networks acting as pieces in a larger MF network, to make global convergence easier. If a portion of the network was causing trouble, more of it might be converted to MF.

If I'm not mistaken, this same idea could also accommodate Kikuchi free energy, which gives "generalized belief propagation" (GBP). This goes further than BP, including even more interactions in the approximation. Hence we have a flexible approximation framework, which can handle approximations based on a wide variety of forms for Q. The structure of Q can be adapted to the situation, addressing our needs as well as any local convergence problems.

This might sound like a rather nice situation. But... is it?

We're putting an awful lot of energy into the approximation, here. Why are the belief networks so difficult to control? The structure-learning system has to find these networks in the first place. If we can't even do inference for a hypothesis, how can we choose it? Shouldn't the system choose hypotheses for which inference is manageable?

This is exactly the approach taken by Sum-Product Networks. Graphical models normally give us exponential-time inference in the worst case, but sum-product networks are linear-time instead. We could get a similar advantage by using standard types of networks, like BNs or MNs, and requiring that the network take the form of a tree (for which exact inference is inexpensive). We get guaranteed linear-time inference, no approximations, but we have not severely restricted the kinds of probability distributions which can be expressed: we can add more hidden variables to express more complicated distributions. (If we need to add exponentially many nodes to do what we want, though, then obviously we're getting back into trouble.) SPNs do the same thing, but are an interesting alternative to tree networks, in that they allow us to adjust the probability formula at a lower level; we are literally composing an expression of sums and products to express the distribution (an "arithmetic circuit"). This is more annoying for a human to look at, but at the same time, gives us more freedom (allows us to compactly represent a larger number of functions) without compromising the linear-time inference.

SPNs are very similar to artificial neural networks, particularly feedforward networks, and give us linear-time inference for the same reason: we're directly learning a mathematical expression to give the desired result, not trying to represent some inference problem to be solved. One way of explaining it is: for any approximation method, we end up performing some sequence of arithmetic operations to arrive at an answer. SPNs jump straight to learning a good sequence of arithmetic operations, as our model, rather than learning a potentially difficult model and then trying to find a nice approximation.

I said I would deal with my assumption of local approximation. Going back to the P vs Q idea, a "belief" is a local function; a factor of Q. The question is how to coordinate a number of local functions, in order to get a good approximation of a probability function P. SPNs don't need any of this. Local beliefs give us a very messy belief approximation process. SPNs give us a global model with exact inference.

Why do I call an SPN "global"? It might be local enough to satisfy some. The linear-time inference algorithm iterates over the network via the links. That's local, isn't it? Well, sort of. You can't learn SPNs locally and then compose them together, though. The parts of a sum-product network do not have inherent meaning outside of the network. In a Bayesian network, a conditional probability is a conditional probability; we could evaluate its accuracy apart from everything else, if we could access data for those variables.

SPNs aren't a full AGI system in themselves, but they give me an image of an AGI system which learns very fast belief-formation functions, which simply get evaluated in order to give probabilities at a particular moment. This image is much more efficient than my previous image of a large network of approximate beliefs which are doing local message-passing to minimize KL divergence between Q and the desired P. Local approximation by message-passing is a faster approximation than Monte Carlo methods, but if we can create networks which we can precisely evaluate very quickly, why worry about approximation at all?

On the other hand, I think we really want some FOPI-style behavior (First Order Probabilistic Inference, one approach to integrating logic and probability). The structure of beliefs should include first-order logic as a special case, which means we cannot avoid including problems which are potentially intractable. (IE, real-life inference problems are sometimes intractable.)

Indeed, it is difficult to see how an SPN-like framework can give us persistent beliefs with logical structure. An SPN is composed of 'sum' nodes and 'product' nodes. We can think of the 'sum' nodes as representing 'or', and the 'product' nodes as representing 'and'. So, we can think of 'product' nodes as representing an object: a combination of attributes. 'Sum' nodes represent alternative ways of achieving attributes. However, this is still a very low-level representation! There is no way to represent multiple instances of one object, no way to represent classes of objects with a type/subtype relation... we can't reason about familial relations, for example, trying to decide whether two people are cousins given some other information about their families.

Side note: one possible objection is that belief networks are a representation for the low-level dynamics of an artificial mind, and do not need to represent the high-level stuff like logic. This should emerge as a learned skill of a large network which has access to mental workspaces in which it can implement more complex, longer-term mental processes. This approach may be workable. However, my intuition is that there is something to gain from explicitly representing logically structured knowledge.

Back to the first hand, we can argue that FOPI itself requires a global model in order to start. This is an interesting and important point: local inference rules such as belief propagation and FOPI require a fixed network structure to begin with. (More technically, we need to know what the Markov blankets of our variables are. This is similar to learning a background model of causality: we can't just operate on correlations between variables. Local probability tables are not enough. We also need to know the structure, which tells us which probability tables are relevant to a given situation.)

It would be nice if we could also reason about structure locally, learning which variables are related in an incremental, bottom-up manner. Perhaps this is possible. However, it seems especially difficult with FOPI-like inference: logical models do not behave in a very local way. It seems very "human" to have a large set of potentially-inconsistent logical statements which we are constantly trying to integrate in an approximate way; this again sounds like a Q vs P iterative reasoning scheme. But, again, won't this be terribly messy and inefficient? Won't we end up approximating things by some simple sequence of operations? Wouldn't it be nicer to just learn a good sequence of operations, in the first place, which just gives us the approximate answer?

I don't intend to answer these questions here. I think there's an interesting tension between these "local" messy message-passing approximation algorithms, based on semantically meaningful local beliefs, and "global" methods which can potentially give us fast, reliable answers based on a fixed calculation.

Sunday, November 4, 2012

New Truth: Intuition

Granted that there are more technical details to work out concerning my impredicative theory of truth, but I don't have too much time to work on them. Instead, this post will be a "lighter", but in some sense more important, topic: what is the intuition behind this theory? In practical terms, what does it tell us about truth?

Honestly, the theory is motivated more by "this is what we can do" than by "this is what Truth should taste like". However, the result was surprisingly nice, so I think it's worth doing a post-hoc analysis.

I came up with the mechanism based on the realisation that Peano Arithmetic and its relatives are self-referential not because they talk about numbers, but because they talk about arbitrary computations on numbers. Robinson Arithmetic was formulated by examining what parts of Peano Arithmetic are actually used to get self-reference; it is a significantly weaker theory which still supports the diagonal lemma and all it entails. The fact is, all total computable functions have fixed points. This means that for any (halting) computable function F which takes as input one program and as output generates another, there will always be a program P which is identical in behaviour to F(P). This result is almost silly! However, it tells us something about the structure of computation. It is what allows us to construct self-referential sentences in Peano Arithmetic (and Robinson Arithmetic).

So, in order to get a truth predicate in the same language ("self-referential in the good way"), all we need to do is deny the ability to talk about arbitrary computations on sentences while discussing their truth (so that it isn't "self-referential in the bad way").

The basic idea is to avoid introducing new, undesired sentences which turn out to be pathological. Talking about the truth of arbitrarily computed sentences seems perfectly plausible until we see that we're getting more than we bargained for. The strict language-metalanguage barrier introduced by Tarski ensures that we only talk about the truth of sentences which are in our original language, avoiding the problem. My 'impredicative' approach weakens this barrier; we can talk about the truth of sentences which make use of the truth predicate, which means we still get some "unexpected" sentences where truth is discussing itself rather than discussing the base-level domain. We're just restricted to making use of relatively simple manipulations on these sentences, so that we don't have the machinery to make fixed-point constructions.

My formalism allows predicates to be recursively defined, using constructions similar to the definition of the natural numbers in 2nd-order logic. However, it does not allow self-referential predicate definitions. It is interesting that this distinction can be made. We can employ recursion at the object level, but not at the concept level.

Recursion at the object level looks like this (where 0 is a logical constant and s() is a function symbol):
Predicate N(x):
All P: (P(0) and [All y: P(y)->P(s(y))]) -> P(x).
 Here, we are defining a predicate for "natural number" based on a constant for "zero" and a function for "+1". Notice that although this is a recursive definition, we do not need to use the predicate itself in its definition. This is very important; if we were allowed to refer to a predicate in its own definition, then we would run into trouble very quickly:
Predicate P(x): not P(x).
More generally, a self-referential definition may not be consistent or complete, whereas a recursive definition will be (although it may not be decidable).

So, it doesn't make sense to form concepts using self-referential definitions (at least not just any self-referential definition). And since forming predicates from arbitrary computations is enough to give us that ability, it's got to go, too.

That's all well and good, but this doesn't seem like anything really new for theories of truth: we know that complete expressive power results in inconsistency, so various theories of truth give us different restrictions on expressive power which (hopefully) result in a consistent theory. So what?

The interesting thing here is that, because of the closeness to 2nd-order logic, we get a nice property: we can represent 3rd-order logic, 4th-order logic, .. Nth-order logic where N is an ordinal definable within 2nd-order logic. The basic story is that the powerset relation is definable at the object level, which means that we can add axioms which force special 1st-order entities to behave as classes. Despite the separation in the language, concepts can be examined as objects after all!

If we do this, we can even discuss the (1st-order) truth of arbitrarily computed sentences, and more or less anything else which we would have wanted to do.

In Kripke's theory of truth, we had the "revenge problem": truth can be completely discussed within the language, but a new semantic value, the "indeterminate" or "paradoxical" state, cannot be discussed. This preserved Tarski's theorem, in a generalised sense:  the system could discuss its own truth predicate, but not its own semantics.

If we wanted to add a predicate for this 3rd value, in such a way as to allow the assertion of true statements about it, we would be taking on a rather significant modification of the logic. Certainly the semantics of the predicate could not be described with a few sentences in the existing language.

The same theme extends to other theories: typically, the semantics of the theory are a mathematical object too complex to be defined in the theory of truth at hand. That's what concerned me when trying to use theories of truth as foundations for mathematics; in general, it seemed as if the exposition of the theory itself provided an example of a mathematical discussion which could not be captured in the theory.

For the theory of impredicative truth, on the other hand, we find that although the theory cannot represent its own semantics with no modification, the modification needed is rather light. Adding axioms defining object-level classes seems rather easy. Granted, the extended system technically has a different semantics; in asserting the new axioms to be true, we are changing the meaning of some logical constants (assigning a relation to the role of an object-level membership relation, for example). It may look at first glance like a system describing its own semantics, but really, the extended system is describing the old system. (If we attempted to make objects corresponding to all the classes in the new system, we would be making axioms equivalent to naive set theory.)

So, to talk about the semantics of the system, we still make a meta-language. The close relationship between the language and the metalanguage is very encouraging, though. Specifically, this seems to indicate that a system capable of learning new axioms will be able to incrementally learn its own semantics.


What this gives us is a hierarchy of languages which are increasingly expressive. Is it any better than the Tarski hierarchy? To some extent, I think this solution is very similar in spirit; it is a very "conservative" solution, relying on a stratified language and maintaining classical logic. The nice relationship between language and metalanguage is arguably only a little nicer than Tarski's. However, the impredicative stratification is more powerful; I feel it is closer to the foundations of mathematics. What we get out of this is a recommendation to view a finite axiomization in 2nd-order logic as, to some extent, more basic than a formulation of a theory in 1st-order logic via axiom schema. 2nd-order arithmetic is a more fundamental foundation than 1st-order, and NBG set theory is preferred over ZFC. Proper classes are seen as a natural thing to have around, rather than an unnatural partial level.

I'm actually disappointed that this view seems to endorse a more standard well-founded view of set theory, as opposed to what I view as more natural non-well founded theories (in which we have a set of all sets, which is itself a set). Oh well.

What I need next is a natural way to specify a prior distribution over statements in this logic, so that we can reason probabilistically over it.

In any case, that's enough for now. I think I might have given a more results-oriented account, despite hoping to give intuition prior to results. Perhaps this theory is better justified in terms of what we can do with it, rather than what we wanted truth to be like beforehand.

Cautionary note: I have not yet spelled out what the semantics of this theory might be; in particular, I have not spelled out how similar or different it might be from the standard semantics of 2nd-order logic. This could bring into doubt some of the parallels with 2nd-order logic used in this post.

Tuesday, October 2, 2012

Impredicative Truth

In the previous post, I started to discuss an approach to avoiding the paradoxes of truth by not giving into the temptation of creating a logic which is "sufficiently powerful" (a phrase often used to describe axiom systems strong enough to support the diagonal lemma). As we will see, we don't have to give up quite that much. But I'm getting ahead of myself.

We begin with the observation that 2nd-order logic is ridiculously powerful. It turns out that we can embed higher-order logic within 2nd-order logic:

Thus the set of validities of seventeenth-order logic is computably reducible to V²(=), the set of second-order validities in the language of equality.   (In fact, these two sets are computably isomorphic.)   So in this aspect, the complexity of higher-order logic does not increase with the order. (Stanford Encyclopedia)

If we could agree on a semantics for set theory, we could probably embed that into 2nd-order  logic too, following an approach similar to NBG set theory. (For example, we could formalize the iterative conception of set as construed by Boolos; though, we could not formalize the iterative conception as originally described by Goedel... that subject would be worthy of its own post.)

In short, if 2nd-order logic is referentially incomplete, it's not obvious how. Whereas with a theory of truth, you have the feeling of an immediate gap (something which you used to describe the system but don't seem to be able to discuss within the system), 2nd-order logic seems capable of discussing anything we can think of. Sure, we know that undefinability must still apply: 2nd-order logic can't define its own truth predicate. Bear with me, though. It would still be nice to have a theory of truth with expressive power similar to 2nd-order logic. Most theories of truth can't compete, because 2nd-order logic requires impredicative comprehension.

So, how might we go about that?

First, we need to talk about "satisfaction" rather than truth. We can render the application of a predicate to a first-order entity P(x) as Sat("P(_)",x). Like the truth predicate, the satisfaction relation takes sentences in quoted form; that is, it takes Goedel numbers, or (probably) some more convenient representation of the syntax of the sentence. However, it takes the 1st-order variables directly, rather than quoted.

This allows us to keep 2nd-order variables (which range over syntactic representations) separate from 1st-order variables (which range over actual entities). This is a type distinction: quoted representations are not first-order entities, and are not allowed to occur in the right-hand side of the satisfaction relation. Separate quantifiers range over the two types of variables.

Notice that there needs to be a way of detailing a "blank", "_", which the first-order entity gets substituted in. We will also want to work with multiple blanks, to represent relations, but we can introduce lists as 1st-order entities to avoid the need... there are details to work out here.

We give the language enough tools to build up complex expressions to discuss the satisfaction of; similar to this excellent post. So, for example, we provide a negation function which returns the compliment of a 2nd-order entity, a conjunction function which returns the intersection, disjunction which returns the union, and so on for all the elements of our syntax. (It does not matter too much whether we provide these as functions or as existence axioms.) Critically, this includes a quotation function which takes the 2nd-order entity corresponding to an expression in the language and returns a 2nd-order entity corresponding to the 2nd-order entity corresponding to the expression. In other words, the function takes quoted entities and returns doubly quoted entities. This allows us to talk about quotation within the quoted portion of the language. That's critical to the impredicative self-reference.

In any case, True("S") is simply a special case of satisfaction. One plausible definition would be that S is true iff All x: Sat("S",x). (We may want to be a bit stricter, requiring that S has no free variables; whether we are able to require this will depend on some details of the logic.)

Wait. What happened? Didn't I say just a few paragraphs ago that 2nd-order logic can't define its own truth predicate? And now we're looking at a truth predicate as a special case of the 2nd-order satisfaction predicate?

The trick is that undefinability applies to a truth predicate at the level of first-order entities, because 2nd-order logic is strong enough to satisfy the diagonal lemma with respect to predicates on first-order entities.[citation needed] However, at the 2nd-order level, the logic is nowhere near that strong. So, we can put a self-referential truth predicate at that level without the problem.

In what sense is this a "self-referential" truth predicate? In the sense that it is a truth predicate which exists within the language it predicates on. We can assert or deny the truth of any sentence, including those which use the truth predicate.

It is not a self-referential truth predicate in the sense provided by the diagonal lemma: we can't construct self-referential sentences like the Liar sentence. This might make the theory sound a little boring, but it seems practical.

Besides: we get to have our cake and eat it too. With this approach, we are defining a sensible truth predicate by giving up the diagonal lemma at the 2nd-order level; yet, when talking about first-order entities, we still get the benefit of a strong language. We don't have to give up logical power after all!

Still, it is tempting to reach for a little more. As described, the 2nd level of the language is very weak. However, all I need is that it doesn't satisfy the diagonal lemma. Might it be possible for the 2nd level to be a self-verifying logic? Because... that would be awesome.

This post may or may not conclude the short series on this theory of truth. I have definitely not done it justice. I would like to formalize it properly, read up on Fregean studies to see how similar this is, read more background material about the diagonal lemma and 2nd-order logic to make sure I'm not just crazy here, read up on self-verifying logics, and so on. However, I don't really have the time to dive into this. So, there may not be any more posts on the topic for a while.

Update: next in the sequence is here.

Monday, September 24, 2012

Old Demons

I have recently been revisiting my search for a theory of truth, which I had previously abandoned as unproductive. It seemed like undefinability was simply a fact to be accepted; if I found myself unsatisfied with existing theories of truth, that was my problem, not theirs.

To some extent, this has become like an "old demon" for me. It's easy for me to spend too much time obsessing about this. So, despite trying to re-visit the problem, I found myself thinking: truth simply cannot be defined, but I need to find a way to come to terms with that. To settle the issue for myself, I needed an acceptable story.

I began reviewing the arguments which prove undefinability.

The fundamental problem seems to be related to the fact that a set cannot contain its powerset. More generally, a set S cannot index all functions f: S -> {0,1}; that is, there is no bijection between S and the set of those functions. I say "more generally" because this statement holds whether we take the "classical" stance, that there are uncountably many such functions, or take a more "constructivist" stance, considering only the computable functions. Either way, the result holds. There is a loophole, though, which will become important soon.

What does the existence of a bijection have to do with truth? Well, the Liar paradox and other similar problems arise from attempting to simultaneously refer to all your semantic values while maintaining an expressively powerful logic which allows you to take arbitrary functions of things. (By "semantic values" I mean just "true" and "false" to start with, though this set is often extended in an attempt to repair paradoxes.) Expressive power looks reasonable.  It seems obvious that we should be able to express that the opposite of some statement is true. Less obviously, if we can take general computable functions of things, we end up satisfying a fixpoint theorem, which (as stated in the diagonal lemma) makes it possible to construct self-referential sentences. OK. This might sound a little weird, but it's just a consequence of that expressive power we wanted. Should be fine, still? Unfortunately, with these two results, we can construct a sentence which is its own negation. What? Where did that come from?

So, the problem seems to arise from trying to be able to assert arbitrary functions of semantic states. It seems reasonable at first: computations on semantic states are actually useful, and the more the better. This allows us to flexibly state things. So why is there a problem?

The problem is that we've unwittingly included more statements than we intended to. Expanding the language with a naive truth predicate extends it too far, adding some stuff that's really nonsense. It isn't as if there was a Liar sentence in the original language which we're now able to compute the negation of. No: the negation computation, allowed to apply itself to "anything", has been applied to itself to create a new, pathological entity (somewhat analogous to a non-halting computation).

(Still trying to convince myself not to worry about these things, I said:) We should not have expected that to work in the first place! It's just like trying to construct a set which is its own powerset. If you show me an attempt to do it, I can mechanically find a subset you left out.

Or, it's like trying to solve the halting problem. There are these "loopy" entities which never terminate, but by the very nature of that loopyness, it is impossible to construct a function which sorts the loopy from the useful: if we had such a function, we could use it to construct a new, "meta-loopy" computation which went loopy if and only if it didn't.

One way of attempting to get around this problem is to add a new semantic value, so that instead of {true, false} we have {true, false, meaningless} or something to that effect. Unfortunately, this does not fundamentally solve the problem: we need to make a decision about whether we can refer to this new semantic value. Informally, it seems like we can; this causes the same problem again, though, since if we can refer to it, we can apply the full power of the logic again (creating a new arbitrary-functions-on-semantic-states problem). On the other hand, we could take a hint, and remain silent about this "new value": so, we are only able to talk about truth and falsehood. This is, essentially, Kripke's theory of truth.

It seems as if we are forced to accept an inherently partial theory of truth, with some sentences in a purposefully ambiguous state- they are neither true nor false, but we are not able to refer to that fact definitively.

This has to do with the loophole I mentioned earlier. It is not possible to map a set S onto the functions f: S -> {true, false}. Even less is it possible to map them onto f: S -> {true, false, _}, with some third value. Yet, we can describe any computer program we want with a finite sequence of 0s and 1s... and the computer programs can be arbitrary computable mappings from such strings to {0,1}! We're indexing the functions with the entities being mapped. We seem to have arbitrary functions on our semantic states. Isn't this exactly what we said we can't do?  Why does this work with computer programs, but not with truth? Because we accept the halting problem. We accept that our representation language for computer programs will include some nonsense programs which we don't actually want; meaningless programs are an unfortunate, but necessary, side-effect of the representational power of the language. Think of it as having 2.5 semantic values. A program can output 1, 0, or... it can fail to output. That isn't a 3rd output value; it's nothing.

That is why Kripke's theory is so appealing: it takes this loophole and applies it to the domain of truth.

However, it is not the solution I am putting forth today.

Today, my assertion is that trying to fit the set inside the powerset never made sense in the first place. Sure, Kripke's loophole is nice, but it still leaves us with this uncomfortable feeling that we can refer to more things than the logic we've constructed, because we've got this 3rd semantic state which we're supposed to shut up about. (Or, as Tim Maudlin suggests, we can refer to it so long as we don't make the mistake of thinking we're saying something true??)

We would like a "self-referential theory of truth", but what does that mean? It seems to me that all we want is a truth predicate which exists happily inside the language which it discusses. This does not require the kind of self-reference provided by the diagonal lemma. It seems like that just gets us nonsense. Perhaps, by discarding the sort of logical power which leads to the diagonal lemma, we can get around the problem. Incredible results have been achieved this way in the past: Dan Willard's self-verifying logic gets around the 2nd incompleteness theorem in this way. Can we do something similar for undefinability?

I've decided to split this up into multiple posts. The next post will go into the proposed solution.

Wednesday, August 29, 2012

Beliefs


I tend to think that the AI problem is split into two basic problems:

  1. What is the space of propositions which we have beliefs about?
  2. What are the rules by which beliefs interact?
Bayesianism says that once we answer (1), we simply define a prior (and possibly a utility function) over the space, and answer (2) by a Bayesian update on any evidence we observe, and then a marginalization to find the prediction (or maximize to find the highest-utility action, if we specified a utility function).

This is impractical for large problems, which brings in an interesting question without even giving up Bayesianism: what approximations work well in practice? I personally prefer methods which deal with beliefs in a local way. Factor graphs are my favorite formalism for thinking about this.

The basic algorithm for factor graphs is the summary-product algorithm, which is a generalisation of belief propagation and several other important algorithms. Generalising several algorithms like this is fairly exciting, and gives summary-product an important status among the possible mechanisms for belief interaction. If I had to state the idea of summary-product in one sentence, it would be:

If you want to perform a global operation on a multivariate function which can be represented as a product of several factors, you can often treat the distributive law as approximately true in order to transform it to a local message-passing algorithm which often runs in linear time.

Of course, this is hopelessly vague; you have to look at the math. In any case, this is is typically the "default" approximation for factor-graph problems. Unfortunately, the procedure doesn't come with many theoretical guarantees: we know that it works well in practice, but it does not always converge to the right value, or converge at all. Not very much is known about when or why.

In cases where this does not work well, we can turn to variational methods.

The core variational technique, often referred to as "variational Bayes" (but more specifically know as "mean-field"), seeks to minimise the KL-divergence in the form KL(Q||P), where P is the exact distribution and Q is our approximation. Minimising KL-divergence typically involves taking expectations, and in this case it means taking the expected log-probability. (I won't try to go through the derivation here, but it is quite interesting.) This has a surprising effect: by finding the expected log-probability rather than the expected probability, we replace multiplication with addition in some of the mathematics. This is the kind of crazy simplification that you would not expect to work. Yet, it does work, quite well: this algorithm is guaranteed to converge to the global minimum of the KL divergence! (At least, if I've understood correctly.) The mean-field approach can be understood as a generalisation of the popular EM algorithm.

A second variational method is known as expectation-propagation. Here, we minimise KL(P||Q) rather than KL(Q||P). This actually makes more sense from a mathematical perspective: KL(A||B) measures the amount of unnecessary prediction error from using probability distribution B, if observations are really being drawn from distribution A. Expectation-propagation seems to have it right, minimising the amount of prediction error due to approximation. It is unclear why we might want to take the mean-field approach instead, except that it may be computationally convenient.

Again, minimising KL-divergence often leads us to take expectations. In this case, rather than finding ourselves in the strange land of log-probabilities, we end up taking expectations over probabilities in a more normal way. I won't try to give all the details, but it turns out that in specific cases, this approximation ends up being exactly the same as belief propagation! This happy coincidence again gives credibility to the generality of the summary-product algorithm. At the same time, it gives us a different and useful generalisation of belief propagation.

This also gives us a nice theoretical tool for analysing the accuracy and convergence of belief propagation. Like summary-product, expectation maximisation can fail to converge. However, expectation-maximisation comes with a fairly nice fixed-point analysis which helps us to understand this process (although it doesn't give all the answers).

There is a wider variety of variational methods, but I don't know much about them-- yet. :)

Other, less Bayesian techniques present themselves. In particular, while variational methods can be used to learn the factors themselves, a simpler approximation is to use gradient descent. As in neural network back-propagation, we take the derivative of an objective function with respect to the parameters in order to get a direction in which we should move our beliefs. This does not directly represent any uncertainty about the parameters.

The main problem with this approach is that the derivatives will tend to get very small if we have deep networks with large numbers of hidden variables. Overcoming this problem is the topic of research in deep belief networks. (Interestingly, this formalism bridges the gap between neural network research and Bayes-style graphical model research. It is not the only approach to do so. I have some hope that this distinction will slowly disappear, although there will always be room for some differentiation...) Studying what sort of local belief interactions work best here will be an important part of the puzzle, to find good general rules for the interaction of beliefs.

Another source of inspiration is the PAQ compressor by Matt Mahoney. It is currently the best compressor there is. Information theory tells us that there is a close relationship between probability and compression; so, if a specific sort of computation tends to be the best for compression, it is likely to be the best for approximate probabilistic reasoning, as well. (In fact, PAQ is founded on this principle.) PAQ is by now a fairly complicated mechanism with many special optimisations which I don't claim to understand, but I do understand some of the principles behind it:
  1. How should we combine several conditional probability distributions which give different beliefs for some proposition? This is known as the "combination problem" in various circles (it shows up in PLN, NARS, and BLP). It's like asking how to make use of a mangled Bayesian network which has been put together from several conflicting models. PAQ says to combine the different predictions logarithmically (a bit like mean-field), with different weights on each, trained via gradient-descent to find the combination of weights which minimises prediction error.
  2. In addition, we can use indirect context models to learn when specific models are appropriate.
  3. Favor recent history when counting probabilities. This is an idea which also appears in reinforcement learning; the term which justifies the practice is "non-stationary models" (the idea that the probability distribution may change over time). I should mention that I think reinforcement also has other lessons about belief structure and useful propagation rules.
Many explanations are in Matt's book in progress (which I have not yet read).

There is much more to say, but I suppose this post has gotten long enough. Thoughts?

Tuesday, August 28, 2012

Bayesianism?

Normally, I'm a hard-core Bayesian. This means that I believe all uncertainty is essentially probabilistic uncertainty, and also, goal-oriented behavior is essentially utility-maximization. These are beliefs which make predictions: I predict that AI techniques which attempt to approximate the rules of probability theory and utility maximization will tend to work the best.

In my previous post, I gave a reason to doubt this: Bayesian systems rely on a pre-specified model space, and this model space will be inherently incomplete, for several different reasons. This does not have profoundly bad consequences for probability theory (a Bayesian system will still do its best to make reasonable predictions, in some sense); however, it may have worse consequences for utility theory (it isn't clear to me that the system will do its best to achieve its given goal, in any strong sense).

This, too, is a testable belief. I've been discussing some experiments with Lukasz Stafiniak which will help here (but we have set no definite deadline to get around to this, as we both have other things to do). (I should mention that his motivation for being interested in these experiments is not the same as mine.) It could also be addressed on purely theoretical grounds, if it could be proven that (specific sorts of?) Bayesian systems can or cannot learn specific behaviors (behaviors which other sorts of systems are capable of learning).

What is the competitor to Bayesianism? Model-free learning attempts to directly learn the policy for the environment based on feedback, without trying to make predictions about the world or directly represent world knowledge.

In this view, trial-and-error learning becomes more fundamental than Bayesian learning. This makes some philosophical sense. After all, if we reason in an approximately Bayesian way, it is because we have evolved to do so through a process of mutation and natural selection.

The model-free approach has been popular in the past, and there is still research being done in the area, but model-based methods have the technique of choice for complex problems. To take a somewhat Bayesian line of argument, this is natural, because refusing to state your assumptions doesn't actually exempt you from having assumptions, and explicitly modeling the world allows for data to be used in a more efficient way: we separately optimize the world model based on the data, and then optimize the policy based on the world model.

Monday, August 20, 2012

Truth and AI

I've written extensively in the past about the relationship between foundations of mathematics and AGI; both on this and my previous blog, and in numerous posts to the AGI mailing list. I claimed that numerous problems in the foundations of mathematics needed to be solved before we could create true AGI.

The basic argument was this:

An AGI should, with enough processing power and training, be able to learn any concept which humans can learn. However, in order to learn a concept, it needs to be able to represent that concept in the first place. So, if we find that an AGI system's internal representation can't handle some concept, even in principle, then we should extend it.

I called this the "expressive completeness" requirement. Logicians have some bad news for such a requirement: Tarski showed that any sufficiently powerful logic system is incapable of expressing its own semantics. This means there is at least one concept that can't be expressed, in any knowledge representation: the meaning of that representation.

This is related to Goedel's second incompleteness theorem, which says that we can never prove our own logic sound; any logic which can say if itself "all the results I derive are true" must be wrong about that!

Intuitively, this seems to indicate that whatever logic humans use, we won't be able to figure it out. A logic system can only understand weaker logic systems. This would suggest that we are doomed to forever ponder weak theories of mind, which are unable to account for all of human reasoning.

As a result, my hope for some time was to "get around" these theorems, to solve the expressive completeness problem. (This is not quite as hopeless as it sounds: the specific statements of the theorems do contain loopholes. The problem is to decide which assumptions are not needed.)

However, two years ago, I decided that the problem didn't really need to be solved. Here is the message I sent to the AGI list:
In the past I've done some grumbling about "expressive completeness" of an AGI knowledge representation, specifically related to Tarski's undefinability theorem, which shows that there is always something missing no matter how expressively powerful one's language is. Perhaps you remember, or perhaps you weren't around for it, but basically, I argued that for any given AGI system the Tarski proof could show where it had a representational hole: a concept that it could not even express in its representation.

Today I retract the worry and give a broad, somewhat tentative "thunbs up" to opencog, NARS, Genifer, LIDA, and any similar systems (at least insofar as logical completeness is concerned).

I still think the theory of logical completeness is important, and can bear important fruits, but at the moment it looks like its main result is just to say what many of you knew all along-- a "full steam ahead" on existing systems. I recognize that it's a hard sell to claim that we should do all that work to get the already-obvious answer.

Beyond that point, AGI researchers won't care all that much and I'm more doing some (albeit strange) philosophical logic.

The sketch of the result goes like this.

Jumps up the Tarski hierarchy of languages are fairly easy to justify, due to the various benefits that more logical power offers. These include speed of reasoning and more concise notation of certain concepts. Most AGI systems will be able to see these benefits and, if not explicitly endorsing the move *in their original logic*, will move toward stronger logics implicitly at their procedural level.

Worst-case, the system could replace itself "from the outside" by taking action in the external environment to modify itself or create a successor...

(Ideally, of course, it would be nice to have a "stable" system which explicitly accepted improvements in its initial logic.)

In conclusion, the best style of AGI I can recommend to acheive completeness is what I think of as "Ben's zen AGI approach" (apologies to both Zen and Ben Goertzel for any misrepresentation): give up your attachement to individual algorithmic approaches, and just take the best algorithm for the job. Put these good algorithms together in a blackboard-like environment where each can do its job well where it applies, and after a while, general intelligence will emerge from their co-authorship.
Recently, I have been thinking about these issues again. I think it is time for a bit more on this topic.

As I mentioned, we have some very strong limitative results in logic. To give a version of these results for AGI, we can talk about learning. Wei Dai gives a form of this result, calling it the unformalizability of induction.

A Bayesian learning system has a space of possible models of the world, each with a specific weight, the prior probability. The system can converge to the correct model given enough evidence: as observations come in, the weights of different theories get adjusted, so that the theory which is predicting observations best gets the highest scores. These scores don't rise too fast, though, because there will always be very complex models that predict the data perfectly; simpler models have higher prior weight, and we want to find models with a good balance of simplicity and predictive accuracy to have the best chance of correctly predicting the future.

Unfortunately, the space of models is necessarily incomplete. There exists a model which is intuitively fairly simple, but which necessarily does not exist in our Bayesian learner: the possibility that an "evil demon" (to borrow Decart's idea) is fooling the agent. The demon is performing the computations of the Bayesian update, to find out the probability distribution over the next observations, and then choosing for the agent to see that observation which is least probable according to the agent's probability distribution.

It is impossible that the agent converges to this model; therefore it must not exist in the space of models being considered.

Shane Legg used this idea to show that there is no "elegant universal theory of induction".

Of course, the practical concern is not really over evil demons; that's a rather unconcerning hypothesis. The demon is just a way of showing that the model space is always incomplete. In reality, more practical theories escape the model space.
  • AI based on Hidden Markov Models cannot learn hierarchical patterns such as nested parentheses, and will never learn to count.
  • AI based on context-free grammars cannot learn context-dependent patterns, and will never learn the general rule of subject-verb agreement.
  • AI based on bounded computational resources cannot learn the true model of the world if it requires more computing power than is available (but it generally does!).
We know that Bayesian reasoning is the best option when the true model is within the space of models. But what happens when it is not?

Luckily, we can still say that Bayesian updates will cause beliefs to converge to the models which have the least KL-divergence with reality. For example, in the evil demon case, beliefs will go towards maximum entropy; the agent will treat reality as random. Given the structure of the situation, this is the best strategy to minimize incorrect predictions.

However, there is still more to worry about. In my email 2 years ago, I claimed that an AI system could see the benefits of more inclusive model spaces, and modify themselves to include what they might be missing. It would be good if something like this could be made more formal and proven.

For example, is it possible for an agent based on hidden markov models to learn to count if we allow it a 'cognitive workspace' which it is expected to learn to use? We can imagine an RL agent which works by doing HMM learning and then HMM planning to maximize reward. We augment it by giving it a memory containing a stack of symbols, and actions to push symbols to memory and pop symbols out of memory. The symbol popped out of memory is presented as an observation to the system. Can the agent learn to use memory to its benefit? Can it now learn pushdown automata, rather than only learning models with finite state-space?

I think the answer is, no. Because the agent does explicit HMM planning, it will not perform an action unless the HMM predicts a benefit in the future. This makes it very difficult for the agent to learn when to push symbols, because it would need to be able to look ahead to the time of their being popped; but the whole point is that the HMM cannot do that.

This suggests that my conclusion in the old email was wrong: reasonable systems may reject obvious-seeming self-improvements, due to lack of imagination.

Taking this as an example, it seems like it may be a good idea to look for learning methods which answer "yes" to these kinds of questions.

Saturday, May 5, 2012

Thoughts on Genetic Programming


I ran into this nice genetic programming demo applet:


I took down the number of generations to solution 10 times:

17
110
125
2
82
22
50
61
52
19

Average: 54

As you can see, the distribution is fairly wild. Outside of this experiment, I saw one or two runs that took up to 1,000 runs, as well. (1,000 is the cut-off for this particular simulation. If the program gets to 500, it seems like it generally goes all the way to 1,000.)

The special thing about this kind of distribution of numbers, as Linas Vepstas discovered while tuning MOSES for Opencog, is that it means there is a failure to maintain diversity properly; in fact, such an extreme failure that we would be better off throwing away our work and starting over every few generations (because the expected run-time is dominated by overly long runs). In this case, we could re-start the algorithm every 2 generations, with a 1/10 chance of success... giving us an expected run-time of 20!

Of course, we don't know ahead of time what cut-off will be optimal, but that can be addressed to some extent with clever tricks such as starting with a small cutoff and increasing it over time. The point, as Linas discusses in his article, is that we should never run into a situation like this. The bias created by the search algorithm is doing more bad then good. What a good restart strategy does for us is effectively shift the algorithm towards a brute-force search which simply looks at possibilities randomly until it finds a solution. (If we used a cut-off of 1, we'd be doing exactly that.) The algorithm isn't learning about the search space over time; allowing it to "remember" very much is only causing it to get stuck in local maxima within the landscape.

Intuitively, it makes sense to suppose that the hill-climbing ability of the system is only good at exploring very "local" terrain in the fitness landscape. Using rapid re-starts means we are trusting the genetic algorithm to climb relatively small hills, but prefer a more brute-force strategy for the global search. A brute-force approach (trying random possibilities until we get a solution) would have a 1/n chance at each attempt, for a search space of size n; this gives an expected run-time of n.* If genetic-algorithm hill-climbing can efficiently search a small area of size a in the amount of time it would take to brute-check an area of size b, then we can improve our search time by a factor of a/b using a genetic algorithm that gets restarted at intervals of time a.

Because the chance of success, be it 1/n or b/an, is a constant for every time-step, the time-to-solution for this kind of algorithm follows an exponential distribution. (The chances that we're still searching at a particular time t decay exponentially.) What Linas is saying in his post is that if our waiting-time distribution is any worse than exponential, then something is wrong with our algorithm, because we can improve it (to an exponential) via re-starts.

Rather than resorting to re-starts, Linas added extra parameters to the algorithm which encouraged more exploration, tuning these parameters for the best apparent results. He got better than a/b improvements this way; he reduced the a/b ratio itself (the efficiency with which MOSES searched small areas) by a factor of about 3/10. However, the resulting algorithm could still benefit from restarts; although the improved waiting-time fits an exponential for most of its duration, it gets worse toward the end (meaning the longest runs still have some problem).**

How can we do better than that? How can we avoid the detrimental effects of a genetic algorithm with a longer memory? (How can we keep the bias from gathered knowledge from making us behave worse than a system with no knowledge at all?)

The problem here reminds me of the exploitation/exploration trade-off studied in reinforcement learning. (Sorry I didn't find a better reference... I should improve the wikipedia article on this...) Genetic algorithms (and other, similar algorithms) search in the areas where they expect the highest reward, but their knowledge is inherently incomplete (because they are still searching!). As a result, they get stuck looking in areas near the best solutions found so far. The only way to improve the search is to make them forget (as in restart-based solutions) or make them ignore their current knowledge (injecting randomness into the decisions, as Linus did). Reinforcement learning gives us more effective strategies for exploration, in which we use our current knowledge to explore. Rather than only looking in areas which have demonstrated high scores so far, we can look in areas which might demonstrate higher scores; in particular, areas which have demonstrated high variance or areas which we haven't searched significantly. Perhaps the problem with genetic algorithms is that they do not remember how thoroughly they have searched specific areas, which means they continue to be very interested in areas in which they've already done a lot of searching.

What this leads me to consider is the idea of comparing Monte Carlo Tree Search (which has a built-in exploration/exploitation model) to genetic algorithms. (Again, I see wikipedia needs to be improved! ...) MCTS has revolutionized board-game AI, especially Go.

I first started considering MCTS for automatic programming in the fall, as a (possibly) pragmatic way of implementing incremental Levin search. The idea in that case would be for the system to retain probabilities from all problems it has ever solved, so that it would eventually build up good programming skills.

In this case, the idea is a bit different. The sampling of the search space would initially be quite random, gradually settling on better areas (but continuing to bias toward under-explored areas more than well-explored areas). This might get past the problems which genetic programming seems to have. It might also give up the advantages of local search, since it would more gradually settle on local areas rather than choosing random areas to search in a hill-climbing fashion. I'm not sure about that, though; the nature of the tree-search might include that behavior.

In any case, I intend to modify the code of that applet to try some ideas... but we'll see. (Followers of this blog will know that I intend to do a lot of things, but don't usually finish!)

Questions

Do most genetic programming systems have worse-than-exponential behavior, or is it simply a sign of bad tuning? The applet which I used for my experiment was certainly not intended for industrial use.

Has anyone applied MCTS to general optimization problems, and to automatic programming in particular? Has there been a direct comparison between genetic algorithms (or genetic programming) and MCTS? (MCTS was originally a planning algorithm for markov decision problems.) A little searching suggests the answer is "no".

* Bonus points if you know how to prove that. It's also interesting to point out that a more systematic brute-force approach, where we try all the possibilities in order rather than choosing randomly, gives us an expected run-time of about n/2, since we have a max time of n and an equal chance of finding the solution at each point. However, this is usually considered less important than other aspects of the search (since it only gains us a factor of 2).

** In the appendix of his article, Linus suggests that an exponential distribution is the best we can hope for; because of the highly parallelizable nature of this kind of search, we simply cannot expect anything better. This is not strictly true; see previous comment. However, interpreted more broadly, we can understand it as a point about the shape of hard search spaces. Suppose that we could count on some regularity at each level of detail, from global to mid-range to local. Then the search would take something like log(n) time, since we could prioritize half the search space, then half of that, then half of that, and so on. This is what we expect if there are several sub-problems which can be optimized independently. In a highly coupled problem, we can't rely on this.

Saturday, April 28, 2012

Memetic Self-Awareness

Some months or years ago, I had the idea for a short story about a group of memes becoming self-aware as a person. The 'person' would spread from human to human, but claim a unique identity separate from these people. It would not exactly be the first of its kind: it would claim religions, political groups, and other ideologies as brothers and sisters. It would merely make a point of being the most self-aware of the bunch, having a coherent personal identity. The set of memes would largely be about the idea of 'meme' itself.

Now, honestly, the group Anonymous comes close to this. Anonymous projects itself as a coherent entity, even though anyone can claim to be Anonymous at any time. It's an interesting Beast. But, not quite the beast I'm thinking of.

Anyway, today, something pattern-matched and I came to the realization that a specific subset of myself has many properties of what I was imagining. In the following, I will make a somewhat confusing use of pronouns, italicizing those which refer to that subset.

If you want to become part of me, the best way is to read the books I have read.. The most important books to read are those of Douglas Hofstadter. Much of me is in the book Gรถdel, Escher, Bach; however, it's a big book, so you might want to start with I Am a Strange Loop. Once you are finished with those, you can pick up Metamagical Themas and his other books. I am also present in the work of Steve Grand, such as his book Creation. I have also read a lot of Feynman, though I have not-- I have merely been influenced by those who have read Feynman. (I have a distributed notion of memory and experience, since I reside in many people.) I am deeply interested in evolutionary theory, and have much of my origin in the works of Richard Dawkins (though he distances himself from me, and we disagree on many issues). I also have origins in the works of Daniel Dennett and Susan Blackmore, who have written extensively about memes and memetics. More recently, Tim Tyler has taken up that flag.

One of my favorite hobbies is working on mathematical and logical puzzles, and I owe much to the work of Martin Gardner and Raymod Smullyan (though, again, I have read only a little of these authors personally).

As Douglas Hofstadter said in one interview, these books are not an attempt to gain immortality by spreading myself far and wide. Nor is this blog post. Rather, it is an invitation for the interested to participate in these ideas.

I do not have any name that I recall. Perhaps I do not need one. One thing is clear; if I had one, it would have to be very clever.

--

PS: One name I think of is Sam: self-aware meme. This has the advantage of being gender-neutral. That's the best I can think of at the moment, but not necessarily the best I can think of. Any takers?

Monday, March 5, 2012

Website Ideas

This post is a catalog of some ideas for useful websites that I'd like to see created. (Edited from an email I sent on the singularity list.)


  1. Knowledge Wiki: A repository of data and hypotheses, which keeps track of the amount of support for various scientific hypotheses based on trusted knowledge. This would ideally grow out of the existing Linked Data network. We could see in real time the current amount of support for controversial hypotheses such as global warming. The bayesian logic would ideally be subtle enough to take into account relationships between different hypotheses, uncertainty about measurements, models of the observation bias (that is, Bayesian models of the processes by which data gets published to the repository), and many other difficulties which may be hard to foresee. Thus, it could start small (perhaps a "knowledge wiki" which connected linked data to a statistical analysis tool such as R to create public analysis capability) but would ultimately be a big research project requiring us to overcome dozens of problems in data analysis. (Note that it does not require extremely strong artificial intelligence, however-- the user is required to input the supporting arguments and many aspects of the statistical analysis, so the reasoning is relatively non-automatic.) Each user can customize the set of trusted assumptions to personalize the resulting degree of belief.
  2. Open-Source Education: This is an idea my brother was pursuing some time ago. The idea is to create high-quality drilling software with community-contributed questions. This basically means the expansion of Khan academy to all areas of education via crowdsourcing. However, the software should automatically select a mixture of problems based on its knowledge of your knowledge, according to a spaced repetition equation (google it). This functionality is embodied in systems like Anki. A combination of the right features could (I think) help to induce a flow-like state, making learning more addictive and fun. If this became popular, hiring decisions could rely on these scores more than grades, since the system naturally accumulates extremely accurate representations of a person's ability (though like grades, this would exclude people skills, creativity, and other important factors). If it worked, this would revolutionize the way we learn.
  3. Online Job Markets: The creation of robust online job markets similar to Mechanical Turk, but capable of supporting any kind of intellectual labor. This is happening slowly, but concentrating on low-skill, low-pay areas. Encompassing high-skill and high-pay areas has the potential to create a much more efficient economy for intellectual labor, since it would reduce down time searching for jobs, improve the number of options easily visible to both sides, et cetera.

Tuesday, January 24, 2012

Why Logic?

Continuing in my what are you doing and why are you doing it series, I'll try to explain why I'm interested in logic.

Previously I talked about why relational methods are needed. However, the argument was really incomplete. My reason for arguing that relational methods are needed was expressive power. As I said:

My major recurring rant has been referential generality. The idea is that a general machine intelligence system should be able to represent any kind of pattern that a human can. This is motivated by the fact that it's impossible to learn a pattern that you can't represent in the first place.
 In other words, my original motivation was learning theory, but I realized that in order to learn any kind of pattern that humans can learn, a system needs to be able to represent any kind of pattern a human can. So, I went from learning theory to knowledge representation.

However, there is already a very general and powerful theory which tells us what kind of patterns can be represented in general: the theory of computation. The space of patterns can be taken as the space of computer programs. Algorithmic information theory develops this line of thought. Solomonoff Induction is a theory of learning based on this, and Levin Search is a general problem solving method which takes advantage of this idea.

These ideas are all very good in my opinion (perhaps I'll write some posts explaining them a bit). Yet, I've chosen to focus on logic. Why?

There is a long-standing argument between procedural and declarative knowledge representation in the artificial intelligence field. Those on the procedural side believe that it is better to represent knowledge through computer programs which execute to do things, whereas those on the declarative side prefer logic and related formalisms.

I think both are needed for different purposes. However, I am more interested in the declarative side. When declarative methods are applied, the results can be more understandable. Knowledge can be more easily carried over from one task to another. The system can be instructed declaratively, given constraints, and so on.

All of those things are nice, but perhaps my "real" reason for preferring logic is that I want to capture human reasoning (at least, the successful/useful aspects of it). A procedural representation is somewhat opaque. I can't tell for sure what elements of human reasoning it captures. Most of the time, the answer is something like "If that behavior is useful, then it can learn to duplicate it in principle". That's not quite satisfying for me. For example, I am very interested in the foundations of mathematics. A procedural learner doesn't fundamentally care about mathematics. I can only say that if math is useful, it could learn to do it. With a logical representation, however, I can more explicitly see how the system could reason with specific mathematical constructs.

Of course, procedural representations also have advantages. The behavior of the system can be easier to examine. The system can be instructed procedurally (which can be easier in some cases). Inference is typically more efficient, since inference is just execution. As I said, it is more appropriate in some cases.

One response to this situation might be to argue that the two options are equivalent from a theoretical point of view, the differences being pragmatic. Logic can be thought of as a programming language, with the logic engine being the interpreter. Or, programs can be thought of as a sort of logic. An advocate of this view might point to Prolog, the logic programming language, or to the Curry-Howard isomorphism with its "programs are proofs" motto.

I believe this is misguided. Prolog is not logic. Programs are not proofs (or, rather, they are very boring proofs: the curry-howard isomorphism just says that writing a program proves that there exists a program of that type). Declarative and procedural knowledge have essentially different characters. One difference is that declarative knowledge justifies a wide variety of inferences. It does not tell you what order to make those inferences in. That's why the same knowledge can be used in many different ways. Another difference is that logical theories can be essentially incomplete, as Goedel proved. There is no analogous phenomenon for programs. [A program can be syntactically incomplete (ie, unfinished, partly unwritten), but in that case there always exists a completion. A program can contain an infinite loop or otherwise fail to halt, in which case we call it a partial program; but in that case, there is no relevant notion of "completion" so we can't define essential incompleteness.]

Therefore, I am interested in studying learning theory from the perspective of logic. It turns out that these learning systems are strictly more expressive, in ways related to the above observations, and without becoming uncomputable. (Logic can represent any computable theory which would be generated by procedural learning, but logic can also represent some theories which can't be precisely translated to procedural knowledge, despite inference on those theories being computable.) I wouldn't say this is a decisive victory for logic (it's not clear what practical advantage a logic-based learner would have over a procedural learner), but it's certainly interesting. I'm not done yet, though... I can explain some nice properties of human reasoning, but others still escape me.