Wednesday, November 16, 2011

Why Relational Methods?

It's been a little while since I've written anything here. While that's not unusual, it's also not ideal. I've actually been meaning to write more, not less. Attending my first "real" conference, AGI-11, has shown me that I have a public image whether I mean to or not. When I'm not writing stuff here, I'm writing stuff on AI mailing lists (or, more recently, facebook groups). People who I respect know my name.

If that's the case, I might as well manage my image... at least a little.

As such, it would be good for me to explain a bit about what I am doing and why I am doing it.

My major recurring rant has been referential completeness. The idea is that an intelligent machine should be able to represent any kind of pattern that a human can. This is motivated by the fact that it's impossible to learn a pattern that you can't represent in the first place.

This has led me to grapple with some difficult issues in logic and the foundations of mathematics, since the undefinability theorem proves that there is no referentially complete system (in a specific sense). Unsurprisingly, I have not come up with a general response to this problem (though I think Hartry Field has some interesting ideas). However, I do have one message which I'd like to get across to as many practitioners as possible:

Relational methods are needed for general intelligence.

(I'm sure I've written about this before, but I will aim to give a decent unified post on the subject here.)

Now, I'm specifically talking about probabilistic relational methods. I will not, however, be giving an argument for probability theory. (Perhaps I'll discuss that in another post.) Rather, I'll be concentrating on the question of whether relational methods are needed.

Many people are beginning to understand this point whether I preach it or not. However, a large amount of research still goes on at the level of propositional methods. This isn't a bad thing (better propositional methods are very useful), but people often don't think about the issue and end up thinking that propositional methodology can lead to human-level intelligence.

So: what is the difference?

Propositional methods are so-called by analogy to propositional logic. Propositional methods deal with a fixed, finite number of variables, looking for predictive relationships like correlations (or even causation) between these. Variables are "basic"; they have no further break-down. It is not possible to learn a correlational structure for some variables and then generalize that to other variables, because there is no information to suggest which variables might be analogous.

Relational methods use more information about the structure of a problem. The details vary from formalism to formalism, but variables result from instantiation. This might be the instantiation of a class of objects (in object-oriented thinking), or an instantiation of a predicate or relation (in first-order language). In either case, we see that variables are now "about something": we have entities which we are describing.

This distinction can be surprisingly difficult to get across at times. Advocates of propositional learning methods will typically say that the difference is "We are learning the relations, rather than giving them ahead of time". This confusion results from using the word "relation" to describe multiple things. Logically, we can split a domain into very distinct levels:

  • Entities
  • Entity Properties
  • Probabilistic Generalizations about Properties
Entity properties include relationships of one entity to another. This is very different from a probabilistic relationship, no matter how complicated. Trying to form relations from correlations is a level-confusion.

This mistake is rarely made with crisp propositional logic (Boolean logic). The limitation there is very clear. Chess is an often-cited example. Stated in propositional logic, the rules of chess would run thousands of pages; however, they can be stated in first-order logic in just a few pages. Why? Because in first-order logic, we can generalize the rules based on the structure of the game. In propositional logic, we must re-state rules for each board location and each round of the game.

For some reason, many people lose sight of the simple limitations of propositional methods when they begin to do more complicated things such as probabilistic reasoning.

Now, we can see one reason quite easily. I said that the rules of chess must be re-stated for each round of the game; ie, they do not generalize over time. However, it is quite standard to generalize over time with propositional methods in machine intelligence. Only a fool would not do so! Indeed, if I point out any one dimension which an algorithm fails to generalize over, it's generally fairly easy to add that dimension to the system's generalization capability. As such, it's not hard to miss the general pattern: a propositional method needs to be fed the dimensions in which it's useful to generalize up front. It can't see the possible analogies by itself.

(One good way to identify the representational power of a system is to ask what kinds of analogies it is capable of making.)

Because the key idea of relational methods is structured regularity, it also seems to be easy for people to confuse various hierarchical methods for relational power. Deep belief networks, hierarchical temporal memory, and other similar methods identify more and more complicated sorts of patterns as we move up the hierarchy. (This is similar to the way that cortical columns have been observed to identify increasingly abstract patterns as we get further from the v1.) However, in the analogy to logic, this is like building up larger and larger propositional sentences. At no point does the system become capable of structured generalizations.

An astute reader may point out that a universal quantifier can be viewed as just a large conjunction, and an existential quantifier as a large disjunction. So long as the number of entities is finite, first-order logic really can be viewed as just propositional logic. This is true. However, learning in this form will take exponentially more data than learning via generalizations. Obviously, this approach forces us to separately re-learn every case of the general rule. We could only see it as creating a "generalization" after the fact; it is unable to use this generalization to speed learning.

Along similar lines, the advocate of propositional methods may argue that humans must be essentially propositional, as we have a finite number of inputs and outputs. (This is related to arguing that the brain is obviously a finite-state machine.) Again, this kind of model does not seem to be a good account of the sorts of patterns people can actually learn to spot (and how quickly they can learn to spot them). Unfortunately, this would be a very long argument to try and spell out in detail. One example is the way people can perform on IQ tests. To my knowledge, the only way this performance can be replicated by computers is via analogical reasoning, which is a deeply relational approach.

9 comments:

  1. Going back to cognitive aspects. What are the good architectures? Is it best when the state of the cognitive system is just a "flat" set of statements, each statement associated with a vector of parameters, and the transition to the next state is computed by a fixed "inference engine"? The whole cognitive process would be encoded in the state. What about "higher-level" architectures? For example sophisticated MDP-like architectures (think: Soar, Toss) where there is a separate notion of state (Soar: o-supported part of working memory; Toss: the model), state transitions (Soar: operators; Toss: rewrite rules), and a logic over the state (Soar: rules and i-supported part of working memory; Toss: FOL formulas, in the future perhaps defined relations). These are "two modes" systems: there's the state, and the logic-over-state based transitions. What about "three modes" systems ("non-Markov"): the (cognitive) state, the transitions, and the memory? For example ACT-R: the state are the buffers, the transitions are the rules, and there's the activation-based memory.

    Then: where should uncertainty and relevance reside? In a "one mode" system, we have probabilistic logics. In "two mode" systems, we can have transition uncertainty (like in MDPs), and model uncertainty (like in POMDPs); relevance is represented by action values. In "three mode" systems, relevance is "activation" over memory.

    ReplyDelete
  2. Lukasz,

    I have to say, this doesn't seem like a clean 3-way distinction to me. I'm especially fuzzy on your "one mode" systems. Can't we model *any* system by specifying a state and a transition function in that way?

    Is the intention that each of these modes has another meta-level which does inference to support the previous level? If so, then I suppose I'd say that a system should be designed to support the eventual creation of arbitrarily many such levels as they become useful... with the initial number being determined by what happens to be practical and efficient.

    ReplyDelete
  3. I didn't have a formal intention, but your observation nailed it pretty well. Why do you say that the system should be open-ended in creating the levels? I think that it would just be a distraction, especially as, when the system is able to create new cognitive mechanisms to support the new levels, I believe the effort would be better spent on redesigning the whole architecture (in particular, creating a thicker graph rather than deepening the chain of levels). But limiting ourselves for a while to the chain of levels, your intuition supports one, two, or three, as most effective?

    (And *sub-level* is probably a better term than *meta-level*.)

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. I think that "sub-level" and "super-level" nomenclature is a better fit than "meta-level" and "object-level". "Sub-level" shares with "meta-level" the fact that it enriches the base level, but it does not have the logical expressiveness of the base level (and its expressiveness seems small by comparison).

    ReplyDelete
  6. (ACT-R is a two-level system actually, Soar with the episodic+semantic memory system is three-level.)

    ReplyDelete
  7. Well, perhaps arbitrarily-many-levels is nothing more than a way of looking at specific sorts of 2-level systems. What I had in mind was something similar to Solomonoff's Aleph system, where the main system process, as one of its normal tasks, devotes some time to self-optimization. So if the main process is induction, some time is devoted to learning how to induct better; if it is optimization, some time is devoted to optimizing the optimization process; etc. This may become multi-level as the system optimizes its own ability to optimize its own ability. (In other words, it may develop special subroutines for this, since it may be significantly different from the other common tasks.)

    ReplyDelete
  8. The above is a difference of purpose, not one that I'm discussing -- not of logic-mechanism. Perhaps my distinction of "levels / layers / scopes / mechanisms" is vacuous. To recall, the levels I distinguished in the examples could be called "abstract state", "symbolic", "sub-symbolic".

    ReplyDelete
  9. I suppose I think that part of the benefit of "wrapping around oneself" in this way is that it eliminates some of the need for different levels. I've always liked that architectural idea: it can be found in the earliest writings on my previous blog too, and in my personal notebooks from high school. On the other hand, I've also long entertained the idea that it might be more practical to switch to progressively simpler methods as functions (such as optimization or induction) are called with smaller amounts of time-- the most important change being to switch off wrap-around when there isn't time for it, but other changes as smaller amounts of time are allotted may also make sense.

    Anyway, these ideas only fit within some sorts of architecture (in which "function calls" make sense at the cognitive level and are given specific amounts of time to work in). I think that's fairly far from the sorts of design decisions you are talking about...

    ReplyDelete