Wednesday, November 16, 2011

Why Relational Methods?

It's been a little while since I've written anything here. While that's not unusual, it's also not ideal. I've actually been meaning to write more, not less. Attending my first "real" conference, AGI-11, has shown me that I have a public image whether I mean to or not. When I'm not writing stuff here, I'm writing stuff on AI mailing lists (or, more recently, facebook groups). People who I respect know my name.

If that's the case, I might as well manage my image... at least a little.

As such, it would be good for me to explain a bit about what I am doing and why I am doing it.

My major recurring rant has been referential completeness. The idea is that an intelligent machine should be able to represent any kind of pattern that a human can. This is motivated by the fact that it's impossible to learn a pattern that you can't represent in the first place.

This has led me to grapple with some difficult issues in logic and the foundations of mathematics, since the undefinability theorem proves that there is no referentially complete system (in a specific sense). Unsurprisingly, I have not come up with a general response to this problem (though I think Hartry Field has some interesting ideas). However, I do have one message which I'd like to get across to as many practitioners as possible:

Relational methods are needed for general intelligence.

(I'm sure I've written about this before, but I will aim to give a decent unified post on the subject here.)

Now, I'm specifically talking about probabilistic relational methods. I will not, however, be giving an argument for probability theory. (Perhaps I'll discuss that in another post.) Rather, I'll be concentrating on the question of whether relational methods are needed.

Many people are beginning to understand this point whether I preach it or not. However, a large amount of research still goes on at the level of propositional methods. This isn't a bad thing (better propositional methods are very useful), but people often don't think about the issue and end up thinking that propositional methodology can lead to human-level intelligence.

So: what is the difference?

Propositional methods are so-called by analogy to propositional logic. Propositional methods deal with a fixed, finite number of variables, looking for predictive relationships like correlations (or even causation) between these. Variables are "basic"; they have no further break-down. It is not possible to learn a correlational structure for some variables and then generalize that to other variables, because there is no information to suggest which variables might be analogous.

Relational methods use more information about the structure of a problem. The details vary from formalism to formalism, but variables result from instantiation. This might be the instantiation of a class of objects (in object-oriented thinking), or an instantiation of a predicate or relation (in first-order language). In either case, we see that variables are now "about something": we have entities which we are describing.

This distinction can be surprisingly difficult to get across at times. Advocates of propositional learning methods will typically say that the difference is "We are learning the relations, rather than giving them ahead of time". This confusion results from using the word "relation" to describe multiple things. Logically, we can split a domain into very distinct levels:

  • Entities
  • Entity Properties
  • Probabilistic Generalizations about Properties
Entity properties include relationships of one entity to another. This is very different from a probabilistic relationship, no matter how complicated. Trying to form relations from correlations is a level-confusion.

This mistake is rarely made with crisp propositional logic (Boolean logic). The limitation there is very clear. Chess is an often-cited example. Stated in propositional logic, the rules of chess would run thousands of pages; however, they can be stated in first-order logic in just a few pages. Why? Because in first-order logic, we can generalize the rules based on the structure of the game. In propositional logic, we must re-state rules for each board location and each round of the game.

For some reason, many people lose sight of the simple limitations of propositional methods when they begin to do more complicated things such as probabilistic reasoning.

Now, we can see one reason quite easily. I said that the rules of chess must be re-stated for each round of the game; ie, they do not generalize over time. However, it is quite standard to generalize over time with propositional methods in machine intelligence. Only a fool would not do so! Indeed, if I point out any one dimension which an algorithm fails to generalize over, it's generally fairly easy to add that dimension to the system's generalization capability. As such, it's not hard to miss the general pattern: a propositional method needs to be fed the dimensions in which it's useful to generalize up front. It can't see the possible analogies by itself.

(One good way to identify the representational power of a system is to ask what kinds of analogies it is capable of making.)

Because the key idea of relational methods is structured regularity, it also seems to be easy for people to confuse various hierarchical methods for relational power. Deep belief networks, hierarchical temporal memory, and other similar methods identify more and more complicated sorts of patterns as we move up the hierarchy. (This is similar to the way that cortical columns have been observed to identify increasingly abstract patterns as we get further from the v1.) However, in the analogy to logic, this is like building up larger and larger propositional sentences. At no point does the system become capable of structured generalizations.

An astute reader may point out that a universal quantifier can be viewed as just a large conjunction, and an existential quantifier as a large disjunction. So long as the number of entities is finite, first-order logic really can be viewed as just propositional logic. This is true. However, learning in this form will take exponentially more data than learning via generalizations. Obviously, this approach forces us to separately re-learn every case of the general rule. We could only see it as creating a "generalization" after the fact; it is unable to use this generalization to speed learning.

Along similar lines, the advocate of propositional methods may argue that humans must be essentially propositional, as we have a finite number of inputs and outputs. (This is related to arguing that the brain is obviously a finite-state machine.) Again, this kind of model does not seem to be a good account of the sorts of patterns people can actually learn to spot (and how quickly they can learn to spot them). Unfortunately, this would be a very long argument to try and spell out in detail. One example is the way people can perform on IQ tests. To my knowledge, the only way this performance can be replicated by computers is via analogical reasoning, which is a deeply relational approach.

Sunday, September 4, 2011

Useful Meta-Analysis

Continuing with procedural logic. I've created a sequences page now, with a summary of the procedural logic sequence.

I recently finished reading the Sutton & Barto book on reinforcement learning. In the last part of that book, they discuss a general framework which is supposed to capture many of the different idea which have been presented. Specifically, it gives a view in which dynamic programming, Q-learning, SARSA, classical planning, monte carlo methods, and a few other things are all treated as value-propagation methods. The different propagation schemes differ, both in the equations of propagation and the scheduling of propagation, but the common structure is there.

This generally fits with the Dyna-style systems I've been thinking about, though the details are fuzzy.

Combined with the amount of thinking I've been doing about systems that use belief propagation as their major reasoning mechanism (which I should write more about...), this has me fairly deep into propagation-based reasoning systems.

We've also been talking about dynamic programming in my advanced algorithms course, as it happens. We went over a common example based on matrix multiplication (which I'd seen before, but not thought about recently). This example is particularly interesting, because the dynamic programming is being used for inference guidance, not for inference: we use dynamic programming to find the fastest value-propagation ordering, to great advantage. This leads to the idea of using dynamic programming to determine the schedule for dynamic programming. This connects back to my original theme of procedural logic: logic which determines its own inference procedures.

The issue with this is, we need some simpler inference guidance to get off the ground. If we wait for informed priority estimates before we do any inference, but rely on inference to get priority estimates, we'll go nowhere. (I referred to this as the reflection problem in the previous post in this sequence.) Furthermore, we have to balance between meta-inference and inference. It's tempting to give meta-inference a higher priority in general, because we want to do it first; but in a general setting, there will be too many possible meta-inferences, so we'd never get anything done if we gave them priority.

Since many of the propagations are just propagating real numbers (particularly in belief propagation and reinforcement learning for discrete domains), a simpler idea is to guide inference simply by giving priority to propagations which are tracking larger changes. This is referred to as prioritized sweeping in Sutton & Barto, but surprisingly not dealt with very much. I also haven't seen much about the effect of such a scheme on the efficiency of belief propagation. It's probably out there, though.

It would be great to test the effects of different (combinations of) prioritization methods to see what actually worked. Hopefully I get to that point.

Today I had a discussion with Laurent Orseau about these ideas. He has been thinking about very similar things-- a system based on a priority queue and a hash table, like the Dyna implementation I've been slowly working on. (Actually, I'm not even worrying about a hash table at the moment, but building something that looks more like a Rete network's alpha memory; when I need to optimize it, I'll go for term indexing, thanks to some advice from Russel Wallace.) He has some ideas about inference control as well (perhaps more unified than mine), which fit in with the theme of using dynamic programming to determine the optimal inference. Perhaps we will make some progress.

Monday, August 1, 2011

Corrected- IO Monad

This is a correction to the addendum of the previous post, where I claimed that Clean's way of handling IO was flat-out better than Haskell's. A quote from William Tyson (to a mailing list):


At first Haskell did try world passing.  Monads hide the world passing.  Compare the following echo procedures which reads a line and prints it character by character.

== Clean Echo ==

echo :: *World -> *World
echo world
| c = '\n'      = world2
               = echo world2
                       where
                       (c, world1) =: getChar world
                       world2 =: putChar c world1

== Haskell Echo ==

echo :: IO ()
echo = do
       c <- getChar
       putChar c
       unless (c == '\n') echo

==

In Clean, each IO operation gives you a new world which you need to explicitly pass around.  Uniqueness typing ensures that you don't use the same world twice.  In Haskell, the world is implicit.
In other words, the monad is making the syntax nicer, but basically the same thing is going on.

Sunday, July 10, 2011

Procedural Logic Answers

After some discussion with Russel Wallace, I've come to a few conclusions about my procedural logic project.


  1. FRP provides an in-principle answer: there is no essential problem with describing behavior-in-time in a logically pure way. You just assert some things about how a system behaves through time. There are certainly questions about the details, but the problem of describing actions with logic is not as nebulous as I was making it. It's computer science, not artificial intelligence.
  2. The problem which is nebulous is the "reflection problem" that I mentioned in the previous post. This is the problem of making a system which dynamically reasons about and improves its own implementation. It's possible to reduce many aspects of this to computer science as well, but ultimately it is an "AI kind of thing" to worry about.
  3. The main technique for making #2 more tractable is to "stage" the reflection, separating an action and reflection on that action. For example, the kind of dynamic priority assessment I was imagining (allowing reasoning about the priority of a task between the time when the task is added to the queue and its execution) is just not a great idea. It's better to reason about the priority of rules.
When it's possible to reduce something to a computer science problem rather than an AI problem, the results become more generally applicable. I mean two things by this. First, it's a computer science problem when it doesn't depend on any standpoint concerning what intelligence is, best approaches to achieving AI, et cetera. It cuts out philosophical stuff and just focuses on a practical problem, with no need to justify itself via claims about building a thinking machine within 20 years. Second, it generally falls within the polynomial-time asymptotic complexity class, rather than exponential-time. Exponential time problems are roughly where the need for methods we refer to as "AI" start, because we need to rely on heuristics for approximate problem-solving. Obviously there is an advantage if we can manage to work on poly-time problems rather than exp-time problems.

Whereas FRP-like solutions to the procedural logic question involve programs which only refer to their external behavior, and so can be implemented in whatever way is convenient, reflective procedural logic would refer to its implementation details. It's possible that it could still maintain a certain amount of purity, but it's difficult to say.

I'm working on a bare-bones implementation of Dyna, with the goal of seeing how much reflection I can conveniently add. I also have a few other goals for the project (but these are not high priorities):
  • Integrate functional and logical programming, as with Curry and Mercury
  • Integrate the pattern calculus, a generalization of lambda calculus for pattern matching
  • Use a type system modeled after the one Qi uses
  • Try FRP
  • Try uniqueness types
We'll see how far I get. :)

Monday, July 4, 2011

Procedural Logic Once Again

Follow up to Procedural Logic.

This is one of those things I've been meaning to write about for a while now.

Sometimes early this year, I had the thought that some form of "agenda" or priority queue is lurking inside a very large number of AI algorithms. What happens if we push the idea of making priority queues the heart of AI systems? Can I make a generally useful AI toolkit by creating an optimized priority queue with some generally useful built-in features? (This would be similar to the way SOAR is "just" a highly optimized rule engine with useful extra features.) This was to be my answer to the "Procedural Logic" question: a principled way to handle thinking-through-time (and the acting/thinking trade-off) by making use of scheduling at the basic level.

If I was writing this post back then, I would have had a big argument justifying this idea. The main thrust of the argument would be that priorities can be thought of as a procedural form of expected utilities; IE, they should be an immediate estimate of the value of performing that action. The special features of the system would center around improving that estimate, through methods like reinforcement learning. Via an analogy to the copycat system, I was referring to each task as a "codelet".

When I told this idea to Russel Wallace, he immediately spotted the problem (though it took me a while to appreciate the insight). Such a system will run into the "problem of reflection" (representing and reasoning about one's own representations and reasoning) head-on, with no direct solution. We can't easily use the task system to think about what priorities should be assigned to tasks without risking adding infinitely many tasks to the queue (if we try to have an "estimate the priority of this" task for each task).

We can get around that if we are careful, but we have to make compromises. Even if we settle on a good trade-off, the credit assignment problem is far from easy; we can't just apply standard RL. To get a good idea of what is going on, it seems as if we have to keep a record of dependencies (what tasks pushed what other tasks into the queue) within the system. This is something SOAR does in order to implement its chunking mechanism, so it's possible. Chunking-like behavior was another feature I was considering: if several codelets often run in sequence, compile their actions into one codelet so that it runs faster. This would actually be very similar to JIT compilation of bytecode, so my hope was to use algorithms which had been developed for that.

A system more heavily built upon the concept of activity tracing is Funk2. The goal of Funk2 is exactly the one I mentioned: solving the credit assignment problem by tracing the cause-and-effect relationships within the code.

So, perhaps there is some hope for the idea. At every corner, though, just when Solution X looks like it will make the idea work, I start wanting to make "Smart X"; ie, X which learns over time thanks to making more general use of the system, and in particular, a version of X which makes use of the task queue. This is, of course, Bad Thinking. There really does need to be a bottom somewhere, even if as a programmer I'm used to the power of recursive decomposition of problems. This takes some of the magic out of the idea, forcing concrete design decisions which may not be as general-purpose as my original vision.

So, all told, I reached a point where the idea was set aside.

Then I found Dyna. The basic idea of Dyna sounded at first outlandish: an AI programming language which manages to make no assumptions about the style of AI being implemented (ie, it "takes no sides" concerning the different debates going on within AI) yet makes it possible to implement many standard AI algorithms in just a few lines. Furthermore, the concept of the language is very simple! It's not obvious at all that this should be possible; usually, you've got to make assumptions to gain anything. Yet, there it is: simple, (mostly) intuitive descriptions of textbook algorithms, often in less than 10 short lines of code. And, as if it was designed just to make me happy, it's got a priority queue at the core of its implementation.

The basic idea behind Dyna, in case you don't want to just dive into the links, is basically just to take rule-based programming / logic programming and completely scrap the idea that the rules have to be passing anything like truth values. A "rule" is just an expression which determines how some sort of value should be propagated. The value can be anything you like; it can be a true/false, or a probability, but it can also be a neural-network activation or an instance count or even a complicated data structure. In this way, it is very similar to probabilistic programming languages like BLP, but drops the "probabilistic" part. (Dyna also relies fundamentally on a concept of aggregation which is very similar to the concept of "combination" in BLP, and also similar to "summary" in the summary-product alg and yet also captures the idea of "aggregate CPD" in statistical-relational modeling... but I'll let you read the papers.)

Dyna, however, is far from a "procedural logic": the philosophy of Dyna is to strictly separate the abstract program specification from the implementation level. The Dyna programmer specifies the logical (or "equational") relationships between different values, but it's up to the Dyna compiler to decide how to implement these effectively.

As with any "Solution X", the obvious thing to want next is "smart X"; and, indeed, the papers refer repeatedly to the idea that Dyna will eventually use machine learning to figure out what implementation strategies work best for which kinds of programs. My reaction to this is that it seems a tad hopeful to think that ML can figure out what the human programmer has difficulty with. It seems safer to just say that we should buckle down and look for good rules for the compiler to use to decide how to implement particular pieces of code, perhaps tuning the occasional threshhold via experimentation...

...but what I really want to say is that Dyna should not have that strict divide between specification and implementation at all; instead, it should be build from the beginning with the reflection capabilities necessary for it to control its own execution strategies via run-time priority manipulation, and so on. This is only bolstered by the fact that Dyna was originally designed with dynamic programming in mind, and RL is a dynamic programming problem!

But we know where all of this leads. Right into the reflection problem. [Update: Ah, actually, they do plan on allowing this sort of thing! They call it "voodoo computation".]

So, all told, I've made very little progress on the whole "procedural logic" idea. However, I do think Dyna is very cool; it brings together a lot of ideas of textbook AI in a small and elegant space, perhaps revealing that there was more to them than we thought when we were in undegrad AI class. :)

PS:

One more semi-relevant idea. Clean is a programming language which is very similar to Haskell, slightly older, slightly less elegant, but with a much better way of thinking about IO (it seems). It allows side effects without compromising referential transparency by introducing a uniqueness type which allows only one reference to values which may change due to side-effects (so that we can't be surprised when the value changes). This seems, well, cleaner than the IO monad (though I'm no expert on IO in Haskell); a good way of interfacing the logical with the procedural, but no solution to generating nanosecond-level goal-oriented behavior.

(Dyna is actually a "pure" language, no side-effects allowed... though values can change over time in several ways!)

Saturday, June 11, 2011

I recently encountered this article (shared by Gwern). The following is a riff off of that, which appeared partially as a buzz comment as well.

Education is very important.

Culture is very important.

The culture of not admitting mistakes which is described is quite different from here, but at the same time, an uncomfortably familiar facet of human nature. Look how bad it can be!

In America, it is more often simply keeping silent rather than admitting something which might make you sound stupid. That is also a mistake, _especially_ when it's a matter of being afraid to ask stupid questions. We've all been there, right? No one wants to sound uninformed. The problem is that the best way to _get_ informed is to ask these questions.

While we may understand this on a conceptual level, it is hard to get that "small amount of courage" needed to ask the critical questions immediately when we have the chance. Perhaps it is because it's an immediate cost for a delayed reward, and we are not very good at thinking of the delayed reward. Here is a trick which may help. So you are worried about making a good impression, right? By keeping silent when you don't understand, you don't make any impression on that issue. By asking questions, you give the impression that you are really interested. If you ask enough questions to really understand, you make the impression of being able to pick things up quickly. By practicing the art of asking good questions, you stand a better chance of making an impression as a good critical thinker.

What we need is a culture which encourages admitting mistakes and inadequacies as soon as possible. Specifically, we need to encourage the idea that the best way to *be* that competent, informed person is to seek feedback and correct yourself whenever you fall short. This also gets into the space of what we do when we argue. We bring in all the support we can to make our side right. Instead, we should be looking at the evidence to choose the right side. That is discussed in this article, also shared recently by Gwern. (It's also discussed extensively on Less Wrong.) How do we engineer a culture in which real discussion, from which both sides can learn, is encouraged? Here are a few tips (this time shared by Luke Palmer). These come from a system of government called Sociocracy.

I'm not buying into the whole of Sociocracy (and neither is Luke), but these tips for how to run small groups seem immediately useful. The basic idea is that our win/lose argument mindset is partly due to the pervasive culture of democracy. What do we do to resolve a dispute in a small group? We vote, and the winning side is considered correct. Sociocracy suggests an alternative: ask for everyone's objections, and don't make a decision until all the objections have been dealt with. This builds a better decision. The phrase "science is not a democracy" comes to mind. Science instead relies on the merit of arguments to convince individuals of the correctness of views. This is the idea behind sociocracy. (I do feel voting is still a good way of making decisions when unified decisions are needed and consensus can't be had or would be too expensive, though.)

So how do we build a more effective culture? By seeking out good ideas like this, and sharing them. (Ultimately, Luke's idea of experimenting with new cultures to see what works is also good.)

Sunday, May 8, 2011

Current Probability Ideas

Follow-up to this post (and the series preceding it). I think that while my distinction between subjective and objective probability makes sense, it's not the right direction for answering the questions I'm asking-- it overcomplicates the issues. That change of view is based on a few papers, which I still need to read in more detail. (IE, this post does not summarise insights from those papers, it summarises my reaction from discovering their existence.)

Aside: my previous post, this one, and perhaps the next one or two do not represent me "thinking aloud" as my posts usually do... instead, I'm summarising thoughts that I had over the semester, but felt too busy to blog about.


So. How well can we do with just a single layer of probability, over a standard logic? How can we achieve something similar to Solomonoff induction?

One idea is to imagine that the universe we are observing was constructed by randomly writing down sentences in logic to determine each fact, only taking care not to ever write something which contradicts what we've already written. For example, we can  use a length-based prior to choose a sentence at random from first-order logic.  This prior obviously contains the Solomonoff prior ("is dominant over" as they say), but it's not clear to me whether the relationship goes the other way ("is dominated by") so that they are roughly equivalent. I'm thinking the prior I'm specifying is "even more uncomputable," since we have to check for consistency of theories.

This prior lends itself to an interesting approximation scheme, though.

  • Each sentence has at least log-probability inversely proportional to its length
  • To this, we can add some probability mass for each consistent sequence of sentences which we find that implies the sentence. (To test for "consistent" we just try to find a contradiction for some reasonable amount of time, perhaps trying harder later if it's important.)
  • Normalise this probability by dividing by (1 - probability we encounter an inconsistent sequence drawing randomly), which is estimated based on the cumulative probability of the inconsistent sequences we've identified so far.
  • This gives a rough lower bound for the probability. To get an upper bound, we apply the same technique to the negation of the sentence.
  • If A implies B, then B is at least as probable as A (ie, it can inherit A's lower bound). Likewise, A is at most as probable as B (inheriting B's upper bound). In other words, we can use existing ideas about propagating probability intervals to work with these estimates if we like.
  • To update on evidence, we use compatibility with that evidence as an extra criteria for consistency of sequences of statements. Intuitively, this will throw out a bunch of possibilities (and increase the amount of normalisation by doing that, which increases the probability of the remaining possibilities).

Unfortunately, these bounds would tend to be very wide, I think. The system would have to do a lot of thinking to narrow them down. More generous approximations might find one model and stick to it until it's disproved or a better model is found, say. This would at least give the system something to work with.

Now that we have a little bit of foundation in place, I think we can more usefully consider what to do if we want higher-order probabilities. Just as we can randomly choose sentences of first-order logic, we can randomly choose sentences of any system of probabilistic models we like (such as BLP, BLog, etc). This allows us to use the inference algorithms, et cetera from our favourite formalism. As long as first-order logic is a special case of the formalism, it will be just as universal.

For example, we could perfectly well use a prior based on random theories stated with the (objective-)probability-theoretic quantification which I was discussing in the previous posts. If we do this, we basically solve the puzzles I was running up against there: the inference from objective probabilities over classes to subjective probabilities over individuals is justified by the regularity assumption in the prior, IE, by the idea that the objective probability might be an "essential feature" of the class... something that was written down in the book of rules for the universe before the individual cases were decided.

Saturday, May 7, 2011

Impredicativity

Lukasz Stafiniak sent me this about impredicativity. I had seen the concern before, but never seen why it was worth any worry. Impredicative comprehension principles are a big convenience for normal mathematics, and don't lead to any paradoxes that we know of.

However, looking at it again, I realised that the basic complaint could be read as "impredicative comprehension is not valid in Kripke's theory of truth." This is more concerning to me. In fact, the concern can be stated just in terms of the Tarski hierarchy: 2nd-order logic does not exist at any stage in the Tarski hierarchy. I was foolishly thinking that there was a direct correspondence between stages in the hierarchy and orders of logic. There is, if we use predicative comprehension! But, impredicative comprehension is "too powerful" to exist at any stage of any recursive hierarchy.

This causes trouble for me, because impredicative comprehension is necessary to my understanding of the natural numbers (and other important things).

Fortunately, the article suggests a solution: impredicative quantifications should be understood at the level of provability. IE, we aren't stating that all instances are true; we are stating something stronger, that they are all provable. (But we use infinitary logic, so that this is not as much of a restriction.) This is a very intuitionistic suggestion.

Unfortunately, we can see that this really just introduces another metalanguage (the language of provability). Hence, comprehension interpreted this way does not include all possible predicates after all (even if we extend the provability interpretation to the point where it covers all cases of 2nd-order comprehension): it skips the predicates which can be defined in terms of provability.

Building a self-referential theory of logical implication which follows the rules we want is hard. The obvious way of doing it just leads directly to Curry's paradox. I still have not finished reading Hartry Field's "Saving Truth from Paradox", but it includes an account of implication which may be helpful. These posts also contain some notes which may be helpful.

I should have known for about a year now that the immediate challenge is to built up to 2nd-order logic, not to build up to set theory. However, this made it hit home: I can't rely on 2nd order logic to be there for me when I need it. I don't yet know what 2nd-order logic is, or where it comes from.

Still, the fact remains that 2nd-order logic is a fairly natural, very useful, and almost certainly consistent set of axioms. It seems probable that aliens would stumble upon it, etc. It does not just poof out of existence. Even just as a syntactic entity rather than one with a semantics, it's still important!

Friday, February 4, 2011

Probabilities over Subjective Worlds

This is the promised follow-up to this post.

What sense can we make of the standard treatment of subjective higher-order probabilities? Beyond the formalism, how are we to visualise this stuff?


I wanted a possible-worlds type semantics, but I decided to construct my own rather than looking at existing ones.

In normal possible-world semantics, we've got a bunch of worlds which hold complete valuations-- assignments of every statement to true or false. Then, we often talk about an "accessibility" relation between worlds: world A is only accessible from world B under certain conditions. The accessibility relation controls our notion of "possible" and "necessary": from our perspective at a world W, something is considered "possible" if it is true in some world accessible from W, and "necessary" if it is true in all accessible worlds.

Examples of useful accessibility relations include ones based on knowledge (ie, the accessible worlds are the ones consistent with what I know about the world I'm in) and based on time (ie, the accessible worlds are potential futures).

One way of getting probabilities out of this is to make the accessibility relation continuous-valued; rather than a world being accessible or not, a world has a degree of accessibility.

So, we start with just non-probabilistic facts in each world; then 1st-order probabilistic facts get values based on the immediately accessible worlds. Then 2nd-order facts, 3rd-order, et cetera. (This can go up to infinite orders if the language is expressive enough to mention them.)

To get coherent probabilities, we need to have some restrictions on the weights of the accessibility relations; I won't go into detail here.

Unfortunately, this really only gets us a 1st-order distribution: since all our facts are completely determined in each world, all our 1st-order probabilities will be completely determined in the first propagation step, so all higher-order probabilities will be 1 or 0.

To get an interesting higher-order distribution, we can give up the idea that possible worlds always hold complete valuations. Instead, a possible world contains just a partial picture: an assignment of a few facts to true or false. This allows the probabilities to be only partially determined, allowing nontrivial probabilities at each stage.

This goes along with an idea called "situation theory" which I know only a little about. As I understand it, the idea is that when talking about possibility, we talk about "situations" rather than worlds: partial assignments to true/false rather than complete ones.

It's probably better to think of these as "possible states of knowledge" rather than possible worlds or even possible situations, given the probabilistic interpretation. We move between worlds when we gain knowledge. Some of the worlds *will* be fully specified, and one of these will correspond to the "actual world"; however, we will never get there in terms of our state of knowledge.

Now, since this construction gives us a self-referential probability predicate, it is interesting to think of it as a sort of generalised theory of self-referential truth. P(S)=1 plays the role of True(S), but does so imperfectly: it is possible for the probability of a statement to be 1 but for the statement to later turn out to be false. For example, if we have an even distribution over the possible probabilities of some statement A, then P(P(A)=r)=0 for all real numbers r. This means P(P(A) not equal r)=1 for all r. Yet, we may eventually transition into a world in which r takes on a specific value-- so we can't think of all statements "P(A) not equal r" as being true. This makes our theory of (pseudo-)truth one in which the inference A => True(A) is justified, but not the inference True(A) => A.

The propagation method gives us a "least-fixed-point" type probability distribution. However, there might also be other useful fixed points. For example, we may want some self-referential sentences to come out with probability values.

In any fixed point, there will be probabilistic statements which do not get any truth value. First off, any statement which has a (non-integer) probabilistic belief associated with it can't also have a truth value. There will also be statements, though, which can't be consistently assigned any level of belief.

Overall, this probably doesn't offer an especially appealing theory of truth. It would be interesting to know more about precisely how much it can give us, though.

Friday, January 28, 2011

Some Consequences of Incompleteness

I wrote an essay about some issues in logic a little while ago.

Friday, January 21, 2011

Subjective Probability

The system I described in the previous post doesn't directly lend itself to "subjective" higher-order probabilities. Specifically, we can't just have a degree of belief in an arbitrary statement. We need to choose a variable randomly in order to have a probability-- a completely specific statement (with no variables) can't have a probability other than 1 or 0. Probabilities are reduced to a sort of counting -- specifically, what's known as a "measure" -- rather than being a way of representing uncertainty.

There are some ways we can try to interpret subjective probabilities as measures. We can introduce "possible worlds" (or "possible situations"), which we treat as being chosen randomly-- so if we're uncertain about whether it will rain or snow tomorrow, it's because in some possible worlds it rains, and in some it snows. The probability of each is obtained by counting the possibilities. However, prompted by Lukasz's comment on my previous post, I took a much closer look at how higher-order subjective probability measures should be represented. I found myself re-inventing the standard higher-order probability theory which I complained about in the first place. This can be represented within my framework, but it isn't clear that it should be. Each theory can stand on its own, and is more convenient for different purposes.

One problem with my proposed system is that it does not directly endorse the idea of generalizing from examples. Having knowledge of more examples will give us more knowledge of the probability involved, but only because the probability is defined by literally counting examples. If we have 100 buttons (each measured equally) and we know only that each button is either red or blue, seeing 98 of those buttons will tell us the probability to within a range of .02, but that tells us nothing about the last two buttons! 98 blue buttons are no indication that the last two are blue, unless we add a possible-world framework so that we can measure probabilities of the last two colors as a function of randomly chosen worlds.

Frequentism is a bit better here: probabilities are not just arbitrary measures, but rather, are limiting frequencies. What this means is that the probability of an event is defined as the ratio one would get after an infinite number of experiments. It seems more justifiable to use a limiting frequency as a generalization; if we have a limiting frequency, then we know that if we run experiments long enough, we'll get close to that ratio. This in-the-long-run statement is potentially quite weak with respect the the immediate future, but at least it tells us something about the future!

There are some problems with limiting frequencies, however. One problem  is that there is no mathematical guarantee that limiting frequencies exist! Mathematically speaking, the limiting frequency will typically depend on such things as the order in which we perform the experiments; some orderings will have different limits, and some will have none at all (ie, the ratio varies up and down infinitely). We have to assume that the actual order in which we perform experiments will have a limit. Another problem is how we might get knowledge of limiting frequencies. A limiting frequency is not something we can directly use as a degree of belief concerning another limiting frequency-- limiting frequencies require something called a reference class (meaning that a probability for a specific event is only defined when we think of that event in the context of a specific sequence of random experiments). Furthermore, a ratio from a finite series of experiments does not necessarily tell us anything about the ratio after an infinite number of experiments; we need additional assumptions to try and make this connection.

This brings in the idea that we have to start with some probabilities in order to get more (effectively, we need a Bayesian prior). Taken to its extreme, we get the theory of subjective Bayesian probabilities; all probabilities are interpreted as personal degrees of belief. New information provides evidence for hypotheses by Bayes' Law, so that we can generalize from examples by performing bayesian updates on our probabilistic models of the world.

Really, all three of these options are valid applications of probability theory. One does not have to be a purist-- the different types of probability can be mixed. In particular, it seems useful to take Bayesian-style degrees of belief as a way of estimating the two other kinds of probabilities.

However, that sort of inclusiveness does not settle the issue of how probabilities should be used for particular applications. For my curiosities about higher-order probabilities, the system I presented in the previous post offers a somewhat nice, tight connection between higher-order probabilities and first-order logic (probabilities being a generalization of universal quantifiers). This may be appealing for certain probabilistic logic issues. On the other hand, the Bayesian theory of higher order probabilities has a nicer way of allowing a degree of belief for any statement... (though, higher-order belief distributions require us to just use expected probabilities when making bets, as I complained in the previous post).

I'll end this here... next time, I hope to talk a bit about the structure of possible-world semantics for purely subjective probabilities, and what it provides in the way of a theory of truth and foundation for mathematics.