Sunday, December 19, 2010

Foundations of Higher-Order Probability

I think the foundations of higher-order probability theory may be in a bit of a mess.

Whereas first-order probabilities express uncertainty about the world, a higher-order probability expresses uncertainty about what value a first-order probability has. Or, if you ask a Frequentist instead of a Bayesian: first-order probabilities give limiting frequencies of results from experiments, whereas higher-order probabilities give limiting frequencies of getting certain limiting frequencies.

Uncertainty about uncertainty could be argued to defeat the whole point of having a plausible degree of belief... why can't it be collapsed into a straight degree of belief?

This paper tries to base higher-order probability on the idea that there is some "ideal expert" who knows the "true probability" which we are uncertain about. However, the paper puts it up to the agent to decide what constitutes a "fully informed ideal expert." If the ideal expert knows which events actually happen or not, then the first-order probabilities will all be 0 or 1, so the higher-order probabilities take on the role that 1st-order probabilities usually take on. In the opposite extreme, the "ideal expert" is just the agent itself, so that the higher-order probabilities are all 1 or 0 and the first-order probabilities are known. (All this is specifically laid out by the author.) This seems to put higher-order probability on a very shaky footing; at least to my ear.

Consider the following situation concerning an event A:

P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2

In English, we do not know if the probability of A is 1/3 or 2/3-- we assign 50% probability to each.

Now, two plausible principles of higher-order reasoning are as follows.

  • Expectation principle: We can find our first-order belief in event A by a weighted averaging over the possibilities; ie, P(A)=E(P(A)), where E is the expectation operator (which takes the average value, weighted by probability). This is necessary to convert higher-order distributions into numbers we can use to place bets, etc.
  • Lifting principal: If we know X, we know P(X)=1. This paper argues for a related rule: P(X)=p if and only if P(P(X)=p)=1. Either rule is sufficient for the argument that follows; specifically, I only need that P(X)=p implies P(P(X)=p)=1. (This is a special case of my "lifting" by instantiation of X to P(X)=p, and a weakening of their coherence-based principle since it takes just one direction of the "if and only if".) [It is interesting to note the similarity of these rules to proposed rules for the truth predicate, and in fact that is what started my investigation of this; however, I won't give significant space to that now.]
Granting both of these rules,  it is not hard to see that any non-trivial higher-order distribution will be contradictory. Returning to the example at hand, we apply the two rules in succession:

P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
->
P(A)=1/2 via expectation,
->
P(P(A)=1/2)=1 via lifting.

But this contradicts the original distribution; we have

P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
P(P(A)=1/2)=1,

which sums to 3/2, violating the laws of probability.

For higher-order probability to be non-trivial, then, it seems we  must reject one or both of my given principles. If this is granted, I think the best view is that the expectation principle should be abandoned; after all, that paper I cited gives a pretty good argument for a form of lifting principle. One way of looking at the expectation principle is that it flattens the higher-order distribution to a first-order one, trivialising it.

However, I want to preserve both principles. Both seem totally reasonable to me, and I've found a solution for keeping a form of each without any problem. I will argue that the problem here is one of equivocation; the notation being used is not specific enough. (The argument I use comes from my brother.)

For concreteness, suppose the event 'A' I mentioned is flipping a coin and getting heads. One scenario which fits with the probabilities I gave is as follows. We have two unfair coins in a bag; one gives heads 2/3 of the time, and the other gives heads 1/3 of the time. If we want a fair chance of heads, we can just draw at random and flip; half the time we will get the coin bias towards heads, but half the time we will get the one bias against. Since the bias is equal in each direction, we will get heads 50% of the time. This illustrates the expectation principle in action, yet it does not trivialise the higher-order characterisation of the situation: if we draw one coin out of the bag and keep it out, we can try and determine whether it is the 1/3 coin or the 2/3 coin through experimentation. This involves a Bayesian update on the higher-order distribution I gave.

The intuition, then, is that the "flattened" probability (1/2) and the "non-flat" probabilities (1/3 and 2/3) are the limiting frequencies of very different experiments. There is no contradiction in the two distributions because they are talking about different things. 1/2 is the limiting frequency of drawing, flipping, and re-drawing; 1/3 or 2/3 is what we get if we draw a coin and don't re-draw as we continue to flip (and we will find that we get each of those 1/2 of the time).

So, how do we formalise that? One way, which I particularly like, is to let the probability operator bind variables.

I'll use the notation P[x](A)=e to represent P binding variable x (where A is some statement with x as a free variable). Intuitively, the meaning is that if we choose x randomly, the statement A has probability e of being true. In addition, we can talk about choosing multiple variables simultaneously; this will be given notation like P[x,y,z](A)=e.

In this notation, the two principles become the following:

  • Expectation principle: If we have a higher-order distribution represented by a collection of statements P[x](P[y](A)=a)=b, P[x](P[y](A)=c)=d, et cetera, then we can form the first-order distribution P[x,y](A)=E[x](P[y](A)), where E[x](P[y](A)) is the expected value of P[y](A), choosing x randomly.
  • Lifting principle: If we know some statement A (possibly with free variables), we can conclude P[x](A)=1, for any variable. (Together with the expectation principle, this implies the related if-and-only-if statement from the paper I cited.)
With these rules, the argument from earlier has a different conclusion:


P[x](P[y](A)=1/3)=1/2
P[x](P[y](A)=2/3)=1/2
->
P[x,y](A)=1/2 via expectation,
->
P[z](P[x,y](A)=1/2)=1 via lifting.

This makes my earlier point about equivocation clear: the concluding probability is not about P[y](A) at all, but rather, is about P[x,y](A).

There are many details to fill out here; we'd like convenient notation for probability densities,  a proper analysis of what subjective probabilities are like in this system, and several other things. However, I feel these are better left to a paper rather than a blog post. :)

PS-- if anyone has references on similar systems, that'd be great!
PPS-- This paper talks about similar issues. :D

Sunday, December 5, 2010

Three Dreams

(1) I wish I had time to build self-education software. This is a wish carried over from my brother, who came up with the idea; a system which hands you exactly the practice problem you need at the moment, to challenge you, keep your interest, but not give you something you can't do. Ideally, the problem sets would be created and shared in a wiki-lije online community (which is the main difference from current high-end education software!) The broadness enabled by this could revolutionise education, job training, and certification. (Since the program estimates your skill level to determine what problem to give you, sitting potential employees down for a session on the machine would give valuable information about their abilities.)

(2) I would also like to have the time to build a powerful "open work" software in the spirit of mechanical turk, love machine, rentacoder, and others. The hope would be to emphasise love machine's social aspect, but mechanical turk's "all are welcome" aspect. Ideally, the environment would also feel like stack overflow-- you can build reputation by answering questions for free, so that you are more likely to get paid in the future. There are a lot of issues to be dealt with here! The nature of contracts, the details of any "reputation" system... it's complicated. However, it would be great for the efficiency of the intellectual labour market!

(3) I want to build a Bayesian reasoning system capable of acting as mankind's store of knowledge, so that all arguments for and against certain points can be kept track of and weighed properly against the evidence. This is the least realistic of the three dreams, but if it could replace/augment Wikipedia, it would help to end debates like global warming (or, rather, force the debate to proceed entirely from empirical evidence and in a logically correct manner). There are lots of big problems with this one, including the problem of verifying sources of empirical data.