Whereas first-order probabilities express uncertainty about the world, a higher-order probability expresses uncertainty about what value a first-order probability has. Or, if you ask a Frequentist instead of a Bayesian: first-order probabilities give limiting frequencies of results from experiments, whereas higher-order probabilities give limiting frequencies of getting certain limiting frequencies.
Uncertainty about uncertainty could be argued to defeat the whole point of having a plausible degree of belief... why can't it be collapsed into a straight degree of belief?
This paper tries to base higher-order probability on the idea that there is some "ideal expert" who knows the "true probability" which we are uncertain about. However, the paper puts it up to the agent to decide what constitutes a "fully informed ideal expert." If the ideal expert knows which events actually happen or not, then the first-order probabilities will all be 0 or 1, so the higher-order probabilities take on the role that 1st-order probabilities usually take on. In the opposite extreme, the "ideal expert" is just the agent itself, so that the higher-order probabilities are all 1 or 0 and the first-order probabilities are known. (All this is specifically laid out by the author.) This seems to put higher-order probability on a very shaky footing; at least to my ear.
Consider the following situation concerning an event A:
P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
In English, we do not know if the probability of A is 1/3 or 2/3-- we assign 50% probability to each.
Now, two plausible principles of higher-order reasoning are as follows.
- Expectation principle: We can find our first-order belief in event A by a weighted averaging over the possibilities; ie, P(A)=E(P(A)), where E is the expectation operator (which takes the average value, weighted by probability). This is necessary to convert higher-order distributions into numbers we can use to place bets, etc.
- Lifting principal: If we know X, we know P(X)=1. This paper argues for a related rule: P(X)=p if and only if P(P(X)=p)=1. Either rule is sufficient for the argument that follows; specifically, I only need that P(X)=p implies P(P(X)=p)=1. (This is a special case of my "lifting" by instantiation of X to P(X)=p, and a weakening of their coherence-based principle since it takes just one direction of the "if and only if".) [It is interesting to note the similarity of these rules to proposed rules for the truth predicate, and in fact that is what started my investigation of this; however, I won't give significant space to that now.]
P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
->
P(A)=1/2 via expectation,
->
P(P(A)=1/2)=1 via lifting.
But this contradicts the original distribution; we have
P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
P(P(A)=1/2)=1,
which sums to 3/2, violating the laws of probability.
For higher-order probability to be non-trivial, then, it seems we must reject one or both of my given principles. If this is granted, I think the best view is that the expectation principle should be abandoned; after all, that paper I cited gives a pretty good argument for a form of lifting principle. One way of looking at the expectation principle is that it flattens the higher-order distribution to a first-order one, trivialising it.
However, I want to preserve both principles. Both seem totally reasonable to me, and I've found a solution for keeping a form of each without any problem. I will argue that the problem here is one of equivocation; the notation being used is not specific enough. (The argument I use comes from my brother.)
For concreteness, suppose the event 'A' I mentioned is flipping a coin and getting heads. One scenario which fits with the probabilities I gave is as follows. We have two unfair coins in a bag; one gives heads 2/3 of the time, and the other gives heads 1/3 of the time. If we want a fair chance of heads, we can just draw at random and flip; half the time we will get the coin bias towards heads, but half the time we will get the one bias against. Since the bias is equal in each direction, we will get heads 50% of the time. This illustrates the expectation principle in action, yet it does not trivialise the higher-order characterisation of the situation: if we draw one coin out of the bag and keep it out, we can try and determine whether it is the 1/3 coin or the 2/3 coin through experimentation. This involves a Bayesian update on the higher-order distribution I gave.
The intuition, then, is that the "flattened" probability (1/2) and the "non-flat" probabilities (1/3 and 2/3) are the limiting frequencies of very different experiments. There is no contradiction in the two distributions because they are talking about different things. 1/2 is the limiting frequency of drawing, flipping, and re-drawing; 1/3 or 2/3 is what we get if we draw a coin and don't re-draw as we continue to flip (and we will find that we get each of those 1/2 of the time).
So, how do we formalise that? One way, which I particularly like, is to let the probability operator bind variables.
I'll use the notation P[x](A)=e to represent P binding variable x (where A is some statement with x as a free variable). Intuitively, the meaning is that if we choose x randomly, the statement A has probability e of being true. In addition, we can talk about choosing multiple variables simultaneously; this will be given notation like P[x,y,z](A)=e.
In this notation, the two principles become the following:
- Expectation principle: If we have a higher-order distribution represented by a collection of statements P[x](P[y](A)=a)=b, P[x](P[y](A)=c)=d, et cetera, then we can form the first-order distribution P[x,y](A)=E[x](P[y](A)), where E[x](P[y](A)) is the expected value of P[y](A), choosing x randomly.
- Lifting principle: If we know some statement A (possibly with free variables), we can conclude P[x](A)=1, for any variable. (Together with the expectation principle, this implies the related if-and-only-if statement from the paper I cited.)
P[x](P[y](A)=1/3)=1/2
P[x](P[y](A)=2/3)=1/2
->
P[x,y](A)=1/2 via expectation,
->
P[z](P[x,y](A)=1/2)=1 via lifting.
This makes my earlier point about equivocation clear: the concluding probability is not about P[y](A) at all, but rather, is about P[x,y](A).
There are many details to fill out here; we'd like convenient notation for probability densities, a proper analysis of what subjective probabilities are like in this system, and several other things. However, I feel these are better left to a paper rather than a blog post. :)
PS-- if anyone has references on similar systems, that'd be great!
PPS-- This paper talks about similar issues. :D