Whereas first-order probabilities express uncertainty about the world, a higher-order probability expresses uncertainty about what value a first-order probability has. Or, if you ask a Frequentist instead of a Bayesian: first-order probabilities give limiting frequencies of results from experiments, whereas higher-order probabilities give limiting frequencies of getting certain limiting frequencies.
Uncertainty about uncertainty could be argued to defeat the whole point of having a plausible degree of belief... why can't it be collapsed into a straight degree of belief?
This paper tries to base higher-order probability on the idea that there is some "ideal expert" who knows the "true probability" which we are uncertain about. However, the paper puts it up to the agent to decide what constitutes a "fully informed ideal expert." If the ideal expert knows which events actually happen or not, then the first-order probabilities will all be 0 or 1, so the higher-order probabilities take on the role that 1st-order probabilities usually take on. In the opposite extreme, the "ideal expert" is just the agent itself, so that the higher-order probabilities are all 1 or 0 and the first-order probabilities are known. (All this is specifically laid out by the author.) This seems to put higher-order probability on a very shaky footing; at least to my ear.
Consider the following situation concerning an event A:
P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
In English, we do not know if the probability of A is 1/3 or 2/3-- we assign 50% probability to each.
Now, two plausible principles of higher-order reasoning are as follows.
- Expectation principle: We can find our first-order belief in event A by a weighted averaging over the possibilities; ie, P(A)=E(P(A)), where E is the expectation operator (which takes the average value, weighted by probability). This is necessary to convert higher-order distributions into numbers we can use to place bets, etc.
- Lifting principal: If we know X, we know P(X)=1. This paper argues for a related rule: P(X)=p if and only if P(P(X)=p)=1. Either rule is sufficient for the argument that follows; specifically, I only need that P(X)=p implies P(P(X)=p)=1. (This is a special case of my "lifting" by instantiation of X to P(X)=p, and a weakening of their coherence-based principle since it takes just one direction of the "if and only if".) [It is interesting to note the similarity of these rules to proposed rules for the truth predicate, and in fact that is what started my investigation of this; however, I won't give significant space to that now.]
P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
->
P(A)=1/2 via expectation,
->
P(P(A)=1/2)=1 via lifting.
But this contradicts the original distribution; we have
P(P(A)=1/3)=1/2
P(P(A)=2/3)=1/2
P(P(A)=1/2)=1,
which sums to 3/2, violating the laws of probability.
For higher-order probability to be non-trivial, then, it seems we must reject one or both of my given principles. If this is granted, I think the best view is that the expectation principle should be abandoned; after all, that paper I cited gives a pretty good argument for a form of lifting principle. One way of looking at the expectation principle is that it flattens the higher-order distribution to a first-order one, trivialising it.
However, I want to preserve both principles. Both seem totally reasonable to me, and I've found a solution for keeping a form of each without any problem. I will argue that the problem here is one of equivocation; the notation being used is not specific enough. (The argument I use comes from my brother.)
For concreteness, suppose the event 'A' I mentioned is flipping a coin and getting heads. One scenario which fits with the probabilities I gave is as follows. We have two unfair coins in a bag; one gives heads 2/3 of the time, and the other gives heads 1/3 of the time. If we want a fair chance of heads, we can just draw at random and flip; half the time we will get the coin bias towards heads, but half the time we will get the one bias against. Since the bias is equal in each direction, we will get heads 50% of the time. This illustrates the expectation principle in action, yet it does not trivialise the higher-order characterisation of the situation: if we draw one coin out of the bag and keep it out, we can try and determine whether it is the 1/3 coin or the 2/3 coin through experimentation. This involves a Bayesian update on the higher-order distribution I gave.
The intuition, then, is that the "flattened" probability (1/2) and the "non-flat" probabilities (1/3 and 2/3) are the limiting frequencies of very different experiments. There is no contradiction in the two distributions because they are talking about different things. 1/2 is the limiting frequency of drawing, flipping, and re-drawing; 1/3 or 2/3 is what we get if we draw a coin and don't re-draw as we continue to flip (and we will find that we get each of those 1/2 of the time).
So, how do we formalise that? One way, which I particularly like, is to let the probability operator bind variables.
I'll use the notation P[x](A)=e to represent P binding variable x (where A is some statement with x as a free variable). Intuitively, the meaning is that if we choose x randomly, the statement A has probability e of being true. In addition, we can talk about choosing multiple variables simultaneously; this will be given notation like P[x,y,z](A)=e.
In this notation, the two principles become the following:
- Expectation principle: If we have a higher-order distribution represented by a collection of statements P[x](P[y](A)=a)=b, P[x](P[y](A)=c)=d, et cetera, then we can form the first-order distribution P[x,y](A)=E[x](P[y](A)), where E[x](P[y](A)) is the expected value of P[y](A), choosing x randomly.
- Lifting principle: If we know some statement A (possibly with free variables), we can conclude P[x](A)=1, for any variable. (Together with the expectation principle, this implies the related if-and-only-if statement from the paper I cited.)
P[x](P[y](A)=1/3)=1/2
P[x](P[y](A)=2/3)=1/2
->
P[x,y](A)=1/2 via expectation,
->
P[z](P[x,y](A)=1/2)=1 via lifting.
This makes my earlier point about equivocation clear: the concluding probability is not about P[y](A) at all, but rather, is about P[x,y](A).
There are many details to fill out here; we'd like convenient notation for probability densities, a proper analysis of what subjective probabilities are like in this system, and several other things. However, I feel these are better left to a paper rather than a blog post. :)
PS-- if anyone has references on similar systems, that'd be great!
PPS-- This paper talks about similar issues. :D
Let me ask a trivial question: what is the semantics in terms of probability measure (over a fixed space of "possible worlds", say?) I think that when someone says P(X>v)=p, he means, "oh, I have that probability space in my backpack, and I've measured it over there". Is it about "distributions in the backpack", or is it about classes of distributions that satisfy the constraints? Going the logic way, one could say, "this sentence is satisfiable, look at this possible-world distribution", or "this sentence is valid, every so-and-so-chosen distribution must satisfy that-and-that". (Perhaps you've implicitly answered in the mailing list.) Higher-order probabilities are then branching-worlds, i.e. worlds with pointers to distributions over worlds of a lower order? This is the stack, right? In what sense do you talk about flattening it? Sorry for all the trivial questions. ;-)
ReplyDeleteLukasz,
ReplyDeleteActually, I think the appropriate model takes on an informative structure... it would, if I am not mistaken, consist of a regular first-order model (assigning all the non-probabilistic statements to true/false) and a single, first-order probability distribution, representing the distribution which entities are being drawn randomly from (the distro of the bound variables). All the higher-order distributions are to be *determined* by this first-order distribution.
(An expanded syntax might include a way of indicating various distributions for the bound values to be drawn from, in which case the model would include many.)
All the higher-order probabilities are to be determined by this first-order distribution; they arise from the different ways of "cutting" it... ie, the ways of measuring related subsets via assertions with multiple free variables. (An assertion with 1 free variable defines a subset with a probability, so one with 2 free variables defines a randomly chosen subset with a random probability (in 2 different ways), and 3 free variables defines a randomly chosen one of *those*.. made more complicated by the possibility of choosing simultaneously rather than in order, in which case 2 free vars represent a relation...)
In other words, I am trying to make higher-order probabilities more well-founded by *removing* their role as "pure uncertainty" about a first-order distro, and preferring a frequency-based definition. (As you said, Bayesianism is just frequentism over possible worlds.)
On the other hand, like you say, we can make a notion of validity in which we consider all models (defined as above) which are consistent with the statements we have; then certain 2nd-order probabilities could be seen to partially constrain a set of first-order distributions. However, the notation still requires a variable to draw from. To be fully "realist" about 2nd-order uncertainty might require what modal folks call 2-dimensional semantics-- every predicate would be indexed by 2 possible worlds, so that first-order subjective probabilities choose 1 of these randomly, and 2nd-order take the other... (Obviously, N-dimensional semantics is required for "realist" Nth-order probability, which means we always have some bound, which doesn't make sense.)
So anyway, I have to think more on how to account for "fully subjective" higher-order probabilities-- but hopefully I've answered your question for the system as it is currently spelled out.
OK, clearly I was overthinking (or underthinking, rather) that bit about subjective higher-order probabilities... it's possible to simply quantify over possible first-order distributions, rather than trying to quantify over possible worlds in which distributions might hold.
ReplyDelete