I've been talking recently about robust statistics, and the consequences of replacing means with medians. However, I've only looked at this in a fairly limited way, asking about one particular distribution (the bell curve). Mean values are everywhere in statistics; perhaps to a greater degree than you realize, because we often refer to the mean value as the "expected value". It's a simple alias for the same thing, but that may be easy to forget when we are taking expectations everywhere.
In some sense, the "expectation" seems to be a more basic concept than the "mean". We could think of the mean as simply one way of formalizing the intuitive notion of expected value. What happens if we choose a different formalization? What if we choose the median?
The post on altering the bell curve is (more or less) an exploration of what happens to some of classical statistics if we do this. What happens to Bayesian theory?
The foundations of Bayesian statistics are really not touched at all by this. A Bayesian does not rely as heavily on "statistics" in the way a frequentist statistician does. A statistic is a number derived from a dataset which gives some sort of partial summary. We can look at mean, variance, and higher moments; correlations; and so on. We distinguish between the sample statistic (the number derived from the data at hand) and the population statistic (the "true" statistic which we could compute if we had all the examples, ever, of the phenomenon we are looking at). We want to estimate the population statistics, so we talk about estimators; these are numbers derived from the data which are supposed to be similar to the true values. Unbiased estimators are an important concept: ways of estimating population statistics whose expected values are exactly the population statistics.
These concepts are not exactly discarded by Bayesians, since they may be useful approximations. However, to a Bayesian, a distribution is a more central object. A statistic may be a misleading partial summary. The mean (/mode/median) is sort of meaningless when a distribution is multimodal. Correlation does not imply... much of anything (because it assumes a linear model!). Bayesian statistics still has distribution parameters, which are directly related to population statistics, but frequentist "estimators" are not fundamental because they only provide point estimates. Fundamentally, it makes more sense to keep a distribution over the possibilities, assigning some probability to each option.
However, there is one area of Bayesian thought where expected value makes a great deal of difference: Bayesian utility theory. The basic law of utility theory is that we choose actions so as to maximize expected value. Changing the definition of "expected" would change everything! The current idea is that in order to judge between different actions (or plans, policies, designs, et cetera) we look at the average utility achieved with each option, according to our probability distribution over the possible results. What if we computed the median utility rather than the average? Let's call this "robust utility theory".
From the usual perspective, robust utility would perform worse: to the extent that we take different actions, we would get a lower average utility. This begs the question of whether we care about average utility or median utility, though. If we are happy to maximize median utility, then we can similarly say that the average-utility maximizers are performing poorly by our standards.
At first, it might not be obvious that the median is well-defined for this purpose. The median value coming from a probability distribution is defined to be the median in the limit of infinite independent samples from that distribution, though. Each case will contribute instances in proportion to its probability. What we end up doing is lining up all the possible consequences of our choice in order of utility, with a "width" determined by the probability of each, and taking the utility value of whatever consequence ends up in the middle. So long as we are willing to break ties somehow (as is usually needed with the median), it is actually well-defined more often than the mean! We avoid problems with infinite expected value. (Suppose I charge you to play a game where I start with a $1 pot, and start flipping a coin. I triple the pot every time I get heads. Tails ends the game, and I give you the pot. Money is all you care about. How much should you be willing to pay to play?)
Since the median is more robust than the mean, we also avoid problems dealing with small-probability but disproportionately high-utility events. The typical example is Pascal's Mugging. Pascal walks up to you and says that if you don't give him your wallet, God will torture you forever in hell. Before you object, he says: "Wait, wait. I know what you are thinking. My story doesn't sound very plausible. But I've just invented probability theory, and let me tell you something! You have to evaluate the expected value of an action by considering the average payoff. You multiply the probability of each case by its utility. If I'm right, then you could have an infinitely large negative payoff by ignoring me. That means that no matter how small the probability of my story, so long as it is above zero, you should give me your wallet just in case!"
A Robust Utility Theorist avoids this conclusion, because small-probability events have a correspondingly small effect on the end result, no matter how high a utility we assign.
Now, a lot of nice results (such as the representation theorem) have been derived for average utilities over the years. Naturally, taking a median utility might do all kinds of violence to these basic ideas in utility theory. I'm not sure how it would all play out. It's interesting to think about, though.
Monday, February 11, 2013
Sunday, February 10, 2013
Philosopher's Carnival #148
Hi all! This month, I have the honor of hosting the monthly blog review, Philosopher's Carnival. This reviews some of the best philosophy postings of the previous month.
Two Metaphysical Pictures, by Richard Yetter Chappell of Philosophy et cetera, outlines a broad classification of metaphysical theories into two different types. (For the record, I prefer the second type! However, as one comment rightly points out, we should avoid lumping views together to create false dichotomies in this way.)
Special relativity and the A-theory, at Alexander Pruss's Blog, discusses the relationship between the philosophical view of time and what we know from physics. (This is related to the two views of the previous post, at least to the extent that you buy the false dichotomy.)
A Question About Religious Experience and Safety Accounts of Knowledge, by ex-apologist, discusses the relationship between the safety condition for knowledge and Christian epistemology.
Metaphysical Skepticism a la Kriegel, by Eric Schwitzgebel of The Splintered Mind, reviews a paper by Uriah Kriegel which suggests that although there may be meaningful metaphysical questions about which there are true and false answers, we cannot arrive at knowledge of those answers by any means. In particular, there is no means by which we can come to know if sets of objects have a literal existence or are merely mental constructs.
A pair of posts by Jeffrey Ketland of M-Phi discuss the Quine-Putnam Indispensability Argument: The Quine-Putnam Indispensability Argument and Other Formulations of the Quine-Putnam Indispensability Argument. The indispensability argument also deals with the question of whether sets (and other mathematical constructions) have a literal existence. The idea is that our best scientific theories make use of sets of objects (and other mathematical constructs), so we must either accept their existence or reject our best science.
Grim Reapers vs. Uncaused Beginnings, by Joshua Rasmussen of Prosblogion, gives a discussion of some "grim reaper" arguments. The Grim Reaper argument is an argument which is supposed to show the implausibility of an infinite past. Joshua shows that a very similar argument would conclude that a finite past with an uncaused beginning is equally implausible.
A Modification to Lewis's Theory of Counterfactuals, by Tristan Haze of Sprachlogik, questions the role that "similarity" of possible worlds should play in our evaluation of counterfactuals.
Computational Metaphysics, by Tomkow, provides a metaphysical companion to computational physics. The idea is illustrated by giving a computational-metaphysics account of counterfactuals, including Lewis's "similarity".
Substitution and Models, Part 1: Bolzano, Quine, Tarski and Boolos, by Jason of Metaphysical Values, reviews the debate between substitution-based understanding of quantifiers and model-theoretic accounts (in preparation for a series of posts about the issue).
That's it for this month! Tune in for the next carnival at http://blog.kennypearce.net/, March 10. If you spy any interesting philosophy articles, submit them for the carnival!
Two Metaphysical Pictures, by Richard Yetter Chappell of Philosophy et cetera, outlines a broad classification of metaphysical theories into two different types. (For the record, I prefer the second type! However, as one comment rightly points out, we should avoid lumping views together to create false dichotomies in this way.)
Special relativity and the A-theory, at Alexander Pruss's Blog, discusses the relationship between the philosophical view of time and what we know from physics. (This is related to the two views of the previous post, at least to the extent that you buy the false dichotomy.)
A Question About Religious Experience and Safety Accounts of Knowledge, by ex-apologist, discusses the relationship between the safety condition for knowledge and Christian epistemology.
Metaphysical Skepticism a la Kriegel, by Eric Schwitzgebel of The Splintered Mind, reviews a paper by Uriah Kriegel which suggests that although there may be meaningful metaphysical questions about which there are true and false answers, we cannot arrive at knowledge of those answers by any means. In particular, there is no means by which we can come to know if sets of objects have a literal existence or are merely mental constructs.
A pair of posts by Jeffrey Ketland of M-Phi discuss the Quine-Putnam Indispensability Argument: The Quine-Putnam Indispensability Argument and Other Formulations of the Quine-Putnam Indispensability Argument. The indispensability argument also deals with the question of whether sets (and other mathematical constructions) have a literal existence. The idea is that our best scientific theories make use of sets of objects (and other mathematical constructs), so we must either accept their existence or reject our best science.
Grim Reapers vs. Uncaused Beginnings, by Joshua Rasmussen of Prosblogion, gives a discussion of some "grim reaper" arguments. The Grim Reaper argument is an argument which is supposed to show the implausibility of an infinite past. Joshua shows that a very similar argument would conclude that a finite past with an uncaused beginning is equally implausible.
A Modification to Lewis's Theory of Counterfactuals, by Tristan Haze of Sprachlogik, questions the role that "similarity" of possible worlds should play in our evaluation of counterfactuals.
Computational Metaphysics, by Tomkow, provides a metaphysical companion to computational physics. The idea is illustrated by giving a computational-metaphysics account of counterfactuals, including Lewis's "similarity".
Substitution and Models, Part 1: Bolzano, Quine, Tarski and Boolos, by Jason of Metaphysical Values, reviews the debate between substitution-based understanding of quantifiers and model-theoretic accounts (in preparation for a series of posts about the issue).
That's it for this month! Tune in for the next carnival at http://blog.kennypearce.net/, March 10. If you spy any interesting philosophy articles, submit them for the carnival!
Monday, February 4, 2013
Weird Curves
Last time, I mentioned a question:
Taleb and others have mentioned that the bell curve (or Gaussian) does not deal with outliers well; it gives them a very small probability, and the parameter estimates end up being highly dependent on them.
Yet, one of the justifications of the Gaussian is that it's the max-entropy curve for a given mean and standard deviation.
Entropy is supposed to be a measure of the uncertainty associated with a distribution; so, shouldn't we expect that the max-entropy distribution would give as high a probability to outliers as possible?
There are several answers.
First: a basic problem is that phrase "a given mean and standard deviation". In particular, to choose a standard deviation is to choose an acceptable range for outliers (in some sense). If we have uncertainty about the standard deviation, it turns out the resulting curve has a polynomial decay rather than an exponential one! (This means distant outliers are far more probable.) Essentially, estimating deviations from data (using maximum-likelihood) makes us extremely overconfident that the data will fall in the range we've experienced before. A little Bayesian uncertainty (which still estimates from data, but admits a range of possibilities) turns out to be much less problematic in this respect.
This is definitely helpful, but doesn't solve the riddle: it still feels strange, that the max-entropy distribution would have such a sharp (super-eponential!) decay rate. Why is that?
My derivation will be somewhat heuristic, but I felt that I understood "why" much better by working it out this way than by following other proofs I found (which tend to start with the normal and show that no other distribution has greater entropy, rather than starting with desired features and deriving the normal). [Also, sorry for the poor math notation...]
First, let's pretend we have a discrete distribution over n points, xn. The result will apply no matter how many points we have, which means it applies in the limit of a continuous distribution. Continuous entropy is not the limit of discrete entropy, so I won't actually be maximising discrete entropy here; I'll maximise the discrete version of the continuous entropy formula:
f(x) maximising sum_i: f(xi)log(f(xi))
Next, we constrain the distribution to sum to a constant, have a constant mean, and have constant variance (which also makes the standard deviation constant):
sumi[f(xi)] = C1
sumi[xif(xi)] = C2
sumi[xi2f(xi)] = C3
To solve the constrained optimisation problem, we make lagrange multipliers for the constraints:
Lagrangian:
f(xi)log(f(xi))
- lambda1*(sumi[f(xi)]-C1)
- lambda2*(sumi[xif(xi)]-C2)
- lambda3*(sumi[xi2f(xi)]-C3)
Partial derivatives in the f(xi):
1 - log(f(xi))
- lambda1
- lambda2*xi
- lambda3*xi2
Setting this equal to zero and solving for f(xi):
f(xi) = 21 - lambda1 - lambda2*xi - lambda3*xi2
That's exactly the form of the Gaussian: a constant to the power of a 2nd-degree polynomial!
So, we can see where everything comes from: the exponential comes from our definition of entropy, and the function within the exponent comes from the Lagrange multipliers. The Gaussian is quadratic precisely because we chose a quadratic loss function! We can get basically any form we want by choosing a different loss function. If we use the kurtosis rather than the variance, we will get a fourth degree polynomial rather than a second degree one. If we choose an exponential function, we can get a doubly exponential probability distribution. And so on. There should be some limitations, but more or less, we can get any probability distribution we want, and claim that it is justified as the maximum-entropy distribution (fixing some measure of spread). We can even get rid of the exponential by putting a logarithm around our loss function.
Last time, I mentioned robust statistics, which attempts to make statistical techniques less sensitive to outliers. Rather than using the mean, robust statistics recommends using the median: whereas a sufficiently large outlier can shift the mean by an arbitrary amount, a single outlier has the same limited effect on the median no matter how extreme its value.
I also mentioned that it seems more intuitive to use the absolute deviation, rather than the squared deviation.
If we fix the absolute deviation and ask for the maximum-entropy function, we get something like e-|x| as our distribution. This is an ugly little function, but the maximum-likelihood estimate of the center of the distribution is precisely the median! e-|x| justifies the strategy of robust statistics, reducing sensitivity to outliers by making extreme outliers more probable. (The reason is: the max-likelihood estimate will be the point which minimizes the sum of the loss functions centred at each data point. The derivative at x is equal to the number of data points below x minus the number above x. Therefore the derivative is only zero when these two are equal. This is the minimum loss.)
What matters, of course, is not what nice theoretical properties a distribution may have, but how well it matches the true situation. Still, I find it very interesting that we can construct a distribution which justifies taking the median rather than the mean... and I think it's important to show how arbitrary the status of the bell curve as the maximum entropy distribution is.
Just to be clear: the Bayesian solution is not usually to think too much about what distributions might have the best properties. This is important, but when in doubt, we can simply take a mixture distribution over as large a class of hypotheses as we practically can. Bayesian updates give us nice convergence to the best option, while also avoiding overconfidence (for example, the asymptotic probability of outliers will be on the order of the most outlier-favouring distribution present).
Still, a complex machine learning algorithm may still need a simple one as a sub-algorithm to perform simple tasks; a genetic programming approximation of Bayes may need simpler statistical tools to make an estimation of distribution algorithm work. More generally, when humans build models, they tend to compose known distributions such as Gaussians to make them work. In such cases, it's interesting to ask whether classical or robust statistics is more appropriate.
Taleb and others have mentioned that the bell curve (or Gaussian) does not deal with outliers well; it gives them a very small probability, and the parameter estimates end up being highly dependent on them.
Yet, one of the justifications of the Gaussian is that it's the max-entropy curve for a given mean and standard deviation.
Entropy is supposed to be a measure of the uncertainty associated with a distribution; so, shouldn't we expect that the max-entropy distribution would give as high a probability to outliers as possible?
There are several answers.
First: a basic problem is that phrase "a given mean and standard deviation". In particular, to choose a standard deviation is to choose an acceptable range for outliers (in some sense). If we have uncertainty about the standard deviation, it turns out the resulting curve has a polynomial decay rather than an exponential one! (This means distant outliers are far more probable.) Essentially, estimating deviations from data (using maximum-likelihood) makes us extremely overconfident that the data will fall in the range we've experienced before. A little Bayesian uncertainty (which still estimates from data, but admits a range of possibilities) turns out to be much less problematic in this respect.
This is definitely helpful, but doesn't solve the riddle: it still feels strange, that the max-entropy distribution would have such a sharp (super-eponential!) decay rate. Why is that?
My derivation will be somewhat heuristic, but I felt that I understood "why" much better by working it out this way than by following other proofs I found (which tend to start with the normal and show that no other distribution has greater entropy, rather than starting with desired features and deriving the normal). [Also, sorry for the poor math notation...]
First, let's pretend we have a discrete distribution over n points, xn. The result will apply no matter how many points we have, which means it applies in the limit of a continuous distribution. Continuous entropy is not the limit of discrete entropy, so I won't actually be maximising discrete entropy here; I'll maximise the discrete version of the continuous entropy formula:
f(x) maximising sum_i: f(xi)log(f(xi))
Next, we constrain the distribution to sum to a constant, have a constant mean, and have constant variance (which also makes the standard deviation constant):
sumi[f(xi)] = C1
sumi[xif(xi)] = C2
sumi[xi2f(xi)] = C3
To solve the constrained optimisation problem, we make lagrange multipliers for the constraints:
Lagrangian:
f(xi)log(f(xi))
- lambda1*(sumi[f(xi)]-C1)
- lambda2*(sumi[xif(xi)]-C2)
- lambda3*(sumi[xi2f(xi)]-C3)
Partial derivatives in the f(xi):
1 - log(f(xi))
- lambda1
- lambda2*xi
- lambda3*xi2
Setting this equal to zero and solving for f(xi):
f(xi) = 21 - lambda1 - lambda2*xi - lambda3*xi2
That's exactly the form of the Gaussian: a constant to the power of a 2nd-degree polynomial!
So, we can see where everything comes from: the exponential comes from our definition of entropy, and the function within the exponent comes from the Lagrange multipliers. The Gaussian is quadratic precisely because we chose a quadratic loss function! We can get basically any form we want by choosing a different loss function. If we use the kurtosis rather than the variance, we will get a fourth degree polynomial rather than a second degree one. If we choose an exponential function, we can get a doubly exponential probability distribution. And so on. There should be some limitations, but more or less, we can get any probability distribution we want, and claim that it is justified as the maximum-entropy distribution (fixing some measure of spread). We can even get rid of the exponential by putting a logarithm around our loss function.
Last time, I mentioned robust statistics, which attempts to make statistical techniques less sensitive to outliers. Rather than using the mean, robust statistics recommends using the median: whereas a sufficiently large outlier can shift the mean by an arbitrary amount, a single outlier has the same limited effect on the median no matter how extreme its value.
I also mentioned that it seems more intuitive to use the absolute deviation, rather than the squared deviation.
If we fix the absolute deviation and ask for the maximum-entropy function, we get something like e-|x| as our distribution. This is an ugly little function, but the maximum-likelihood estimate of the center of the distribution is precisely the median! e-|x| justifies the strategy of robust statistics, reducing sensitivity to outliers by making extreme outliers more probable. (The reason is: the max-likelihood estimate will be the point which minimizes the sum of the loss functions centred at each data point. The derivative at x is equal to the number of data points below x minus the number above x. Therefore the derivative is only zero when these two are equal. This is the minimum loss.)
What matters, of course, is not what nice theoretical properties a distribution may have, but how well it matches the true situation. Still, I find it very interesting that we can construct a distribution which justifies taking the median rather than the mean... and I think it's important to show how arbitrary the status of the bell curve as the maximum entropy distribution is.
Just to be clear: the Bayesian solution is not usually to think too much about what distributions might have the best properties. This is important, but when in doubt, we can simply take a mixture distribution over as large a class of hypotheses as we practically can. Bayesian updates give us nice convergence to the best option, while also avoiding overconfidence (for example, the asymptotic probability of outliers will be on the order of the most outlier-favouring distribution present).
Still, a complex machine learning algorithm may still need a simple one as a sub-algorithm to perform simple tasks; a genetic programming approximation of Bayes may need simpler statistical tools to make an estimation of distribution algorithm work. More generally, when humans build models, they tend to compose known distributions such as Gaussians to make them work. In such cases, it's interesting to ask whether classical or robust statistics is more appropriate.
Tuesday, January 29, 2013
Why Continuous Entropy Matters
Previously, I discussed definitions of entropy for continuous distributions. Entropy, when used as a lower bound on description length, doesn't obviously apply to real-numbered random variables. Yet, there is a very common definition of entropy for continuous variables which almost everyone uses without question. This definition looks right at the syntactic level (we simply replace summation with integration), but doesn't make a lot of conceptual sense, and also loses many properties of discrete entropy. I took a look at one alternative definition that might make more sense.
Why does it matter?
The main issue here is the legitimacy of the maximum-entropy justification of specific continuous distributions, chief among them the bell curve; using slightly more general language, the Gaussian distribution (which takes the bell curve to the multivariate domain, allowing it to account for correlations rather than just means and variances).
The Gaussian distribution is used in many cases as the default: the first distribution you'd try. In some cases this is out of statistical ignorance (the Gaussian is the distribution people remember), but in many cases this is because of a long tradition in statistics.
One justification for this is the "physical" justification: many physical processes will give distributions which are approximately (though not exactly) Gaussian. The Gaussian is a good approximation of naturally occurring statistics when the number is the result of many individual positive and negative fluctuations: if we start with a base number such as 30, and there are many processes which randomly add or subtract something from our number, then the distribution of our end result will be approximately Gaussian. This could be thought of as a frequentist justification of the Gaussian.
A more Bayesian justification of the Gaussian is via the maximum-entropy principle. Entropy represents the uncertainty of a distribution. Bayesians need prior probability distributions in order to begin reasoning. It makes intuitive sense to choose a prior which has maximal uncertainty associated with it: we are representing our ignorance. It turns out that with a known mean and variance, the Gaussian is the maximum-entropy distribution... according to the usual definition of entropy.
This seems quite mysterious to me. Intuitively, a maximum-entropy distribution should be as spread out as possible (within the confines of the fixed variance, which is a measure of spread). A Gaussian, however, decays super-exponentially! It falls off very quickly, on the order of e-x2. This makes outliers very, very unlikely. Intuitively, it seems as if a polynomial decay rate would be much better, leaving a little bit of probability mass for far-flung values.
This issue is not merely academic. Nassim Nicolas Taleb has repeatedly shown that the standard statistical models used in socioeconomic settings are disastrously wrong when it comes to outliers. A small number of outliers are improbable enough to completely invalidate the models from a scientific perspective. These outliers are not just annoying, but also highly consequential, often leading to the failure of investments and massive losses of capital. To quote him: "Nobody has managed to explain why it is not charlatanism, downright scientifically fraudulent to use these techniques." Taleb advocates the use of probability distributions with a wider spread, such as a power law. Basically, polynomial distributions rather than exponential distributions.
Advocates of the maximum entropy principle should (arguably) be puzzled by this failure. The Gaussian distribution was supposed to have maximum spread, in the sense that's important (entropy). Yet, Taleb showed that the spread is far too conservative! What went wrong?
I can see two possible problems with the Gaussian distribution: entropy, and variance.
As I've already explored, the definition of entropy being used is highly questionable. It's not obvious that entropy should be applied to the continuous domain at all, and even if it is, there doesn't seem to be very much justification for the formula which us currently employed.
There is another assumption behind the maximum-entropy justification of the Gaussian, however: fixed variance. A Gaussian is the maximum-entropy distribution given a variance. Entropy measures "spread", but variance also measures "spread" in a different sense. The interaction between these two formulas for spread gives rise to the Gaussian distribution.
The formula for variance is based on the squared deviation. Deviation is the distance from the mean. There are many different measures of the collective deviation. Squared deviation is not robust. Why not use the absolute deviation? This comes up now and again, but as far as I've observed, rarely addressed inside statistics (rarely taught in classrooms, rarely mentioned in textbooks, et cetera). It's a fascinating topic, and Taleb's work suggests it's a critical one.
So, it seems as if there is something important to learn by digging further into the foundations of our measures of spread, including both entropy and variance.
Naturally, my personal interest here is for applications in AI. It seems necessary to have good built-in methods for dealing with continuous variables, and uncertainty about continuous variables. So, if the standard methods are flawed, that could be relevant for AI. Interestingly, many machine learning methods do prefer alternatives to squared error. SVMs are a primary example.
Why does it matter?
The main issue here is the legitimacy of the maximum-entropy justification of specific continuous distributions, chief among them the bell curve; using slightly more general language, the Gaussian distribution (which takes the bell curve to the multivariate domain, allowing it to account for correlations rather than just means and variances).
The Gaussian distribution is used in many cases as the default: the first distribution you'd try. In some cases this is out of statistical ignorance (the Gaussian is the distribution people remember), but in many cases this is because of a long tradition in statistics.
One justification for this is the "physical" justification: many physical processes will give distributions which are approximately (though not exactly) Gaussian. The Gaussian is a good approximation of naturally occurring statistics when the number is the result of many individual positive and negative fluctuations: if we start with a base number such as 30, and there are many processes which randomly add or subtract something from our number, then the distribution of our end result will be approximately Gaussian. This could be thought of as a frequentist justification of the Gaussian.
A more Bayesian justification of the Gaussian is via the maximum-entropy principle. Entropy represents the uncertainty of a distribution. Bayesians need prior probability distributions in order to begin reasoning. It makes intuitive sense to choose a prior which has maximal uncertainty associated with it: we are representing our ignorance. It turns out that with a known mean and variance, the Gaussian is the maximum-entropy distribution... according to the usual definition of entropy.
This seems quite mysterious to me. Intuitively, a maximum-entropy distribution should be as spread out as possible (within the confines of the fixed variance, which is a measure of spread). A Gaussian, however, decays super-exponentially! It falls off very quickly, on the order of e-x2. This makes outliers very, very unlikely. Intuitively, it seems as if a polynomial decay rate would be much better, leaving a little bit of probability mass for far-flung values.
This issue is not merely academic. Nassim Nicolas Taleb has repeatedly shown that the standard statistical models used in socioeconomic settings are disastrously wrong when it comes to outliers. A small number of outliers are improbable enough to completely invalidate the models from a scientific perspective. These outliers are not just annoying, but also highly consequential, often leading to the failure of investments and massive losses of capital. To quote him: "Nobody has managed to explain why it is not charlatanism, downright scientifically fraudulent to use these techniques." Taleb advocates the use of probability distributions with a wider spread, such as a power law. Basically, polynomial distributions rather than exponential distributions.
Advocates of the maximum entropy principle should (arguably) be puzzled by this failure. The Gaussian distribution was supposed to have maximum spread, in the sense that's important (entropy). Yet, Taleb showed that the spread is far too conservative! What went wrong?
I can see two possible problems with the Gaussian distribution: entropy, and variance.
As I've already explored, the definition of entropy being used is highly questionable. It's not obvious that entropy should be applied to the continuous domain at all, and even if it is, there doesn't seem to be very much justification for the formula which us currently employed.
There is another assumption behind the maximum-entropy justification of the Gaussian, however: fixed variance. A Gaussian is the maximum-entropy distribution given a variance. Entropy measures "spread", but variance also measures "spread" in a different sense. The interaction between these two formulas for spread gives rise to the Gaussian distribution.
The formula for variance is based on the squared deviation. Deviation is the distance from the mean. There are many different measures of the collective deviation. Squared deviation is not robust. Why not use the absolute deviation? This comes up now and again, but as far as I've observed, rarely addressed inside statistics (rarely taught in classrooms, rarely mentioned in textbooks, et cetera). It's a fascinating topic, and Taleb's work suggests it's a critical one.
So, it seems as if there is something important to learn by digging further into the foundations of our measures of spread, including both entropy and variance.
Naturally, my personal interest here is for applications in AI. It seems necessary to have good built-in methods for dealing with continuous variables, and uncertainty about continuous variables. So, if the standard methods are flawed, that could be relevant for AI. Interestingly, many machine learning methods do prefer alternatives to squared error. SVMs are a primary example.
Monday, January 21, 2013
One More Problem
Last time, I listed some problems which my current foundation idea doesn't solve.
Here's one more, and I'd refer to it as the problem.
First, some set-up.
Eliezer has been writing some good stuff about 2nd-order logic. He's touching on a problem which I've written about before, namely: if any proof-system can be interpreted via theories in 1st-order logic, why is it that we can discuss ideas (such as number theory) which seem not to be fully characterized by it? We think we mean something more when we talk about higher-order concepts, but how can we, if everything could be reduced to first-order talk? Let's call this the problem of meaning.
My original hope was (somehow) to solve the problem using a sufficiently powerful theory of truth. Tarski's undefinability theorem gives us a way to, given some logic, 'mechanically' construct one thing which the logic cannot express: the truth predicate in that logic. No logic can express its own semantics. Therefore (I must have been thinking), we can use 'truth' as a standard way of extending a logic, and hope that extending it in this way will automatically make it more powerful in every other way that is interesting to us. (There is fairly good support for this idea.) I hoped to solve the problem of meaning by solving the problem of truth.
Along the way, I learned about the arithmetic hierarchy and other hierarchies. These provide a characterisation of how difficult logical/mathematical problems are. Only the first couple of levels of the arithmetic hierarchy can be handled on computers. The rest are increasingly uncomputable, and the other hierarchies just get worse. (The arithmetic hierarchy deals just with problems in number theory, and the next-up, the analytic hierarchy, deals with real analysis.) Given that we cannot compute facts about the higher-up stuff, the question is, in what sense can we refer to it? I referred to this as a grounding problem: in what sense does our logic refer to that stuff, if it can't access the truths of those hierarchy levels? This is a big aspect of the problem of meaning.
A somewhat different question (and more relevant to this post) is: How can we define rational belief about these things?
Anyway, I eventually got a bit fed up trying to climb levels, and attempted to solve the problem more directly by just coming up with a complete theory of truth. I eventually learned that the sorts of approaches I was considering, which attempted to build up truth incrementally, would typically only get me predicative mathematics at best, which is not very much. Finally, recently, I specified my impredicative theory of truth, realising that impredicativity in itself suggests a design for putting a classical truth predicate within its own language, getting around Tarski's theorem (in a sense). This gives something much like 2nd-order logic.
So, having declared tentative success on Truth, let's return to the question of Meaning.
Interestingly, 2nd-order logic seems to contain every meaningful mathematical statement we've come up with so far... it can say anything which we would want to invent higher-order logic for. is logically complex enough that it doesn't fit in any hierarchies of difficulty, "past, present, or future". So, it makes a surprisingly good candidate for a logic of thought. (Higher-order logics may still be important, but are in some sense "merely convenient".)
There are basically two reasonable semantics for my logic. One: only the syntactically constructible predicates exist. This makes sense if we want the theory of truth to just be talking about the truth of stuff we can say. Two: all mathematically possible predicates exist. This turns my logic into a syntactic variant of 2nd-order logic.
Really, I'd like to find a way to keep the semantics usefully ambiguous. Many of the theorems for real numbers have analogous results for the computable real numbers. We might hope for some similar match between results for constructible predicates and the mathematically possible predicates. This would tell us that, to some extent, it doesn't matter which we pick.
My idea is that the predicate ambiguity represents extensible meaning. When we define the natural numbers, we say "The least set containing zero and closed under succession". The semantic question is, how do we interpret "set"? What sets exist? Answer #1 says that only sets we can define exist. Answer #2 says that we refer to a broader range of sets, which is intuitively maximal (contains all possible sets). What I'm saying is that we want to refer to "all sets we, or anyone else, might conceive of"; in other words, we don't just want to include stuff we know how to define, but also things we might learn about later, or anything anyone might define.
Constructivists might object that this does not justify semantics #2 at all: even "all the predicates that might be defined" should be a countable set. I would counter that it is still uncountable in some sense: we solve Skolem's paradox by pointing out that "uncountable" is relative to a logical background. It simply means we can't put the predicates in one-to-one correspondence with the natural numbers using what we currently know how to say. Therefore constructivists should not be "fundamentally" unhappy with uncountability. Most constructivists I know will now argue that the powerset is "actually" countable, despite being apparently uncountable in this relative sense, but at this point they are actually relying on classical mathematics!
Yet, they should point out that I'm being disingenuous. I'm stating that the semantics of my 2nd-order quantifiers should be, in some sense, ambiguous, to allow for the broadest possible interpretation. It's true that, for any particular logic, the predicates definable in that logic will be uncountable within the logic. However, I'm claiming something more: that the quantifiers are ranging over predicates in some imaginary maximal logic (which very clearly does not exist). In doing so, I'm effectively claiming "true uncountability" instead of relative uncountability: the maximal notion of predicates would be uncountable in any logic. A constructivist might be OK so long as I'm honestly being ambiguous, but with the intent of being maximal? There is, arguably, no way of making sense of that (without some specific background theory defining "maximal"...).
Now would be a good time to point out that I'm only doing armchair philosophy here. There are many details of the discussion which would benefit greatly from a formal treatment. For example, I'm not explicitly defining what it means for my logic to claim that its own predicates are uncountable (and this is something which you should be suspicious about, since the theory of truth I offered doesn't directly allow us to ask such a question). I'd like to explore all this further, but it's just a hobby at this point, so I'm prioritising the fun part at the expense of hard work.
So, putting aside the justification, let's just admit that I want to talk about classical mathematics and go for the full 2nd-order semantics. The central question again:
How can we define rational belief about these things?
"Rational belief" could be interpreted in several ways; for the moment, let's formalise rational degrees of belief as probabilities. If you're a Bayesian, then, the question is how to define a prior probability distribution over 2nd-order logic.
I know how I want probabilities on 1st-order logic to work. How do we define a prior over logical beliefs? It makes sense to suppose that statements in 1st-order logic are being drawn at random to construct a theory. We update on evidence by requiring this theory to be consistent with everything we know. When we want to know the probability of a statement being true, we ask the probability that it occurs in such a theory.
It's possible to approximate this probability distribution arbitrarily well, because it is possible to detect inconsistency in 1st-order theories. The proof system corresponds exactly to the semantics, so all we need to do is look for proofs of inconsistencies and throw out inconsistent ones.
This approach doesn't obviously generalise. We can define randomly generated consistent theories for other systems, but we can't approximate these distributions by looking for proofs. Suppose we're dealing with first-order arithmetic. We want to know the probability of the Goldbach Conjecture. We consider randomly constructed theories, throwing out those which we can prove inconsistent. Unfortunately, there will be a finite probability that the statement "there exists an even number which is not the sum of two primes" gets generated. The probability of this statement will go down as we rule out theories which include this statement but imply falsifiable things; however, it may converge to a finite number greater than zero. The axioms of first-order arithmetic are incomplete, so there may be no proof of the conjecture, in which case randomly generated theories would have a positive probability of hypothesising a counterexample. This is referred to as an omega inconsistency.
Even for just the stuff in the arithmetic hierarchy, it is not possible to specify any convergent computation which will arrive at the truth in every case. (And again, 2nd-order logic goes well beyond the arithmetic hierarchy, and the analytic hierarchy as well.)
Yet, we can do better than just the 1st-order probability. For example, we can converge to the truth of the Goldbach conjecture just by checking examples. We could put the probability of the conjecture being false at 1/N, where N is the number of positive examples we have found so far. If we find a counterexample, of course, then we permanently put the probability of it being false to 1. This is very rough, but would be guaranteed to eventually get arbitrarily close to the correct answer, so long as we eventually check any particular number.
More realistically, we make a general rule: for an arbitrary universal statement about natural numbers, we would like to converge to probability 1 as we check examples and find no counterexample. It would make sense to converge slowly: we would expect the probability of finding a counterexample as we check more numbers to go down roughly like the probability of a Turing machine halting as we wait longer. (It's improbable that the first counterexample is extremely large, but not very improbable.)
A central problem for me is: how can we make that sort of behaviour come naturally out of a general definition of rational belief over an incomplete logic?
Just for the language of first-order arithmetic (and therefore the arithmetic hierarchy), it seems possible to define appropriate belief in terms of something like the case-checking I've described. This cannot arrive at the correct belief for everything, because nested quantifiers make it possible to quantify over undecidable properties (so that we cannot check cases); however, it seems reasonable to say that it's the best we can do. In cases where we cannot converge to the truth, we should simply ensure that we don't converge toward 1 or 0. (Note: I'm hiding a lot of work here. The case-checking is inadequate as stated; we would like to guarantee nice logical properties for the probability distribution, like if A->B, P(A)<=P(B). However, this seems "not so hard".)
However, impredicative 2nd-order quantification seems impossible to check "by cases" almost by definition: even assuming only definable predicates, we still have the circular truth conditions introduced by the impredicativity. This seems to make a solution unlikely.
Maybe there is a better way of thinking about it, though. Depending on how we take my idea that the space of predicates is supposed to be usefully ambiguous, we might prefer the Henkin semantics, which actually gets us back to the first-order case: we don't assume that only the definable sets exist, but those are the only ones we know to exist. The impredicative definitions are grounded in some (unknown) model, rather than deriving their truth values circularly from facts about themselves. This makes the semantics equivalent to first-order logic with extra axioms; we can handle it in the same way we handled probabilities over 1st-order beliefs.
This is too weak, of course. We wanted to be able to do reasonable probabilistic inference about the Goldbach conjecture. However, it may be a good place to start. How can we get the case-checking behaviour for arithmetic, from something similar to a Henkin semantics?
One approach would be to assert the existence of enumerated classes: classes generated by given constants and functions. For example, even before asserting any axioms about the natural numbers, we would have the class NAT: 0, s(NAT). (Since we haven't stated anything, it could be that S(0)=0, et cetera. Only the existence of the class is guaranteed.) These are unnecessary in crisp logic, having crisp 2nd-order equivalents. However, they define special probabilistic behaviour: enumerated checking.
These aren't needed in the syntax of the language; just the semantics. When we give the inductive definition for natural numbers (n such that: for all X, if X(0) and for all i, X(i)->X(s(i)), then X(n)), then NAT will simply be one instantiation of the 2nd-order quantification in the semantics. We will need to account for this in the probabilistic calculation: we can use randomly generated theories, as for first-order logic, but the probability of statements concerning NAT and other enumerated classes would follow special rules. Since the probability of the Goldbach conjecture stated with respect to NAT would become high, and it being true of NAT implies that it is true of the natural numbers as normally defined (because NAT contains zero and successors), the probability with respect to the natural numbers would have to be high as well. (We would use case-checking to find probabilities of omega-inconsistency, and then use that to reduce the significance of possibilities which appear omega-inconsistent.)
It's not exactly pretty, but at least it gets the job done. Making a special kind of class for specific probabilistic behaviour feels a bit awkward, raising the question of whether there are any other kinds of classes we're forgetting. It would be better if the result emerged more naturally from concerns of what probabilistic behaviours can be obtained; perhaps out of a universal distribution or something...
Here's one more, and I'd refer to it as the problem.
First, some set-up.
Eliezer has been writing some good stuff about 2nd-order logic. He's touching on a problem which I've written about before, namely: if any proof-system can be interpreted via theories in 1st-order logic, why is it that we can discuss ideas (such as number theory) which seem not to be fully characterized by it? We think we mean something more when we talk about higher-order concepts, but how can we, if everything could be reduced to first-order talk? Let's call this the problem of meaning.
My original hope was (somehow) to solve the problem using a sufficiently powerful theory of truth. Tarski's undefinability theorem gives us a way to, given some logic, 'mechanically' construct one thing which the logic cannot express: the truth predicate in that logic. No logic can express its own semantics. Therefore (I must have been thinking), we can use 'truth' as a standard way of extending a logic, and hope that extending it in this way will automatically make it more powerful in every other way that is interesting to us. (There is fairly good support for this idea.) I hoped to solve the problem of meaning by solving the problem of truth.
Along the way, I learned about the arithmetic hierarchy and other hierarchies. These provide a characterisation of how difficult logical/mathematical problems are. Only the first couple of levels of the arithmetic hierarchy can be handled on computers. The rest are increasingly uncomputable, and the other hierarchies just get worse. (The arithmetic hierarchy deals just with problems in number theory, and the next-up, the analytic hierarchy, deals with real analysis.) Given that we cannot compute facts about the higher-up stuff, the question is, in what sense can we refer to it? I referred to this as a grounding problem: in what sense does our logic refer to that stuff, if it can't access the truths of those hierarchy levels? This is a big aspect of the problem of meaning.
A somewhat different question (and more relevant to this post) is: How can we define rational belief about these things?
Anyway, I eventually got a bit fed up trying to climb levels, and attempted to solve the problem more directly by just coming up with a complete theory of truth. I eventually learned that the sorts of approaches I was considering, which attempted to build up truth incrementally, would typically only get me predicative mathematics at best, which is not very much. Finally, recently, I specified my impredicative theory of truth, realising that impredicativity in itself suggests a design for putting a classical truth predicate within its own language, getting around Tarski's theorem (in a sense). This gives something much like 2nd-order logic.
So, having declared tentative success on Truth, let's return to the question of Meaning.
Interestingly, 2nd-order logic seems to contain every meaningful mathematical statement we've come up with so far... it can say anything which we would want to invent higher-order logic for. is logically complex enough that it doesn't fit in any hierarchies of difficulty, "past, present, or future". So, it makes a surprisingly good candidate for a logic of thought. (Higher-order logics may still be important, but are in some sense "merely convenient".)
There are basically two reasonable semantics for my logic. One: only the syntactically constructible predicates exist. This makes sense if we want the theory of truth to just be talking about the truth of stuff we can say. Two: all mathematically possible predicates exist. This turns my logic into a syntactic variant of 2nd-order logic.
Really, I'd like to find a way to keep the semantics usefully ambiguous. Many of the theorems for real numbers have analogous results for the computable real numbers. We might hope for some similar match between results for constructible predicates and the mathematically possible predicates. This would tell us that, to some extent, it doesn't matter which we pick.
My idea is that the predicate ambiguity represents extensible meaning. When we define the natural numbers, we say "The least set containing zero and closed under succession". The semantic question is, how do we interpret "set"? What sets exist? Answer #1 says that only sets we can define exist. Answer #2 says that we refer to a broader range of sets, which is intuitively maximal (contains all possible sets). What I'm saying is that we want to refer to "all sets we, or anyone else, might conceive of"; in other words, we don't just want to include stuff we know how to define, but also things we might learn about later, or anything anyone might define.
Constructivists might object that this does not justify semantics #2 at all: even "all the predicates that might be defined" should be a countable set. I would counter that it is still uncountable in some sense: we solve Skolem's paradox by pointing out that "uncountable" is relative to a logical background. It simply means we can't put the predicates in one-to-one correspondence with the natural numbers using what we currently know how to say. Therefore constructivists should not be "fundamentally" unhappy with uncountability. Most constructivists I know will now argue that the powerset is "actually" countable, despite being apparently uncountable in this relative sense, but at this point they are actually relying on classical mathematics!
Yet, they should point out that I'm being disingenuous. I'm stating that the semantics of my 2nd-order quantifiers should be, in some sense, ambiguous, to allow for the broadest possible interpretation. It's true that, for any particular logic, the predicates definable in that logic will be uncountable within the logic. However, I'm claiming something more: that the quantifiers are ranging over predicates in some imaginary maximal logic (which very clearly does not exist). In doing so, I'm effectively claiming "true uncountability" instead of relative uncountability: the maximal notion of predicates would be uncountable in any logic. A constructivist might be OK so long as I'm honestly being ambiguous, but with the intent of being maximal? There is, arguably, no way of making sense of that (without some specific background theory defining "maximal"...).
Now would be a good time to point out that I'm only doing armchair philosophy here. There are many details of the discussion which would benefit greatly from a formal treatment. For example, I'm not explicitly defining what it means for my logic to claim that its own predicates are uncountable (and this is something which you should be suspicious about, since the theory of truth I offered doesn't directly allow us to ask such a question). I'd like to explore all this further, but it's just a hobby at this point, so I'm prioritising the fun part at the expense of hard work.
So, putting aside the justification, let's just admit that I want to talk about classical mathematics and go for the full 2nd-order semantics. The central question again:
How can we define rational belief about these things?
"Rational belief" could be interpreted in several ways; for the moment, let's formalise rational degrees of belief as probabilities. If you're a Bayesian, then, the question is how to define a prior probability distribution over 2nd-order logic.
I know how I want probabilities on 1st-order logic to work. How do we define a prior over logical beliefs? It makes sense to suppose that statements in 1st-order logic are being drawn at random to construct a theory. We update on evidence by requiring this theory to be consistent with everything we know. When we want to know the probability of a statement being true, we ask the probability that it occurs in such a theory.
It's possible to approximate this probability distribution arbitrarily well, because it is possible to detect inconsistency in 1st-order theories. The proof system corresponds exactly to the semantics, so all we need to do is look for proofs of inconsistencies and throw out inconsistent ones.
This approach doesn't obviously generalise. We can define randomly generated consistent theories for other systems, but we can't approximate these distributions by looking for proofs. Suppose we're dealing with first-order arithmetic. We want to know the probability of the Goldbach Conjecture. We consider randomly constructed theories, throwing out those which we can prove inconsistent. Unfortunately, there will be a finite probability that the statement "there exists an even number which is not the sum of two primes" gets generated. The probability of this statement will go down as we rule out theories which include this statement but imply falsifiable things; however, it may converge to a finite number greater than zero. The axioms of first-order arithmetic are incomplete, so there may be no proof of the conjecture, in which case randomly generated theories would have a positive probability of hypothesising a counterexample. This is referred to as an omega inconsistency.
Even for just the stuff in the arithmetic hierarchy, it is not possible to specify any convergent computation which will arrive at the truth in every case. (And again, 2nd-order logic goes well beyond the arithmetic hierarchy, and the analytic hierarchy as well.)
Yet, we can do better than just the 1st-order probability. For example, we can converge to the truth of the Goldbach conjecture just by checking examples. We could put the probability of the conjecture being false at 1/N, where N is the number of positive examples we have found so far. If we find a counterexample, of course, then we permanently put the probability of it being false to 1. This is very rough, but would be guaranteed to eventually get arbitrarily close to the correct answer, so long as we eventually check any particular number.
More realistically, we make a general rule: for an arbitrary universal statement about natural numbers, we would like to converge to probability 1 as we check examples and find no counterexample. It would make sense to converge slowly: we would expect the probability of finding a counterexample as we check more numbers to go down roughly like the probability of a Turing machine halting as we wait longer. (It's improbable that the first counterexample is extremely large, but not very improbable.)
A central problem for me is: how can we make that sort of behaviour come naturally out of a general definition of rational belief over an incomplete logic?
Just for the language of first-order arithmetic (and therefore the arithmetic hierarchy), it seems possible to define appropriate belief in terms of something like the case-checking I've described. This cannot arrive at the correct belief for everything, because nested quantifiers make it possible to quantify over undecidable properties (so that we cannot check cases); however, it seems reasonable to say that it's the best we can do. In cases where we cannot converge to the truth, we should simply ensure that we don't converge toward 1 or 0. (Note: I'm hiding a lot of work here. The case-checking is inadequate as stated; we would like to guarantee nice logical properties for the probability distribution, like if A->B, P(A)<=P(B). However, this seems "not so hard".)
However, impredicative 2nd-order quantification seems impossible to check "by cases" almost by definition: even assuming only definable predicates, we still have the circular truth conditions introduced by the impredicativity. This seems to make a solution unlikely.
Maybe there is a better way of thinking about it, though. Depending on how we take my idea that the space of predicates is supposed to be usefully ambiguous, we might prefer the Henkin semantics, which actually gets us back to the first-order case: we don't assume that only the definable sets exist, but those are the only ones we know to exist. The impredicative definitions are grounded in some (unknown) model, rather than deriving their truth values circularly from facts about themselves. This makes the semantics equivalent to first-order logic with extra axioms; we can handle it in the same way we handled probabilities over 1st-order beliefs.
This is too weak, of course. We wanted to be able to do reasonable probabilistic inference about the Goldbach conjecture. However, it may be a good place to start. How can we get the case-checking behaviour for arithmetic, from something similar to a Henkin semantics?
One approach would be to assert the existence of enumerated classes: classes generated by given constants and functions. For example, even before asserting any axioms about the natural numbers, we would have the class NAT: 0, s(NAT). (Since we haven't stated anything, it could be that S(0)=0, et cetera. Only the existence of the class is guaranteed.) These are unnecessary in crisp logic, having crisp 2nd-order equivalents. However, they define special probabilistic behaviour: enumerated checking.
These aren't needed in the syntax of the language; just the semantics. When we give the inductive definition for natural numbers (n such that: for all X, if X(0) and for all i, X(i)->X(s(i)), then X(n)), then NAT will simply be one instantiation of the 2nd-order quantification in the semantics. We will need to account for this in the probabilistic calculation: we can use randomly generated theories, as for first-order logic, but the probability of statements concerning NAT and other enumerated classes would follow special rules. Since the probability of the Goldbach conjecture stated with respect to NAT would become high, and it being true of NAT implies that it is true of the natural numbers as normally defined (because NAT contains zero and successors), the probability with respect to the natural numbers would have to be high as well. (We would use case-checking to find probabilities of omega-inconsistency, and then use that to reduce the significance of possibilities which appear omega-inconsistent.)
It's not exactly pretty, but at least it gets the job done. Making a special kind of class for specific probabilistic behaviour feels a bit awkward, raising the question of whether there are any other kinds of classes we're forgetting. It would be better if the result emerged more naturally from concerns of what probabilistic behaviours can be obtained; perhaps out of a universal distribution or something...
Thursday, January 17, 2013
Defining Real Entropy
The wikipedia article for entropy(information theory) indicates the usual formula for entropy of a probability density function, and then calmly proceeds with a proof that it is not the correct way to generalise entropy to the continuous domain. (Yay for wikipedia!) This is the definition of entropy advocated by Jaynes, which is linked as an alternative. It's called Limiting Density of Discrete Points. (Not a catchy name.)
My intuition:
Entropy is about expected code length (which is expected information gain). One way to try and define entropy for distributions over real numbers might be to try and deal with them as a discrete object. To do this, we choose a specific encoding E of the reals as a stream of bits, and examine the efficiency with which we can store points randomly drawn from our probability distribution P. Unfortunately, no matter how clever our encoding, a single real number almost always carries infinite information. So, to get a finite number, we subtract the information in each bit from the worst-case information for that bit (a coin flip). In other words, since we can't give the (infinite) the amount of information directly, we look at how much less information there is than in the case of random noise.
"Random noise" means coin flips in our discrete representation E for continuous numbers... but that means that E induces a probability distribution of its own. The measure of entropy in P relative to the encoding E ends up looking a lot like the KL-divergence between P and E, viewing E as a probability distribution.
One way of understanding this is to say that the continuous version of entropy doesn't work out quite right, but the continuous version of KL-divergence (which is closely related to entropy) works quite well; so the best we can do in the continuous domain is to measure KL-divergence from some reference distribution (in this case, E) as a substitute for entropy.
Normally, entropy tells us what we can expect to achieve with the best encoding. (It places a lower bound on code length: intuitively we can't make the code shorter than the actual amount of information, except by luck.) The reason this doesn't work for real numbers is that they require an infinite amount of information in any case. Yet, it is still possible to generate optimal codes (for example, iteratively divide the probability density into upper and lower halves), and ask how "spread out" they are. The problem is just that we need to define some sort of "frame" in order to get a numerical value (defining the extent to which we care about differing numbers), and it seems like most options end up looking similar to the approach Jaynes came up with (or, being a special case of that approach).
Is this really the best way to translate entropy to the continuous domain?
My intuition:
Entropy is about expected code length (which is expected information gain). One way to try and define entropy for distributions over real numbers might be to try and deal with them as a discrete object. To do this, we choose a specific encoding E of the reals as a stream of bits, and examine the efficiency with which we can store points randomly drawn from our probability distribution P. Unfortunately, no matter how clever our encoding, a single real number almost always carries infinite information. So, to get a finite number, we subtract the information in each bit from the worst-case information for that bit (a coin flip). In other words, since we can't give the (infinite) the amount of information directly, we look at how much less information there is than in the case of random noise.
"Random noise" means coin flips in our discrete representation E for continuous numbers... but that means that E induces a probability distribution of its own. The measure of entropy in P relative to the encoding E ends up looking a lot like the KL-divergence between P and E, viewing E as a probability distribution.
One way of understanding this is to say that the continuous version of entropy doesn't work out quite right, but the continuous version of KL-divergence (which is closely related to entropy) works quite well; so the best we can do in the continuous domain is to measure KL-divergence from some reference distribution (in this case, E) as a substitute for entropy.
Normally, entropy tells us what we can expect to achieve with the best encoding. (It places a lower bound on code length: intuitively we can't make the code shorter than the actual amount of information, except by luck.) The reason this doesn't work for real numbers is that they require an infinite amount of information in any case. Yet, it is still possible to generate optimal codes (for example, iteratively divide the probability density into upper and lower halves), and ask how "spread out" they are. The problem is just that we need to define some sort of "frame" in order to get a numerical value (defining the extent to which we care about differing numbers), and it seems like most options end up looking similar to the approach Jaynes came up with (or, being a special case of that approach).
Is this really the best way to translate entropy to the continuous domain?
Thursday, January 10, 2013
Problems I Didn't Solve
There are several foundational problems which my current foundational theory doesn't solve.
First, it does not determine a specific set theory. In my previous post on the topic, I mentioned:
Basically, in this view, set theory is a way of transposing 2nd-order entities into the 1st-order level to increase the power of the system. We can't just inject "all of them", because there will always be more 2nd-order entities than 1st-order entities, so if we force all 2nd-order entities to be 1st-order entities, we can prove a contradiction. There are many different ways to do a partial job.
My 2nd-order entities are properties, but they are related to the notion of a 'class'. A class is much like a property except that (as with sets) we regard two classes as equal if they have exactly the same members. Set theories developed in 2nd-order logic use 2nd-order entities as classes, and define sets from these. (The difference between a set and a class is that a class is a 2nd-order entity, and therefore can have members, but cannot be a member in anything. A set is a 1st-order entity, and can therefore be a member, as well.)
It would be possible to develop a finitely axiomatized version of ZFC, like NBG set theory. We could also do many other things, though. For example, we could make a set theory like NFU, which has a universal set. This simply means that we reject the 'iterative' idea of converting 2nd-order entities into 1st-order ones. In the 'iterative' conception, we imagine transporting entities to the 1st-order level in stages. Since increasing the number of 1st-order entities can increase the membership of some of our 2nd-order entities, we think of it as changing the meaning of those 2nd-order properties; so, even though we transpose all the existing 2nd-order entities to the 1st-order level at any particular stage, there are yet more entities to consider at the next stage.
NFU rejects the intuition that each stage is conceptually distinct. So, for example, the universal set is the result of transposing the trivial property which is always true. This holds all entities at every stage; so rather than constructing a series of larger and larger distinct sets, each holding all the entities present at some particular stage (but no other entities), we allow it to exist as a single set holding everything.
The stages themselves are still important to NFU: we still need a restriction on which properties may form sets, since we can't make the claim that all properties correspond to sets. How NFU implements this restriction is that properties only become sets if the sets mentioned in the definition can be consistently assigned to stages in a way that ensures membership tests are well-defined. So, for example, Russel's paradoxical set (the set of all sets which don't contain themselves) will never exist, because at no stage can we already test membership if a set to itself. (The variable x in "x such that x ∉ x" can't be assigned a number to ensure that all membership tests a ∈ b have #a ∈ #b.)
Equivalently, the expression which gives us a predicate should be well-typed in simple type theory. (For a completely formal account, read the book.)
The point is, my foundational logic doesn't solve, or even greatly constrain, the foundations of set theory.
(One thing I intentionally left ambiguous was the semantics of the logic. How many properties are there? Are we quantifying over just property definitions, giving a more constructivist foundation, or are we quantifying over a classical space of properties, which is a superset of those we can define? I was hoping to address this in an eventual post.)
There are also other problems which I have not solved in this way. I've claimed that this logic gives us, in an important sense, a self-referential truth predicate. It gives us a language which contains a truth predicate that is correct for all sentences in that language, and also (at a different level) satisfies the Diagonal Lemma, despite Tarski's theorem. However, this doesn't exactly solve all paradoxes of the truth predicate. Kripke noted that a paradox can sometimes depend on accidental self-reference created by the world, not the logic:
Suppose that a spiteful Professor X writes on the chalkboard, "Whatever Professor Y is writing on his chalkboard is false." Meanwhile, the benevolent Professor Y is writing "Whatever Professor X is writing on his chalkboard is true." How should we analyse the situation?
In order to reason about statements written down in the external world, we must have a theory associating 1st and 2nd order entities, just as in the set-theory problem. Complex external objects must "indicate true propositions"!
So, the analysis is not completely determined by the theory of truth. Let's be conservative, though, and suppose we've done just one round of introspection: starting with my impredicative theory of truth, which I'll call L, we generate L', the theory of truth-in-L. This contains a first-order entity for every statement in L, and a truth predicate for these. Let's call this a meta-theory of truth.
We quickly see that we cannot unpack the statements made on the chalkboards to correspond to any particular statements in L. Any (very reasonable) theory which we try to make which relates marks on chalkboards to propositions will fail to relate these particular marks to anything. Before we determine what proposition Y is asserting, we would have to determine what proposition X is asserting, and vice versa.
This is basically because my theory of truth does not solve any problems associated with self-referential truth in the 2nd sense: the truth of self-referential statements. It only provides self-referential truth in the first sense: the ability to discuss truth-in-L in L (so long as we keep the discussion 2nd-order).
Different meta-theories of truth will give different results. Some meta-theories of truth may give us accounts of how statements which are self-referential can get associated with propositions, just as some set theories would allow for sets which contained themselves (such as the universal set).
A related deficiency of the theory is that we cannot form propositions via arbitrary computations. This would be a nice guarantee of the expressiveness of the language. At one point, I thought of this as a major goal: provide a language satisfying the diagonal lemma and containing its own truth predicate, so that the truth of arbitrary computations on atomic propositions could be asserted. Of course, it is not possible to include all and only the meaningful computations; however, it should be able to ensure that if we can prove a computation always halts, then we can use it; this would allow more computational power to be conveniently added by adding axioms. (IXI seems to be like this.) Then we can say that the limitation is not about what the system can't express, but only about what the system doesn't know (not having enough axioms).
It isn't terribly clear which of these problems needs to be solved.
First, it does not determine a specific set theory. In my previous post on the topic, I mentioned:
I'm actually disappointed that this view seems to endorse a more standard well-founded view of set theory, as opposed to what I view as more natural non-well founded theories (in which we have a set of all sets, which is itself a set). Oh well.Really, though, this isn't quite the case. I was imagining a process in which we strengthen the language iteratively, adding increasingly powerful truth predicates, as in the last paragraph of this old essay. However, we're actually free to postulate any set theory we like.
Basically, in this view, set theory is a way of transposing 2nd-order entities into the 1st-order level to increase the power of the system. We can't just inject "all of them", because there will always be more 2nd-order entities than 1st-order entities, so if we force all 2nd-order entities to be 1st-order entities, we can prove a contradiction. There are many different ways to do a partial job.
My 2nd-order entities are properties, but they are related to the notion of a 'class'. A class is much like a property except that (as with sets) we regard two classes as equal if they have exactly the same members. Set theories developed in 2nd-order logic use 2nd-order entities as classes, and define sets from these. (The difference between a set and a class is that a class is a 2nd-order entity, and therefore can have members, but cannot be a member in anything. A set is a 1st-order entity, and can therefore be a member, as well.)
It would be possible to develop a finitely axiomatized version of ZFC, like NBG set theory. We could also do many other things, though. For example, we could make a set theory like NFU, which has a universal set. This simply means that we reject the 'iterative' idea of converting 2nd-order entities into 1st-order ones. In the 'iterative' conception, we imagine transporting entities to the 1st-order level in stages. Since increasing the number of 1st-order entities can increase the membership of some of our 2nd-order entities, we think of it as changing the meaning of those 2nd-order properties; so, even though we transpose all the existing 2nd-order entities to the 1st-order level at any particular stage, there are yet more entities to consider at the next stage.
NFU rejects the intuition that each stage is conceptually distinct. So, for example, the universal set is the result of transposing the trivial property which is always true. This holds all entities at every stage; so rather than constructing a series of larger and larger distinct sets, each holding all the entities present at some particular stage (but no other entities), we allow it to exist as a single set holding everything.
The stages themselves are still important to NFU: we still need a restriction on which properties may form sets, since we can't make the claim that all properties correspond to sets. How NFU implements this restriction is that properties only become sets if the sets mentioned in the definition can be consistently assigned to stages in a way that ensures membership tests are well-defined. So, for example, Russel's paradoxical set (the set of all sets which don't contain themselves) will never exist, because at no stage can we already test membership if a set to itself. (The variable x in "x such that x ∉ x" can't be assigned a number to ensure that all membership tests a ∈ b have #a ∈ #b.)
Equivalently, the expression which gives us a predicate should be well-typed in simple type theory. (For a completely formal account, read the book.)
The point is, my foundational logic doesn't solve, or even greatly constrain, the foundations of set theory.
(One thing I intentionally left ambiguous was the semantics of the logic. How many properties are there? Are we quantifying over just property definitions, giving a more constructivist foundation, or are we quantifying over a classical space of properties, which is a superset of those we can define? I was hoping to address this in an eventual post.)
There are also other problems which I have not solved in this way. I've claimed that this logic gives us, in an important sense, a self-referential truth predicate. It gives us a language which contains a truth predicate that is correct for all sentences in that language, and also (at a different level) satisfies the Diagonal Lemma, despite Tarski's theorem. However, this doesn't exactly solve all paradoxes of the truth predicate. Kripke noted that a paradox can sometimes depend on accidental self-reference created by the world, not the logic:
Suppose that a spiteful Professor X writes on the chalkboard, "Whatever Professor Y is writing on his chalkboard is false." Meanwhile, the benevolent Professor Y is writing "Whatever Professor X is writing on his chalkboard is true." How should we analyse the situation?
In order to reason about statements written down in the external world, we must have a theory associating 1st and 2nd order entities, just as in the set-theory problem. Complex external objects must "indicate true propositions"!
So, the analysis is not completely determined by the theory of truth. Let's be conservative, though, and suppose we've done just one round of introspection: starting with my impredicative theory of truth, which I'll call L, we generate L', the theory of truth-in-L. This contains a first-order entity for every statement in L, and a truth predicate for these. Let's call this a meta-theory of truth.
We quickly see that we cannot unpack the statements made on the chalkboards to correspond to any particular statements in L. Any (very reasonable) theory which we try to make which relates marks on chalkboards to propositions will fail to relate these particular marks to anything. Before we determine what proposition Y is asserting, we would have to determine what proposition X is asserting, and vice versa.
This is basically because my theory of truth does not solve any problems associated with self-referential truth in the 2nd sense: the truth of self-referential statements. It only provides self-referential truth in the first sense: the ability to discuss truth-in-L in L (so long as we keep the discussion 2nd-order).
Different meta-theories of truth will give different results. Some meta-theories of truth may give us accounts of how statements which are self-referential can get associated with propositions, just as some set theories would allow for sets which contained themselves (such as the universal set).
A related deficiency of the theory is that we cannot form propositions via arbitrary computations. This would be a nice guarantee of the expressiveness of the language. At one point, I thought of this as a major goal: provide a language satisfying the diagonal lemma and containing its own truth predicate, so that the truth of arbitrary computations on atomic propositions could be asserted. Of course, it is not possible to include all and only the meaningful computations; however, it should be able to ensure that if we can prove a computation always halts, then we can use it; this would allow more computational power to be conveniently added by adding axioms. (IXI seems to be like this.) Then we can say that the limitation is not about what the system can't express, but only about what the system doesn't know (not having enough axioms).
It isn't terribly clear which of these problems needs to be solved.
Subscribe to:
Posts (Atom)
