Tuesday, January 29, 2013

Why Continuous Entropy Matters

Previously, I discussed definitions of entropy for continuous distributions. Entropy, when used as a lower bound on description length, doesn't obviously apply to real-numbered random variables. Yet, there is a very common definition of entropy for continuous variables which almost everyone uses without question. This definition looks right at the syntactic level (we simply replace summation with integration), but doesn't make a lot of conceptual sense, and also loses many properties of discrete entropy. I took a look at one alternative definition that might make more sense.

Why does it matter?

The main issue here is the legitimacy of the maximum-entropy justification of specific continuous distributions, chief among them the bell curve; using slightly more general language, the Gaussian distribution (which takes the bell curve to the multivariate domain, allowing it to account for correlations rather than just means and variances).

The Gaussian distribution is used in many cases as the default: the first distribution you'd try. In some cases this is out of statistical ignorance (the Gaussian is the distribution people remember), but in many cases this is because of a long tradition in statistics.

One justification for this is the "physical" justification: many physical processes will give distributions which are approximately (though not exactly) Gaussian. The Gaussian is a good approximation of naturally occurring statistics when the number is the result of many individual positive and negative fluctuations: if we start with a base number such as 30, and there are many processes which randomly add or subtract something from our number, then the distribution of our end result will be approximately Gaussian. This could be thought of as a frequentist justification of the Gaussian.

A more Bayesian justification of the Gaussian is via the maximum-entropy principle. Entropy represents the uncertainty of a distribution. Bayesians need prior probability distributions in order to begin reasoning. It makes intuitive sense to choose a prior which has maximal uncertainty associated with it: we are representing our ignorance. It turns out that with a known mean and variance, the Gaussian is the maximum-entropy distribution... according to the usual definition of entropy.

This seems quite mysterious to me. Intuitively, a maximum-entropy distribution should be as spread out as possible (within the confines of the fixed variance, which is a measure of spread). A Gaussian, however, decays super-exponentially! It falls off very quickly, on the order of e-x2. This makes outliers very, very unlikely. Intuitively, it seems as if a polynomial decay rate would be much better, leaving a little bit of probability mass for far-flung values.

This issue is not merely academic. Nassim Nicolas Taleb has repeatedly shown that the standard statistical models used in socioeconomic settings are disastrously wrong when it comes to outliers. A small number of outliers are improbable enough to completely invalidate the models from a scientific perspective. These outliers are not just annoying, but also highly consequential, often leading to the failure of investments and massive losses of capital. To quote him: "Nobody has managed to explain why it is not charlatanism, downright scientifically fraudulent to use these techniques." Taleb advocates the use of probability distributions with a wider spread, such as a power law. Basically, polynomial distributions rather than exponential distributions.

Advocates of the maximum entropy principle should (arguably) be puzzled by this failure. The Gaussian distribution was supposed to have maximum spread, in the sense that's important (entropy). Yet, Taleb showed that the spread is far too conservative! What went wrong?

I can see two possible problems with the Gaussian distribution: entropy, and variance.

As I've already explored, the definition of entropy being used is highly questionable. It's not obvious that entropy should be applied to the continuous domain at all, and even if it is, there doesn't seem to be very much justification for the formula which us currently employed.

There is another assumption behind the maximum-entropy justification of the Gaussian, however: fixed variance. A Gaussian is the maximum-entropy distribution given a variance. Entropy measures "spread", but variance also measures "spread" in a different sense. The interaction between these two formulas for spread gives rise to the Gaussian distribution.

The formula for variance is based on the squared deviation. Deviation is the distance from the mean. There are many different measures of the collective deviation. Squared deviation is not robust. Why not use the absolute deviation? This comes up now and again, but as far as I've observed, rarely addressed inside statistics (rarely taught in classrooms, rarely mentioned in textbooks, et cetera). It's a fascinating topic, and Taleb's work suggests it's a critical one.

So, it seems as if there is something important to learn by digging further into the foundations of our measures of spread, including both entropy and variance.

Naturally, my personal interest here is for applications in AI. It seems necessary to have good built-in methods for dealing with continuous variables, and uncertainty about continuous variables. So, if the standard methods are flawed, that could be relevant for AI. Interestingly, many machine learning methods do prefer alternatives to squared error. SVMs are a primary example.

1 comment:

  1. I found it helpful to read your article backwards, after reading it through once. I understood before, to have you theory & hypothesis presented firs...but it's not like this is a research paper or anything. ;) The bottom line is that these ideas are communicable & that you get value out of them in ways that help you fulfill practical solving power in life...to which I say, kuddos to you! Hug! :D