Taleb and others have mentioned that the bell curve (or Gaussian) does not deal with outliers well; it gives them a very small probability, and the parameter estimates end up being highly dependent on them.

Yet, one of the justifications of the Gaussian is that it's the max-entropy curve for a given mean and standard deviation.

Entropy is supposed to be a measure of the uncertainty associated with a distribution; so, shouldn't we expect that the max-entropy distribution would give as high a probability to outliers as possible?

There are several answers.

First: a basic problem is that phrase "a given mean and standard deviation". In particular, to choose a standard deviation is to choose an acceptable range for outliers (in some sense). If we have uncertainty about the standard deviation, it turns out the resulting curve has a polynomial decay rather than an exponential one! (This means distant outliers are far more probable.) Essentially, estimating deviations from data (using maximum-likelihood) makes us extremely overconfident that the data will fall in the range we've experienced before. A little Bayesian uncertainty (which still estimates from data, but admits a range of possibilities) turns out to be much less problematic in this respect.

This is definitely helpful, but doesn't solve the riddle: it still feels strange, that the max-entropy distribution would have such a sharp (super-eponential!) decay rate. Why is that?

My derivation will be somewhat heuristic, but I felt that I understood "why" much better by working it out this way than by following other proofs I found (which tend to start with the normal and show that no other distribution has greater entropy, rather than starting with desired features and deriving the normal). [Also, sorry for the poor math notation...]

First, let's pretend we have a discrete distribution over n points, x

_{n}. The result will apply no matter how many points we have, which means it applies in the limit of a continuous distribution. Continuous entropy is not the limit of discrete entropy, so I won't actually be maximising discrete entropy here; I'll maximise the discrete version of the continuous entropy formula:

f(x) maximising sum_i: f(x

_{i})log(f(x

_{i}))

Next, we constrain the distribution to sum to a constant, have a constant mean, and have constant variance (which also makes the standard deviation constant):

sum

_{i}[f(x

_{i})] = C

_{1}

sum

_{i}[x

_{i}f(x

_{i})] = C

_{2}

sum

_{i}[x

_{i}

^{2}f(x

_{i})] = C

_{3}

To solve the constrained optimisation problem, we make lagrange multipliers for the constraints:

Lagrangian:

f(x

_{i})log(f(x

_{i}))

- lambda

_{1}*(sum

_{i}[f(x

_{i})]-C

_{1})

- lambda

_{2}*(sum

_{i}[x

_{i}f(x

_{i})]-C

_{2})

- lambda

_{3}*(sum

_{i}[x

_{i}

^{2}f(x

_{i})]-C

_{3})

Partial derivatives in the f(x

_{i}):

1 - log(f(x

_{i}))

- lambda

_{1}

- lambda

_{2}*x

_{i}

- lambda

_{3}*x

_{i}

^{2}

Setting this equal to zero and solving for f(x

_{i}):

f(x

_{i}) = 2

^{1 - lambda1 - lambda2*xi - lambda3*xi2}

That's exactly the form of the Gaussian: a constant to the power of a 2nd-degree polynomial!

So, we can see where everything comes from: the exponential comes from our definition of entropy, and the function within the exponent comes from the Lagrange multipliers. The Gaussian is quadratic precisely because we chose a quadratic loss function! We can get basically any form we want by choosing a different loss function. If we use the kurtosis rather than the variance, we will get a fourth degree polynomial rather than a second degree one. If we choose an exponential function, we can get a doubly exponential probability distribution. And so on. There should be some limitations, but more or less, we can get any probability distribution we want, and claim that it is justified as the maximum-entropy distribution (fixing some measure of spread). We can even get rid of the exponential by putting a logarithm around our loss function.

Last time, I mentioned robust statistics, which attempts to make statistical techniques less sensitive to outliers. Rather than using the mean, robust statistics recommends using the median: whereas a sufficiently large outlier can shift the mean by an arbitrary amount, a single outlier has the same limited effect on the median no matter how extreme its value.

I also mentioned that it seems more intuitive to use the absolute deviation, rather than the squared deviation.

If we fix the absolute deviation and ask for the maximum-entropy function, we get something like e

^{-|x|}as our distribution. This is an ugly little function, but the maximum-likelihood estimate of the center of the distribution is precisely the median! e

^{-|x|}justifies the strategy of robust statistics, reducing sensitivity to outliers by making extreme outliers more probable.

*(The reason is: the max-likelihood estimate will be the point which minimizes the sum of the loss functions centred at each data point. The derivative at x is equal to the number of data points below x minus the number above x. Therefore the derivative is only zero when these two are equal. This is the minimum loss.)*

What matters, of course, is not what nice theoretical properties a distribution may have, but how well it matches the true situation. Still, I find it very interesting that we can construct a distribution which justifies taking the median rather than the mean... and I think it's important to show how arbitrary the status of the bell curve as the maximum entropy distribution is.

Just to be clear: the Bayesian solution is not usually to think

*too*much about what distributions might have the best properties. This is important, but when in doubt, we can simply take a mixture distribution over as large a class of hypotheses as we practically can. Bayesian updates give us nice convergence to the best option, while also avoiding overconfidence (for example, the asymptotic probability of outliers will be on the order of the most outlier-favouring distribution present).

Still, a complex machine learning algorithm may still need a simple one as a sub-algorithm to perform simple tasks; a genetic programming approximation of Bayes may need simpler statistical tools to make an estimation of distribution algorithm work. More generally, when humans build models, they tend to compose known distributions such as Gaussians to make them work. In such cases, it's interesting to ask whether classical or robust statistics is more appropriate.

## No comments:

## Post a Comment