Sunday, March 29, 2015

Good Bias, Bad Bias

I had a conceptual disagreement with a couple of friends, and I'm trying to spell out what I meant here in order to continue the discussion.

The statistical definition of bias is defined in terms of estimators. Suppose there's a hidden value, Theta, and you observe data X whose probability distribution is dependent on Theta, with known P(X|Theta). An estimator is a function of the data which gives you a hopefully-plausible value of Theta.

An unbiased estimator is an estimator which has the property that, given a particular value of Theta, the expected value of the estimator (expectation in P(X|Theta)) is exactly Theta. In other words: our estimate may be higher or lower than Theta due to the stochastic relationship between X and Theta, but it hits Theta on average. (In order for averaging to make sense, we're assuming Theta is a real number, here.)

The Bayesian view is that we have a prior on Theta, which injects useful bias in our judgments. A Bayesian making statistical estimators wants to minimize loss. Loss can mean different things in different situations; for example, if we're estimating whether a car is going hit us, the damage done by wrongly thinking we are safe is much larger than the damage done by wrongly thinking we're not. However, if we don't have any specific idea about real-world consequences, it may be reasonable to assume a squared-error loss so that we are trying to get our estimated Theta to match the average value of Theta.

Even so, the Bayesian choice of estimator will not be unbiased, because Bayesians will want to minimize the expected loss accounting for the prior, which means looking at the expectation in P(X|Theta)*P(Theta). In fact, we can just look at P(Theta|X). If we're minimizing squared error, then our estimator would be the average Theta in P(Theta|X), which is proportional to P(X|Theta)P(Theta).

Essentially, we want to weight our average by the prior over Theta because we decrease our overall expected loss by accepting a lot of statistical bias for values of Theta which are less probable according to our prior.

So, a certain amount of statistical bias is perfectly rational.

Bad bias, to a Bayesian, refers to situations when we can predictably improve our estimates in a systematic way.

One of the limitations of the paper reviewed last time was that it didn't address good vs bad bias. Bias, in that paper, was more or less indistinguishable from bias in the statistical sense. Detangling things we can improve from things which we want would require a deeper analysis of the mathematical model, and of the data.

Saturday, March 28, 2015

A Paper on Bias

I've been reading some of the cognitive bias literature recently.

First, I dove into Toward a Synthesis of Cognitive Biases, by Martin Hilbert: a work which claims to explain how eight different biases observed in the literature are an inevitable result of noise in the information-processing channels in the brain.

The paper starts out with what it calls the conservatism bias. (The author complains that the literature is inconsistent about naming biases, both giving one bias multiple names and using one name for multiple biases. Conservatism is what is used for this paper, but this may not be standard terminology. What's important is the mathematical idea.)

The idea behind conservatism is that when shown evidence, people tend to update their probabilities more conservatively than would be predicted by probability theory. It's as if they didn't observe all the evidence, or aren't taking the evidence fully into account. A well-known study showed that subjects were overly conservative in assigning probabilities to gender based on height; an earlier study had found that the problem is more extreme when subjects are asked to aggregate information, guessing the gender of a random selection of same-sex individuals from height. Many studies were done to confirm this bias. A large body of evidence accumulated which indicated that subjects irrationally avoided extreme probabilities, preferring to report middling values.

The author construed conservatism very broadly. Another example given was: if you quickly flash a set of points on a screen and ask subjects to estimate their number, then subjects will tend to over-estimate the number of a small set of points, and under-estimate the number of a large set of points.

The hypothesis put forward in Toward a Synthesis is that conservatism is a result of random error in the information-processing channels which take in evidence. If all red blocks are heavy and all blue blocks are light, but you occasionally mix up red and blue, you will conclude that most red blocks are heavy and most blue blocks are light. If you are trying to integrate some quantity of information, but some of it is mis-remembered, small probabilities will become larger and large will become smaller.

One thing that bothered me about this paper was that it did not directly contrast processing-error conservitism with the rational conservatism which can result from quantifying uncertainty. My estimate of the number of points on a screen should tend toward the mean if I only saw them briefly; this bias will increase my overall accuracy rate. It seems that previous studies established that people were over-conservative compared to the rational amount, but I didn't take the time to dig up those analyses.

All eight biases explained in Toward a Synthesis were effectively consequences of conservatism in different ways.

  • Illusory correlation: Two rare events X and Y which are independent appear correlated as a result of their probabilities being inflated by conservatism bias. I found this to be the most interesting application. The standard example of illusory correlation is stereotyping of minority groups. The race is X, and some rare trait is Y. What was found was that stereotyping could be induced in subjects by showing them artificial data in which the traits were entirely independent of the races. Y could be either a positive or a negative trait; illusory correlation occurs either way. The effect that conservatism has on the judgements will depend on how you ask the subject about the data, which is interesting, but illusory correlation emerges regardless. Essentially, because all the frequencies are smaller within the minority group, the conservatism bias operates more strongly; the trait Y is inflated so much that it's seen as being about 50-50 in that group, whereas the judgement about its frequency in the majority group is much more realistic.
  • Self-Other Placement: People with low skill tend to overestimate their abilities, and people with high skill tend to underestimate theirs; this is known as the Dunning-Kruger effect. This is a straightforward case of conservatism. Self-other placement refers to the further effect that people tend to be even more conservative about estimating other people's abilities, which paradoxically means that people of high ability tend to over-estimate the probability that they are better than a specific other person, despite the Dunning-Kruger effect; ans similarly, people of low ability tend to over-estimate the probability that they are worse as compared with specific individuals, despite over-estimating their ability overall. The article explains this as a result of having less information about others, and hence, being more conservative. (I'm not sure how this fits with the previously-mentioned result that people get more conservative as they have more evidence.)
  • Sub-Additivity: This bias is a class of inconsistent probability judgements. The estimated probability of an event will be higher if we ask for the probability of a set of sub-events, rather than merely asking for the overall probability. From WikipediaFor instance, subjects in one experiment judged the probability of death from cancer in the United States was 18%, the probability from heart attack was 22%, and the probability of death from "other natural causes" was 33%. Other participants judged the probability of death from a natural cause was 58%. Natural causes are made up of precisely cancer, heart attack, and "other natural causes," however, the sum of the latter three probabilities was 73%, and not 58%. According to Tversky and Koehler (1994) this kind of result is observed consistently. The bias is explained with conservativism again. The smaller probabilities are inflated more by the conservatism bias than the larger probability is, which makes their sum much more inflated than the original event.
  • Hard-Easy Bias: People tend to overestimate the difficulty of easy tasks, and underestimate the difficulty of hard ones. This is straightforward conservatism, although the paper framed it in a somewhat more complex model (it was the 8th bias covered in the paper, but I'm putting it out of order in this blog post).

That's 5 biases down, and 3 to go. The article has explained conservatism as a mistake made by a noisy information-processor, and explains 4 other biases as consequences of conservatism. So far so good.

Here's where things start to get... weird.

Simultaneous Overestimation and Underestimation

Bias 5 is termed exaggerated expectation in the paper. This is a relatively short section which reviews a bias dual to conservatism. Conservatism looks at the statistical relationship from the evidence, to the estimate formed in the brain. If there is noise in the information channel connecting the two, then conservatism is a statistical near-certainty.

Similarly, we can turn the relationship around. The conservatism bias was based on looking at P(estimate|evidence). We can turn it around with Bayes' Law, to examine P(evidence|estimate). If there is noise in one direction, there is noise in the other direction. This has a surprising implication: the evidence will be conservative with respect to the estimate, by essentially the same argument which says that the estimate will tend to be conservative with respect to the evidence. This implies that (under statistical assumptions spelled out in the paper), our estimates will tend to be more extreme than the data. This is the exaggerated expectation effect.

If you're like me, at this point you're saying what???

The whole idea of conservatism was that the estimates tend to be less extreme than the data! Now "by the same argument" we are concluding the opposite?

The section refers to a paper about this, so before moving further I took a look at that reference. The paper is Simultaneous Over- and Under- Confidnece: the Role of Error in Judgement Process by Erev et. al. It's a very good paper, and I recommend taking a look at it.

Simultaneous Over- and Under- Estimation reviews two separate strains of literature in psychology. A large body of studies in the 1960s found systematic and reliable underestimation of probabilities. This revision-of-opinion literature concluded that it was difficult to take the full evidence into account to change your beliefs. Later, many studies on calibration found systematic overestimation of probabilities: when subjects are asked to give probabilities for their beliefs, the probabilities are typically higher than their frequency of being correct.

What is going on? How can both of these be true?

One possible answer is that the experimental conditions are different. Revision-of-opinion tests give a subject evidence, and then test how well the subject has integrated the evidence to form a belief. Calibration tests are more like trivia sessions; the subject is asked an array of questions, and assigns a probability to each answer they give. Perhaps humans are stubborn but boastful: slow to revise their beliefs, but quick to over-estimate the accuracy of those beliefs. Perhaps this is true. It's difficult to test this against the data, though, because we can't always distinguish between calibration tests and revision-of-opinion tests. All question-answering involves drawing on world knowledge combined with specific knowledge given in the question to arrive at an answer. In any case, a much more fundamental answer is available.

The Erev paper points out that revision-of-opinion experiments used different data analysis. Erev re-analysed the data for studies on both sides, and found that the statistical techniques used by revision-of-opinion researchers found underconfidence, while the techniques of calibration researchers found overconfidence, in the same data-set!

Both techniques compared the objective probability, OP, with the subject's reported probability, SP. OP is the empirical frequency, while SP is whatever the subject writes down to represent their degree of belief. However, revision-of-opinion studies started with a desired OP for each situation and calculated the average SP for a given OP. Calibration literature instead starts with the numbers written down by the subjects, and then asks how often they were correct; so, they're computing the average OP for a given SP.

When we look at data and try to find functions from X to Y like that, we're creating statistical estimators. A very general principle is that estimators tend to be regressive: my Y estimate will tend to be closer to the Y average than the actual Y. Now, in the first case, scientists were using X=OP and Y=SP; lo and behold, they found it to be regressive. In later decades, they took X=SP and Y=OP, and found that to be regressive! From a statistical perspective, this is plain and ordinary business as usual. The problem is that one case was termed under-confidence and the other over-confidence, and they appeared from those names to be contrary to one another.

This is exactly what the Toward a Synthesis paper was trying to get across with the reversed channel, P(estimate|evidence) vs P(evidence|estimate).

Does this mean that the two biases are mere statistical artifacts, and humans are actually fairly good information systems whose beliefs are neither under- nor over- confident? No, not really. The statistical phenomena are real: humans are both under- and over-confident in these situations. What Toward a Synthesis and Simultaneous Over- and Under- Confidence are trying to say is that these are not mutually inconsistent, and can be accounted for by noise in the information-processing system of the brain.

Both papers propose a model which accounts for overconfidence as the result of noise during the creation of an estimate, although they are put in different terms. The next section of Toward a Synthesis is about overconfidence bias specifically (which it sees as a special case of exaggerated expectations, as I understand them; the 7th bias to be examined in the paper, for those keeping count). The model shows that even with accurate memories (and therefore the theoretical ability to reconstruct accurate frequencies), an overconfidence bias should be observed (under statistical conditions outlined in the paper). Similarly, Simultaneous Over-and Under- confidence constructs a model in which people have perfectly accurate probabilities in their heads, and the noise occurs when they put pen to paper: their explicit reflection on their belief adds noise which results in an observed overconfidence.

Both models also imply underconfidence. This means that in situations where you expect perfectly rational agents to reach 80% confidence in a belief, you'd expect rational agents with noisy reporting of the sort postulated to give estimates averaging lower (say, 75%). This is the apparent underconfidence. On the other hand, if you are ignorant of the empirical frequency and one of these agents tells you that it is 80%, then it is you who is best advised to revise the number down to 75%.

This is made worse by the fact that human memories and judgement are actually fallible, not perfect, and subject to the same effects. Information is subject to bias-inducing-noise at each step of the way, from first observation, through interpretation and storage in the brain, modification by various reasoning processes, and final transmission to other humans. In fact, most information we consume is subject to distortion before we even touch it (as I discussed in my previous post). I was a bit disappointed when the Toward a Synthesis paper dismissed the relevance of this, stating flatly "false input does not make us irrational".

Overall, I find Toward a Synthesis of Cognitive Biases a frustrating read and recommend the shorter, clearer Simultaneous Over- and Under- Confidence as a way to get most of the good ideas with less of the questionable ones. However, that's for people who already read this blog post and so have the general idea that these effects can actually explain a lot of biases. By itself, Simultaneous Over- and Under- Confidence is one step away from dismissing these effects as mere statistical artifacts. I was left with the impression that Erev doesn't even fully dismiss the model where our internal probabilities are perfectly calibrated and it's only the error in conscious reporting that's causing over- and under- estimation to be observed.

Both papers come off as quite critical of the state of the research, and I walk away from these with a bitter taste in my mouth: is this the best we've got? The extend of the statistical confusion observed by Erev is saddening, and although it was cited in Toward a Synthesis, I didn't get the feeling that it was sharply understood (another reason I recommend the Erev paper instead). Toward a Synthesis also discusses a lot of confusion about the names and definitions of biases as used by different researchers,which is not quite as problematic, but also causes trouble.

A lot of analysis is still needed to clear up the issues raised by these two papers. One problem which strikes me is the use of averaging to aggregate data, which has to do with the statistical phenomenon of simultaneous over- and under- confidence. Averaging isn't really the right thing to do to a set of probabilities to see whether it has a tendency to be over or under a mark. What we really want to know, I take it, is whether there is some adjustment which we can do after-the-fact to systematically improve estimates. Averaging tells us whether we can improve a square-loss comparison, but that's not the notion of error we are interested in; it seems better to use a proper scoring rule.

Finally, to keep the reader from thinking that this is the only theory trying to account for a broad range of biases: go read this paper too! It's good, I promise.

Monday, March 16, 2015

The Ordinary Web of Lies

One of the basic lessons in empiricism is that you need to consider how the data came to you in order to use it as evidence for or against a hypothesis. Perhaps you have a set of one thousand survey responses, answering questions about income, education level, and age. You want to draw conclusions about the correlations of these variables in the United States. Before we do so, we need to ask how the data was collected. Did you get these from telephone surveys? Did you walk around your neighborhood and knock on people's doors? Perhaps you posted the survey on Amazon's Mechanical Turk? These different possibilities give you samples from very different populations.

When we obtain data in a way that does not evenly sample from the population we are trying to study, this is called selection bias. If not accounted for, selection effects can cause you to draw just about any conclusion, regardless of the truth.

In modern society, we consume a very large amount of information. Practically all of that information is highly filtered. Most of this filtering is designed to nudge your beliefs in specific directions. Even when the original authors engage in intellectual honesty, we usually see something as a result of a large, complex filter imposed by society (for example, social media). Even when scientists are perfectly unbiased, journalists can choose to cite only the studies which support their perspective.

I have cultivated what I think is a healthy fear of selection effects. I would like to convey to the reader a visceral sense of danger, because it's so easy to be trapped in a web of false beliefs based on selection effects.

A Case Study

Consider this article, Miracles of the Koran: Chemical Elements Indicated in the Koran. A Muslim roommate showed this to me when I voiced skepticism about the miraculous nature of the Koran. He suggested that there could be no ordinary explanation of such coincidences. (Similar patterns have been found in the Bible, a phenomenon which has been named the Bible Code.) I decided to try to attempt an honest analysis of the data to see what it led to.

Take a look at these coincidences. On their own, they are startling, right? When I first looked at these, I had the feeling that they were rather surprising and difficult to explain. I felt confused.

Then I started to visualize the person who had written this website. I supposed that they were (from their own perspective) making a perfectly honest attempt to record patterns in the Koran. They simply checked each possibility they thought of, and recorded what patterns they found.

There are 110 elements on the periodic table. The article discusses the placement (within a particular Sura, the Iron Sura) of Arabic letters which correspond (roughly) to the element abbreviations used on the Periodic Table. For example, the first coincidence noted is that the first occurrence of the Arabic equivalent of "Rn" is 86 letters from the beginning of the verse, and the atomic number of the element Rn is 86. The article notes similar coincidences with atomic weight (as opposed to atomic number), the number of letters from the end of the verse (rather than the beginning), the number of words (rather than number of letters), and several other variations.

Notice that simply looking at the number of characters from the beginning and the end, we double the chances of corresponding to the atomic number. Similarly, looking for atomic weights as well as atomic numbers doubles the chances. Each extra degree of freedom we allow multiplies the chances in this way.

I couldn't easily account for all the possible variations the article's author might have looked for. However, I could restrict myself to one class of patterns and see how much the data looked like chance.

Even restricting myself to one particular class of patterns, I did not know enough of the statistics of the Arabic language to come up with a real Bayesian analysis of the data. I made some very, very rough assumptions which I didn't write down and no longer recall. I estimated the number of elements which would follow the pattern by chance, and my estimate came very close to the number which the article actually listed.

I have to admit, whatever my analysis was, it was probably quite biased as well. It's likely that I added assumptions in a way which was likely to get me the answer I wanted, although I felt I was not doing that. Even supposing that I didn't, I did stop doing math once the numbers looked like chance, satisfied with the answer. This in itself creates a bias. I could certainly have examined some of my assumptions more closely to make a better estimate, but the numbers said what I wanted, so I stopped questioning.

Nonetheless, I do think that the startling coincidences are entirely explained by the strong selection effect produced by someone combing the Koran for patterns. Innocently reporting patterns which fit your theory, with no intention to mislead, can produce startling arguments which appear at first glance to very strongly support your point. The most effective, convincing versions of these startling arguments will get shared widely on the internet and other media (so long as there is social incentive to spread the argument).

If you're not accounting for selection bias, then trying to respond to arguments with rational consideration makes you easy to manipulate. Your brain can be reprogrammed simply by showing it the most convincing arguments in one direction and not the other.

Everything is Selection Bias

Selection processes filter everything we see. We see successful products and not unsuccessful ones. We hear about famous people, which greatly biases our perception of how to get rich. We filter our friends quite a bit, perhaps in ways we don't even realize, and then often we trick ourselves into wrong conclusions about typical people based on the people we've chosen as friends.

No matter what data you're looking at, it was sampled from some distribution. It's somewhat arbitrary to think that selecting from university students is biased, but that selecting evenly from Amaricans is not. Indeed, university professors have far more incentive to understand the psychology of the student population! What matters is being aware of the selection process which got you the data, and accounting for that when trying to draw conclusions.

Even biological evolution can be seen as a selection effect. Selective pressure takes a tiny minority of the genes, and puts those genes into the whole population. This is a kind of self-fulfilling selection effect, weirder than simple selection bias. It's as if the rock stars in one generation become the common folk of the next.

The intuition I'm trying to get across is: selection effects are something between a physical force and an agent. Like an agent, selection effects optimize for particular outcomes. Like a physical force, selection effects operate automatically, everywhere, without requiring a guiding hand to steer them. This makes them a dangerous creature.

Social Constructs

Social reality is a labyrinth of mirrors reflecting each other. All the light ultimately comes from outside the maze, but the mirrors can distort it any way they like. The ordinary web of lies is my personal term for this. Many people will think of religion, but it goes far beyond this. When society decides a particular group is the enemy, they become the enemy. When society deems words or concepts uncouth, they are uncouth. I call these lies, but it's not what we ordinarily mean by dishonest. It's terrifyingly easy to distort reality. Even one person, alone, will tend to pick and choose observations in a self-serving way. When we get together in groups, we have to play the game: selecting facts to use as social affirmations or condemnations, selecting arguments to create consensus... it's all quite normal.

This all has to do with the concept of hyperstition (see Lemurian Time War) and hyperreality. Hyperstition refers to superstition which makes itself real. Hyperreality refers to our inability to distinguish certain fictions from reality, and the way in which our fictional, constructed world tends to take primacy over the physical world. Umberto Eco illustrates this nicely in his book Focault's Pendulum, which warns of the deadly danger in these effects.

The webcomic The Accidental Space Spy explores alien cultures as a way of illustrating evolutionary psychology. One of the races, the Twolesy, has evolved strong belief in magic wizards. These wizards command the towns. Whoever doubts the power of a wizard is killed. Being that it has been this way for many generations, the Twolesy readily hallucinate magic. Whatever the wizards claim they can do, the Twolesy hallucinate happening. Whatever other Twolesy claim is happening, they hallucinate as well. Twolesy who do not hallucinate will not be able to play along with the social system very effectively, and are likely to be killed.

Similarly with humans. Our social system relies on certain niceties. Practically anything, no matter how not about signaling it is, becomes a subject for signaling. Those who are better at filtering information to their advantage have been chosen by natural selection for generations. We need not consciously know what we're doing -- it seems to work best when we fool ourselves as well as everyone else. And yes, this goes so far as to allow us to believe in magic. There are mentalists who know how to fool our perceptions and consciously develop strategies to do so, but equally well, there are Wiccans and the like who have similar success by embedding themselves in the ordinary web of lies.

Something which surprised me a bit is that when you try to start describing rationality techniques, people will often object to the very idea of truth-oriented dialog. Truth-seeking is not the first thing on people's minds in everyday conversation, and when you raise it to their awareness, it's not obvious that it should be. Other things are more important.

Imagine a friend has experienced a major loss. Which is better: frank discussion of the mistakes they made, or telling them that it's not really their fault and anyway everything will work out for the best in the end? In American culture at least, it can be rude to let on that you think it might be their fault. You can't honestly speculate about that, because they're likely to get their feelings hurt. Only if you're reasonably sure you have a point, and if your relationship is close enough that they will not take offense, could you say something like that. Making your friend feel better is often more important. By convincing them that you don't think it's their fault, you strengthen the friendship by signalling to them that you trust them. (In Persian culture, I'm given to understand, it's the opposite way: everyone should criticize each other all the time, because you want to make your friends think that you know better than them.)

When the stakes are high, other things easily become more important than the truth.

Notice the consequences, though: the mistakes with high consequences are exactly the ones you want to thoroughly debug. What's important is not whether it's your fault or no; what matters is whether there are different actions you should take to forestall disaster, next time a similar situation arises.

What, then? Bad poetry to finish it off?

Beware, beware, the web of lies;
the filters twist the truth, and eyes
are fool'd too well; designed to see
what'ere the social construct be!

We the master, we the tool,
that spin the thread and carve the spool
weave the web and watch us die!

Friday, December 12, 2014


Sometimes, people make a fuss about the difference between knowledge and understanding.

Recently, an explanation of this difference occurred to me which I had not considered before.

The Slate Star Codex article Right is the New Left  explains fashion with cellular automata. It's a model of society which has about ten moving parts, yet has behaviors which resemble those of a whole society.

This made me think that understanding is essentially explaining something with a model small enough to fit in working memory.

Consider the extremely detailed weather model which meteorologists use to produce forecasts.

Now, consider the highly simplified explanation based on air masses, warm and cold fronts and so on which is commonly illustrated with weather maps.

The first gives us more accurate predictions, but the second one gives us more understanding. If a scientist was able to use the detailed mathematical weather model but did not think in terms of storm fronts and so on, he/she could not answer questions such as "why" it is raining. In the detailed physical model, "why" is almost meaningless: the causes of any particular event are huge in number.

This notion of understanding has several implications.

What constitutes understanding will depend on the mind doing the understanding, whereas knowledge is more objective in nature. I can achieve understanding of a system by putting it in terms I am familiar with. Suppose I am trying to understand an esoteric branch of chemistry known as semi-equilibrium Z-theory. I might learn all the statements belinging to SEZ theory by heart, and gain the ability to solve SEZ equations and get the correct answer, and still feel that I have little understanding. Yet, if I can relate SEZ theory to more familiar subjects, I will feel I've "put it in terms I can understand".

Assume I was an apple farmer before learning chemistry.

Let's say an experienced SEZ-theoretician gives me an analogy in which a SEZ-frubian (a central object of SEZ theory) is a rotten apple, and a SEZ-nite (another important concept in SEZ theory) is a worm slowly eating the apple. If the analogy works well enough, I feel I've gained an understanding: now when I'm solving the equations, I imagine that they are telling me things about this worm munching away happily at the core of the apple. I've now got a model with a few moving parts which allows me to make heuristic predictions much more effectively.

However, someone with no experience of apples and worms will not be helped very much by this analogy. It's placed SEZ-theory into my mental landscape, but the same explanation may not be useful to others.

Even a superhuman intelligence would have use for understanding: the actual universe is far too complex for a mind-within-universe to fully model. However, the understanding it achieves would be far beyond us. The "small" heuristic models would be too large to fit into our working memory (the mythical seven-plus-or-minus-two). Its weather maps would likely look more like our "detailed" physical simulations of the weather.

Wednesday, December 10, 2014

Epistemic Trust

My attraction to LessWrong is partially predicated on a feeling that a better culture can be created, raising the sanity waterline to improve society overall.

Recently, though, I've somewhat given up on that.

First, I was overestimating the degree to which LessWrong had created such a culture already. I'll explain why.

Talking to core LessWrong people is different. It feels highly reflective, with each person cognizant of the flaws in their own reasoning. Much less time is wasted trying to point out such flaws because they are often spotted by the person, and if not, they are admitted quickly if someone else points them out.

I call this intellectual honesty: being up-front not just about what you believe, but also why you believe it, what your motivations are in saying it, and the degree to which you have evidence for it. Feynman discussed the necessary attitude, although he didn't give it a name that I'm aware of.

There are many forces working against intellectual honesty in everyday life, but the most important one is face culture: status in the group is being signaled by agreeing and disagreeing, arguing for or against people. This need not be competitive in nature; in fact, the most common type of face-culture I experience is cooperative: people in the group are trying to be friendly and nice by finding plausible ways in which what the other person said might be true.

For example, suppose that a group of engineers are meeting to discuss a technical problem. A new hire is in the group; I will call this person X. X is eager to find acceptance in the group. The other members of the group are also eager to make X feel accepted. During the discussion, X is looking for opportunities to interject with something relevant, useful, and smart-sounding. At some point, X is reminded of something associated (as opposed to relevant): a story about a similar project at a previous work-place. X tells the story. Although the story appears on the surface to be analogous, after the telling of the story it becomes clear to everyone at the table that there is a critical dis-analogy and X has failed to make a relevant point.

Intellectual-Honesty Culture: If everyone at the table is intellectually honest, someone points out the disanalogy and everyone moves on quickly. It's very likely that X is the one to point out the disanalogy, perhaps even before finishing the story.

Face Culture: In face culture, people will focus more on trying to make X feel included. Although everyone knows the story was not relevant, it's worthwhile to comment on the story in order to make it seem relevant. If someone at the table happens to point out that it is not relevant, someone (often X) will try to "repair the damage" by amending the point being made or pointing out that it was at least associated.

The critical point here is that when the two cultures mix, a face-culture person will see intellectual honesty as an attack.

It is worth emphasizing that face culture is not dishonest, not in the normal sense. Face culture is nice; face culture is friendly; face culture is welcoming. (Although, it can be vicious at times.) Face culture is filled with white lies, especially lies by omission (such as acting as if a comment were relevant and made a good point), but if you try to call out any of these lies you will utterly fail. They are not lies in the common conception of lie. They are not dishonest in the common conception of honesty.

Attempting to call out someone for following face culture rather than being intellectually honest is, as far as I know, doomed to failure. Any such call-out will be perceived as a threat, and will ramp up the defensive face-culture behavior.

Ok, so, there are these two cultures and LessWrong succeeds at intellectual honesty. I said at the beginning that I've (partially) given up on improving culture via LessWrong, though. Why?

Well, I talked with someone who worked at a Christian school (as I understand it, a very fundamentalist one). They described what sounded like the same thing: the community was very high in intellectual honesty.

Why would this be?

If LessWrong is succeeding as a result of being devoted to rationality and reflective thinking, shouldn't we expect the exact opposite in highly religious organizations?

I think what's happening here is that LessWrong is intellectually honest not because we explicitly think about rationality and reflective thinking quite a bit, not because LessWrong is in possession of improved ideas of what rationality is about, but instead, because there is a high degree of intellectual trust.

Intellectual trust occurs when the group has common goals, mutual respect, and a largely-shared ideological framework.

When people have intellectual trust, they do not need to worry as much about why the other person is saying what they are saying. You know they are on your side, so you are free to worry about the topic at hand. You are free to point out flaws in your own reasoning because you are relatively secure in your social status and share the common goal of arriving at the correct conclusion. Likewise, you are free to find flaws in their reasoning without worrying that they will hate you for it.

This sort of intellectual trust cannot be created by simply "raising the rationality waterline".

I'm now much more interested in communities of rationalists.

Tuesday, December 9, 2014

Scramble Graphs: A Failed Idea

I spent some time this semester inventing, testing, and discarding a new probabilistic model. My adviser suggested that it might be worthwhile to write things up anyway, since the idea is intriguing.

I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as dimensionality reduction, and there are many different techniques optimized for different applications.

Random Projections

Random projections take advantage of the Johnson-Lindenstrauss lemma, which shows that the vast majority of dimension reductions are "good" in the sense of preserving approximate distances between points. If this property is useful to our application, then this is a very easy dimension-reduction technique to apply.

This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.

The fundamental story as I see it is:

An n-dimensional vector space has n  orthogonal basis vectors (n cardinal directions), which are all at right angles from each other. However, if we relax to approximate orthogonality (almost 90 degree separation), the number of vectors we can pack in increases rapidly; especially for high n. In fact, it increases so rapidly that even choosing vectors randomly we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit m vectors, where m is on the order of n. (Look at theorem 6 here for the more precise formula.)

These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we really want to work with) down to this small space of size n, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.

For example, if we want to represent a probability distribution on m items, we would normally need that many numbers. This can be thought of as a point in the m-dimensional space. However, we can approximately represent it with just n numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care too much about getting the small ones right).

Scramble Graphs

The idea I had was to use this kind of representation within a probabilistic network. Let's say we are trying to represent a dynamic Bayesian network with variables that have thousands of possible values (for example, variables might be English words). The size of probabilistic relationships between these variables gets very large. English has about 10 6 words. A table giving the probabilistic relationship between two words would need 10 12 entries, and between three words, 10 18. Fortunately, these tables will be very sparse. 95% of the time we're using just the most common 5% of words, so we can restrict the vocabulary size to 10 5 or even 10 4 without doing too much damage. Furthermore, it's unlikely we're getting anywhere near 10 12 words in training data, and impossible that we get 10 18. There are less than 10 9 web pages, so if each page averaged 1,000 English words, we could possibly get near 10 12. (I don't know how many words the internet actually totals to.) Even then, because some pairs of words are much more common than others, our table of probabilities would be very sparse. It's much more likely that out data has just millions of words, though, which means we have at most millions of nonzero co-occurrence counts (and again, much less in practice due to the predominance of a few common co-occurrences).

This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?

My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.

Think about the tensor network and the underlying network. In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).

Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".

In order to train the tensor network directly without worrying about the underlying network, we want to find values for the tensors which cause the underlying network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.

(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)

The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.

The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?

Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.

However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).

So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.

Unfortunately, it didn't work.

It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.

I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.

Saturday, December 6, 2014

Likelihood Ratios from Statistically Significant Studies

The previous post I reacted to an old Black Belt Bayesian post about p-values.

Since then, there's been some more discussion of this article in the LA LessWrong group. Scott Garrabrant pointed out that the likelihood ratios coming from p-values are far less than he naively intuited. I think I was making the same mistake before reading BBB, and I think it's an important and common mistake.

How much should we shift our belief when we see a p-value around 0.05 (so, just barely passing the standard for statistical significance)?

The p-value is defined as the probability that a statistic would be as great or greater than observed, assuming the null hypothesis were true.

The very common mistake is to confuse P(observation | hypothesis) with P(hypothesis | observation), naively thinking that the p-value can be used as the probability of the null hypothesis. This is bad, don't do it. (David Manheim, also from the Los Angeles LessWrong group, pointed us to this article.)

But if that's not the correct conclusion to draw, what is?

The Bayesian answer is the Bayes Factor, which measures the strength of evidence for one hypothesis H1 vs another H2 as P(obs | H1) / P(obs | H2). If we combine this with a prior probability for each hypothesis, P(H1), P(H2), we can compute our posterior P(H1 | obs). For example, if our prior belief is 50-50 between the two and the likelihood ratio is 1/2, then our posterior should be 1/3 for H1 and 2/3 for H2. (H2 has become comparatively twice as probable.) However, the Bayes factor has the advantage of objectively measuring the influence of evidence on our beliefs, independent of our prior.

The less common mistake which both Scott and I were making was to think as if a p-value were a Bayes factor, so that a statistically significant study will shift belief against the null hypothesis by a ratio of about 1:20.

The formula mentioned by Black Belt Bayesian shows this is wrong. For a p-value of 0.05, the Bayes factor can be lower-bounded at 0.4, which means the odds of the null hypothesis only shift by 2:5. This is much less than the 1:20 shift I was intuitively making. (Of course, if the p-value is lower, this will be better!)

Also notice, this is a minimum: the actual likelihood ratio could be much higher! A higher ratio would be worse news for a scientist's attempt to reject the null hypothesis. It's even possible that the Bayesian should be increasing belief in the null hypothesis, if the alternative hypothesis explains the data less well. This might happen if our alternative hypothesis spreads probability mass very thinly across possibilities. The Bayes Factor is a relative comparison of hypotheses (comparing how well one hypothesis compares to another) whereas null hypothesis via p-values attempts an absolute measure (rejecting the null hypothesis in absolute terms).