Sometimes, people make a fuss about the difference between knowledge and understanding.
Recently, an explanation of this difference occurred to me which I had not considered before.
The Slate Star Codex article Right is the New Left explains fashion with cellular automata. It's a model of society which has about ten moving parts, yet has behaviors which resemble those of a whole society.
This made me think that understanding is essentially explaining something with a model small enough to fit in working memory.
Consider the extremely detailed weather model which meteorologists use to produce forecasts.
Now, consider the highly simplified explanation based on air masses, warm and cold fronts and so on which is commonly illustrated with weather maps.
The first gives us more accurate predictions, but the second one gives us more understanding. If a scientist was able to use the detailed mathematical weather model but did not think in terms of storm fronts and so on, he/she could not answer questions such as "why" it is raining. In the detailed physical model, "why" is almost meaningless: the causes of any particular event are huge in number.
This notion of understanding has several implications.
What constitutes understanding will depend on the mind doing the understanding, whereas knowledge is more objective in nature. I can achieve understanding of a system by putting it in terms I am familiar with. Suppose I am trying to understand an esoteric branch of chemistry known as semi-equilibrium Z-theory. I might learn all the statements belinging to SEZ theory by heart, and gain the ability to solve SEZ equations and get the correct answer, and still feel that I have little understanding. Yet, if I can relate SEZ theory to more familiar subjects, I will feel I've "put it in terms I can understand".
Assume I was an apple farmer before learning chemistry.
Let's say an experienced SEZ-theoretician gives me an analogy in which a SEZ-frubian (a central object of SEZ theory) is a rotten apple, and a SEZ-nite (another important concept in SEZ theory) is a worm slowly eating the apple. If the analogy works well enough, I feel I've gained an understanding: now when I'm solving the equations, I imagine that they are telling me things about this worm munching away happily at the core of the apple. I've now got a model with a few moving parts which allows me to make heuristic predictions much more effectively.
However, someone with no experience of apples and worms will not be helped very much by this analogy. It's placed SEZ-theory into my mental landscape, but the same explanation may not be useful to others.
Even a superhuman intelligence would have use for understanding: the actual universe is far too complex for a mind-within-universe to fully model. However, the understanding it achieves would be far beyond us. The "small" heuristic models would be too large to fit into our working memory (the mythical seven-plus-or-minus-two). Its weather maps would likely look more like our "detailed" physical simulations of the weather.
Friday, December 12, 2014
Wednesday, December 10, 2014
Epistemic Trust
My attraction to LessWrong is partially predicated on a feeling that a better culture can be created, raising the sanity waterline to improve society overall.
Moreover, face-culture is important. If X is an old hand whose status in the group is secure, face-culture would be babying. If X is a newcomer, however, face-culture niceties can establish a welcoming environment. I don't mean to suggest that there is an absolute opposition between being nice and being truthful -- often the two don't even come into conflict. There is a very real trade-off, though. At times you simply must choose one or the other.
Recently, though, I've somewhat given up on that.
First, I was overestimating the degree to which LessWrong had created such a culture already. I'll explain why.
Talking to core LessWrong people is different. It feels highly reflective, with each person cognizant not only of the subject at hand but also the potential flaws in their reasoning process. Much less time is wasted trying to sort out such flaws because an objection is not met with defensiveness. If you're already looking for the holes in your own arguments, you'll try to understand a counter-argument rather than trying to protect yourself from it. The attitude is contagious.
I call this intellectual honesty: being up-front not just about what you believe, but also why you believe it, what your motivations are in saying it, and the degree to which you have evidence for it. Feynman discussed the necessary attitude, although he didn't give it a name that I'm aware of.
There are many forces working against intellectual honesty in everyday life, but the most important one is face culture: status in the group is being signaled by agreeing and disagreeing, arguing for or against people. This can be nasty, but it isn't always; in fact, the most common type of face-culture I experience is cooperative: people in the group are trying to be friendly by accepting and encouraging each other.
For example, suppose that a group of engineers are meeting to discuss a technical problem. We will focus on one of them; I will call this person X. X is eager to find acceptance in the group. The other members of the group are also eager to make X feel accepted. During the discussion, X is looking for opportunities to interject with something relevant and useful. At some point, X is reminded of something: a problem which was encountered in a similar project at a previous work-place. X interjects that there might be a problem, and proceeds to tell the story. As X recalls the details, there's a critical difference between the old situation and the new one which makes it unlikely for the problem to arise in the current situation.
Intellectual-Honesty Culture: If everyone at the table is intellectually honest, someone points out the disanalogy. X likely concedes the point and the discussion moves on. (Often X will be the first to notice the disanalogy, and will point it out him/her-self.) If X thinks the objection is mistaken, a discussion in which both participants try to understand what each other is saying ensues.
Face Culture: In face culture, people will focus more on trying to make X feel included. Although the story's conclusion is unlikely to apply in the current situation, it's worthwhile to comment on the story in an agreeable way. Because agreement is a social currency, it is somewhat noncommittal; perhaps the best move is to agree that this problem can arise but then do little about it. Bold disagreement with the point is seen as (and often would be) an attempt to take X down a peg.
The critical point here is that when the two cultures mix, a face-culture person will see intellectual honesty as an attack.
It is worth emphasizing that face culture is not dishonest, not in the normal sense. Face culture is nice; face culture is friendly; face culture is welcoming. (Although, it can be vicious when it gets competitive.) Face culture is filled with white lies, especially lies by omission (such as acting as if a comment were relevant and made a good point), but if you try to call out any of these lies you will utterly fail. They are not lies in the common conception of lie. They are not dishonest in the common conception of honesty.
Moreover, face-culture is important. If X is an old hand whose status in the group is secure, face-culture would be babying. If X is a newcomer, however, face-culture niceties can establish a welcoming environment. I don't mean to suggest that there is an absolute opposition between being nice and being truthful -- often the two don't even come into conflict. There is a very real trade-off, though. At times you simply must choose one or the other.
Attempting to call out someone for following face culture rather than being intellectually honest is, as far as I know, doomed to failure. Any such call-out will be perceived as a threat, and will ramp up the defensive face-culture behavior.
Ok, so, there are these two cultures and LessWrong succeeds at intellectual honesty. I said at the beginning that I've (partially) given up on improving broader culture via LessWrong, though. Why?
Well, I talked with someone who worked at a Christian school (as I understand it, a very fundamentalist one). They described what sounded like the same thing I experience with LessWrong: the community was very high in intellectual honesty.
Why would this be?
If LessWrong's high intellectual honesty is a result of being devoted to rationality and reflective thinking, shouldn't we expect the exact opposite in highly religious organizations?
I think what's happening here is that LessWrong is intellectually honest not because we explicitly think about rationality quite a bit, not because LessWrong is in possession of improved ideas of what rationality is about, but instead, because there is a high degree of intellectual trust.
Intellectual trust occurs when the group has common goals, mutual respect, and a largely-shared ideological framework.
When people have intellectual trust, they do not need to worry as much about why the other person is saying what they are saying. You know they are on your side, so you are free to worry about the topic at hand. You are free to point out flaws in your own reasoning because you are relatively secure in your social status and share the common goal of arriving at the correct conclusion. Likewise, you are free to find flaws in their reasoning without worrying that they will hate you for it.
This sort of intellectual trust cannot be created by simply "raising the rationality waterline".
Tuesday, December 9, 2014
Scramble Graphs: A Failed Idea
I spent some time this semester inventing, testing, and discarding a new probabilistic model. My adviser suggested that it might be worthwhile to write things up anyway, since the idea is intriguing.
I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as dimensionality reduction, and there are many different techniques optimized for different applications.
Random projections take advantage of the Johnson-Lindenstrauss lemma, which shows that the vast majority of dimension reductions are "good" in the sense of preserving approximate distances between points. If this property is useful to our application, then this is a very easy dimension-reduction technique to apply.
This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.
The fundamental story as I see it is:
An n-dimensional vector space has n orthogonal basis vectors (n cardinal directions), which are all at right angles from each other. However, if we relax to approximate orthogonality (almost 90 degree separation), the number of vectors we can pack in increases rapidly; especially for high n. In fact, it increases so rapidly that even choosing vectors randomly we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit m vectors, where m is on the order of e n. (Look at theorem 6 here for the more precise formula.)
These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we really want to work with) down to this small space of size n, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.
For example, if we want to represent a probability distribution on m items, we would normally need that many numbers. This can be thought of as a point in the m-dimensional space. However, we can approximately represent it with just n numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care too much about getting the small ones right).
I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as dimensionality reduction, and there are many different techniques optimized for different applications.
Random Projections
This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.
The fundamental story as I see it is:
An n-dimensional vector space has n orthogonal basis vectors (n cardinal directions), which are all at right angles from each other. However, if we relax to approximate orthogonality (almost 90 degree separation), the number of vectors we can pack in increases rapidly; especially for high n. In fact, it increases so rapidly that even choosing vectors randomly we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit m vectors, where m is on the order of e n. (Look at theorem 6 here for the more precise formula.)
These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we really want to work with) down to this small space of size n, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.
For example, if we want to represent a probability distribution on m items, we would normally need that many numbers. This can be thought of as a point in the m-dimensional space. However, we can approximately represent it with just n numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care too much about getting the small ones right).
Scramble Graphs
The idea I had was to use this kind of representation within a probabilistic network. Let's say we are trying to represent a dynamic Bayesian network with variables that have thousands of possible values (for example, variables might be English words). The size of probabilistic relationships between these variables gets very large. English has about 10 6 words. A table giving the probabilistic relationship between two words would need 10 12 entries, and between three words, 10 18. Fortunately, these tables will be very sparse. 95% of the time we're using just the most common 5% of words, so we can restrict the vocabulary size to 10 5 or even 10 4 without doing too much damage. Furthermore, it's unlikely we're getting anywhere near 10 12 words in training data, and impossible that we get 10 18. There are less than 10 9 web pages, so if each page averaged 1,000 English words, we could possibly get near 10 12. (I don't know how many words the internet actually totals to.) Even then, because some pairs of words are much more common than others, our table of probabilities would be very sparse. It's much more likely that out data has just millions of words, though, which means we have at most millions of nonzero co-occurrence counts (and again, much less in practice due to the predominance of a few common co-occurrences).
This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?
My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.
Think about the tensor network and the underlying network. In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).
Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".
In order to train the tensor network directly without worrying about the underlying network, we want to find values for the tensors which cause the underlying network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.
(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)
The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.
The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?
Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.
However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).
So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.
Unfortunately, it didn't work.
It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.
I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.
This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?
My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.
Think about the tensor network and the underlying network. In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).
Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".
In order to train the tensor network directly without worrying about the underlying network, we want to find values for the tensors which cause the underlying network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.
(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)
The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.
The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?
Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.
However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).
So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.
Unfortunately, it didn't work.
It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.
I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.
Saturday, December 6, 2014
Likelihood Ratios from Statistically Significant Studies
The previous post I reacted to an old Black Belt Bayesian post about p-values.
Since then, there's been some more discussion of this article in the LA LessWrong group. Scott Garrabrant pointed out that the likelihood ratios coming from p-values are far less than he naively intuited. I think I was making the same mistake before reading BBB, and I think it's an important and common mistake.
How much should we shift our belief when we see a p-value around 0.05 (so, just barely passing the standard for statistical significance)?
The p-value is defined as the probability that a statistic would be as great or greater than observed, assuming the null hypothesis were true.
The very common mistake is to confuse P(observation | hypothesis) with P(hypothesis | observation), naively thinking that the p-value can be used as the probability of the null hypothesis. This is bad, don't do it. (David Manheim, also from the Los Angeles LessWrong group, pointed us to this article.)
But if that's not the correct conclusion to draw, what is?
The Bayesian answer is the Bayes Factor, which measures the strength of evidence for one hypothesis H1 vs another H2 as P(obs | H1) / P(obs | H2). If we combine this with a prior probability for each hypothesis, P(H1), P(H2), we can compute our posterior P(H1 | obs). For example, if our prior belief is 50-50 between the two and the likelihood ratio is 1/2, then our posterior should be 1/3 for H1 and 2/3 for H2. (H2 has become comparatively twice as probable.) However, the Bayes factor has the advantage of objectively measuring the influence of evidence on our beliefs, independent of our prior.
The less common mistake which both Scott and I were making was to think as if a p-value were a Bayes factor, so that a statistically significant study will shift belief against the null hypothesis by a ratio of about 1:20.
The formula mentioned by Black Belt Bayesian shows this is wrong. For a p-value of 0.05, the Bayes factor can be lower-bounded at 0.4, which means the odds of the null hypothesis only shift by 2:5. This is much less than the 1:20 shift I was intuitively making. (Of course, if the p-value is lower, this will be better!)
Also notice, this is a minimum: the actual likelihood ratio could be much higher! A higher ratio would be worse news for a scientist's attempt to reject the null hypothesis. It's even possible that the Bayesian should be increasing belief in the null hypothesis, if the alternative hypothesis explains the data less well. This might happen if our alternative hypothesis spreads probability mass very thinly across possibilities. The Bayes Factor is a relative comparison of hypotheses (comparing how well one hypothesis compares to another) whereas null hypothesis via p-values attempts an absolute measure (rejecting the null hypothesis in absolute terms).
Thursday, November 27, 2014
P-values and Chaos Worlds
In First Aid for P-Values, Black Belt Bayesian discusses how a Bayesian can interpret the p-value to get some information. He references an article which argues that this can shift the frame of the discussion in a useful way, improving the nature of the statistical arguments without significantly changing the methodology. It emphasizes the role of evidence in shifting beliefs progressively, as opposed to proof/disproof.
While this does seem like a useful tool, it still leaves us with the problems of null hypothesis testing. One problem is that the null hypothesis is sometimes not very plausible. Arguing from a point of total randomness is an odd thing to do. What would we expect to see if the world was a chaotic place with no patterns? Hm, reality doesn't match that? Ok, well, our hypothesis is better than maximum entropy. Good!
Scott Alexander makes this error in a post which he explicitly predicted he'd regret writing. (Epistemic Warning: This is, perhaps, among the smaller problems with the post. A larger problem is that it makes readers think in simplistic tribes. Another possible problem is that it risks the same error it calls out. There's a reason he said he'd regret it.) He's discussing how strongly our friends and acquaintances are filtered in terms of beliefs:
It's not like there is a baseline world where everything is completely random, and an extra physical force on top of this which puts things into nonrandom configurations. (Except, perhaps, in the sense that everything is heading toward thermodynamic equilibrium.) We do not form associates with people randomly. It would be much more meaningful to compare possibly-realistic models and the level of friend filtering which they imply.
I'm not trying to call out Slate Star Codex here. That particular post happened to be an epistemic landmine, yes, but this mistake is easy to make and fairly common. What's interesting to me is the difference between what arguments feel meaningful vs actually are meaningful.
While this does seem like a useful tool, it still leaves us with the problems of null hypothesis testing. One problem is that the null hypothesis is sometimes not very plausible. Arguing from a point of total randomness is an odd thing to do. What would we expect to see if the world was a chaotic place with no patterns? Hm, reality doesn't match that? Ok, well, our hypothesis is better than maximum entropy. Good!
Scott Alexander makes this error in a post which he explicitly predicted he'd regret writing. (Epistemic Warning: This is, perhaps, among the smaller problems with the post. A larger problem is that it makes readers think in simplistic tribes. Another possible problem is that it risks the same error it calls out. There's a reason he said he'd regret it.) He's discussing how strongly our friends and acquaintances are filtered in terms of beliefs:
And I don’t have a single one of those people in my social circle. It’s not because I’m deliberately avoiding them; I’m pretty live-and-let-live politically, I wouldn’t ostracize someone just for some weird beliefs. And yet, even though I probably know about a hundred fifty people, I am pretty confident that not one of them is creationist. Odds of this happening by chance? 1/2^150 = 1/10^45 = approximately the chance of picking a particular atom if you are randomly selecting among all the atoms on Earth.He goes on to use this number a couple more times as an indication of the strength of filtering:
I inhabit the same geographical area as scores and scores of conservatives. But without meaning to, I have created an outrageously strong bubble, a 10^45 bubble. Conservatives are all around me, yet I am about as likely to have a serious encounter with one as I am a Tibetan lama.And:
A disproportionate number of my friends are Jewish, because I meet them at psychiatry conferences or something – we self-segregate not based on explicit religion but on implicit tribal characteristics. So in the same way, political tribes self-segregate to an impressive extent – a 1/10^45 extent, I will never tire of hammering in – based on their implicit tribal characteristics.The problem is that this is a world-of-chaos-and-fire hypothesis he's comparing to. The number makes the strength of the filter incredible-sounding, almost physically implausible. But, that's just what you get when you use a bad model! Note that the "strength" would keep getting more extreme as we examine more data (just as a p-value gets extreme with more data, unless the null hypothesis is actually true).
It's not like there is a baseline world where everything is completely random, and an extra physical force on top of this which puts things into nonrandom configurations. (Except, perhaps, in the sense that everything is heading toward thermodynamic equilibrium.) We do not form associates with people randomly. It would be much more meaningful to compare possibly-realistic models and the level of friend filtering which they imply.
I'm not trying to call out Slate Star Codex here. That particular post happened to be an epistemic landmine, yes, but this mistake is easy to make and fairly common. What's interesting to me is the difference between what arguments feel meaningful vs actually are meaningful.
Sunday, November 9, 2014
A List Of Nuances
Abram Demski and Grognor
(This article is also cross-posted to LessWrong.)
Much of rationality is pattern-matching. An article on lesswrong might point out a thing to look for. Noticing this thing changes your reasoning in some way. This essay is a list of things to look for. These things are all associated, but the reader should take care not to lump them together. Each dichotomy is distinct, and although the brain will tend to abstract them into some sort of yin/yang correlated mush, in reality they have a more complicated structure; some things may be similar, but if possible, try to focus on the complex interrelationships.
- Map vs. Territory
- Eliezer’s sequences use this as a jump-off point for discussion of rationality.
- Many thinking mistakes are map vs. territory confusions.
- A map and territory mistake is a mix-up of seeming vs being.
- Humans need frequent reminders that we are not omniscient.
- Cached Thoughts vs. Thinking
- This document is a list of cached thoughts.
- Clusters vs. Properties
- These words could be used in different ways, but the distinction I want to point at is that of labels we put on things vs actual differences in things.
- The mind projection fallacy is the fallacy of thinking a mental category (a “cluster”) is an actual property things have.
- If we see something as good for one reason, we are likely to attribute other good properties to it, as if it had inherent goodness. This is called the halo effect. (If we see something as bad and infer other bad properties as a result, it is referred to as the reverse-halo effect.)
- Syntax vs. Semantics
- The syntax is the physical instantiation of the map. The semantics is the way we are meant to read the map; that is, the intended relationship to the territory.
- Semantics vs. Pragmatics
- The semantics is the literal contents of a message, whereas the pragmatics is the intended result of conveying the message.
- An example of a message with no semantics and only pragmatics is a command, such as “Stop!”.
- Almost no messages lack pragmatics, and for good reason. However, if you seek truth in a discussion, it is important to foster a willingness to say things with less pragmatic baggage.
- Usually when we say things, we do so with some “point” which is beyond the semantics of our statement. The point is usually to build up or knock down some larger item of discussion. This is not inherently a bad thing, but has a failure mode where arguments are battles and statements are weapons, and the cleverer arguer wins.
- Object-level vs. Meta-level
- The difference between making a map and writing a book about map-making.
- A good meta-level theory helps get things right at the object level, but it is usually impossible to get things right at the meta level before before you’ve made significant progress at the object level.
- Seeming vs. Being
- We can only deal with how things seem, not how they are. Yet, we must strive to deal with things as they are, not as they seem.
- This is yet another reminder that we are not omniscient.
- If we optimize too hard for things which seem good rather than things which are good, we will get things which seem very good but which may only be somewhat good, or even bad.
- The dangerous cases are the cases where you do not notice there is a distinction.
- This is why humans need constant reminders that we are not omniscient.
- We must take care to notice the difference between how things seem to seem, and how they actually seem.
- Signal vs. Noise
- Not all information is equal. It is often the case that we desire certain sorts of information and desire to ignore other sorts.
- In a technical setting, this has to do with the error rate present in a communication channel; imperfections in the channel will corrupt some bits, making a need for redundancy in the message being sent.
- In a social setting, this is often used to refer to the amount of good information vs irrelevant information in a discussion. For example, letting a mediocre writer add material to a group blog might increase the absolute amount of good information, yet worsen the signal-to-noise ratio.
- Attention is a scarce resource; yes everyone has something to teach you, but many people are much more efficient sources of wisdom than others.
- Selection Effects
- In many situations, if we can present evidence to a Bayesian agent without the agent knowing that we are being selective, we can convince the agent of anything we like. For example, if I want to convince you that smoking causes obesity, I could find many people who became obese after they started smoking.
- The solution to this is for the Bayesian agent to model where the information is coming from. If you know I am selecting people based on this criteria, then you will not take it as evidence of anything, because the evidence has been cherry-picked.
- Most of the information you receive is intensely filtered. Nothing comes to your attention with a good conscience.
- The silent evidence problem.
- Selection bias need not be the result of purposeful interference as in cherry-picking. Often, an unrelated process may hide some of the evidence needed. For example, we hear far more about successful people than unsuccessful. It is tempting to look at successful people and attempt to draw conclusion about what it takes to be successful. This approach suffers from the silent evidence problem: we also need to look at the unsuccessful people and examine what is different about the two groups.
- What You Mean vs. What You Think You Mean
- Very often, people will say something and then that thing will be refuted. The common response to this is to claim you meant something slightly different, which is more easily defended.
- We often do this without noticing, making it dangerous for thinking. It is an automatic response generated by our brains, not a conscious decision to defend ourselves from being discredited. You do this far more often than you notice. The brain fills in a false memory of what you meant without asking for permission.
- What You Mean vs. What the Others Think You Mean
- What You Optimize vs. What You Think You Optimize
- Evolution optimizes for reproduction but in doing so creates animals with a variety of goals which are correlated with reproduction.
- The people who value practice for its own sake do better than the people who only value being good at what they’re practicing.
- “Consequentialism is true, but virtue ethics is what works.”
- Stated Preferences vs. Revealed Preferences
- Revealed preferences are the preferences we can infer from your actions. These are usually different from your stated preferences.
- Food isn’t about nutrition.
- Clothes aren’t about comfort.
- Bedrooms aren’t about sleep.
- Marriage isn’t about love.
- Talk isn’t about information.
- Laughter isn’t about humour.
- Charity isn’t about helping.
- Church isn’t about God.
- Art isn’t about insight.
- Medicine isn’t about health.
- Consulting isn’t about advice.
- School isn’t about learning.
- Research isn’t about progress.
- Politics isn’t about policy.
- Going meta isn’t about the object level.
- Language isn’t about communication.
- The rationality movement isn’t about epistemology.
- Everything is actually about signalling.
- Never attribute to malice that which can be adequately explained by stupidity. The difference between stated preferences and revealed preferences does not indicate dishonest intent. We should expect the two to differ in the absence of a mechanism to align them.
- People, ideas, and organizations respond to incentives.
- Evolution selects humans who have reproductively selfish behavioral tendencies, but prosocial and idealistic stated preferences.
- Social forces select ideas for virality and comprehensibility as opposed to truth or even usefulness.
- Organizations are by default bad at being strategic about their own survival, but the ones that survive are the ones you see.
- What You Achieve vs. What You Think You Achieve
- Most of the consequences of our actions are totally unknown to us.
- It is impossible to optimize without proper feedback.
- What You Optimize vs. What You Actually Achieve
- Consequentialism is more about expected consequences than actual consequences.
- What You Seem Like vs. What You Are
- You can try to imagine yourself from the outside, but no one has the full picture.
- What Other People Seem Like vs. What They Are
- When people assume that they understand others, they are wrong.
- What People Look Like vs. What They Think They Look Like
- People underestimate the gap between stated preferences and revealed preferences.
- What Your Brain Does vs. What You Think It Does
- The brain’s machinations are fundamentally social; it automatically does things like signal, save face, etc., which distort the truth.
- Knowing that you are running on corrupted hardware should cause skepticism about the outputs of your thought-processes. Yet, too much skepticism will cause you to stumble, particularly when fast thinking is needed.
- Producing a correct result plus justification is harder than producing only the correct result.
- Justifications are important, but the correct result is more important.
- Much of our apparent self-reflection is confabulation, generating plausible explanations after the brain spits out an answer.
- Example: doing quick mental math. If you are good at this, attempting to explicitly justify every step as you go would likely slow you down.
- Example: impressions formed over a long period of time. Wrong or right, it is unlikely that you can explicitly give all your reasons for the impression. Requiring your own beliefs to be justifiable would preempt impressions that require lots of experience and/or many non-obvious chains of subconscious inference.
- Impressions are not beliefs and they are always useful data.
- Clever Argument vs. Truth-seeking; The Bottom Line
- People believe what they want to believe.
- Believing X for some reason unrelated to X being true is referred to as motivated cognition.
- Giving a smart person more information and more methods of argument may actually make their beliefs less accurate, because you are giving them more tools to construct clever arguments for what they want to believe.
- Your actual reason for believing X determines how well your belief correlates with the truth.
- If you believe X because you want to, any arguments you make for X no matter how strong they sound are devoid of informational context about X and should properly be ignored by a truth-seeker.
- Lumpers vs. Splitters
- A lumper is a thinker who attempts to fit things into overarching patterns. A splitter is a thinker who makes as many distinctions as possible, recognizing the importance of being specific and getting the details right.
- Specifically, some people want big Wikipedia and TVTropes articles that discuss many things, and others want smaller articles that discuss fewer things.
- This list of nuances is a lumper attempting to think more like a splitter.
- Fox vs. Hedgehog
- “A fox knows many things, but a hedgehog knows One Big Thing.” Closely related to a splitter, a fox is a thinker whose strength is in a broad array of knowledge. A hedgehog is a thinker who, in contrast, has one big idea and applies it everywhere.
- The fox mindset is better for making accurate judgements, according to Tetlock.
- Traps vs. Gardens
- Conversations tend to slide toward contentious and useless topics.
- Societies tend to decay.
- Thermodynamic equilibrium is entropic.
- Without proper institutions being already in place, it takes large amounts of constant effort and vigilance to stay out of traps.
- From the outside of a broken Molochian system it is easy to see how to fix. But it cannot be fixed from the inside.
Subscribe to:
Posts (Atom)