Sometimes, people make a fuss about the difference between knowledge and understanding.
Recently, an explanation of this difference occurred to me which I had not considered before.
The Slate Star Codex article Right is the New Left  explains fashion with cellular automata. It's a model of society which has about ten moving parts, yet has behaviors which resemble those of a whole society.
This made me think that understanding is essentially explaining something with a model small enough to fit in working memory.
Consider the extremely detailed weather model which meteorologists use to produce forecasts.
Now, consider the highly simplified explanation based on air masses, warm and cold fronts and so on which is commonly illustrated with weather maps.
The first gives us more accurate predictions, but the second one gives us more understanding. If a scientist was able to use the detailed mathematical weather model but did not think in terms of storm fronts and so on, he/she could not answer questions such as "why" it is raining. In the detailed physical model, "why" is almost meaningless: the causes of any particular event are huge in number.
This notion of understanding has several implications.
What constitutes understanding will depend on the mind doing the understanding, whereas knowledge is more objective in nature. I can achieve understanding of a system by putting it in terms I am familiar with. Suppose I am trying to understand an esoteric branch of chemistry known as semi-equilibrium Z-theory. I might learn all the statements belinging to SEZ theory by heart, and gain the ability to solve SEZ equations and get the correct answer, and still feel that I have little understanding. Yet, if I can relate SEZ theory to more familiar subjects, I will feel I've "put it in terms I can understand".
Assume I was an apple farmer before learning chemistry.
Let's say an experienced SEZ-theoretician gives me an analogy in which a SEZ-frubian (a central object of SEZ theory) is a rotten apple, and a SEZ-nite (another important concept in SEZ theory) is a worm slowly eating the apple. If the analogy works well enough, I feel I've gained an understanding: now when I'm solving the equations, I imagine that they are telling me things about this worm munching away happily at the core of the apple. I've now got a model with a few moving parts which allows me to make heuristic predictions much more effectively.
However, someone with no experience of apples and worms will not be helped very much by this analogy. It's placed SEZ-theory into my mental landscape, but the same explanation may not be useful to others.
Even a superhuman intelligence would have use for understanding: the actual universe is far too complex for a mind-within-universe to fully model. However, the understanding it achieves would be far beyond us. The "small" heuristic models would be too large to fit into our working memory (the mythical seven-plus-or-minus-two). Its weather maps would likely look more like our "detailed" physical simulations of the weather.
Friday, December 12, 2014
Wednesday, December 10, 2014
Epistemic Trust
My attraction to LessWrong is partially predicated on a feeling that a better culture can be created, raising the sanity waterline to improve society overall.
Moreover, face-culture is important. If X is an old hand whose status in the group is secure, face-culture would be babying. If X is a newcomer, however, face-culture niceties can establish a welcoming environment. I don't mean to suggest that there is an absolute opposition between being nice and being truthful -- often the two don't even come into conflict. There is a very real trade-off, though. At times you simply must choose one or the other.
Recently, though, I've somewhat given up on that.
First, I was overestimating the degree to which LessWrong had created such a culture already. I'll explain why.
Talking to core LessWrong people is different. It feels highly reflective, with each person cognizant not only of the subject at hand but also the potential flaws in their reasoning process. Much less time is wasted trying to sort out such flaws because an objection is not met with defensiveness. If you're already looking for the holes in your own arguments, you'll try to understand a counter-argument rather than trying to protect yourself from it. The attitude is contagious.
I call this intellectual honesty: being up-front not just about what you believe, but also why you believe it, what your motivations are in saying it, and the degree to which you have evidence for it. Feynman discussed the necessary attitude, although he didn't give it a name that I'm aware of.
There are many forces working against intellectual honesty in everyday life, but the most important one is face culture: status in the group is being signaled by agreeing and disagreeing, arguing for or against people. This can be nasty, but it isn't always; in fact, the most common type of face-culture I experience is cooperative: people in the group are trying to be friendly by accepting and encouraging each other.
For example, suppose that a group of engineers are meeting to discuss a technical problem. We will focus on one of them; I will call this person X. X is eager to find acceptance in the group. The other members of the group are also eager to make X feel accepted. During the discussion, X is looking for opportunities to interject with something relevant and useful. At some point, X is reminded of something: a problem which was encountered in a similar project at a previous work-place. X interjects that there might be a problem, and proceeds to tell the story. As X recalls the details, there's a critical difference between the old situation and the new one which makes it unlikely for the problem to arise in the current situation.
Intellectual-Honesty Culture: If everyone at the table is intellectually honest, someone points out the disanalogy. X likely concedes the point and the discussion moves on. (Often X will be the first to notice the disanalogy, and will point it out him/her-self.) If X thinks the objection is mistaken, a discussion in which both participants try to understand what each other is saying ensues.
Face Culture: In face culture, people will focus more on trying to make X feel included. Although the story's conclusion is unlikely to apply in the current situation, it's worthwhile to comment on the story in an agreeable way. Because agreement is a social currency, it is somewhat noncommittal; perhaps the best move is to agree that this problem can arise but then do little about it. Bold disagreement with the point is seen as (and often would be) an attempt to take X down a peg.
The critical point here is that when the two cultures mix, a face-culture person will see intellectual honesty as an attack.
It is worth emphasizing that face culture is not dishonest, not in the normal sense. Face culture is nice; face culture is friendly; face culture is welcoming. (Although, it can be vicious when it gets competitive.) Face culture is filled with white lies, especially lies by omission (such as acting as if a comment were relevant and made a good point), but if you try to call out any of these lies you will utterly fail. They are not lies in the common conception of lie. They are not dishonest in the common conception of honesty.
Moreover, face-culture is important. If X is an old hand whose status in the group is secure, face-culture would be babying. If X is a newcomer, however, face-culture niceties can establish a welcoming environment. I don't mean to suggest that there is an absolute opposition between being nice and being truthful -- often the two don't even come into conflict. There is a very real trade-off, though. At times you simply must choose one or the other.
Attempting to call out someone for following face culture rather than being intellectually honest is, as far as I know, doomed to failure. Any such call-out will be perceived as a threat, and will ramp up the defensive face-culture behavior.
Ok, so, there are these two cultures and LessWrong succeeds at intellectual honesty. I said at the beginning that I've (partially) given up on improving broader culture via LessWrong, though. Why?
Well, I talked with someone who worked at a Christian school (as I understand it, a very fundamentalist one). They described what sounded like the same thing I experience with LessWrong: the community was very high in intellectual honesty.
Why would this be?
If LessWrong's high intellectual honesty is a result of being devoted to rationality and reflective thinking, shouldn't we expect the exact opposite in highly religious organizations?
I think what's happening here is that LessWrong is intellectually honest not because we explicitly think about rationality quite a bit, not because LessWrong is in possession of improved ideas of what rationality is about, but instead, because there is a high degree of intellectual trust.
Intellectual trust occurs when the group has common goals, mutual respect, and a largely-shared ideological framework.
When people have intellectual trust, they do not need to worry as much about why the other person is saying what they are saying. You know they are on your side, so you are free to worry about the topic at hand. You are free to point out flaws in your own reasoning because you are relatively secure in your social status and share the common goal of arriving at the correct conclusion. Likewise, you are free to find flaws in their reasoning without worrying that they will hate you for it.
This sort of intellectual trust cannot be created by simply "raising the rationality waterline".
Tuesday, December 9, 2014
Scramble Graphs: A Failed Idea
I spent some time this semester inventing, testing, and discarding a new probabilistic model. My adviser suggested that it might be worthwhile to write things up anyway, since the idea is intriguing.
I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as dimensionality reduction, and there are many different techniques optimized for different applications.
Random projections take advantage of the Johnson-Lindenstrauss lemma, which shows that the vast majority of dimension reductions are "good" in the sense of preserving approximate distances between points. If this property is useful to our application, then this is a very easy dimension-reduction technique to apply.
This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.
The fundamental story as I see it is:
An n-dimensional vector space has n orthogonal basis vectors (n cardinal directions), which are all at right angles from each other. However, if we relax to approximate orthogonality (almost 90 degree separation), the number of vectors we can pack in increases rapidly; especially for high n. In fact, it increases so rapidly that even choosing vectors randomly we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit m vectors, where m is on the order of e n. (Look at theorem 6 here for the more precise formula.)
These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we really want to work with) down to this small space of size n, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.
For example, if we want to represent a probability distribution on m items, we would normally need that many numbers. This can be thought of as a point in the m-dimensional space. However, we can approximately represent it with just n numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care too much about getting the small ones right).
I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as dimensionality reduction, and there are many different techniques optimized for different applications.
Random Projections
This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.
The fundamental story as I see it is:
An n-dimensional vector space has n orthogonal basis vectors (n cardinal directions), which are all at right angles from each other. However, if we relax to approximate orthogonality (almost 90 degree separation), the number of vectors we can pack in increases rapidly; especially for high n. In fact, it increases so rapidly that even choosing vectors randomly we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit m vectors, where m is on the order of e n. (Look at theorem 6 here for the more precise formula.)
These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we really want to work with) down to this small space of size n, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.
For example, if we want to represent a probability distribution on m items, we would normally need that many numbers. This can be thought of as a point in the m-dimensional space. However, we can approximately represent it with just n numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care too much about getting the small ones right).
Scramble Graphs
The idea I had was to use this kind of representation within a probabilistic network. Let's say we are trying to represent a dynamic Bayesian network with variables that have thousands of possible values (for example, variables might be English words). The size of probabilistic relationships between these variables gets very large. English has about 10 6 words. A table giving the probabilistic relationship between two words would need 10 12 entries, and between three words, 10 18. Fortunately, these tables will be very sparse. 95% of the time we're using just the most common 5% of words, so we can restrict the vocabulary size to 10 5 or even 10 4 without doing too much damage. Furthermore, it's unlikely we're getting anywhere near 10 12 words in training data, and impossible that we get 10 18. There are less than 10 9 web pages, so if each page averaged 1,000 English words, we could possibly get near 10 12. (I don't know how many words the internet actually totals to.) Even then, because some pairs of words are much more common than others, our table of probabilities would be very sparse. It's much more likely that out data has just millions of words, though, which means we have at most millions of nonzero co-occurrence counts (and again, much less in practice due to the predominance of a few common co-occurrences).
This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?
My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.
Think about the tensor network and the underlying network. In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).
Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".
In order to train the tensor network directly without worrying about the underlying network, we want to find values for the tensors which cause the underlying network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.
(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)
The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.
The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?
Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.
However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).
So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.
Unfortunately, it didn't work.
It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.
I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.
This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?
My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.
Think about the tensor network and the underlying network. In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).
Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".
In order to train the tensor network directly without worrying about the underlying network, we want to find values for the tensors which cause the underlying network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.
(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)
The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.
The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?
Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.
However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).
So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.
Unfortunately, it didn't work.
It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.
I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.
Saturday, December 6, 2014
Likelihood Ratios from Statistically Significant Studies
The previous post I reacted to an old Black Belt Bayesian post about p-values.
Since then, there's been some more discussion of this article in the LA LessWrong group. Scott Garrabrant pointed out that the likelihood ratios coming from p-values are far less than he naively intuited. I think I was making the same mistake before reading BBB, and I think it's an important and common mistake.
How much should we shift our belief when we see a p-value around 0.05 (so, just barely passing the standard for statistical significance)?
The p-value is defined as the probability that a statistic would be as great or greater than observed, assuming the null hypothesis were true.
The very common mistake is to confuse P(observation | hypothesis) with P(hypothesis | observation), naively thinking that the p-value can be used as the probability of the null hypothesis. This is bad, don't do it. (David Manheim, also from the Los Angeles LessWrong group, pointed us to this article.)
But if that's not the correct conclusion to draw, what is?
The Bayesian answer is the Bayes Factor, which measures the strength of evidence for one hypothesis H1 vs another H2 as P(obs | H1) / P(obs | H2). If we combine this with a prior probability for each hypothesis, P(H1), P(H2), we can compute our posterior P(H1 | obs). For example, if our prior belief is 50-50 between the two and the likelihood ratio is 1/2, then our posterior should be 1/3 for H1 and 2/3 for H2. (H2 has become comparatively twice as probable.) However, the Bayes factor has the advantage of objectively measuring the influence of evidence on our beliefs, independent of our prior.
The less common mistake which both Scott and I were making was to think as if a p-value were a Bayes factor, so that a statistically significant study will shift belief against the null hypothesis by a ratio of about 1:20.
The formula mentioned by Black Belt Bayesian shows this is wrong. For a p-value of 0.05, the Bayes factor can be lower-bounded at 0.4, which means the odds of the null hypothesis only shift by 2:5. This is much less than the 1:20 shift I was intuitively making. (Of course, if the p-value is lower, this will be better!)
Also notice, this is a minimum: the actual likelihood ratio could be much higher! A higher ratio would be worse news for a scientist's attempt to reject the null hypothesis. It's even possible that the Bayesian should be increasing belief in the null hypothesis, if the alternative hypothesis explains the data less well. This might happen if our alternative hypothesis spreads probability mass very thinly across possibilities. The Bayes Factor is a relative comparison of hypotheses (comparing how well one hypothesis compares to another) whereas null hypothesis via p-values attempts an absolute measure (rejecting the null hypothesis in absolute terms).
Subscribe to:
Comments (Atom)
