In Search of Logic

When Does One Program Embed Another?

2016-04-02T14:46:00.000-07:00

When does one program simulate, or embed another? This is a question that has vaguely bothered me for some time. Intuitively, it seems fairly clear when one program is running inside another. However, it gets quite tricky to formalize. I'm thinking about this now because it's closely related to the type of "correspondence" needed for the correspondence theory of truth.

(This post also came out of discussions with @moralofstory and @alleleofgene.)

Easy Version: Subroutines

The simple case is when one program calls another. For this to be meaningful, we need a syntactic notion of procedure call. Many computing formalisms provide this. In Lambda Calculus it's easy; however, Lambda Calculus is Turing complete, but not universal. (A universal Turing machine is needed for the invariance theorem of algorithmic information theory; Turing-complete formalisms like lambda calculus are insufficient for this, because they can introduce a multiplicative cost in description length.) For a universal machine, it's convenient to suppose that there's a procedure call much like that in computer chip architectures.

In any case, this is more or less useless to us. The interesting case is when a program is embedded in a larger program with no "markings" telling us where to look for it.

An Orthodox Approach: Reductions

For the study of computational complexity, a notion of reduction is often used. One problem class P is polynomial-time reducible to another Q if we can solve P in polynomial time when we have access to an oracle for Q problems. An "oracle" is essentially the same concept as a subroutine, but where we don't count computation time spent inside the subroutine. This allows us to examine how much computation we save when we're provided this "free" subroutine use.

This has been an extremely fruitful concept, especially for the study of NP-completeness / NP-hardness. However, it seems of little use to us here.

Only input/output mappings are considered. The use of oracles allows us to quantify how useful a particular subroutine would be for implementing a specific input/output mapping. What I intuitively want to discuss is whether a particular program embeds another, not whether an input-output mapping (which can be implemented in many different ways) can be reduced to another. For example, it's possible that a program takes no input and produces no output, but simulates another program inside it. I want to define what this means rigorously.
Polynomial-time reducibility is far too permissive, since it means all poly-time algorithms are considered equivalent (they can be reduced to each other). However, refining things further (trying things like quadratic-time reducibility) becomes highly formalism-dependent. (Different Turing machine formalisms can easily have poly-time differences.)

Bit-To-Bit Embedding

Here's a simplistic proposal, to get us off the ground. Consider the execution history of two Turing machines, A and B. Imagine these as 2D spaces, with time-step t and tape location l. The intuition is that B embeds A if there is a computable function embed(t,l) which takes a t,l for A and produces one for B, and the bits are always exactly the same in these two time+locations.

The problem is, this can be a coincidence. embed(t,l) might be computing A completely, and finding a 0 bit or a 1 bit accordingly. This means there will always be an embedding, making the definition trivial. This is similar to the problem which is solved in "reductions" by restricting the reduction to be polynomial time. We could restrict the computational complexity of embed to try and make sure it's not cheating us by computing A. however, I don't think that works in our favor. I think we need to solve it by requiring causal structure.

My intuition is that causality is necessary to solve the problem, not just for this "internal simple embedding" thing, but more generally.

The causal structure is defined by the interventions (see Pearl's book, Causality). If we change a bit during the execution of a program, this has well-defined consequences for the remainder of the program execution. (It may even have no consequences whatsoever, if the bit is never used again.)

We can use all computable embed(t,l) as before, but now we don't just require that the bits at t,l in A are the same as the bits at embed(t,l) in B; we also require the interventions to be the same. That is, when we change bit t,l in A and embed(t,l) in B, then the other bits still correspond. (We need to do multi-bit interventions, not just single-bit; but I think infinite-bit interventions are never needed, due to the nature of computation.)

Bit-To-Pattern Embedding

The embeddings recognized by the above proposal are far too limited. Most embeddings will not literally translate the program history bit for bit. For example, suppose we have a program which simulates Earth-like physics with enough complexity that we can implement a transistor-based computer as a structure within the simulation. B could be a physical implementation of A based on this simulation. There will not necessarily be specific t,l correspondences which give a bit-to-bit embedding. Instead, bits in A will map onto electrical charges in predictable locations within B's physics. Recognizing one such electrical charge in the execution history of B might require accessing a large number of bits from B's low-level physics simulation.

This suggests that we need embed(t,l) to output a potentially complicated pattern-matcher for B's history, rather tan a simple t,l location.

A difficulty here is how to do the causal interventions on the pattern-matched structure. We need to "flip bits" in B when the bit is represented by a complicated pattern.

We can make useful extensions to simple embedding by restricting the "pattern matcher" in easily reversible ways. embed(t,l) can give a list of t,l locations in B along with a dictionary/function which classifies these as coding 1 or 0, or invalid. (This can depend on t,l! It doesn't need to be fixed.) An intervention changing t,l in A would be translated as any of the changes to the set embed(t,l) which change its classification. I'd say all the (valid) variations should work in order for the embedding to be valid. (There might be somewhat less strict ways to do it, though.)

This approach is the most satisfying for me at the moment. It seems to capture almost all of the cases I want. I'm not totally confident that it rules out all the non-examples I'd want to rule out, though. We can make a "causality soup" program, which computes every Boolean expression in order, caching values of sub-expressions so that there's a causal chain from the simplest expressions to the most complicated. This program embeds every other program, by the definition here. I'm not sure this should be allowed: it feels like almost the same error as the claim that the digits of pi are Turing-complete because (if pi is normal, as it appears to be) you can find any computable bit sequence in them. While the set of Boolean expressions gives a lot of structure, it doesn't seem like as much structure as the set of all programs.

Another case which seems problematic: suppose B embeds a version of A, but wired to self-destruct if causal interventions are detected. This can be implemented by looking for a property which the real execution history always has (such as a balance of 1s and 0s that never goes beyond some point), and stopping work whenever the property is violated. Although this is intuitively still an embedding, it lacks some of the causal structure of A, and therefore would not be counted.

Pattern-To-Pattern Embedding

Bit-to-pattern embeddings may still be too inflexible to capture everything. What if we want some complex structures in A to map to simple structures in B, so long as the causal structure is preserved? An important example of this is a bit which is modified by the computation at time t, left untouched for a while, and then used again at time t+n. In terms of bit-to-pattern embeddings, each individual time t, t+1, t+2, ... t+n has to have a distinct element in B to map to. This seems wrong: it's requiring too much causal structure in B. We want to treat the bit as "one item" while it is untouched by the computation.

Rather than looking for an embed function, I believe we now need an embedding relation. I'm not sure exactly how this goes. One idea:

A "pattern frame" is an ordered set of t,l locations.
A "pattern" is an ordered set of bits (which can be fit in a frame of equal size).
An "association" is a frame for A, a frame for B, a single pattern for A (size matching the frame), and a set of patterns for B (size matching the frame).
embedding is a program which enumerates associations. A proper embedding between A and B is one which:
- Covers A completely, in the actual run and in all interventions. ("Covers" means that the associations contain a pattern frame with matching pattern for every t,l location in A.)
- For all interventions on A, for all derived interventions on B, the execution of B continues to match with A according to the embedding.

This concept is asymmetric, capturing the idea that B embeds all the causal structure of A, but possibly has more causal structure besides. We could make symmetric variants, which might also be useful.

In any case, this doesn't seem to work as desired. Suppose B is the physics simulation mentioned before, but without any computer in it. A embeds B anyway, by the following argument. Let the pattern frames be the whole execution histories. Map the case where A has no interventions to the case where B has no intervention. Map the cases with interventions to entirely altered versions of B, containing appropriate A-computers with the desired interventions. This meets all the requirements, but intuitively isn't a real embedding of A in B.

Pattern-to-pattern embeddings seem necessary for this to apply to theories of truth, as I'm hoping for, because a belief will necessarily be represented by a complex physical sign. For example, a neural structure implementing a concept might have a causal structure which at a high level resembles something in the physical world; but, certainly, the internal causal structure of the individual neurons is not meant to be included in this mapping.

In any case, more work is needed.

The Correspondence Theory

2016-02-28T18:59:00.001-08:00

In my post on intuitionistic intuitions, I discussed my apparent gradual slide into constructivism, including some reasons to be skeptical about classical notions of truth. After a conversation with Sam Eisenstat, I've become less skeptical once again - mainly out of renewed hope that the classical account of truth, the correspondence theory, can be made mathematically rigorous.

Long-time readers might notice that this is a much different notion of truth than I normally talk about on this blog. I usually talk about mathematical truth, which deals with how to add truth predicates to a logical language, and has a lot to do with self-reference. Here, I'm talking about empirical truth, which has to do with descriptions of the world derived from observation. The two are not totally unrelated, but how to relate them is a question I won't deal with today.

Pragmatism

At bare minimum, I think, I accept a pragmatist notion of truth: beliefs are useful, some more than others. A belief is judged in terms of what it allows us to predict and control. The pragmatist denies that beliefs are about the world. Thinking in terms of the world is merely convenient.

How do we judge usefulness, then? Usefulness can't just be about what we think is useful, or else we'll include any old mistaken belief. It seems as if we need to refer to usefulness in the actual world. Pragmatism employs a clever trick to get around this. Truth refers to the model that we would converge to upon further investigation:

The real, then, is that which, sooner or later, information and reasoning would finally result in, and which is therefore independent of the vagaries of me and you. Thus, the very origin of the conception of reality shows that this conception essentially involves the notion of a COMMUNITY, without definite limits, and capable of an indefinite increase of knowledge. (Peirce 1868, CP 5.311).

This is saying that truth is more than just social consensus; it's a kind of idealized social consensus which would eventually be reached. The truth isn't what we currently think we believe, but we are still the ultimate judge.

If we make a mathematical model out of this, we can get some quite interesting results. Machine learning is full of results like this: we connect possible beliefs with a loss function which tells us when we make an error and how much it costs us to make different kinds of errors, and then prove that a particular algorithm has bounded regret. Regret is the loss relative to some better model; bounded regret means the total loss will not be too much worse than that of the best model in some class of models.

The ideal model is one with minimum loss; this is the model which we would assent to after the fact, which we'd want to tell to our previous self if we could. Since we can't have this perfect belief, the principle of bounded regret is a way to keep the damage to an acceptably low level. This might not be exactly realistic in life (the harms of bad beliefs might not be bounded), but at least it's a useful principle to apply when thinking about thinking.

Bayesianism

The way I see it, Bayesian philosophy is essentially pragmatist. The main shift from bounded-regret type thinking to Bayesian thinking is that Bayesians are more picky about which loss function is employed: it should be a proper scoring rule, ideally the logarithmic scoring rule (which has strong ties to information theory).

Bayesianism has a stronger separation between knowledge and goals than pragmatism. Pragmatism says that the aim of knowledge is to predict and manipulate the world. Bayesianism says "Wait a minute... predict... and manipulate. Those two sound distinct. Let's solve those problems separately." Knowledge is about prediction only, and is scored exclusively on predictive accuracy. Bayesian decision theory distinguishes between the probability function, P(x), and the utility function, U(x), even though in the end everything gets mixed together in the expected value.

Perhaps the reason this separation is so useful is that the same knowledge can be useful toward many different goals. Even though it's easy to find knowledge which doesn't fit this pattern (which is more useful for some goals than others), the abstraction is useful enough to persist because before you've solved a problem, you can't predict which pieces of knowledge will or won't be useful -- so you need a usefulness-agnostic notion of knowledge to some extent.

I think many Bayesians would also go further, saying truth has to do with a map-territory distinction and so on. However, this concept doesn't connect very strongly with the core of Bayesian techniques. A pragmatic notion of truth ("all models are wrong, but some are useful") seems to be closer to both theory and practice.

Still, this is an extremely weak notion of "truth". There doesn't need to be any notion of an external world. As in the pragmatist view quoted earlier, "the real" is a sort of convergence of belief. Knowledge is about making predictions, so all belief is fundamentally about sense-data; the view can be very solipsistic.

External Things

If all we have access to is our direct sense-data, what does it mean to believe in external things? One simplistic definition is to say that our model contains additional variables beyond the sense-data. In statistics, these are called "hidden variables" or "latent variables": stuff we can't directly observe, but which we put in our model anyway. Why would we ever do this? Well, it turns out to be really useful for modeling purposes. Even if only sense-data is regarded as "real", almost any approach will define probabilities over a larger set of variables.

This kind of "belief in external objects" is practically inevitable. Take any kind of black-box probability distribution over sense-data. If you open it up, there must be some mechanics inside; unless it's just a big look-up table, we can interpret it as a model with additional variables.

The pragmatist says that these extra beliefs inside the black box are true to the extent that they are useful (as judged in hindsight). The realist (meaning, the person who believes in external things) responds that this notion of truth is insufficient.

Imagine that one of the galaxies in the sky is fake: a perfect illusion of a galaxy, hiding a large alien construct. Further, let's suppose that we can never get to that galaxy with the resources available to us. Whatever is really there behind the galaxy-illusion has the mass of a galaxy, and everything fits correctly within the pattern of surrounding galaxies. We have no reason to believe that the galaxy is fake, so we will say that there's a galaxy there. This belief will never change, no matter how much we investigate.

For the pragmatist, this is fine. The belief has no consequence for prediction or action. It's "true" in every respect that might ever be important. It still seems to be a false belief, though. What notion of truth might we be employing, which marks this belief false?

I think part of the problem is that from our perspective, now, we don't know which things we will have an opportunity to observe or not. We want to have correct beliefs for anything we might observe. Because we can't really quantify that, we want to have correct beliefs for everything.

This leads to a model in which "external things" are anything which could hypothetically be sense-data. Imagine we have a camera which we can aim at a variety of things. We're trying to predict what we see in the camera based on where we swing it. We could try to model the flow of images recorded by the camera directly. However, this is not likely to work well. Instead, we should built up a 3D map of the environment. This map can predict observations at a much larger variety of angles than we have ever observed. Some of these will be angles that we could never observe -- perhaps places too high for us to lift the camera to, or views inside small places we cannot fit the camera. We won't have enough data to construct a 3D model for all of those non-possible angles, either; but, it makes sense to talk (speculatively) about what would be there if we could see it.

This is much more than the "black box" model mentioned earlier, where we look inside and see that there are some extra variables doing something. Here, the model itself explicitly presents us with spurious "predictions" about things which will never be sense-data, as a natural result of attempting to model the situation.

I think "external things" are like that. We use models which provide explicit predictions for things, outside of any particular idea of how we might ever measure those things. Like the earlier argument about Bayesians separating P(x) from U(x), this is an argument from convenience: we don't know ahead of time which things will be observable or how, so it turns out to be very useful to construct models which are agnostic to this.

Correspondence

The correspondence theory of truth is the oldest, and still the most widely accepted. We're now in a position to outline and defend a correspondence theory, but first, I'd like to expand a bit more on the concerns I'm trying to address (which I described somewhat in intuitionistic intuitions).

The Problem to Be Solved

According to the correspondence theory, truth is like the relationship between a good map and the territory it describes: if you understand the scale of the map, the georgraphic area it is representing, and the meaning of the various symbols employed by the map, you can understand where things are and navigate the landscape. If you are checking the map for truth, you can travel to areas depicted on the map and check whether the map accurately records what's there.

The correspondence theory of truth says that beliefs are like maps, and reality is the territory being described. We believe in statements which describe things in the external world. These beliefs are true when the descriptions are accurate, and false when inaccurate. Seems simple, right?

I've been struggling with this for several reasons:

The Bayesian understanding of knowledge has no obvious need for map-territory correspondence. If "truth" lacks information-theoretic relevance, why would we talk about such a thing?
Given perfect knowledge of our human beliefs, and perfect knowledge of the external world, it's not clear that a single correct correspondence can be found to decide whether beliefs are true or false. A map which must be graciously interpreted is no map at all; take any random set of scribbles and you can find some interpretation in the landscape.
Even if we can account for that somehow, it's not clear what "territory" our map should be corresponding to. Although the universe contains many "internal" views (people observing the universe from the inside), it's not clear that there is any objective "external" view to compare things to. To understand such a view, we would have to imagine an entity sitting outside of the universe and observing it.

Point #1 was largely what I've been addressing in the essay up to this point. I propose that "truth" is a notion of hypothetical predictive accuracy, over a space of things which are in-principle observable, even if we know we cannot directly observe those things. We use truth as a convenient "higher standard" which implies good predictive accuracy. This ends up being useful because in practice we don't know beforehand what variables will be observable or closely connected with observation. The hypothesis of an external world has been validated again and again as a predictor of our actual observations.

In order to address point #2, we need a mathematically objective way of determining the correspondence by which to judge truth. In our story about maps and territories, a human played an important role. The human interprets the map: we understand the correspondence, because we can check whether the map is true. It's not possible for the correspondence to be contained in the map; no matter how much is written on the map to indicate scale, meaning of symbols, and so on, a human needs to understand the symbolic language in which such an explanation is written. This metaphor breaks down when we attempt to apply it to a human's beliefs. It seems that the human, or at least the human society at large, needs to contain everything necessary to interpret beliefs.

Point #3 will similarly be addressed by a formal theory.

Now for some formalism. I'll keep things fairly light.

Solution Sketch

Suppose we observe a sequence of bits. We call these the observable variables. We want to predict that sequence of bits as accurately as possible, by choosing from a set of possible models which make different predictions. However, these models may also predict hidden variables which we never observe, but which we hypothesize.

Definition. A model declares a (possibly infinite) collection of boolean-valued variables, which must include the observation bits. It also provides a set of functions which determine some variables from other variables, possibly probabilistically. These functions must compose into a complete model, IE, giving probabilities to all of the variables in the model. A model with only the observed variables is called a fully observable model; otherwise, a model is a hidden-variable model.

Note that because the global probability distribution is made of local functions which get put together, we've got more than just a probability distribution; we also have a causal structure. I'll explain why we need this later.

(I said I'd keep this light! More technically, this means that the sigma-algebra of the probability distribution can contain events which are distinct despite allowing the same set of possible observation-bit values; additionally, a directed acyclic graph is imposed on the variables, which determines their causal structure.)

Any hidden-variable model can be transformed into a fully observable model which makes the exact same predictions, by summing out the hidden variables and computing the probability of the next observation purely from the observations so far. Why, then, might an agent prefer the hidden-variable version? My answer is that adding hidden variables can be computationally more convenient for a number of reasons. Although we can always specify a probability distribution only on the history, there will usually be intermediate variables which could be useful to many predictions. These can be stored in hidden variables.

Consider again the example of a camera moving around a 3D environment. It's possible to try and predict the observations for a new angle as a pure function of the history. We would try to grab memories from related angles, and then construct a prediction on the fly from those; sometimes simple interpolation would give a good enough prediction, but in general, we need to build up a 3D model. It's better to keep the 3D models in memory and keep building them up as we receive more observations, rather than trying to create them from sensory memories on the fly.

Is this an adequate reason for believing in an external world? I'm implying that our ontology is a rather fragile result of our limited computational resources. If we did not have limited computing power, then we would never need to postulate hidden variables. Maybe this is philosophically inadequate. I think this might be close to our real reason for believing in an external world, though.

Now, how do we judge which model is true?

In order to say that a model is true, we need to compare it to something. Suppose some model R represents the real functional structure of the universe. Unlike other models, we require that R is fully deterministic, giving us a fixed value for every variable; I'll call this the true state, S. (It seems to me like a good idea to assume that reality is deterministic, but this isn't a necessary part of the model; the interested might modify this part of the definition.)

Loose Definition. A model M is meaningful to the extend that it "cuts reality at the joints", introducing hidden variables whose values correspond to clusters in the state-space of R. The beliefs about the hidden variables in M are true to the extend that they pick out the actual state S.

Observations:

We're addressing issue #3 by comparing a model to another model. This might at first seem like cheating; it looks like we're merely judging one opinion about the universe via some other opinion. However, the idea isn't that R is supplied by some external judge. Rather, R is representing the actual state of affairs. We don't ever have full access to R; we will never know what it is. All we can ever do is judge someone else's M by our probability distribution over possible R. That's what it looks like to reason about truth under uncertainty. We're describing external reality in terms of a hypothetical true model R; this doesn't mean reality "is" R, but it does assume reality is describable.
This is very mathematically imprecise. I don't know yet how to turn "cut reality at the joints" into math here. However, it seems like a tractable problem which might be addressed by tools from information theory like KL-divergence, mutual information, and so on.
Because R determines a single state S, the concept of "state-space" needs to rely on the causal structure of R. Correspondence between M and R needs to look something like "if I change variable x in R, and change variable y in M, we see the same cascade of consequences on other corresponding states in both models."
The notion of correspondence probably needs to be a generalization of program equivalence. This suggests truth is uncomputable even if we have access to R. (This is only a concern for infinite models, however.)
The correspondence between M and R is an interpretation of our mental states as descriptions of states of reality.
In order to be valid, the correspondence needs to match the observable variables in M with the observable variables in R. Other variables are matched based on the similarity of the causal structure, and the constraint imposed by exactly matching the observable variables.

To show that my loose definition has some hope of being formalized, here is a too-tight version:

Strict Definition. A model M is meaningful if there is a unique injection of the variables of M into those of R which preserves the causal structure: if x=f(y,z) in M, and the injection takes x,y,z to a,b,c, then we need a=g(b,c,...) in R. (The "..." means there can be more variables which a depends on.) Additionally, observable variables in M must be mapped onto the same observable variables in R. Furthermore, the truth of a belief is defined by the log-loss with respect to the actual state S.

This definition is too strict mainly because it doesn't let variables in our model M stand for clusters in R. This means references to aggregate objects such as tables and chairs aren't meaningful in this definition. However, it does capture the important observation that we can have a meaningful and true model which is incomplete: because the correspondence is an injection, there may be some variables in R which we have no concept of in M. This partly explains why our beliefs need to be probabilistic; if the variable "a" is really set by the deterministic function a=g(b,c,d,e) but we only model "b" and "c", our function x=f(y,z) can have apparently stochastic behavior.

Let's take stock of how well we've addressed the three concerns I listed before.

#1: Bayesianism doesn't need map-territory correspondence.

I think I've given a view that a pragmatist can accept, but which goes beyond pragmatism. Bayesian models will tend to talk about hidden variables anyway. This can be pragmatically justified. A pragmatist might say that the correspondence theory of truth is a feature of the world we find ourselves in; it could have been that we can model experience quite adequately without hidden variables, in which case the pragmatist would have rejected such a theory as confused. Since our sensory experience in fact requires us to postulate hidden variables to explain it very well, the correspondence theory of truth is natural to us.

#2: It's not clear that there will be a unique correspondence. A map which must be interpreted generously is not a good map.

I think I've only moderately succeeded in addressing this point. My strict definition requires that there is a unique injection. Realistically, this seems far too large a requirement to impose upon the model. Perhaps a more natural version of this would allow parts of a model to be meaningless, due to having non-unique interpretations. I think we need to go even further, though, allowing degrees of meaning based on how vague a concept is. Human concepts will be quite vague. So, I'm not confident that a more worked-out version of this theory would adequately address point #2.

#3: It's not clear that there IS an objective external view of reality to judge truth by.

Again, I think I've only addressed this point moderately well. I use the assumption that reality is describable to justify hypothesizing an R by which we judge the truth of M. Is there a unique R we can use? Most likely not. Can the choice of R change the judgement of truth in M? I don't have the formal tools to address such a question. In order for this to be a very good theory of truth, it seems like we need to be able to claim that some choices of R are right and some are wrong, and all the right ones are equivalent for the purpose of evaluating M.

Levels and Levels

2015-12-19T11:44:00.000-08:00

A system of levels related to my idea of epistemic/intellectual trust:

Becoming defensive if your idea is attacked. A few questions might be fine, but very many feels like an interrogation. Objections to ideas are taken personally, especially if they occur repeatedly. This is sort of the level where most people are at, especially about identity issues like religion. Intellectuals can hurt people without realizing it when they try to engage people on such issues. The defensiveness is often a rational response to an environment in which criticism very often is an attack, and arguments like this are used as dominance moves.
Competitive intellectualism. Like at level 1, debates are battles and arguments are soldiers. However, at level 2 this becomes a friendly competition rather than a problem. Intelligent objections to your ideas are expected and welcome; you may even take trollish positions in order to fish for them. Still, you're trying to win. Pseudo-legalistic concepts like burden of proof may be employed. Contrarianism is encouraged; the more outrageous the belief you can successfully defend, the better. At this level of discourse, scientific thought may be conflated with skepticism. The endpoint of this style of intellectualism can be universal skepticism as a result.
Intellectual honesty. Sorting out possibilities. Exploring all sides of an issue. This can temporarily look a lot like level 2, because taking a devil's-advocate position can be very useful. However, you never want to convey stronger evidence than exists. The very idea of arguing one side and only one side, as in level 2, is crazy -- it would defeat the point. The goal is to understand the other person's thinking, get your own thoughts across, and then try to take both forward by thinking about the issue together. You don't "win" and "lose"; all participants in the discussion are trying to come up with arguments that constrain the set of possibilities, while listing more options within that and evaluating the quality of different options. If a participant in the discussion appears to be giving a one-sided arguement for an extended period, it's because they think they have a point which hasn't been understood and they're still trying to convey it properly.

This is more nuanced than the two-level view I articulated previously, but it's still bound to be very simplistic compared to the reality. Discussions will mix these levels, and there are things happening in discussions which aren't best understood in terms of these levels (such as storytelling, jokes...). People will tend to be at different levels for different sets of beliefs, and with different people, and so on. Politics will almost always be at level 1 or 2, while it hardly even makes sense to talk about mathematics at anything but level 3. Higher levels are in some sense better than lower levels, but this should not be taken too far. Each level is an appropriate response to a different situation, and problems occur if you're not appropriately adapting your level of response to the situation. Admitting the weakness of your argument is a kind of countersignaling which can help shift from level 2 to level 3, but which can be ineffective or backfire if the conversation is stuck at level 2 or 1.

Here's an almost unrelated system of levels:

Relying on personal experience and to a less extent anecdotal evidence, as opposed to statistics and controlled studies. (This is usually looked down upon by those with a scientific mindset, but again I'll be arguing that these levels shouldn't be taken as a scale from worse to better.) This is a human bias, since a personal example or story from a friend (or friend of friend) will tend to stick out in memory more vividly than numbers will. Practitioners of this approach to evidence can often be heard saying things like "you can prove anything with statistics" (which is, of course, largely true!).
Relying on science, but only at the level it's conveyed in popular media. This is often really, really misleading. What the science says is often misunderstood, misconstrued, or ignored.
Single study syndrome. Beware the man of one study. The habit/tactic of taking the conclusion of one scientific study as the truth. While looking at the actual studies is better than listening to the popular media, this replicates the same mistake that those who write the popular media articles are usually making. It ignores the fact that studies are often not replicated, and can show conflicting results. Another, perhaps even more important reason why single study syndrome is dangerous is because you can fish for a study to back up almost any view you like. You can do this without even realizing it; if you google terms related to what you are thinking, it will often result in information confirming those things. To overcome this, you've typically got to search for both sides of the argument. But what do you do when you find confirming evidence on both sides?
Surveying science. Looking for many studies and meta-studies. This is, in some sense, the end of the line; unless you're going to break out the old lab coat and start doing science yourself, the best you can do is become acquainted with the literature and make an overall judgement from the disparate opinions there. Unfortunately, this can still be very misleading. A meta-analysis is not just a matter of finding all the relevant studies and adding up what's on one side vs the other, although this much effort is already quite a lot. Often the experiments in different studies are testing for different things. Determining which statistics are comparable will tend to be difficult, and usually you'll end up making somewhat crude comparisons. Even when studies are easily comparable, due to publication bias, a simple tally can look like overwhelming evidence where in fact there is only chance (HT @grognor for reference). And when an effect is real, it can be due to operational definitions whose relation to real life is difficult to pin down; for example, do cognitive biases which are known to exist in a laboratory setting carry over to real-world decision-making?

Due to the troublesome nature of scientific evidence, the dialog at level 4 can sound an awful lot like level 1 at times. However, keep in mind that level 4 takes a whole lot more effort than level 1. We can put arbitrary amounts of effort into fact-checking any individual belief. While it's easy to criticize the lower levels and say that everyone should be at level 4 all the time (and they should learn to do meta-studies right, darnit!), it's practically impossible to put that amount of effort in all the time. When one does, one is often confronted with a disconcerting labrynth of arguments and refutations on both sides, so that any conclusion you may come to is tempered by the knowledge that many people have been very wrong about this same thing (for surprising reasons).

While you'll almost certainly find more errors in your thinking if you go down that rabbit hole, at some point you've got to stop. For some kinds of beliefs, the calculated point of stopping is quite early; hence, we're justified in staying at level 1 for many (most?) things. It may be easy to underestimate the amount of investigation we need to do, since long-term consequences of wrong beliefs are unpredictable (it's easier to think about only short-term needs) and it's much easier to see the evidence currently in favor of our position than the possible refutations which we've yet to encounter. Nonetheless, only so much effort is justified.

Intuitionistic Intuitions

2015-11-15T14:39:00.000-08:00

I've written quite a few blog posts about the nature of truth over the years. There's been a decent gap, though. This is partly because my standards have increased and I don't wish to continue the ramblings of past-me, and partly because I've moved on to other things. However, over this time I've noticed a rather large shift taking place in my beliefs about these things; specifically my reaction to intuitionist/constructivist arguments.

My feeling in former days was that classical logic is clear and obvious, while intuitionistic logic is obscure. I had little sympathy for the arguments in favor of intuitionism which I encountered. I recall my feeling: "Everything, all of these arguments given, can be understood in terms of classical logic -- or else not at all." My understanding of the meaning of intuitionistic logic was limited to the provability interpretation, which translates intuitionistic statements into classical statements. I could see the theoretical elegance and appeal of the principle of harmony and constructivism as long as the domain was pure mathematics, but as soon as we use logic to talk about the real world, the arguments seemed to fall apart; and surely the point (even when dealing with pure math) is to eventually make useful talk about the world? I wanted to say: all these principles are wonderful, but on top of all of this, wouldn't you like to add the Law of Excluded Middle? Surely it can be said that any meaningful statement is either true, or false?

My thinking, as I say, has shifted. However, I find myself in the puzzling position of not being able to point to a specific belief which has changed. Rather, the same old arguments for intuitionistic logic merely seem much more clear and understandable from my new perspective. The purpose of this post, then, is to attempt to articulate my new view on the meaning of intuitionistic logic.

The slow shift in my underlying beliefs was punctuated by at least two distinct realizations, so I'll attempt to articulate those.

Language Is Incomplete

In certain cases it's quite difficult to distinguish what "level" you're speaking about with natural language. Perhaps the largest example of this is that there isn't the same kind of use/mention distinction which is firmly made in formal logic. It's hard to know exactly when we're just arguing semantics (arguing about the meaning of words) vs arguing real issues. If I say "liberals don't necessarily advocate individual freedom" am I making a claim about the definition of the word liberal, or an empirical claim about the habits of actual liberals? It's unclear out of context, and can even be unclear in context.

My first realization was that the ambiguity of language allows for two possible views about what kind of statements are usually being made:

Words have meanings which can be fuzzy at times, but this doesn't matter too much. In the context of a conversation, we attempt to agree on a useful definition of the word for the discussion we're having; if the definition is unclear, we probably need to sort that out before proceeding. Hence, the normal, expected case is that words have concrete meanings referring to actual things.
Words are social constructions whose meanings are partial at the best of times. Even in pure mathematics, we see this: systems of axioms are typically incomplete, leaving wiggle room for further axioms to be added, potentially ad infinitum. If we don't pin down the topic of discourse precisely in math, how can we think that's the case in typical real-world cases? Therefore, the normal, expected case is that we're dealing with only incompletely-specified notions. Because our statements must be understood in this context, they have to be interpreted as mostly talking about these constructions rather than talking about the real world as such.

This is undoubtedly a false dichotomy, but helped me see why one might begin to advocate intuitionistic logic. I might think that there is always a fact of the matter about purely physical items such as atoms and gluons, but when we discuss tables and chairs, such entities are sufficiently ill-defined that we're not justified in acting as if there is always a physical yes-or-no sitting behind our statements. Instead, when I say "the chair is next to the table" the claim is better understood as indicating that understood conditions for warranted assertibility have been met. Likewise, if I say "the chair is not next to the table" it indicates that conditions warranting denial have been met. There need not be a sufficiently precise notion available so that we would say the chair "is either next to the table or not" -- there well may be cases when we would not assent to either judgement.

After thinking of it this way, I was seeing it as a matter of convention -- a tricky semantic issue somewhat related to use/mention confusion.

Anti-Realism Is Just Rejection of Map/Territory Distinctions

Anti-realism is a position which some (most?) intuitionists take. Now, on the one hand, this sort of made sense to me: my confusion about intuitionism was largely along the lines "but things are really true or false!", so it made a certain kind of sense for the intuitionist reply to be "No, there is no real!". The intuitionists seemed to retreat entirely into language. Truth is merely proof; and proof in turn is assertability under agreed-upon conventions. (These views are not necessarily what intuitionists would say exactly, but it's the impression I somehow got of them. I don't have sources for those things.)

If you're retreating this far, how do you know anything? Isn't the point to operate in the real world, somewhere down the line?

At some point, I read this facebook post by Brienne, which got me thinking:

One of the benefits of studying constructivism is that no matter how hopelessly confused you feel, when you take a break to wonder about a classical thing, the answer is SO OBVIOUS. It's like you want to transfer just the pink glass marbles from this cup of water to that cup of water using chopsticks, and then someone asks whether pink marbles are even possible to distinguish from blue marbles in the first place, and it occurs to you to just dump out all the water and sort through them with your fingers, so you immediately hand them a pink marble and a blue marble. Or maybe it's more like catching Vaseline-coated eels with your bare hands, vs. catching regular eels with your bare hands. Because catching eels with your bare hands is difficult simpliciter. Yes, make them electric, and that's exactly what it's like to study intuitionism. Intuitionism is like catching vaseline-coated electric eels with your bare hands.
Posted by Brienne Yudkowsky on Friday, September 25, 2015

I believe she simply meant that constructivism is hard and classical logic is easy by comparison. (For the level of detail in this blog post, constructivism and intuitionism are the same.) However, the image with the marbles stuck with me. Some time later, I had the sudden thought that a marble is a constructive proof of a marble. The intuitionists are not "retreating entirely into language" as I previously thought. Rather, almost the opposite: they are rejecting a strict brain/body separation, with logic happening only in the brain. Logic becomes more physical.

Rationalism generally makes a big deal of the map/territory distinction. The idea is that just as a map describes a territory, our beliefs describe the world. Just as a map must be constructed by looking at the territory if it's to be accurate, our beliefs must be constructed by looking at the world. The correspondence theory of truth holds that a statement or belief is true or false based on a correspondence to the world, much as a map is a projected model of the territory it depicts, and is judged correct or incorrect with respect to this projection. This is the meaning of the map.

In classical logic, this translates to model theory. Logical sentences correspond to a model via an interpretation; this determines their truth values as either true or false. How can we understand intuitionistic logic in these terms? The standard answer is Kripke semantics, but while that's a fine formal tool, I never found it helped me understand the meaning of intuitionistic statements. Kripke semantics is a many-world interpretation; the anti-realist position seemed closer to no-world-at-all. I now see I was mistaken.

Anti-realism is not rejection of the territory. Anti-realism is rejection of the map-territory correspondence. In the case of a literal map, such a correspondence makes sense because there is a map-reader who interprets the map. In the case of our beliefs, however, we are the only interpreter. A map cannot also contain its own map-territory correspondence; that is fundamentally outside of the map. A map will often include a legend, which helps us interpret the symbols on the map; but the legend itself cannot be explained with a legend, and so on ad infinitum. The chain must bottom out somehow, with some kind of semantics other than the map-territory kind.

The anti-realist provides this with constructive semantics. This is not based on correspondence. The meaning of a sentence rests instead in what we can do with it. In computer programming terms, meaning is more like a pointer: we uncover the reference by a physical operation of finding the location in memory which we've been pointing to, and accessing its contents. If we claim that 3*7=21, we can check the truth of this statement with a concrete operation. "3" is not understood as a reference to a mysterious abstract entity known as a "number"; 3 is the number. (Or if the classicist insists that 3 is only a numeral, then we accept this, but insist that it is clear numerals exist but unclear that numbers exist.)

A proof worked out on a sheet of paper is a proof; it does not merely mean the proof. It is not a set of squiggles with a correspondence to a proof.

How does epistemology work in this kind of context? How do we come to know things? Well...

Bayesianism Needs No Map

The epistemology of machine learning has been pragmatist all along: all models are wrong; some are useful. A map-territory model of knowledge plays a major role in the way we think of modeling, but in practice? There is no measurement of map-territory correspondence. What matters is goodness-of-fit and generalization error. In other words, a model is judged by the predictions it makes, not by the mechanisms which lead to those predictions. We tend to expect models which make better predictions to have internal models close to what's going on in the external world, but there is no precise notion of what this means, and none is required. The theorems of statistical learning theory and Bayesian epistemology (of which I am aware) do not make use of a map-territory concept, and the concept is not missed.

It's interesting that formal Bayesian epistemology relies so little on a map/territory distinction. The modern rationalist movement tends to advocate both rather strongly. While Bayesianism is compatible with map-territory thinking, it doesn't need it or really encourage it. This realization was rather surprising to me.

Associated vs Relevant

2015-06-14T18:00:00.001-07:00

Also cross-posted to LessWrong.

The List of Nuances (which is actually more of a list of fine distinctions - a fine distinction which only occurred to its authors after the writing of it) has one glaring omission, which is the distinction between associated and relevant. A List of Nuances is largely a set of reminders that we aren't omniscient, but it also serves the purpose of listing actual subtleties and calling for readers to note the subtleties rather than allowing themselves to fall into associationism, applying broad cognitive clusters where fine distinctions are available. The distinction between associated and relevant is critical to this activity.

An association can be anything related to a subject. To be relevant is a higher standard: it means that there is an articulated argument connecting to a question on the table, such that the new statement may well push the question one way or the other (perhaps after checking other relevant facts). This is close to the concept of value of information.

Whether something is relevant or merely associated can become confused when epistemic defensiveness comes into play. From A List of Nuances:

10. What You Mean vs. What You Think You Mean

Very often, people will say something and then that thing will be refuted. The common response to this is to claim you meant something slightly different, which is more easily defended.

We often do this without noticing, making it dangerous for thinking. It is an automatic response generated by our brains, not a conscious decision to defend ourselves from being discredited. You do this far more often than you notice. The brain fills in a false memory of what you meant without asking for permission.

As mentioned in Epistemic Trust, a common reason for this is when someone says something associated to the topic at hand, which turns out not to be relevant.

There is no shame in saying associated things. In a free-ranging discussion, the conversation often moves forward from topic to topic by free-association. All of the harm here comes from claiming that something is relevant when it is merely associated. Because this is often a result of knee-jerk self-defense, it is critical to repeat: there is no shame in saying something merely associated with the topic at hand!

It is quite important, however, to spot the difference. Association-based thinking is one of the signs of a death spiral, as a large associated memeplex reinforces itself to the point where it seems like a single, simple idea. A way to detect this trap is to try to write down the idea in list form and evaluate the different parts. If you can't explicitly articulate the unseen connection you feel between all the ideas in the memeplex, it may not exist.

Utilizing the power of associations is a powerful tool for creating a good story (although, see item #3 here for a counterpoint). Repeating themes can create a powerful feeling of relevance, which may be good for convincing people of a memeplex. Furthermore, association is a wonderful exploratory tool. However, it can turn into an enemy of articulated argument; for this reason, it is important to tread carefully (especially in one's own mind).

Epistemic Trust: Clarification

2015-06-10T17:50:00.001-07:00

Cross-posted to LessWrong Discussion.

A while ago, I wrote about epistemic trust. The thrust of my argument was that rational argument is often more a function of the group dynamic, as opposed to how rational the individuals in the group are. I assigned meaning to several terms, in order to explain this:

Intellectual honesty: being up-front not just about what you believe, but also why you believe it, what your motivations are in saying it, and the degree to which you have evidence for it.

Intellectual-Honesty Culture: The norm of intellectual honesty. Calling out mistakes and immediately admitting them; feeling comfortable with giving and receiving criticism.

Face Culture: Norms associated with status which work contrary to intellectual honesty. Agreement as social currency; disagreement as attack. A need to save face when one's statements turn out to be incorrect or irrelevant; the need to make everyone feel included by praising contributions and excusing mistakes.

Intellectual trust: the expectation that others in the discussion have common intellectual goals; that criticism is an attempt to help, rather than an attack. The kind of trust required to take other people's comments at face value rather than being overly concerned with ulterior motives, especially ideological motives. I hypothesized that this is caused largely by ideological common ground, and that this is the main way of achieving intellectual-honesty culture.

There are several important points which I did not successfully make last time.

Sometimes it's necessary to play at face culture. The skills which go along with face-culture are important. It is generally a good idea to try to make everyone feel included and to praise contributions even if they turn out to be incorrect. It's important to make sure that you do not offend people with criticism. Many people feel that they are under attack when engaged in critical discussion. Wanting to change this is not an excuse for ignoring it.

Face culture is not the error. Being unable to play the right culture at the right time is the error. In my personal experience, I've seen that some people are unable to give up face-culture habits in more academic settings where intellectual honesty is the norm. This causes great strife and heated arguments! There is no gain in playing for face when you're in the midst of an honesty culture, unless you can do it very well and subtly. You gain a lot more face by admitting your mistakes. On the other hand, there's no honor in playing for honesty when face-culture is dominant. This also tends to cause more trouble than it's worth.

It's a cultural thing, but it's not just a cultural thing. Some people have personalities much better suited to one culture or the other, while other people are able to switch freely between them. I expect that groups move further toward intellectual honesty as a result of establishing intellectual trust, but that is not the only factor. Try to estimate the preferences of the individuals you're dealing with (while keeping in mind that people may surprise you later on).

Simultaneous Overconfidence and Underconfidence

2015-06-03T14:02:00.000-07:00

Follow-up to this and this. Prep for this meetup. Cross-posted to LessWrong.

Eliezer talked about cognitive bias, statistical bias, and inductive bias in a series of posts only the first of which made it directly into the LessWrong sequences as currently organized (unless I've missed them!). Inductive bias helps us leap to the right conclusion from the evidence, if it captures good prior assumptions. Statistical bias can be good or bad, depending in part on the bias-variance trade-off. Cognitive bias refers only to obstacles which prevent us from thinking well.

Unfortunately, as we shall see, psychologists can be quite inconsistent about how cognitive bias is defined. This created a paradox in the history of cognitive bias research. One well-researched and highly experimentally validated effect was conservatism, the tendency to give estimates too middling, or probabilities too near 50%. This relates especially to integration of information: when given evidence relating to a situation, people tend not to take it fully into account, as if they are stuck with their prior. Another highly-validated effect was overconfidence, relating especially to calibration: when people give high subjective probabilities like 99%, they are typically wrong with much higher frequency.

In real-life situations, these two contradict: there is no clean distinction between information integration tasks and calibration tasks. A person's subjective probability is always, in some sense, the integration of the information they've been exposed to. In practice, then, when should we expect other people to be under- or over- confident?

Simultaneous Overconfidence and Underconfidence

The conflict was resolved in an excellent paper by Ido Ereve et al which showed that it's the result of how psychologists did their statistics. Essentially, one group of psychologists defined bias one way, and the other defined it another way. The results are not really contradictory; they are measuring different things. In fact, you can find underconfidence or overconfidence in the same data sets by applying the different statistical techniques; it has little or nothing to do with the differences between information integration tasks and probability calibration tasks. Here's my rough drawing of the phenomenon (apologies for my hand-drawn illustrations):

Overconfidence here refers to probabilities which are more extreme than they should be, here illustrated as being further from 50%. (This baseline makes sense when choosing from two options, but won't always be the right baseline to think about.) Underconfident subjective probabilities are associated with more extreme objective probabilities, which is why the slope tilts up in the figure. Overconfident similarly tilts down, indicating that the subjective probabilities are associated with less-extreme objective probabilities. Unfortunately, if you don't know how the lines are computed, this means less than you might think. Ido Ereve et al show that these two regression lines can be derived from just one data-set. I found the paper easy and fun to read, but I'll explain the phenomenon in a different way here by relating it to the concept of statistical bias and tails coming apart.

The Tails Come Apart

Everyone who has read Why the Tails Come Apart will likely recognize this image:

The idea is that even if X and Y are highly correlated, the most extreme X values and the most extreme Y values will differ. I've labelled the difference the "curse" after the optimizer's curse: if you optimize a criteria which is merely correlated with the thing you actually want, you can expect to be disappointed.

Applying the idea to calibration, we can say that the most extreme subjective beliefs are almost certainly not the most extreme on the objective scale. That is: a person's most confident beliefs are almost certainly overconfident. A belief is not likely to have worked its way up to the highest peak of confidence by merit alone. It's far more likely that some merit but also some error in reasoning combined to yield high confidence.

In what follows, I'll describe a "soft version" which shows the tails coming apart gradually, rather than only talking about the most extreme points.
Statistical Bias

Statistical bias is defined through the notion of an estimator. We have some quantity we want to know, X, and we use an estimator to guess what it might be. The estimator will be some calculation which gives us our estimate, which I will write as X^. An estimator is derived from noisy information, such as a sample drawn at random from a larger population. The difference between the estimator and the true value, X^-X, would ideally be zero; however, this is unrealistic. We expect estimators to have error, but systematic error is referred to as bias.

Given a particular value for X, the bias is defined as the expected value of X^-X, written EX(X^-X). An unbiased estimator is an estimator such that EX(X^-X)=0 for any value of X we choose.

Due to the bias-variance trade-off, unbiased estimators are not the best way to minimize error in general. However, statisticians still love unbiased estimators. It's a nice property to have, and in situations where it works, it has a more objective feel than estimators which use bias to further reduce error.

Notice, the definition of bias is taking fixed X; that is, it's fixing the quantity which we don't know. Given a fixed X, the unbiased estimator's average value will equal X. This is a picture of bias which can only be evaluated "from the outside"; that is, from a perspective in which we can fix the unknown X.

A more inside-view of statistical estimation is to consider a fixed body of evidence, and make the estimator equal the average unknown. This is exactly inverse to unbiased estimation:

In the image, we want to estimate unknown Y from observed X. The two variables are correlated, just like in the earlier "tails come apart" scenario. The average-Y estimator tilts down because good estimates tend to be conservative: because I only have partial information about Y, I want to take into account what I see from X but also pull toward the average value of Y to be safe. On the other hand, unbiased estimators tend to be overconfident: the effect of X is exaggerated. For a fixed Y, the average Y^ is supposed to equal Y. However, for fixed Y, the X we will get will lean toward the mean X (just as for a fixed X, we observed that the average Y leans toward the mean Y). Therefore, in order for Y^ to be high enough, it needs to pull up sharply: middling values of X need to give more extreme Y^ estimates.

If we superimpose this on top of the tails-come-apart image, we see that this is something like a generalization:

Wrapping It All Up

The punchline is that these two different regression lines were exactly what yields simultaneous underconfidence and overconfidence. The studies in conservatism were taking the objective probability as the independent variable, and graphing people's subjective probabilities as a function of that. The natural next step is to take the average subjective probability per fixed objective probability. This will tend to show underconfidence due to the statistics of the situation.

The studies on calibration, on the other hand, took the subjective probabilities as the independent variable, graphing average correct as a function of that. This will tend to show overconfidence, even with the same data as shows underconfidence in the other analysis.

From an individual's standpoint, the overconfidence is the real phenomenon. Errors in judgement tend to make us overconfident rather than underconfident because errors make the tails come apart so that if you select our most confident beliefs it's a good bet that they have only mediocre support from evidence, even if generally speaking our level of belief is highly correlated with how well-supported a claim is. Due to the way the tails come apart gradually, we can expect that the higher our confidence, the larger the gap between that confidence and the level of factual support for that belief.

This is not a fixed fact of human cognition pre-ordained by statistics, however. It's merely what happens due to random error. Not all studies show systematic overconfidence, and in a given study, not all subjects will display overconfidence. Random errors in judgement will tend to create overconfidence as a result of the statistical phenomena described above, but systematic correction is still an option.

Good Bias, Bad Bias

2015-03-29T15:48:00.000-07:00

I had a conceptual disagreement with a couple of friends, and I'm trying to spell out what I meant here in order to continue the discussion.

The statistical definition of bias is defined in terms of estimators. Suppose there's a hidden value, Theta, and you observe data X whose probability distribution is dependent on Theta, with known P(X|Theta). An estimator is a function of the data which gives you a hopefully-plausible value of Theta.

An unbiased estimator is an estimator which has the property that, given a particular value of Theta, the expected value of the estimator (expectation in P(X|Theta)) is exactly Theta. In other words: our estimate may be higher or lower than Theta due to the stochastic relationship between X and Theta, but it hits Theta on average. (In order for averaging to make sense, we're assuming Theta is a real number, here.)

The Bayesian view is that we have a prior on Theta, which injects useful bias in our judgments. A Bayesian making statistical estimators wants to minimize loss. Loss can mean different things in different situations; for example, if we're estimating whether a car is going hit us, the damage done by wrongly thinking we are safe is much larger than the damage done by wrongly thinking we're not. However, if we don't have any specific idea about real-world consequences, it may be reasonable to assume a squared-error loss so that we are trying to get our estimated Theta to match the average value of Theta.

Even so, the Bayesian choice of estimator will not be unbiased, because Bayesians will want to minimize the expected loss accounting for the prior, which means looking at the expectation in P(X|Theta)*P(Theta). In fact, we can just look at P(Theta|X). If we're minimizing squared error, then our estimator would be the average Theta in P(Theta|X), which is proportional to P(X|Theta)P(Theta).

Essentially, we want to weight our average by the prior over Theta because we decrease our overall expected loss by accepting a lot of statistical bias for values of Theta which are less probable according to our prior.

So, a certain amount of statistical bias is perfectly rational.

Bad bias, to a Bayesian, refers to situations when we can predictably improve our estimates in a systematic way.

One of the limitations of the paper reviewed last time was that it didn't address good vs bad bias. Bias, in that paper, was more or less indistinguishable from bias in the statistical sense. Detangling things we can improve from things which we want would require a deeper analysis of the mathematical model, and of the data.

A Paper on Bias

2015-03-28T01:56:00.001-07:00

I've been reading some of the cognitive bias literature recently.

First, I dove into Toward a Synthesis of Cognitive Biases, by Martin Hilbert: a work which claims to explain how eight different biases observed in the literature are an inevitable result of noise in the information-processing channels in the brain.

The paper starts out with what it calls the conservatism bias. (The author complains that the literature is inconsistent about naming biases, both giving one bias multiple names and using one name for multiple biases. Conservatism is what is used for this paper, but this may not be standard terminology. What's important is the mathematical idea.)

The idea behind conservatism is that when shown evidence, people tend to update their probabilities more conservatively than would be predicted by probability theory. It's as if they didn't observe all the evidence, or aren't taking the evidence fully into account. A well-known study showed that subjects were overly conservative in assigning probabilities to gender based on height; an earlier study had found that the problem is more extreme when subjects are asked to aggregate information, guessing the gender of a random selection of same-sex individuals from height. Many studies were done to confirm this bias. A large body of evidence accumulated which indicated that subjects irrationally avoided extreme probabilities, preferring to report middling values.

The author construed conservatism very broadly. Another example given was: if you quickly flash a set of points on a screen and ask subjects to estimate their number, then subjects will tend to over-estimate the number of a small set of points, and under-estimate the number of a large set of points.

The hypothesis put forward in Toward a Synthesis is that conservatism is a result of random error in the information-processing channels which take in evidence. If all red blocks are heavy and all blue blocks are light, but you occasionally mix up red and blue, you will conclude that most red blocks are heavy and most blue blocks are light. If you are trying to integrate some quantity of information, but some of it is mis-remembered, small probabilities will become larger and large will become smaller.

One thing that bothered me about this paper was that it did not directly contrast processing-error conservitism with the rational conservatism which can result from quantifying uncertainty. My estimate of the number of points on a screen should tend toward the mean if I only saw them briefly; this bias will increase my overall accuracy rate. It seems that previous studies established that people were over-conservative compared to the rational amount, but I didn't take the time to dig up those analyses.

All eight biases explained in Toward a Synthesis were effectively consequences of conservatism in different ways.

Illusory correlation: Two rare events X and Y which are independent appear correlated as a result of their probabilities being inflated by conservatism bias. I found this to be the most interesting application. The standard example of illusory correlation is stereotyping of minority groups. The race is X, and some rare trait is Y. What was found was that stereotyping could be induced in subjects by showing them artificial data in which the traits were entirely independent of the races. Y could be either a positive or a negative trait; illusory correlation occurs either way. The effect that conservatism has on the judgements will depend on how you ask the subject about the data, which is interesting, but illusory correlation emerges regardless. Essentially, because all the frequencies are smaller within the minority group, the conservatism bias operates more strongly; the trait Y is inflated so much that it's seen as being about 50-50 in that group, whereas the judgement about its frequency in the majority group is much more realistic.
Self-Other Placement: People with low skill tend to overestimate their abilities, and people with high skill tend to underestimate theirs; this is known as the Dunning-Kruger effect. This is a straightforward case of conservatism. Self-other placement refers to the further effect that people tend to be even more conservative about estimating other people's abilities, which paradoxically means that people of high ability tend to over-estimate the probability that they are better than a specific other person, despite the Dunning-Kruger effect; ans similarly, people of low ability tend to over-estimate the probability that they are worse as compared with specific individuals, despite over-estimating their ability overall. The article explains this as a result of having less information about others, and hence, being more conservative. (I'm not sure how this fits with the previously-mentioned result that people get more conservative as they have more evidence.)
Sub-Additivity: This bias is a class of inconsistent probability judgements. The estimated probability of an event will be higher if we ask for the probability of a set of sub-events, rather than merely asking for the overall probability. From Wikipedia: For instance, subjects in one experiment judged the probability of death from cancer in the United States was 18%, the probability from heart attack was 22%, and the probability of death from "other natural causes" was 33%. Other participants judged the probability of death from a natural cause was 58%. Natural causes are made up of precisely cancer, heart attack, and "other natural causes," however, the sum of the latter three probabilities was 73%, and not 58%. According to Tversky and Koehler (1994) this kind of result is observed consistently. The bias is explained with conservativism again. The smaller probabilities are inflated more by the conservatism bias than the larger probability is, which makes their sum much more inflated than the original event.
Hard-Easy Bias: People tend to overestimate the difficulty of easy tasks, and underestimate the difficulty of hard ones. This is straightforward conservatism, although the paper framed it in a somewhat more complex model (it was the 8th bias covered in the paper, but I'm putting it out of order in this blog post).

That's 5 biases down, and 3 to go. The article has explained conservatism as a mistake made by a noisy information-processor, and explains 4 other biases as consequences of conservatism. So far so good.

Here's where things start to get... weird.

Simultaneous Overestimation and Underestimation

Bias 5 is termed exaggerated expectation in the paper. This is a relatively short section which reviews a bias dual to conservatism. Conservatism looks at the statistical relationship from the evidence, to the estimate formed in the brain. If there is noise in the information channel connecting the two, then conservatism is a statistical near-certainty.

Similarly, we can turn the relationship around. The conservatism bias was based on looking at P(estimate|evidence). We can turn it around with Bayes' Law, to examine P(evidence|estimate). If there is noise in one direction, there is noise in the other direction. This has a surprising implication: the evidence will be conservative with respect to the estimate, by essentially the same argument which says that the estimate will tend to be conservative with respect to the evidence. This implies that (under statistical assumptions spelled out in the paper), our estimates will tend to be more extreme than the data. This is the exaggerated expectation effect.

If you're like me, at this point you're saying what???

The whole idea of conservatism was that the estimates tend to be less extreme than the data! Now "by the same argument" we are concluding the opposite?

The section refers to a paper about this, so before moving further I took a look at that reference. The paper is Simultaneous Over- and Under- Confidnece: the Role of Error in Judgement Process by Erev et. al. It's a very good paper, and I recommend taking a look at it.

Simultaneous Over- and Under- Estimation reviews two separate strains of literature in psychology. A large body of studies in the 1960s found systematic and reliable underestimation of probabilities. This revision-of-opinion literature concluded that it was difficult to take the full evidence into account to change your beliefs. Later, many studies on calibration found systematic overestimation of probabilities: when subjects are asked to give probabilities for their beliefs, the probabilities are typically higher than their frequency of being correct.

What is going on? How can both of these be true?

One possible answer is that the experimental conditions are different. Revision-of-opinion tests give a subject evidence, and then test how well the subject has integrated the evidence to form a belief. Calibration tests are more like trivia sessions; the subject is asked an array of questions, and assigns a probability to each answer they give. Perhaps humans are stubborn but boastful: slow to revise their beliefs, but quick to over-estimate the accuracy of those beliefs. Perhaps this is true. It's difficult to test this against the data, though, because we can't always distinguish between calibration tests and revision-of-opinion tests. All question-answering involves drawing on world knowledge combined with specific knowledge given in the question to arrive at an answer. In any case, a much more fundamental answer is available.

The Erev paper points out that revision-of-opinion experiments used different data analysis. Erev re-analysed the data for studies on both sides, and found that the statistical techniques used by revision-of-opinion researchers found underconfidence, while the techniques of calibration researchers found overconfidence, in the same data-set!

Both techniques compared the objective probability, OP, with the subject's reported probability, SP. OP is the empirical frequency, while SP is whatever the subject writes down to represent their degree of belief. However, revision-of-opinion studies started with a desired OP for each situation and calculated the average SP for a given OP. Calibration literature instead starts with the numbers written down by the subjects, and then asks how often they were correct; so, they're computing the average OP for a given SP.

When we look at data and try to find functions from X to Y like that, we're creating statistical estimators. A very general principle is that estimators tend to be regressive: my Y estimate will tend to be closer to the Y average than the actual Y. Now, in the first case, scientists were using X=OP and Y=SP; lo and behold, they found it to be regressive. In later decades, they took X=SP and Y=OP, and found that to be regressive! From a statistical perspective, this is plain and ordinary business as usual. The problem is that one case was termed under-confidence and the other over-confidence, and they appeared from those names to be contrary to one another.

This is exactly what the Toward a Synthesis paper was trying to get across with the reversed channel, P(estimate|evidence) vs P(evidence|estimate).

Does this mean that the two biases are mere statistical artifacts, and humans are actually fairly good information systems whose beliefs are neither under- nor over- confident? No, not really. The statistical phenomena are real: humans are both under- and over-confident in these situations. What Toward a Synthesis and Simultaneous Over- and Under- Confidence are trying to say is that these are not mutually inconsistent, and can be accounted for by noise in the information-processing system of the brain.

Both papers propose a model which accounts for overconfidence as the result of noise during the creation of an estimate, although they are put in different terms. The next section of Toward a Synthesis is about overconfidence bias specifically (which it sees as a special case of exaggerated expectations, as I understand them; the 7th bias to be examined in the paper, for those keeping count). The model shows that even with accurate memories (and therefore the theoretical ability to reconstruct accurate frequencies), an overconfidence bias should be observed (under statistical conditions outlined in the paper). Similarly, Simultaneous Over-and Under- confidence constructs a model in which people have perfectly accurate probabilities in their heads, and the noise occurs when they put pen to paper: their explicit reflection on their belief adds noise which results in an observed overconfidence.

Both models also imply underconfidence. This means that in situations where you expect perfectly rational agents to reach 80% confidence in a belief, you'd expect rational agents with noisy reporting of the sort postulated to give estimates averaging lower (say, 75%). This is the apparent underconfidence. On the other hand, if you are ignorant of the empirical frequency and one of these agents tells you that it is 80%, then it is you who is best advised to revise the number down to 75%.

This is made worse by the fact that human memories and judgement are actually fallible, not perfect, and subject to the same effects. Information is subject to bias-inducing-noise at each step of the way, from first observation, through interpretation and storage in the brain, modification by various reasoning processes, and final transmission to other humans. In fact, most information we consume is subject to distortion before we even touch it (as I discussed in my previous post). I was a bit disappointed when the Toward a Synthesis paper dismissed the relevance of this, stating flatly "false input does not make us irrational".

Overall, I find Toward a Synthesis of Cognitive Biases a frustrating read and recommend the shorter, clearer Simultaneous Over- and Under- Confidence as a way to get most of the good ideas with less of the questionable ones. However, that's for people who already read this blog post and so have the general idea that these effects can actually explain a lot of biases. By itself, Simultaneous Over- and Under- Confidence is one step away from dismissing these effects as mere statistical artifacts. I was left with the impression that Erev doesn't even fully dismiss the model where our internal probabilities are perfectly calibrated and it's only the error in conscious reporting that's causing over- and under- estimation to be observed.

Both papers come off as quite critical of the state of the research, and I walk away from these with a bitter taste in my mouth: is this the best we've got? The extend of the statistical confusion observed by Erev is saddening, and although it was cited in Toward a Synthesis, I didn't get the feeling that it was sharply understood (another reason I recommend the Erev paper instead). Toward a Synthesis also discusses a lot of confusion about the names and definitions of biases as used by different researchers,which is not quite as problematic, but also causes trouble.

A lot of analysis is still needed to clear up the issues raised by these two papers. One problem which strikes me is the use of averaging to aggregate data, which has to do with the statistical phenomenon of simultaneous over- and under- confidence. Averaging isn't really the right thing to do to a set of probabilities to see whether it has a tendency to be over or under a mark. What we really want to know, I take it, is whether there is some adjustment which we can do after-the-fact to systematically improve estimates. Averaging tells us whether we can improve a square-loss comparison, but that's not the notion of error we are interested in; it seems better to use a proper scoring rule.

Finally, to keep the reader from thinking that this is the only theory trying to account for a broad range of biases: go read this paper too! It's good, I promise.

The Ordinary Web of Lies

2015-03-16T18:41:00.003-07:00

One of the basic lessons in empiricism is that you need to consider how the data came to you in order to use it as evidence for or against a hypothesis. Perhaps you have a set of one thousand survey responses, answering questions about income, education level, and age. You want to draw conclusions about the correlations of these variables in the United States. Before we do so, we need to ask how the data was collected. Did you get these from telephone surveys? Did you walk around your neighborhood and knock on people's doors? Perhaps you posted the survey on Amazon's Mechanical Turk? These different possibilities give you samples from very different populations.

When we obtain data in a way that does not evenly sample from the population we are trying to study, this is called selection bias. If not accounted for, selection effects can cause you to draw just about any conclusion, regardless of the truth.

In modern society, we consume a very large amount of information. Practically all of that information is highly filtered. Most of this filtering is designed to nudge your beliefs in specific directions. Even when the original authors engage in intellectual honesty, we usually see something as a result of a large, complex filter imposed by society (for example, social media). Even when scientists are perfectly unbiased, journalists can choose to cite only the studies which support their perspective.

I have cultivated what I think is a healthy fear of selection effects. I would like to convey to the reader a visceral sense of danger, because it's so easy to be trapped in a web of false beliefs based on selection effects.

A Case Study

Consider this article, Miracles of the Koran: Chemical Elements Indicated in the Koran. A Muslim roommate showed this to me when I voiced skepticism about the miraculous nature of the Koran. He suggested that there could be no ordinary explanation of such coincidences. (Similar patterns have been found in the Bible, a phenomenon which has been named the Bible Code.) I decided to try to attempt an honest analysis of the data to see what it led to.

Take a look at these coincidences. On their own, they are startling, right? When I first looked at these, I had the feeling that they were rather surprising and difficult to explain. I felt confused.

Then I started to visualize the person who had written this website. I supposed that they were (from their own perspective) making a perfectly honest attempt to record patterns in the Koran. They simply checked each possibility they thought of, and recorded what patterns they found.

There are 110 elements on the periodic table. The article discusses the placement (within a particular Sura, the Iron Sura) of Arabic letters which correspond (roughly) to the element abbreviations used on the Periodic Table. For example, the first coincidence noted is that the first occurrence of the Arabic equivalent of "Rn" is 86 letters from the beginning of the verse, and the atomic number of the element Rn is 86. The article notes similar coincidences with atomic weight (as opposed to atomic number), the number of letters from the end of the verse (rather than the beginning), the number of words (rather than number of letters), and several other variations.

Notice that simply looking at the number of characters from the beginning and the end, we double the chances of corresponding to the atomic number. Similarly, looking for atomic weights as well as atomic numbers doubles the chances. Each extra degree of freedom we allow multiplies the chances in this way.

I couldn't easily account for all the possible variations the article's author might have looked for. However, I could restrict myself to one class of patterns and see how much the data looked like chance.

Even restricting myself to one particular class of patterns, I did not know enough of the statistics of the Arabic language to come up with a real Bayesian analysis of the data. I made some very, very rough assumptions which I didn't write down and no longer recall. I estimated the number of elements which would follow the pattern by chance, and my estimate came very close to the number which the article actually listed.

I have to admit, whatever my analysis was, it was probably quite biased as well. It's likely that I added assumptions in a way which was likely to get me the answer I wanted, although I felt I was not doing that. Even supposing that I didn't, I did stop doing math once the numbers looked like chance, satisfied with the answer. This in itself creates a bias. I could certainly have examined some of my assumptions more closely to make a better estimate, but the numbers said what I wanted, so I stopped questioning.

Nonetheless, I do think that the startling coincidences are entirely explained by the strong selection effect produced by someone combing the Koran for patterns. Innocently reporting patterns which fit your theory, with no intention to mislead, can produce startling arguments which appear at first glance to very strongly support your point. The most effective, convincing versions of these startling arguments will get shared widely on the internet and other media (so long as there is social incentive to spread the argument).

If you're not accounting for selection bias, then trying to respond to arguments with rational consideration makes you easy to manipulate. Your brain can be reprogrammed simply by showing it the most convincing arguments in one direction and not the other.

Everything is Selection Bias

Selection processes filter everything we see. We see successful products and not unsuccessful ones. We hear about famous people, which greatly biases our perception of how to get rich. We filter our friends quite a bit, perhaps in ways we don't even realize, and then often we trick ourselves into wrong conclusions about typical people based on the people we've chosen as friends.

No matter what data you're looking at, it was sampled from some distribution. It's somewhat arbitrary to think that selecting from university students is biased, but that selecting evenly from Amaricans is not. Indeed, university professors have far more incentive to understand the psychology of the student population! What matters is being aware of the selection process which got you the data, and accounting for that when trying to draw conclusions.

Even biological evolution can be seen as a selection effect. Selective pressure takes a tiny minority of the genes, and puts those genes into the whole population. This is a kind of self-fulfilling selection effect, weirder than simple selection bias. It's as if the rock stars in one generation become the common folk of the next.

The intuition I'm trying to get across is: selection effects are something between a physical force and an agent. Like an agent, selection effects optimize for particular outcomes. Like a physical force, selection effects operate automatically, everywhere, without requiring a guiding hand to steer them. This makes them a dangerous creature.

Social Constructs

Social reality is a labyrinth of mirrors reflecting each other. All the light ultimately comes from outside the maze, but the mirrors can distort it any way they like. The ordinary web of lies is my personal term for this. Many people will think of religion, but it goes far beyond this. When society decides a particular group is the enemy, they become the enemy. When society deems words or concepts uncouth, they are uncouth. I call these lies, but it's not what we ordinarily mean by dishonest. It's terrifyingly easy to distort reality. Even one person, alone, will tend to pick and choose observations in a self-serving way. When we get together in groups, we have to play the game: selecting facts to use as social affirmations or condemnations, selecting arguments to create consensus... it's all quite normal.

This all has to do with the concept of hyperstition (see Lemurian Time War) and hyperreality. Hyperstition refers to superstition which makes itself real. Hyperreality refers to our inability to distinguish certain fictions from reality, and the way in which our fictional, constructed world tends to take primacy over the physical world. Umberto Eco illustrates this nicely in his book Focault's Pendulum, which warns of the deadly danger in these effects.

The webcomic The Accidental Space Spy explores alien cultures as a way of illustrating evolutionary psychology. One of the races, the Twolesy, has evolved strong belief in magic wizards. These wizards command the towns. Whoever doubts the power of a wizard is killed. Being that it has been this way for many generations, the Twolesy readily hallucinate magic. Whatever the wizards claim they can do, the Twolesy hallucinate happening. Whatever other Twolesy claim is happening, they hallucinate as well. Twolesy who do not hallucinate will not be able to play along with the social system very effectively, and are likely to be killed.

Similarly with humans. Our social system relies on certain niceties. Practically anything, no matter how not about signaling it is, becomes a subject for signaling. Those who are better at filtering information to their advantage have been chosen by natural selection for generations. We need not consciously know what we're doing -- it seems to work best when we fool ourselves as well as everyone else. And yes, this goes so far as to allow us to believe in magic. There are mentalists who know how to fool our perceptions and consciously develop strategies to do so, but equally well, there are Wiccans and the like who have similar success by embedding themselves in the ordinary web of lies.

Something which surprised me a bit is that when you try to start describing rationality techniques, people will often object to the very idea of truth-oriented dialog. Truth-seeking is not the first thing on people's minds in everyday conversation, and when you raise it to their awareness, it's not obvious that it should be. Other things are more important.

Imagine a friend has experienced a major loss. Which is better: frank discussion of the mistakes they made, or telling them that it's not really their fault and anyway everything will work out for the best in the end? In American culture at least, it can be rude to let on that you think it might be their fault. You can't honestly speculate about that, because they're likely to get their feelings hurt. Only if you're reasonably sure you have a point, and if your relationship is close enough that they will not take offense, could you say something like that. Making your friend feel better is often more important. By convincing them that you don't think it's their fault, you strengthen the friendship by signalling to them that you trust them. (In Persian culture, I'm given to understand, it's the opposite way: everyone should criticize each other all the time, because you want to make your friends think that you know better than them.)

When the stakes are high, other things easily become more important than the truth.

Notice the consequences, though: the mistakes with high consequences are exactly the ones you want to thoroughly debug. What's important is not whether it's your fault or no; what matters is whether there are different actions you should take to forestall disaster, next time a similar situation arises.

What, then? Bad poetry to finish it off?

Beware, beware, the web of lies;

the filters twist the truth, and eyes

are fool'd too well; designed to see

what'ere the social construct be!

We the master, we the tool,

that spin the thread and carve the spool

we the spider, we the fly,

weave the web and watch us die!

Understanding

2014-12-12T21:50:00.001-08:00

Sometimes, people make a fuss about the difference between knowledge and understanding.

Recently, an explanation of this difference occurred to me which I had not considered before.

The Slate Star Codex article Right is the New Left explains fashion with cellular automata. It's a model of society which has about ten moving parts, yet has behaviors which resemble those of a whole society.

This made me think that understanding is essentially explaining something with a model small enough to fit in working memory.

Consider the extremely detailed weather model which meteorologists use to produce forecasts.

Now, consider the highly simplified explanation based on air masses, warm and cold fronts and so on which is commonly illustrated with weather maps.

The first gives us more accurate predictions, but the second one gives us more understanding. If a scientist was able to use the detailed mathematical weather model but did not think in terms of storm fronts and so on, he/she could not answer questions such as "why" it is raining. In the detailed physical model, "why" is almost meaningless: the causes of any particular event are huge in number.

This notion of understanding has several implications.

What constitutes understanding will depend on the mind doing the understanding, whereas knowledge is more objective in nature. I can achieve understanding of a system by putting it in terms I am familiar with. Suppose I am trying to understand an esoteric branch of chemistry known as semi-equilibrium Z-theory. I might learn all the statements belinging to SEZ theory by heart, and gain the ability to solve SEZ equations and get the correct answer, and still feel that I have little understanding. Yet, if I can relate SEZ theory to more familiar subjects, I will feel I've "put it in terms I can understand".

Assume I was an apple farmer before learning chemistry.

Let's say an experienced SEZ-theoretician gives me an analogy in which a SEZ-frubian (a central object of SEZ theory) is a rotten apple, and a SEZ-nite (another important concept in SEZ theory) is a worm slowly eating the apple. If the analogy works well enough, I feel I've gained an understanding: now when I'm solving the equations, I imagine that they are telling me things about this worm munching away happily at the core of the apple. I've now got a model with a few moving parts which allows me to make heuristic predictions much more effectively.

However, someone with no experience of apples and worms will not be helped very much by this analogy. It's placed SEZ-theory into my mental landscape, but the same explanation may not be useful to others.

Even a superhuman intelligence would have use for understanding: the actual universe is far too complex for a mind-within-universe to fully model. However, the understanding it achieves would be far beyond us. The "small" heuristic models would be too large to fit into our working memory (the mythical seven-plus-or-minus-two). Its weather maps would likely look more like our "detailed" physical simulations of the weather.

Epistemic Trust

2014-12-10T14:56:00.002-08:00

My attraction to LessWrong is partially predicated on a feeling that a better culture can be created, raising the sanity waterline to improve society overall.

Recently, though, I've somewhat given up on that.

First, I was overestimating the degree to which LessWrong had created such a culture already. I'll explain why.

Talking to core LessWrong people is different. It feels highly reflective, with each person cognizant not only of the subject at hand but also the potential flaws in their reasoning process. Much less time is wasted trying to sort out such flaws because an objection is not met with defensiveness. If you're already looking for the holes in your own arguments, you'll try to understand a counter-argument rather than trying to protect yourself from it. The attitude is contagious.

I call this intellectual honesty: being up-front not just about what you believe, but also why you believe it, what your motivations are in saying it, and the degree to which you have evidence for it. Feynman discussed the necessary attitude, although he didn't give it a name that I'm aware of.

There are many forces working against intellectual honesty in everyday life, but the most important one is face culture: status in the group is being signaled by agreeing and disagreeing, arguing for or against people. This can be nasty, but it isn't always; in fact, the most common type of face-culture I experience is cooperative: people in the group are trying to be friendly by accepting and encouraging each other.

For example, suppose that a group of engineers are meeting to discuss a technical problem. We will focus on one of them; I will call this person X. X is eager to find acceptance in the group. The other members of the group are also eager to make X feel accepted. During the discussion, X is looking for opportunities to interject with something relevant and useful. At some point, X is reminded of something: a problem which was encountered in a similar project at a previous work-place. X interjects that there might be a problem, and proceeds to tell the story. As X recalls the details, there's a critical difference between the old situation and the new one which makes it unlikely for the problem to arise in the current situation.

Intellectual-Honesty Culture: If everyone at the table is intellectually honest, someone points out the disanalogy. X likely concedes the point and the discussion moves on. (Often X will be the first to notice the disanalogy, and will point it out him/her-self.) If X thinks the objection is mistaken, a discussion in which both participants try to understand what each other is saying ensues.

Face Culture: In face culture, people will focus more on trying to make X feel included. Although the story's conclusion is unlikely to apply in the current situation, it's worthwhile to comment on the story in an agreeable way. Because agreement is a social currency, it is somewhat noncommittal; perhaps the best move is to agree that this problem can arise but then do little about it. Bold disagreement with the point is seen as (and often would be) an attempt to take X down a peg.

The critical point here is that when the two cultures mix, a face-culture person will see intellectual honesty as an attack.

It is worth emphasizing that face culture is not dishonest, not in the normal sense. Face culture is nice; face culture is friendly; face culture is welcoming. (Although, it can be vicious when it gets competitive.) Face culture is filled with white lies, especially lies by omission (such as acting as if a comment were relevant and made a good point), but if you try to call out any of these lies you will utterly fail. They are not lies in the common conception of lie. They are not dishonest in the common conception of honesty.

Moreover, face-culture is important. If X is an old hand whose status in the group is secure, face-culture would be babying. If X is a newcomer, however, face-culture niceties can establish a welcoming environment. I don't mean to suggest that there is an absolute opposition between being nice and being truthful -- often the two don't even come into conflict. There is a very real trade-off, though. At times you simply must choose one or the other.

Attempting to call out someone for following face culture rather than being intellectually honest is, as far as I know, doomed to failure. Any such call-out will be perceived as a threat, and will ramp up the defensive face-culture behavior.

Ok, so, there are these two cultures and LessWrong succeeds at intellectual honesty. I said at the beginning that I've (partially) given up on improving broader culture via LessWrong, though. Why?

Well, I talked with someone who worked at a Christian school (as I understand it, a very fundamentalist one). They described what sounded like the same thing I experience with LessWrong: the community was very high in intellectual honesty.

Why would this be?

If LessWrong's high intellectual honesty is a result of being devoted to rationality and reflective thinking, shouldn't we expect the exact opposite in highly religious organizations?

I think what's happening here is that LessWrong is intellectually honest not because we explicitly think about rationality quite a bit, not because LessWrong is in possession of improved ideas of what rationality is about, but instead, because there is a high degree of intellectual trust.

Intellectual trust occurs when the group has common goals, mutual respect, and a largely-shared ideological framework.

When people have intellectual trust, they do not need to worry as much about why the other person is saying what they are saying. You know they are on your side, so you are free to worry about the topic at hand. You are free to point out flaws in your own reasoning because you are relatively secure in your social status and share the common goal of arriving at the correct conclusion. Likewise, you are free to find flaws in their reasoning without worrying that they will hate you for it.

This sort of intellectual trust cannot be created by simply "raising the rationality waterline".

Scramble Graphs: A Failed Idea

2014-12-09T14:20:00.000-08:00

I spent some time this semester inventing, testing, and discarding a new probabilistic model. My adviser suggested that it might be worthwhile to write things up anyway, since the idea is intriguing.

I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as dimensionality reduction, and there are many different techniques optimized for different applications.

Random Projections

Random projections take advantage of the Johnson-Lindenstrauss lemma, which shows that the vast majority of dimension reductions are "good" in the sense of preserving approximate distances between points. If this property is useful to our application, then this is a very easy dimension-reduction technique to apply.

This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.

The fundamental story as I see it is:

An n-dimensional vector space has n orthogonal basis vectors (n cardinal directions), which are all at right angles from each other. However, if we relax to approximate orthogonality (almost 90 degree separation), the number of vectors we can pack in increases rapidly; especially for high n. In fact, it increases so rapidly that even choosing vectors randomly we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit m vectors, where m is on the order of e ⁿ. (Look at theorem 6 here for the more precise formula.)

These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we really want to work with) down to this small space of size n, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.

For example, if we want to represent a probability distribution on m items, we would normally need that many numbers. This can be thought of as a point in the m-dimensional space. However, we can approximately represent it with just n numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care too much about getting the small ones right).

Scramble Graphs

The idea I had was to use this kind of representation within a probabilistic network. Let's say we are trying to represent a dynamic Bayesian network with variables that have thousands of possible values (for example, variables might be English words). The size of probabilistic relationships between these variables gets very large. English has about 10 ⁶ words. A table giving the probabilistic relationship between two words would need 10 ¹² entries, and between three words, 10 ¹⁸. Fortunately, these tables will be very sparse. 95% of the time we're using just the most common 5% of words, so we can restrict the vocabulary size to 10 ⁵ or even 10 ⁴ without doing too much damage. Furthermore, it's unlikely we're getting anywhere near 10 ¹² words in training data, and impossible that we get 10 ¹⁸. There are less than 10 ⁹ web pages, so if each page averaged 1,000 English words, we could possibly get near 10 ¹². (I don't know how many words the internet actually totals to.) Even then, because some pairs of words are much more common than others, our table of probabilities would be very sparse. It's much more likely that out data has just millions of words, though, which means we have at most millions of nonzero co-occurrence counts (and again, much less in practice due to the predominance of a few common co-occurrences).

This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?

My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.

Think about the tensor network and the underlying network. In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).

Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".

In order to train the tensor network directly without worrying about the underlying network, we want to find values for the tensors which cause the underlying network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.

(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)

The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.

The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?

Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.

However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).

So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.

Unfortunately, it didn't work.

It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.

I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.

Likelihood Ratios from Statistically Significant Studies

2014-12-06T12:21:00.000-08:00

The previous post I reacted to an old Black Belt Bayesian post about p-values.

Since then, there's been some more discussion of this article in the LA LessWrong group. Scott Garrabrant pointed out that the likelihood ratios coming from p-values are far less than he naively intuited. I think I was making the same mistake before reading BBB, and I think it's an important and common mistake.

How much should we shift our belief when we see a p-value around 0.05 (so, just barely passing the standard for statistical significance)?

The p-value is defined as the probability that a statistic would be as great or greater than observed, assuming the null hypothesis were true.

The very common mistake is to confuse P(observation | hypothesis) with P(hypothesis | observation), naively thinking that the p-value can be used as the probability of the null hypothesis. This is bad, don't do it. (David Manheim, also from the Los Angeles LessWrong group, pointed us to this article.)

But if that's not the correct conclusion to draw, what is?

The Bayesian answer is the Bayes Factor, which measures the strength of evidence for one hypothesis H1 vs another H2 as P(obs | H1) / P(obs | H2). If we combine this with a prior probability for each hypothesis, P(H1), P(H2), we can compute our posterior P(H1 | obs). For example, if our prior belief is 50-50 between the two and the likelihood ratio is 1/2, then our posterior should be 1/3 for H1 and 2/3 for H2. (H2 has become comparatively twice as probable.) However, the Bayes factor has the advantage of objectively measuring the influence of evidence on our beliefs, independent of our prior.

The less common mistake which both Scott and I were making was to think as if a p-value were a Bayes factor, so that a statistically significant study will shift belief against the null hypothesis by a ratio of about 1:20.

The formula mentioned by Black Belt Bayesian shows this is wrong. For a p-value of 0.05, the Bayes factor can be lower-bounded at 0.4, which means the odds of the null hypothesis only shift by 2:5. This is much less than the 1:20 shift I was intuitively making. (Of course, if the p-value is lower, this will be better!)

Also notice, this is a minimum: the actual likelihood ratio could be much higher! A higher ratio would be worse news for a scientist's attempt to reject the null hypothesis. It's even possible that the Bayesian should be increasing belief in the null hypothesis, if the alternative hypothesis explains the data less well. This might happen if our alternative hypothesis spreads probability mass very thinly across possibilities. The Bayes Factor is a relative comparison of hypotheses (comparing how well one hypothesis compares to another) whereas null hypothesis via p-values attempts an absolute measure (rejecting the null hypothesis in absolute terms).

P-values and Chaos Worlds

2014-11-27T14:11:00.000-08:00

In First Aid for P-Values, Black Belt Bayesian discusses how a Bayesian can interpret the p-value to get some information. He references an article which argues that this can shift the frame of the discussion in a useful way, improving the nature of the statistical arguments without significantly changing the methodology. It emphasizes the role of evidence in shifting beliefs progressively, as opposed to proof/disproof.

While this does seem like a useful tool, it still leaves us with the problems of null hypothesis testing. One problem is that the null hypothesis is sometimes not very plausible. Arguing from a point of total randomness is an odd thing to do. What would we expect to see if the world was a chaotic place with no patterns? Hm, reality doesn't match that? Ok, well, our hypothesis is better than maximum entropy. Good!

Scott Alexander makes this error in a post which he explicitly predicted he'd regret writing. (Epistemic Warning: This is, perhaps, among the smaller problems with the post. A larger problem is that it makes readers think in simplistic tribes. Another possible problem is that it risks the same error it calls out. There's a reason he said he'd regret it.) He's discussing how strongly our friends and acquaintances are filtered in terms of beliefs:

And I don’t have a single one of those people in my social circle. It’s not because I’m deliberately avoiding them; I’m pretty live-and-let-live politically, I wouldn’t ostracize someone just for some weird beliefs. And yet, even though I probably know about a hundred fifty people, I am pretty confident that not one of them is creationist. Odds of this happening by chance? 1/2^150 = 1/10^45 = approximately the chance of picking a particular atom if you are randomly selecting among all the atoms on Earth.

He goes on to use this number a couple more times as an indication of the strength of filtering:

I inhabit the same geographical area as scores and scores of conservatives. But without meaning to, I have created an outrageously strong bubble, a 10^45 bubble. Conservatives are all around me, yet I am about as likely to have a serious encounter with one as I am a Tibetan lama.

And:

A disproportionate number of my friends are Jewish, because I meet them at psychiatry conferences or something – we self-segregate not based on explicit religion but on implicit tribal characteristics. So in the same way, political tribes self-segregate to an impressive extent – a 1/10^45 extent, I will never tire of hammering in – based on their implicit tribal characteristics.

The problem is that this is a world-of-chaos-and-fire hypothesis he's comparing to. The number makes the strength of the filter incredible-sounding, almost physically implausible. But, that's just what you get when you use a bad model! Note that the "strength" would keep getting more extreme as we examine more data (just as a p-value gets extreme with more data, unless the null hypothesis is actually true).

It's not like there is a baseline world where everything is completely random, and an extra physical force on top of this which puts things into nonrandom configurations. (Except, perhaps, in the sense that everything is heading toward thermodynamic equilibrium.) We do not form associates with people randomly. It would be much more meaningful to compare possibly-realistic models and the level of friend filtering which they imply.

I'm not trying to call out Slate Star Codex here. That particular post happened to be an epistemic landmine, yes, but this mistake is easy to make and fairly common. What's interesting to me is the difference between what arguments feel meaningful vs actually are meaningful.

A List Of Nuances

2014-11-09T21:08:00.001-08:00

Abram Demski and Grognor

(This article is also cross-posted to LessWrong.)

Much of rationality is pattern-matching. An article on lesswrong might point out a thing to look for. Noticing this thing changes your reasoning in some way. This essay is a list of things to look for. These things are all associated, but the reader should take care not to lump them together. Each dichotomy is distinct, and although the brain will tend to abstract them into some sort of yin/yang correlated mush, in reality they have a more complicated structure; some things may be similar, but if possible, try to focus on the complex interrelationships.

Map vs. Territory

Eliezer’s sequences use this as a jump-off point for discussion of rationality.
Many thinking mistakes are map vs. territory confusions.

A map and territory mistake is a mix-up of seeming vs being.
Humans need frequent reminders that we are not omniscient.

Cached Thoughts vs. Thinking

This document is a list of cached thoughts.

Clusters vs. Properties

These words could be used in different ways, but the distinction I want to point at is that of labels we put on things vs actual differences in things.
The mind projection fallacy is the fallacy of thinking a mental category (a “cluster”) is an actual property things have.

If we see something as good for one reason, we are likely to attribute other good properties to it, as if it had inherent goodness. This is called the halo effect. (If we see something as bad and infer other bad properties as a result, it is referred to as the reverse-halo effect.)

Categories are inference applicability heuristics; ruling X an instance of Y without expecting novel inferences is cargo cult classification.

Syntax vs. Semantics

The syntax is the physical instantiation of the map. The semantics is the way we are meant to read the map; that is, the intended relationship to the territory.

Semantics vs. Pragmatics

The semantics is the literal contents of a message, whereas the pragmatics is the intended result of conveying the message.

An example of a message with no semantics and only pragmatics is a command, such as “Stop!”.
Almost no messages lack pragmatics, and for good reason. However, if you seek truth in a discussion, it is important to foster a willingness to say things with less pragmatic baggage.
Usually when we say things, we do so with some “point” which is beyond the semantics of our statement. The point is usually to build up or knock down some larger item of discussion. This is not inherently a bad thing, but has a failure mode where arguments are battles and statements are weapons, and the cleverer arguer wins.

The meaning of a thing is the way you should be influenced by it.

Object-level vs. Meta-level

The difference between making a map and writing a book about map-making.
A good meta-level theory helps get things right at the object level, but it is usually impossible to get things right at the meta level before before you’ve made significant progress at the object level.

Seeming vs. Being

We can only deal with how things seem, not how they are. Yet, we must strive to deal with things as they are, not as they seem.

This is yet another reminder that we are not omniscient.

If we optimize too hard for things which seem good rather than things which are good, we will get things which seem very good but which may only be somewhat good, or even bad.
The dangerous cases are the cases where you do not notice there is a distinction.

This is why humans need constant reminders that we are not omniscient.

We must take care to notice the difference between how things seem to seem, and how they actually seem.

Signal vs. Noise

Not all information is equal. It is often the case that we desire certain sorts of information and desire to ignore other sorts.
In a technical setting, this has to do with the error rate present in a communication channel; imperfections in the channel will corrupt some bits, making a need for redundancy in the message being sent.
In a social setting, this is often used to refer to the amount of good information vs irrelevant information in a discussion. For example, letting a mediocre writer add material to a group blog might increase the absolute amount of good information, yet worsen the signal-to-noise ratio.
Attention is a scarce resource; yes everyone has something to teach you, but many people are much more efficient sources of wisdom than others.

Selection Effects

Filtered evidence.

In many situations, if we can present evidence to a Bayesian agent without the agent knowing that we are being selective, we can convince the agent of anything we like. For example, if I want to convince you that smoking causes obesity, I could find many people who became obese after they started smoking.
The solution to this is for the Bayesian agent to model where the information is coming from. If you know I am selecting people based on this criteria, then you will not take it as evidence of anything, because the evidence has been cherry-picked.
Most of the information you receive is intensely filtered. Nothing comes to your attention with a good conscience.

The silent evidence problem.

Selection bias need not be the result of purposeful interference as in cherry-picking. Often, an unrelated process may hide some of the evidence needed. For example, we hear far more about successful people than unsuccessful. It is tempting to look at successful people and attempt to draw conclusion about what it takes to be successful. This approach suffers from the silent evidence problem: we also need to look at the unsuccessful people and examine what is different about the two groups.

Observer selection effects.

What You Mean vs. What You Think You Mean

Very often, people will say something and then that thing will be refuted. The common response to this is to claim you meant something slightly different, which is more easily defended.

We often do this without noticing, making it dangerous for thinking. It is an automatic response generated by our brains, not a conscious decision to defend ourselves from being discredited. You do this far more often than you notice. The brain fills in a false memory of what you meant without asking for permission.

What You Mean vs. What the Others Think You Mean

What You Optimize vs. What You Think You Optimize

Evolution optimizes for reproduction but in doing so creates animals with a variety of goals which are correlated with reproduction.
Extrinsic motivation is weaker than intrinsic motivation.
The people who value practice for its own sake do better than the people who only value being good at what they’re practicing.
“Consequentialism is true, but virtue ethics is what works.”

Stated Preferences vs. Revealed Preferences

Revealed preferences are the preferences we can infer from your actions. These are usually different from your stated preferences.

X is not about Y:

Food isn’t about nutrition.
Clothes aren’t about comfort.
Bedrooms aren’t about sleep.
Marriage isn’t about love.
Talk isn’t about information.
Laughter isn’t about humour.
Charity isn’t about helping.
Church isn’t about God.
Art isn’t about insight.
Medicine isn’t about health.
Consulting isn’t about advice.
School isn’t about learning.
Research isn’t about progress.
Politics isn’t about policy.
Going meta isn’t about the object level.
Language isn’t about communication.
The rationality movement isn’t about epistemology.

Everything is actually about signalling.

Humans Are Not Automatically Strategic

Never attribute to malice that which can be adequately explained by stupidity. The difference between stated preferences and revealed preferences does not indicate dishonest intent. We should expect the two to differ in the absence of a mechanism to align them.
Hidden Motives vs. Innocent Failure

People, ideas, and organizations respond to incentives.

Evolution selects humans who have reproductively selfish behavioral tendencies, but prosocial and idealistic stated preferences.

Near vs. Far

Social forces select ideas for virality and comprehensibility as opposed to truth or even usefulness.

Motte-and-bailey fallacy

Organizations are by default bad at being strategic about their own survival, but the ones that survive are the ones you see.

What You Achieve vs. What You Think You Achieve

Most of the consequences of our actions are totally unknown to us.
It is impossible to optimize without proper feedback.

What You Optimize vs. What You Actually Achieve

Consequentialism is more about expected consequences than actual consequences.

What You Seem Like vs. What You Are

You can try to imagine yourself from the outside, but no one has the full picture.

What Other People Seem Like vs. What They Are

When people assume that they understand others, they are wrong.

What People Look Like vs. What They Think They Look Like

People underestimate the gap between stated preferences and revealed preferences.

What Your Brain Does vs. What You Think It Does

You are running on corrupted hardware.

The brain’s machinations are fundamentally social; it automatically does things like signal, save face, etc., which distort the truth.

The reverse of stupidity is not intelligence.

Knowing that you are running on corrupted hardware should cause skepticism about the outputs of your thought-processes. Yet, too much skepticism will cause you to stumble, particularly when fast thinking is needed.

Producing a correct result plus justification is harder than producing only the correct result.
Justifications are important, but the correct result is more important.
Much of our apparent self-reflection is confabulation, generating plausible explanations after the brain spits out an answer.
Example: doing quick mental math. If you are good at this, attempting to explicitly justify every step as you go would likely slow you down.
Example: impressions formed over a long period of time. Wrong or right, it is unlikely that you can explicitly give all your reasons for the impression. Requiring your own beliefs to be justifiable would preempt impressions that require lots of experience and/or many non-obvious chains of subconscious inference.
Impressions are not beliefs and they are always useful data.

Clever Argument vs. Truth-seeking; The Bottom Line

People believe what they want to believe.

Believing X for some reason unrelated to X being true is referred to as motivated cognition.
Giving a smart person more information and more methods of argument may actually make their beliefs less accurate, because you are giving them more tools to construct clever arguments for what they want to believe.

Your actual reason for believing X determines how well your belief correlates with the truth.

If you believe X because you want to, any arguments you make for X no matter how strong they sound are devoid of informational context about X and should properly be ignored by a truth-seeker.

If you believe true things when doing so improves your life, that is no credit to you at all. Everyone does that.

Lumpers vs. Splitters

A lumper is a thinker who attempts to fit things into overarching patterns. A splitter is a thinker who makes as many distinctions as possible, recognizing the importance of being specific and getting the details right.
Specifically, some people want big Wikipedia and TVTropes articles that discuss many things, and others want smaller articles that discuss fewer things.
This list of nuances is a lumper attempting to think more like a splitter.

Fox vs. Hedgehog

“A fox knows many things, but a hedgehog knows One Big Thing.” Closely related to a splitter, a fox is a thinker whose strength is in a broad array of knowledge. A hedgehog is a thinker who, in contrast, has one big idea and applies it everywhere.
The fox mindset is better for making accurate judgements, according to Tetlock.

Traps vs. Gardens

Well-kept gardens die by pacifism.

Conversations tend to slide toward contentious and useless topics.
Societies tend to decay.
Systems in general work poorly or not at all.
Thermodynamic equilibrium is entropic.
Without proper institutions being already in place, it takes large amounts of constant effort and vigilance to stay out of traps.

From the outside of a broken Molochian system it is easy to see how to fix. But it cannot be fixed from the inside.

Losing Faith in Factor Graphs

2014-07-18T18:16:00.000-07:00

In my post Beliefs, I split AGI into two broad problems:

What is the space of possible beliefs?
How do beliefs interact?

The idea behind #1 was: can we formulate a knowledge representation which could in principle express any concept a human can conceive of?

#2 represents the more practical concern: how do we implement inference over these beliefs?

(This was needlessly narrow. I could at least have included something like: 3. How do beliefs conspire to produce actions? That's more like what my posts on procedural logic discuss. 1 and 2 alone don't exactly get you AGI. Nonetheless, these are central questions for me.)

Factor graphs are a fairly general representation for networks of probabilistic beliefs, subsuming bayesian networks, markov networks, and most other graphical models. Like those two, factor graphs are only as powerful as propositional logic, but can be a useful tool for representing more powerful belief structures as well. In other words, the factor graph "toolbox" includes some useful solutions to #2 which we may be able to apply to whatever solution for #1 we come up with. When I started graduate school at USC 3 years ago, I was basically on board with this direction. That's the direction taken by Sigma, the cognitive architecture effort I joined.

I've had several shifts of opinion since that time.

I. Learning is Inference

My first major shift (as I recall) was to give up on the idea of a uniform inference technique for both learning models and applying them ("induction" vs "deduction"). In principle, Bayes' Law tells us that learning is just probabilistic inference. In practice, unless you're using Monte Carlo methods, practical implementations tend have much different algorithms for the two cases. There was a time when I hoped that parameter learning could be implemented via belief propagation in factor graphs, properly understood: we just need to find the appropriate way to structure the factor graph such that parameters are explicitly represented as variables we reason about.

It's technically possible, but not really such a good idea. One reason is that we don't usually care about the space of possible parameter settings, so long as we find one combination which predicts data well. There is no need to keep the spread of possibilities except so far as it facilitates making good predictions. On the other hand, we do usually care about maintaining a spread of possible values for the latent variables within the model itself, precisely because this does tend to help. (The distinction could be is ambiguous! In principle there might not be a clean line between latent variables, model parameters, and even the so-called hyperparameters. In practice, the distinction is quite clear, though, and that's the point here.)

Instead, I ended up working on a perfectly normal gradient-descent learning algorithm for Sigma. Either you're already familiar with this term, or you've got a lot to learn about machine learning; so, I wont try to go into details. The main point is that this is a totally non-probabilistic method, which only believes a single parameter setting at any given time, but attempts to adjust these in response to data.

The next big shift in my thinking has been less easily summarized, and has been taking place over a longer period of time. Last year, I wrote a post attempting to think about it from a perspective of "local" vs "global" methods. At that time, my skepticism about whether factor graphs provide a good foundation to start with was already strong, but I don't think I articulated it very well.

II. Inference is Approximate

Inference in factor graphs is exponential time in the general case, which means that (unless the factor graph has a simple form) we need to use faster approximate inference.

This makes perfect sense if we assume that "inference" is a catch-all term referring to the way beliefs move around in the space of possible beliefs. It should be impossible to do exact inference on the whole belief space.

If we concede that "inference" is distinct from "learning", though, we have a different situation. What's the point in learning a model that is so difficult to apply? If we are going to go ahead and approximate it with a simpler function, doesn't it make sense to learn the simpler function in the first place?

This is part of the philosophy behind sum-product networks (SPNs). Like neural networks, SPN inference is always linear time: you basically just have to propagate function values. Unlike neural networks, everything is totally probabilistic. This may be important for several reasons; it means we can always do reasoning on partial data (filling in the missing parts using the probabilistic model), and reason "in any direction" thanks to Bayes' Law (where neural networks tend to define a one-direction input-output relationship, making reasoning in the reverse direction difficult).

Why are SPNs so efficient?

III. Models are Factored Representations

The fundamental idea behind factor graphs is that we can build up complicated probability distributions by multiplying together simple pieces. Multiplication acts like a conjunction operation, putting probabilistic knowledge together. Suppose that we know a joint distribution connecting X with Y: P(X, Y). Suppose that we also know a conditional distribution, defining a probability on Z for any given probability on X: P(Z|X). If we assume that Z and Y are independent given X, we can obtain the joint distribution across all three variables by multiplying the two probability functions together: P(X,Y,Z) = P(X,Y)P(Z|X).

A different way of looking at this is that it allows us to create probabilistic constraints connecting variables. A factor graph is essentially a network of these soft constraints. We would expect this to be a powerful representation, because we already know that constraints are a powerful representation for non-probabilistic systems. We would also expect inference to be very difficult, though, because solving systems of constraints is hard.

SPNs allow multiplication of distributions, but only when it does not introduce dependencies between distributions which must be "solved" in any way. To oversimplify just a smidge, we can multiply just in the case of total independence: P(X)P(Y). We are not allowed to use P(X|Y)P(Y), because if we start allowing those kinds of cross-distribution dependencies, things get tangled and inference becomes hard.

To supplement this reduced ability to multiply, we add in the ability to add. We compose complicated distributions as a series of sums and products of simpler distributions. (Hence the name.) Despite the simplifying requirements on the products, this turns out to be a rather powerful representation.

Just as we can think of a product as conjunctive, imposing a series of constraints on a system, we can think of a sum as being disjunctive, building up a probability distribution by enumeration of possibilities. This should bring to mind mixture models and clustering.

Reasoning based on enumeration is faster because, in a sense, it's just the already-solved version of the constraint problem: you have to explicitly list the set of possibilities, as opposed to starting with all possibilities and listing constraints to narrow them down.

Yet, it's also more powerful in some cases. It turns out that SPNs are much better at representing probabilistic parsing than factor graphs are. Parsing is a technique which is essential in natural language processing, but it's also been used for other purposes; image parsing has been a thing for a long time, and I think it's a good way of trying to get a handle on a more intricately structured model of images and other data. A parse is elegantly represented via a sum of possibilities. It can be represented with constraints, and this approach has been successfully used. Those applications require special-purpose optimizations to avoid the exponential time inference associated with factor graphs and constraint networks, though.

The realization that factor graphs aren't very good at representing this is what really broke my resolve as far as factor graphs go. This indicated to me that factor graphs really were missing a critical representational capability; the ability to enumerate possibilities.

Grammar-like theories of general learning have been an old obsession of mine, which I had naively assumed could be handled well within the factor-graph world.

My new view suggested that inference and learning should both be a combination of sum-reasoning and product-reasoning. Sum-learning includes methods like clustering, boosting, and bagging: we learn enumerative models. Reasoning with these is quite fast. Product-learning splits reality into parts which can be modeled separately and then combined. These two learning steps interact. Through this process, we create inherently fast, grammar-like models of the world.

IV. Everything is Probabilistic

Around the same time I was coming to these conclusions, the wider research community was starting to get excited about the new distributed representations. My shift in thinking was taking me toward grammars, so I was quite excited about Richard Socher's RNN representation. This demonstrated using one fairly simple algorithm for both language parsing and image parsing, producing state-of-the-art results along with a learned representation that could be quite useful for other tasks. RNNs have continued to produce impressive results moving forward; in fact, I would go so far as to say that they are producing results which look like precisely what we want to see out of a nascent AGI technology. These methods produce cross-domain structured generalizations powerful enough to classify previously-unseen objects based on knowledge obtained by reading, which (as I said in my previous post) seems quite encouraging. Many other intriguing results have been published as well.

Unfortunately, it's not clear how to fit RNNs in with more directly probabilistic models. The vector representations at the heart of RNNs could be re-conceived as restricted Boltzmann machines (RBMs) or another similar probabilistic model, giving a distributed representation with a fully probabilistic semantics (and taking advantage of the progress in deep belief networks). However, this contradicts the conclusion of section II: an RBM is a complex probabilistic model which must be approximated. Didn't I just say that we should do away with overly-complex models if we know we'll just be approximating them?

Carrying forward the momentum from the previous section, it might be tempting to abandon probabilistic methods entirely, in favor of the new neural approaches. SPNs restrict the form of probabilistic models to insure fast inference. But why accept these restrictions? Neural networks are always fast, and they don't put up barriers against certain sorts of complexity in models.

The objective functions for the neural models are still (often) probabilistic. A neural network can be trained to output a probability. We do not need everything inside the network to be a probability. We do lose something in this approach: it's harder to reverse a function (reasoning backwards via Bayes' Law) and perform other probabilistic manipulations. However, there may be solutions to these problems (such as training a network to invert another network).

V. Further Thought Needed

These neural models are impressive, and it seems as if a great deal could be achieved by extending what's been done and putting those pieces together into one multi-domain knowledge system. However, this could never yield AGI as it stands: as this paper notes (Section 6), vector representations do not perform any logical deduction to answer questions; rather, answers are baked-in during learning. These systems often can correctly answer totally new questions which have not been trained on, but that is because the memorization of the other answers forced the vectors into the right "shape" to make the correct answers evident. While this technique is powerful, it can't capture aspects of intelligence which require thinking.

Similarly, it's not possible for all probabilistic models to be tractable. SPNs may be a powerful tool for creating fast probabilistic models, but the restrictions prevent us from modeling everything. Intelligence requires some inference to be difficult! So, we need a notion of difficult inference!

It seems like there is a place for both shallow and deep methods; unstructured and highly structured models. Fitting all these pieces together is the challenge.

More Distributed Vectors

2014-07-15T00:41:00.000-07:00

Since my last post on distributed vector representations, interest in this area has continued to spread across the research community. This exposition on Colah's blog is quite good, although it unfortunately perpetuates the confusing view that distributed representations are necessarily "deep learning". (In fact, as far as I can see, the trend is in the opposite direction: you can do better by using simpler networks so that you can train faster and scale up to larger datasets. This reflects a very general trend to which deep networks seem to be a rare exception.)

The story Colah tells is quite exciting. Vector representations (AKA "word embeddings") are able to perform well on a variety of tasks which they have not been trained on at all; they seem to "magically" encode general knowledge about language after being trained on just one language task. The form of this knowledge makes it easier to translate between languages, too, because the relationship structure between concepts is similar in the two languages. This even extends to image classification: there have been several successes with so-called zero-shot learning, where a system is able to correctly classify images even when it's never seen examples of those classes before, thanks to the general world knowledge provided by distributed word representations.

For example, it's possible to recognize a cat having only seen dogs, but having read about both dogs and cats.

(This seems rather encouraging!)

Colah mentions that while encoding has been very successful, there is a corresponding decoding problem which seems to be much harder. One paper is mentioned as a hopeful direction for solving this. Colah is talking about trying to decode representations coming out of RNNs, a representation I'm quite fond of because it gives (in some sense) a semantic parse of a sentence. However, another option which I'd like to see tried would be to decode representations based on the Paragraph Vector algorithm. This looks easier, and besides, Paragraph Vector actually got better results for sentiment analysis (one of the key domains where RNNs provide a natural solution). Again, we can point to the general trend of AI toward simpler models. RNNs are a way of combining semantic vectors with probabilistic context-free grammers; Paragraph Vector combines semantic vectors with a markov model. Markov models are simpler and less powerful; therefore, by the contrarian logic of the field, we expect them to do better. And, they do.

All of this makes distributed vectors sound quite promising as a general-purpose representation of concepts within an AI system. In addition to aiding bilingual translation and image-word correspondence problems, RNNs have also been applied to predict links in common-sense knowledge bases, showing that the same model can also be applied to understand information presented in a more logical form (and perhaps to form a bridge between logical representations and natural language). I imagine that each additional task which the vectors are used on can add more implicit knowledge to the vector structure, further increasing the number of "zero-shot" generalizations it may get correct in future tasks. This makes me envision a highly general system which accumulates knowledge in vectors over the course of its existence, achieving lifelong learning as opposed to being re-trained on each task. Vector representations by themselves are obviously not sufficient for AGI (for example, there's no model of problem solving), but they could be a very useful tool within an AGI system.

I mentioned in a previous post that the idea of distributed vectors isn't really that new. One older type of word embedding is latent semantic analysis (LSA). Some researchers who are from the LSA tradition (Marco Baroni, Georgiana Dinu, & German Krusewski) have gotten annoyed at the recent hype, and decided that LSA-style embedding was not getting a fair trial; the new word embeddings have not been systematically compared to the older methods. This paper is the result. The verdict: the new stuff really is better!

When it's coming from people who openly admit that they hoped to prove the opposite, it sounds rather convincing. However, I do have some reservations. The title of their paper is: Don't count, predict! The authors refer to the older LSA-like methods as "count vectors", and newer methods as "predict vectors". The LSA-like methods rely on some transformation of raw co-occurrence counts, whereas the new neural methods train to predict something, such as predicting the current word given several previous words. (For example, LSA uses the singular value decomposition to make a low-rank approximation of tf-idf counts. Paragraph Vector trains to find a latent variable representing the paragraph topic, and from this and several previous words, predict each next word in the paragraph.)

As I said: they are assuming the count vs predict distinction is what explains the different performance of the two options. The main experiment of the paper, an extensive comparison of word2vec (representing the prediction-based techniques) vs DISSECT (representing the counting-based techniques) strongly supports this position. Two other techniques are also tried, though: SENNA and DM. Both of these did worse than word2vec, but the relative comparison of the two is much muddier than the comparison between DISSECT and word2vec. This weakens the conclusion somewhat. Are we expected to believe that new counting-based models will continue to be worse than new prediction-driven models?

If we believed that counting models had reached a state of near-perfection, with only small improvements left to be found, then the conclusion would make sense. Prediction-based vector representations appear to still be in their early days, with large unexplored areas. Presumably, they still have significant room for improvement. If the authors believe that this is not the case for counting-based vector representations, the conclusion makes sense.

However, work I'm involved with may undermine this conclusion.

I've been doing distributed vector stuff with Volkan Ustun and others here at ICT. Our forthcoming paper has some suggestive evidence, showing performance very close to Google's word2vec when trained with similar vector size and amount of data. We are not doing any neural network training. Instead, we are creating representations by summing together initial random representations for each word. This seems to fall firmly into count-based methods.

The incredible thing about Volkan's technique is that it's essentially the first thing we thought of trying; what's going on is much simpler than what happens in word2vec. Yet, we seem to be getting similar results. (We should run a more direct comparison to be sure of what's going on.) If this is the case, it directly contradicts the conclusion of Don't Count, Predict!.

In any case, distributed vectors continue to offer a surprising amount of generality, and have some promise as a cross-task, cross-language, cross-modality unified representation.

DeepMind Papers

2014-02-07T17:46:00.001-08:00

There has been a lot of talk about Google's acquisition of DeepMind. I won't try to review the facts for those who don't know about it-- there's got to be about 100 roughly identical news articles you can find. What those news articles only rarely include is a link to DeepMind's research publications; and as far as I've seen, only to the paper on playing Atari games using model-free RL. (Maybe it's the most public-friendly.) Before the news broke, I didn't even know that DeepMind was putting out any information on what they were doing; their very blank company website made me think they were being hush-hush.

So. For the curious, here are all the papers I could find which include at least one author @deepmind.com:

Deep Autoregressive Networks

Playing Atari with Deep Reinforcement Learning

Neural Variational Inference and Learning in Belief Networks

Unit Tests for Stochastic Optimization

An Approximation of the Universal Intelligence Measure

Stochastic Back-Propagation and Variational Inference in Deep Latent Gaussian Models

Unsupervised Feature Learning by Deep Sparse Coding

Deterministic Policy Gradient Algorithms

Bayesian Hierarchical Community Discovery

Learning Word Embeddings Efficiently with Noise-Contrastic Estimation

Math vs Logic

2014-01-20T12:00:00.002-08:00

Math provides a series of games, which can be usefully applied to reality when their rules closely mirror those of some real system.

Originally, it would seem that math started out as just another part of language. Language itself has been described as a game: a useful set of rules which we follow in order to get things done.

Eventually, math developed into a sort of sub-language or sub-game with a clearly independent set of rules. The objects of mathematical language were much different from the typical objects of everyday language, being more abstract while simultaneously being atypically precise, with very definite behavior. This "definite behavior" constitutes the rules of the game. The Pythagorean cult nurtured the idea of formal derivations from axioms or postulates.

(Around this time, Plato's idea of a separate pure mathematical reality started to seem plausible to some folks.)

Notice, however, that math still seemed like a single game. Euclid's Elements provided a wonderfully unified world of mathematics, in which number theory was considered as a geometric topic.

I think it wasn't until the 1800s that it started to become clear that we want to view math as a rich set of different possible games, instead. Non-euclidean geometry was discovered, and mathematicians started to explore variant geometries. The use of imaginary numbers became widely accepted due to the work of Gauss (and Euler in the previous century), and the quaternions were discovered. Group theory was being applied and developed.

Once we have this view, math becomes an exploration of all the possible systems we can define.

Within the same century, logic was gaining teeth and starting to look like a plausible foundation for all mathematical reasoning. It would be overstating things to claim that logic had not developed within the past two thousand years; however, developments in that century would overshadow all previous.

In using the number two thousand, I refer to Aristotle, who laid out the first formal system of logic around 350 BC. Aristotle provided a deep theory, but it was far from enough to account for all correct arguments. Euclid's Elements (written just 50 years later) may have used exceedingly rigorous arguments (a bright light in the history of mathematics), but they were still informal in the sense that there was no formal system of logic justifying every step. Instead, the reader must see that the steps are justified by force of reason. Aristotle's logic was simply too weak to do the job.

It was not until Frege, publishing in the late 1800s, that this became a possibility.

Frege attempted to set out a system of logic in which all of math could be formalized, and he went a long way toward achieving this. He invented modern logic. In the hands of others, it would become the language in which all of mathematics could be set out.

So, we see that logic steps in to provide a sort of super-game in which all the various games of mathematics can be played. It does so just as a unified picture of mathematics as a single game is crumbling.

My point is to answer a question which comes up now and then: what is the dividing line between math and logic? Is logic a sub-field of math, or is it the other way around? The reality is complex. Logic and math are clearly distinct: logic is an attempt to characterize justified arguments, whereas math merely relies heavily on these sorts of arguments. A system of logic can be viewed as just another mathematical system (just another game), but it must be admitted that logic has a different "flavor" than mathematics. I think the difference is in how we are trying to capture a very large space of possibilities within a logical system (even when we are studying very restricted systems of logic, such as logic with bounded quantification).

Ultimately, the difference is a historical one.

All facts cited here can be easily verified on Wikipedia. Thanks for reading!

You Think Too Much

2013-12-21T15:00:00.000-08:00

This first appeared as one of my few facebook notes. I'm copying it here; perhaps it's a better place for it.

This is what the phrase "You're overthinking it" is like.

Mary and Bob sit down to a game of chess. Mary is an inexperienced player, and Bob is giving her some advice. He's being nice, trying to hint at the right moves without telling her what to do. So, Bob is making his moves very quickly, since he's experienced, and Mary is not difficult to play against (and he's going easy on her anyway). Mary is very uncertain of most of her moves, and spends a lot of time staring at the board indecisively.

At one point, Mary is looking at the board in confusion. Bob sees very plainly what move she needs to make; she should move her rook to defend her knight. Mary is looking very carefully at all the possible moves she could make, trying hard to evaluate which things might be good or bad for her, trying to think a few moves ahead.

"You're thinking too much," Bob says. "It's very simple."

This advice sounds helpful to Bob. From Bob's perspective, Mary is spending a lot of time thinking about many alternatives when she should be quickly hitting on the critical move. And it's true: if Mary were a better player, she would be thinking less here.

From Mary's perspective, this is not very helpful at all. She tries to take Bob's advice. She tries to think less about her move. She figures, if Bob says "It's simple", this must mean that she doesn't need to look several moves ahead to see the consequences. She looks for possible moves again, this time looking for things that have good consequences for her immediately.

Mary moves the pawn up to threaten one of Bob's pieces.

Bob takes Mary's knight.

Bob explains to a frustrated Mary what she could have done to avoid this. "See? You're overthinking it" he adds. To Bob, this feels like the right explanation for Mary's wrong move: she was thinking about all these other pieces, when she needed to be defending her knight.

The worst part is, Mary starts to be convinced, too. She admits that she was taking a lot of time to look several moves ahead in all kinds of situations that turned out to be irrelevant to what she needed to do. She tries to think less during the rest of the game, and makes many other mistakes as a result.

Where is AI Headed?

2013-12-15T17:05:00.001-08:00

I've spent a lot of effort on this blog arguing for the direction of higher expressiveness. Machine intelligence should be able to learn anything a human can learn, and in order for that to be possible, it should be able to conceive of any concept that a human can. I have proceeded with the belief that this is the direction to push in order for the field to make progress.

Yet, in some ways at least, the field is headed in the opposite direction.

I've often discussed the Chomsky hierarchy, and how most techniques at present fall very low on it. I've often discussed hierarchies "above" the Chomsky hierarchy; hierarchies of logic & truth, problems of uncomputability and undefinability. Reaching for the highest expression of form, the most general notion of pattern.

Machine learning has made artificial intelligence increasingly practical. Yet, the most practical techniques are often the least expressively powerful. Machine learning flourished once it abandoned the symbolic obsession of GOFAI. Fernando Pereira famously said: "The older I get, the further down the Chomsky Hierarchy I go."

There's a good reason for this, too. Highly structured techniques like logic induction and genetic programming (both of which would go high in the hierarchy) don't scale well. Commercial machine learning is large-scale, and increasingly so. I mentioned this in connection with word2vec last time: "Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result."

The "structure" I'm referring to provides more prior bias, which means more generalization capability. This is very useful when we want to come to the correct conclusion using small amounts of data. However, with more data, we can cover more and more cases without needing to actually make the generalization. At some point, the generalization becomes irrelevant in practice.

Take XML data. You can't parse XML with regular expressions.¹ Regular expressions are too low on the Chomsky hierarchy to form a proper model of what's going on. However, for the Large Text Compression Benchmark, which requires us to compress XML data, the leading technique is the PAQ compressor. Compression is equivalent to prediction, so the task amounts to making a predictive model of XML data. PAQ works by constructing a probabilistic model of the sequence of bits, similar to a PPM model. This is not even capable of representing regular expressions. Learning regular expressions is like learning hidden markov models. PPM allows us to learn fully observable markov models. PAQ learns huge markov models that get the job done.

The structure of XML requires a recursive generalization, to understand the nested expressions. Yet, PAQ does acceptably well, because the depth of the recursion is usually quite low.

You can always push a problem lower down on the hierarchy if you're willing to provide more data (often exponentially more), and accept that it will learn the common cases and can't generalize the patterns to the uncommon ones. In practice, it's been an acceptable loss.

Part of the reason for this is that the data just keeps flowing. The simpler techniques require exponentially more data... and that's how much we're producing. It's only getting worse:

Has Big Data Made Anonymity Impossible? MIT Technology Review

At The New Yorker, Gary Marcus complains: Why Can't My Computer Understand Me? Reviewing the work of Hector Levesque, the article conveys a desire to "google-proof" AI, designing intelligence tests which are immune to the big-data approach. Using big data rather than common-sense logic to answer facts is seen as cheating. Levesque presents a series of problems which cannot (presently) be solved by such techniques, and calls others to "stop bluffing".

I can't help but agree. Yet, it seems the tide of history is against us. As the amount of data continues to increase, dumb techniques will achieve better and better results.

Will this trend turn around at some point?

Gary Marcus points out that some information just isn't available on the web. Yet, this is a diminishing reality. As more and more of our lives are online (and as the population rises), more and more will be available in the global brain.

Artificial intelligence is evolving into a specific role in that global brain: a role which requires only simple association-like intelligence, fueled by huge amounts of data. Humans provide the logically structured thoughts, the prior bias, the recursive generalizations; that's a niche which machines are not currently required to fill. At the present, this trend only seems to be increasing.

Should we give up structured AI?

I don't think so. We can forge a niche. We can climb the hierarchy. But it's not where the money is right now... and it may not be for some time.

1: Cthulhu will eat your face.

History of Distributed Representations

2013-12-09T19:23:00.000-08:00

Commenting on the previous post, a friend pointed out that "distributed representations" are not so new. I thought I would take a look at the history to clarify the situation.

In a very broad sense, I was discussing the technique of putting a potentially nonlinear problem into a linear vector space. This vague idea matches to many techniques in machine learning. A number of well-developed algorithms take advantage of linearity assumptions, including PCA, logistic regression, and SVM.¹ A common approach to machine learning is to find a number of features, which are just functions of your data, and use one of these techniques on the features (hoping they are close enough to linear). Another common technique, the kernel trick, projects features into a higher-dimensional space where the linearity assumption is more likely to get good results. Either way, a large part of the work to get good results is "feature engineering": choosing how to represent the data as a set of features to feed into the machine learning algorithm.

We could even argue that probability theory itself is an example: probabilities are always linear, no matter how nonlinear the underlying problem being described. (The probability of an event is the sum of the ways it could happen.) This gives us nice results; for example, there is always a Nash equilibrium for a game if we allow probabilistic strategies. This is not the case if we only consider "pure" strategies.

This theme is interesting to me, but, I was trying to be much more narrow in talking about recent developments in distributed representations. Like feature-based machine learning, a distributed representation will put data into a vector space to make it easier to work with. Unlike approaches relying on feature engineering, there is an emphasis on figuring out how to get the representations to "build themselves", often starting with randomly assigned vector representations.

The beginning of this kind of approach is probably latent semantic analysis (LSA), which is from 1988. LSA assigns 'semantic vectors' to words based on statistical analysis of the contexts those words occur in, based on the idea that words with similar meaning will have very similar statistics.

Given how old this technique is, the excitement around Google's release of the word2vec tool is striking. Reports spun it as deep learning for the masses. Deep learning is a much more recent wave of development. I think the term has lost much of its meaning in becoming a buzzword.² Calling word2vec "deep" takes this to farcical levels: the techniques of word2vec improve previous models by removing the hidden layer from the network. Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result.

One of the exciting things about word2vec is the good results with solving word analogies by vector math. The result of vector computations like France - Paris and Russia - Moscow are very similar, meaning we can approximately find the vector for a capital given the vector for the corresponding nation. The same trick works for a range of word relationships.

However, I've talked with people who had the incorrect impression that this is a new idea. I'm not sure exactly how old it is, but I've heard the idea mentioned before, and I did find a reference from 2004 which appears to use LSA to do the same basic thing. (I can't see the whole article on google books...)

One thing which I thought was really new was the emerging theme of combining vectors to form representations of compound entities. This, too, is quite old. I found a paper from 1994, which cites harder-to-find papers from 1993, 1990, and 1989 that also developed techniques to combine vectors to create representations of compound objects. Recent developments seem much more useful, but, the basic idea is present.

So, all told, it's a fairly long-standing area which has seen large improvements in the actual techniques employed, but, whose central ideas were laid out (in one form or another) over 20 years ago.

1: By the way, don't get too hung up about what makes one machine learning technique "linear" and another "nonlinear". This is a false dichotomy. What I really mean is that a technique works in a vector space (which more or less means a space where + is defined and behaves very much like we expect), and relies "largely" on linear operations in this space. What does "linear" actually mean? A function F is linear if and only if F(x+y) = F(x) + F(y) and for scalar a, F(ax) = aF(x). PCA, for example, is justified by minimizing a squared error (a common theme), where the error is based on euclidean distance, a linear operation. Notice that taking the square isn't linear, but PCA is still thought of as a linear approach.

2: Deep learning has come to mean almost any multi-layer neural network. The term caught on with the success related to Deep Belief Networks, which proposed specific new techniques. Things currently being called "deep learning" often have little in common with this. I feel the term has been watered down by people looking to associate their work with the success of others. This isn't all bad. The work on multi-layered networks seems to have produced real progress in reducing or eliminating the need for feature engineering.

Distributed Representations

2013-12-03T16:49:00.000-08:00

Distributed vector representations are a set of techniques which take a domain (usually, words) and embed it into a linear space (representing each word as a large vector of numbers). Useful tasks can then be represented as manipulations of these embedded representations. The embedding can be created in a variety of ways; often, it is learned by optimizing task performance. SENNA demonstrated that representations learned for one task are often useful for others.

There are so many interesting advances being made in distributed vector representations, it seems that a nice toolset is emerging which will soon be considered a basic part of machine intelligence.

Google's word2vec assigns distributed vector representations to individual words and a few short phrases. These representations have been shown to give intuitively reasonable results on analogy tasks with simple vector math: king - man + woman is approximately equal to the vector for queen, for example. This is despite not being explicitly optimized for that task, again showing that these representations tend to be useful for a wide range of tasks.

Similar approaches have aided machine translation tasks by turning word translation into a linear transform from one vector space to another.

One limitation of this approach is that we cannot do much to represent sentences. Sequences of words can be given somewhat useful representations by adding together the individual word representations, but this approach is limited.

Socher's RNN learns a matrix transform to compose two elements together and give them a score, which is then used for greedy parsing by composing together the highest-scoring items, with great success. This gives us useful vector representations for phrases and sentences.

Another approach which has been suggested is circular convolution. This combines vectors in a way which captures ordering information, unlike addition or multiplication. Impressively, the technique has solved Raven progressive matrix problems:

http://eblerim.net/?page_id=2383

Then there's a project, COMPOSES, which seeks to create a language representation in which nouns get vector representations and other parts of speech get matrix representations (and possibly tensor representations?).

http://clic.cimec.unitn.it/composes/

I haven't looked into the details fully, but conceptually it makes sense: the parts of speech which intuitively represent modifiers are linear functions, while the parts of speech which are intuitively static objects are getting operated on by these functions.

The following paper gives a related approach:

http://www.cs.utoronto.ca/~ilya/pubs/2008/mre.pdf

Here, everything is represented as a matrix of the same size. Representing the objects as functions is somewhat limiting, but the uniform representation makes it easy to jump to higher-level functions (modifiers on modifiers) without adding anything. This seems to have the potential to enable a surprisingly wide range of reasoning capabilities, given the narrow representation.

As the authors of that last paper mention, the approach can only support reasoning of a "memorized" sort. There is no mechanism which would allow chained logical inferences to answer questions. This seems like a good characterization of the general limitations of the broader set of techniques. The distributed representation of a word, phrase, image, or other object is a static encoding which represents, in some sense, a classification of the object into a fuzzy categorization system we've learned. How can we push the boundary here, allowing for a more complex reasoning? Can these vector representations be integrated into a more generally capable probabilistic logic system?

Progress in Logical Priors

2013-08-11T19:53:00.001-07:00

It's been a while since I've posted here. I've been having a lot of ideas, but I suppose the phd student life makes me focus more on implementing than on speculating (which is a good thing).

I presented my first sole-authored paper (based on this blog post) at the AGI conference in December of last year, and it was cited by Marcus Hutter in an interesting paper about approximating universal intelligence (which was presented at this year's AGI conference, which was once again a summer conference, so already took place).

When I set out to write the paper, my main goal was to show that we gain something by representing beliefs as something like logical statements, rather than as something like programs. This allows our beliefs to decompose more easily, readily allows inference in "any direction" (whereas programs are naturally executed in one direction, producing specific results in a specific order), and also allowing incomputable hypotheses to be dealt with in a partial way (dealing somewhat more gracefully with the possibility by explicitly representing it in the hypothesis class, but incompletely).

My desire to put forward this thesis was partly out of an annoyance with people invoking the Curry-Howard isomorphism all-too-often, to claim that logic and computation are really one and the same. I still think this is misguided, and not what the curry-howard isomorphism really says when you get down to it. The "programs are proofs" motto is misleading. There is no consensus on how to deal with Turing-complete programs in this way; turing-complete programming languages seem to correspond to trivial logics where you can prove anything from anything!*

Annoyed, I wanted to show that there was a material difference between the two ways of representing knowledge.

As I wrote the paper and got feedback from colleagues, it became clear that I was fighting a losing fight for that thesis: although the first-order prior represented a new mathematical object with interesting features along the lines I was advocating, it would be possible to write a somewhat program-like representation with the same features. I would still argue each of the advantages I mentioned, and still argue against naive invocations of Curry-Howard, but I was trying to make these arguments too strong, and it wasn't working. In any case, this was a point that didn't need to be made in order for the paper to be interesting, for two reasons:

If desired, you could re-do everything in a more computational way. It would still be a new, interesting distribution with features similar to but different from the Solomonoff distribution.
A universal distribution over logic, done right, is interesting even if it had turned out to be somehow equivalent to the Solomonoff distribution.

So, all told, I downplayed the "logic is different from computation" side of the paper, and tried to focus more on the prior itself.

After submitting the paper, I went back to working on other things. Although I still thought about logical priors every so often, I didn't make very much conceptual progress for a while.

At the July MIRI workshop, I got the opportunity to spend time on the topic again, with other smart folks. We spend roughly a day going over the paper, and then discussed how to take things further.

The main problem with the first-order prior is that the probability of a universal statement does not approach 1 as we see more and more examples. This is because all the examples in the world will still be consistent with the statement "there exists a counterexample"; so, if we are randomly sampling sentences to compose a logical theory, the probability that we add that sentences doesn't drop below a certain minimum.

So, for example, if we are observing facts about the natural numbers, we will not converge to arbitrarily high probability for generalizations of these facts. To make it more concrete, we cannot arrive at arbitrarily high probabilities for the Goldbach conjecture by observing more and more examples of even numbers being written as the sum of two primes.

This isn't a bad thing in all cases. Holding back some fixed probability for the existence of a counterexample matches with the semantics of first-order logic, which is not supposed to be able to rule out omega-inconsistent theories. (Omega inconsistency is the situation where we deny a universal statement while simultaneously believing all the examples.)

For some domains, though, we really do want to rule out omega-inconsistency; the natural numbers are one of these cases. The reason the first-order prior allows some probability for omega-inconsistent possibilities is that first-order logic is unable to express the fact that natural numbers correspond exactly to the finite ones. ("Finite" cannot be properly characterized in first-order logic.) More expressive logics, such as second-order logic, can make this kind of assertion; so, we might hope to specify reasonable probability distributions over those logics which have the desired behavior.

Unfortunately, it is not difficult to show that the desired behavior is not approximable. If the probability of universal statements approaches 1 as we observe increasingly many examples, then it must equal 1 if we believe all the examples. Let's take an example. If we believe all the axioms of peano arithmetic, then we may be able to prove all the examples of the Goldbach conjecture. In fact, we end up believing all true Pi_1 statements in the arithmetic hierarchy. But this implies that we believe all true Sigma_2 statements, if our beliefs are closed under implication. This in turn means that we believe all the examples of the Pi_3 universal statements, which means we must believe the true Pi_3 with probability 1, since we supposed that we believe universal statements if we believe their examples. And so on. This process can be used to argue that we must believe the true statements on every level of the hierarchy.

Since the hierarchy transcends every level of hypercomputation, there can be no hope of a convergent approximation for it. So, convergence of universal statements to probability 1 as we see more examples is (very) uncomputable. This may seem a bit surprising, given the naturalness of the idea.

Marcus Hutter has discussed distributions like this, and argues that it's OK: this kind of distribution doesn't try to capture our uncertainty about logically undecidable statements. Instead, his probability distribution represents the strong inductive power that we could have if we could infallibly arrive at correct mathematical beliefs.

Personally, though, I am much more interested in approximable distributions, and approaches which do try to represent the kind of uncertainty we have about undecidable mathematical statements.

My idea has been that we can get something interesting by requiring convergence on the Pi_1 statements only.

One motivation for this is that Pi_1 convergence guarantees that a logical probability distribution will eventually recognize the consistency of any axiomatic system, which sort-of gets around the 2nd incompleteness theorem: an AI based on this kind of distribution would eventually recognize that any axioms you give it to start with are consistent, which would allow it to gradually increase its logical strength as it came to recognize more mathematical truth. This plausibly seems like a step in the direction of self-trusting AI, one of the goals of MIRI.

The immediate objection to this is that the system still won't trust itself, because it is not a set of axioms, but rather, is a convergent approximation of a probability distribution. Convergence facts are higher up in the arithmetic hierarchy, which suggests that the system won't be able to trust itself even if it does become able to (eventually) trust axiomatic systems.

This intuition turns out to be wrong! There is a weak sense in which Pi_1 convergence implies self-trust. Correctness for Pi_1 implies that we believe the true Sigma_2 statements, which are statements of the form "There exists x such that for all y, R(x,y)" where R is some primitive recursive relation. Take R to be "y is greater than x, and at time y in the approximation process, our probability of statement S is greater than c." (The arithmetic hierarchy can discuss the probability approximation process through a godel-encoding.) The relevant Sigma_2 statements place lower bounds on the limiting probabilities from our probability approximation. We can state upper bounds in a similar way.

This shows that a probability distribution which has Pi_1 convergence will obey something strikingly like the probabilistic reflection principle which came out of a previous MIRI workshop. If its probabilities fall within specific bounds, it will believe that (but the converse, that if it believes they fall within specific bounds, they do, does not hold). This gives such a probability distribution a significant amount of self-knowledge.

So, Pi_1 convergence looks like a nice thing to have. But is it?

During the MIRI workshop, Will Sawin proved that this leads to bad (possibly unacceptable) results: any logically coherent, approximable probability distribution over statements in arithmetic which assigns probability 1 to true pi_1 statements will assign probability 0 to some true pi_2 statements. This seems like a rather severe error; the whole purpose of using probabilities to represent uncertainty about mathematical truth would be to allow "soft failure", where we don't have complete mathematical knowledge, but can assign reasonable probabilities so as to be less than completely in the dark. This theorem shows that we get hard failures if we try for pi_1 convergence.

How concerned should we be? Some of the "hard failures" here correspond to the necessary failures in probabilistic reflection. These actually seem quite tolerable. There could be a lot more errors than that, though.

One fruitful idea might be to weaken the coherence requirement. The usual argument for coherence is the dutch book argument; but this makes the assumption that bets will pay off, which does not apply here, since we may never face the truth or falsehood of certain mathematical statements. Intuitionistic probability comes out of a variation of the dutch book argument for the case when bets not paying off at all is a possible outcome. This does not require that probabilities sum to 1, which means we can have a gap between the probability of X and the probability of not-X.

An extreme version of this was proposed by Marcello Herreschoff at the MIRI workshop; he suggested that we can get Pi_1 convergence by only sampling Pi_1 statements. This gets what we want, but results in probability gaps at higher levels in the hierarchy; it's possible that a sampled theory will never prove or disprove some complicated statements. (This is similar to the intuitionistic probability idea, but doesn't actually satisfy the intuitionistic coherence requirements. I haven't worked this out, though, so take what I'm saying with a grain of salt.)

We may even preserve some degree of probabilistic reflection this way, since the true Pi_1 still imply the true Sigma_2.

That particular approach seems rather extreme; perhaps too limiting. The general idea, though, may be promising: we may be able to get the advantages of Pi_1 convergence without the disadvantages.

*(Source: Last paragraph of this section on wikipedia.)