Sunday, November 9, 2014

A List Of Nuances

Abram Demski and George Koleszarik
(This article is also cross-posted to LessWrong.)
Much of rationality is pattern-matching. An article on lesswrong might point out a thing to look for. Noticing this thing changes your reasoning in some way. This essay is a list of things to look for. These things are all associated, but the reader should take care not to lump them together. Each dichotomy is distinct, and although the brain will tend to abstract them into some sort of yin/yang correlated mush, in reality they have a more complicated structure; some things may be similar, but if possible, try to focus on the complex interrelationships.

  1. Map vs. Territory
    1. Eliezer’s sequences use this as a jump-off point for discussion of rationality.
    2. Many thinking mistakes are map vs. territory confusions.
      1. A map and territory mistake is a mix-up of seeming vs being.
      2. Humans need frequent reminders that we are not omniscient.
  2. Cached Thoughts vs. Thinking
    1. This document is a list of cached thoughts.
  3. Clusters vs. Properties
    1. These words could be used in different ways, but the distinction I want to point at is that of labels we put on things vs actual differences in things.
    2. The mind projection fallacy is the fallacy of thinking a mental category (a “cluster”) is an actual property things have.
      1. If we see something as good for one reason, we are likely to attribute other good properties to it, as if it had inherent goodness. This is called the halo effect. (If we see something as bad and infer other bad properties as a result, it is referred to as the reverse-halo effect.)
  4. Syntax vs. Semantics
    1. The syntax is the physical instantiation of the map. The semantics is the way we are meant to read the map; that is, the intended relationship to the territory.
  5. Semantics vs. Pragmatics
    1. The semantics is the literal contents of a message, whereas the pragmatics is the intended result of conveying the message.
      1. An example of a message with no semantics and only pragmatics is a command, such as “Stop!”.
      2. Almost no messages lack pragmatics, and for good reason. However, if you seek truth in a discussion, it is important to foster a willingness to say things with less pragmatic baggage.
      3. Usually when we say things, we do so with some “point” which is beyond the semantics of our statement. The point is usually to build up or knock down some larger item of discussion. This is not inherently a bad thing, but has a failure mode where arguments are battles and statements are weapons, and the cleverer arguer wins.
  6. Object-level vs. Meta-level
    1. The difference between making a map and writing a book about map-making.
    2. A good meta-level theory helps get things right at the object level, but it is usually impossible to get things right at the meta level before before you’ve made significant progress at the object level.
  7. Seeming vs. Being
    1. We can only deal with how things seem, not how they are. Yet, we must strive to deal with things as they are, not as they seem.
      1. This is yet another reminder that we are not omniscient.
    2. If we optimize too hard for things which seem good rather than things which are good, we will get things which seem very good but which may only be somewhat good, or even bad.
    3. The dangerous cases are the cases where you do not notice there is a distinction.
      1. This is why humans need constant reminders that we are not omniscient.
    4. We must take care to notice the difference between how things seem to seem, and how they actually seem.
  8. Signal vs. Noise
    1. Not all information is equal. It is often the case that we desire certain sorts of information and desire to ignore other sorts.
    2. In a technical setting, this has to do with the error rate present in a communication channel; imperfections in the channel will corrupt some bits, making a need for redundancy in the message being sent.
    3. In a social setting, this is often used to refer to the amount of good information vs irrelevant information in a discussion. For example, letting a mediocre writer add material to a group blog might increase the absolute amount of good information, yet worsen the signal-to-noise ratio.
    4. Attention is a scarce resource; yes everyone has something to teach you, but many people are much more efficient sources of wisdom than others.
  9. Selection Effects
      1. In many situations, if we can present evidence to a Bayesian agent without the agent knowing that we are being selective, we can convince the agent of anything we like. For example, if I want to convince you that smoking causes obesity, I could find many people who became obese after they started smoking.
      2. The solution to this is for the Bayesian agent to model where the information is coming from. If you know I am selecting people based on this criteria, then you will not take it as evidence of anything, because the evidence has been cherry-picked.
      3. Most of the information you receive is intensely filtered. Nothing comes to your attention with a good conscience.
    1. The silent evidence problem.
      1. Selection bias need not be the result of purposeful interference as in cherry-picking. Often, an unrelated process may hide some of the evidence needed. For example, we hear far more about successful people than unsuccessful. It is tempting to look at successful people and attempt to draw conclusion about what it takes to be successful. This approach suffers from the silent evidence problem: we also need to look at the unsuccessful people and examine what is different about the two groups.
  10. What You Mean vs. What You Think You Mean
    1. Very often, people will say something and then that thing will be refuted. The common response to this is to claim you meant something slightly different, which is more easily defended.
      1. We often do this without noticing, making it dangerous for thinking. It is an automatic response generated by our brains, not a conscious decision to defend ourselves from being discredited. You do this far more often than you notice. The brain fills in a false memory of what you meant without asking for permission.
  11. What You Mean vs. What the Others Think You Mean
  12. What You Optimize vs. What You Think You Optimize
    1. Evolution optimizes for reproduction but in doing so creates animals with a variety of goals which are correlated with reproduction.
    2. The people who value practice for its own sake do better than the people who only value being good at what they’re practicing.
    3. “Consequentialism is true, but virtue ethics is what works.”
  13. Stated Preferences vs. Revealed Preferences
    1. Revealed preferences are the preferences we can infer from your actions. These are usually different from your stated preferences.
        1. Food isn’t about nutrition.
        2. Clothes aren’t about comfort.
        3. Bedrooms aren’t about sleep.
        4. Marriage isn’t about love.
        5. Talk isn’t about information.
        6. Laughter isn’t about humour.
        7. Charity isn’t about helping.
        8. Church isn’t about God.
        9. Art isn’t about insight.
        10. Medicine isn’t about health.
        11. Consulting isn’t about advice.
        12. School isn’t about learning.
        13. Research isn’t about progress.
        14. Politics isn’t about policy.
        15. Going meta isn’t about the object level.
        16. Language isn’t about communication.
        17. The rationality movement isn’t about epistemology.
      1. Everything is actually about signalling.
      1. Never attribute to malice that which can be adequately explained by stupidity. The difference between stated preferences and revealed preferences does not indicate dishonest intent. We should expect the two to differ in the absence of a mechanism to align them.
    2. People, ideas, and organizations respond to incentives.
      1. Evolution selects humans who have reproductively selfish behavioral tendencies, but prosocial and idealistic stated preferences.
      2. Social forces select ideas for virality and comprehensibility as opposed to truth or even usefulness.
      3. Organizations are by default bad at being strategic about their own survival, but the ones that survive are the ones you see.
  14. What You Achieve vs. What You Think You Achieve
    1. Most of the consequences of our actions are totally unknown to us.
    2. It is impossible to optimize without proper feedback.
  15. What You Optimize vs. What You Actually Achieve
    1. Consequentialism is more about expected consequences than actual consequences.
  16. What You Seem Like vs. What You Are
    1. You can try to imagine yourself from the outside, but no one has the full picture.
  17. What Other People Seem Like vs. What They Are
    1. When people assume that they understand others, they are wrong.
  18. What People Look Like vs. What They Think They Look Like
    1. People underestimate the gap between stated preferences and revealed preferences.
  19. What Your Brain Does vs. What You Think It Does
    1. You are running on corrupted hardware.
      1. The brain’s machinations are fundamentally social; it automatically does things like signal, save face, etc., which distort the truth.
      1. Knowing that you are running on corrupted hardware should cause skepticism about the outputs of your thought-processes. Yet, too much skepticism will cause you to stumble, particularly when fast thinking is needed.
        1. Producing a correct result plus justification is harder than producing only the correct result.
        2. Justifications are important, but the correct result is more important.
        3. Much of our apparent self-reflection is confabulation, generating plausible explanations after the brain spits out an answer.
        4. Example: doing quick mental math. If you are good at this, attempting to explicitly justify every step as you go would likely slow you down.
        5. Example: impressions formed over a long period of time. Wrong or right, it is unlikely that you can explicitly give all your reasons for the impression. Requiring your own beliefs to be justifiable would preempt impressions that require lots of experience and/or many non-obvious chains of subconscious inference.
        6. Impressions are not beliefs and they are always useful data.
  20. Clever Argument vs. Truth-seeking; The Bottom Line
    1. People believe what they want to believe.
      1. Believing X for some reason unrelated to X being true is referred to as motivated cognition.
      2. Giving a smart person more information and more methods of argument may actually make their beliefs less accurate, because you are giving them more tools to construct clever arguments for what they want to believe.
    2. Your actual reason for believing X determines how well your belief correlates with the truth.
      1. If you believe X because you want to, any arguments you make for X no matter how strong they sound are devoid of informational context about X and should properly be ignored by a truth-seeker.
  21. Lumpers vs. Splitters
    1. A lumper is a thinker who attempts to fit things into overarching patterns. A splitter is a thinker who makes as many distinctions as possible, recognizing the importance of being specific and getting the details right.
    2. Specifically, some people want big Wikipedia and TVTropes articles that discuss many things, and others want smaller articles that discuss fewer things.
    3. This list of nuances is a lumper attempting to think more like a splitter.
  22. Fox vs. Hedgehog
    1. “A fox knows many things, but a hedgehog knows One Big Thing.” Closely related to a splitter, a fox is a thinker whose strength is in a broad array of knowledge. A hedgehog is a thinker who, in contrast, has one big idea and applies it everywhere.
    2. The fox mindset is better for making accurate judgements, according to Tetlock.
  23. Traps vs. Gardens
      1. Conversations tend to slide toward contentious and useless topics.
      2. Societies tend to decay.
      3. Thermodynamic equilibrium is entropic.
      4. Without proper institutions being already in place, it takes large amounts of constant effort and vigilance to stay out of traps.
    1. From the outside of a broken Molochian system it is easy to see how to fix. But it cannot be fixed from the inside.

Friday, July 18, 2014

Losing Faith in Factor Graphs

In my post Beliefs, I split AGI into two broad problems:

  1. What is the space of possible beliefs?
  2. How do beliefs interact?
The idea behind #1 was: can we formulate a knowledge representation which could in principle express any concept a human can conceive of?

#2 represents the more practical concern: how do we implement inference over these beliefs?

(This was needlessly narrow. I could at least have included something like: 3. How do beliefs conspire to produce actions? That's more like what my posts on procedural logic discuss. 1 and 2 alone don't exactly get you AGI. Nonetheless, these are central questions for me.)

Factor graphs are a fairly general representation for networks of probabilistic beliefs, subsuming bayesian networks, markov networks, and most other graphical models. Like those two, factor graphs are only as powerful as propositional logic, but can be a useful tool for representing more powerful belief structures as well. In other words, the factor graph "toolbox" includes some useful solutions to #2 which we may be able to apply to whatever solution for #1 we come up with. When I started graduate school at USC 3 years ago, I was basically on board with this direction. That's the direction taken by Sigma, the cognitive architecture effort I joined.

I've had several shifts of opinion since that time.

I. Learning is Inference

My first major shift (as I recall) was to give up on the idea of a uniform inference technique for both learning models and applying them ("induction" vs "deduction"). In principle, Bayes' Law tells us that learning is just probabilistic inference. In practice, unless you're using Monte Carlo methods, practical implementations tend have much different algorithms for the two cases. There was a time when I hoped that parameter learning could be implemented via belief propagation in factor graphs, properly understood: we just need to find the appropriate way to structure the factor graph such that parameters are explicitly represented as variables we reason about.

It's technically possible, but not really such a good idea. One reason is that we don't usually care about the space of possible parameter settings, so long as we find one combination which predicts data well. There is no need to keep the spread of possibilities except so far as it facilitates making good predictions. On the other hand, we do usually care about maintaining a spread of possible values for the latent variables within the model itself, precisely because this does tend to help. (The distinction could be is ambiguous! In principle there might not be a clean line between latent variables, model parameters, and even the so-called hyperparameters. In practice, the distinction is quite clear, though, and that's the point here.)

Instead, I ended up working on a perfectly normal gradient-descent learning algorithm for Sigma. Either you're already familiar with this term, or you've got a lot to learn about machine learning; so, I wont try to go into details. The main point is that this is a totally non-probabilistic method, which only believes a single parameter setting at any given time, but attempts to adjust these in response to data.

The next big shift in my thinking has been less easily summarized, and has been taking place over a longer period of time. Last year, I wrote a post attempting to think about it from a perspective of "local" vs "global" methods. At that time, my skepticism about whether factor graphs provide a good foundation to start with was already strong, but I don't think I articulated it very well.

II. Inference is Approximate

Inference in factor graphs is exponential time in the general case, which means that (unless the factor graph has a simple form) we need to use faster approximate inference.

This makes perfect sense if we assume that "inference" is a catch-all term referring to the way beliefs move around in the space of possible beliefs. It should be impossible to do exact inference on the whole belief space.

If we concede that "inference" is distinct from "learning", though, we have a different situation. What's the point in learning a model that is so difficult to apply? If we are going to go ahead and approximate it with a simpler function, doesn't it make sense to learn the simpler function in the first place?

This is part of the philosophy behind sum-product networks (SPNs). Like neural networks, SPN inference is always linear time: you basically just have to propagate function values. Unlike neural networks, everything is totally probabilistic. This may be important for several reasons; it means we can always do reasoning on partial data (filling in the missing parts using the probabilistic model), and reason "in any direction" thanks to Bayes' Law (where neural networks tend to define a one-direction input-output relationship, making reasoning in the reverse direction difficult).

Why are SPNs so efficient?

III. Models are Factored Representations

The fundamental idea behind factor graphs is that we can build up complicated probability distributions by multiplying together simple pieces. Multiplication acts like a conjunction operation, putting probabilistic knowledge together. Suppose that we know a joint distribution connecting X with YP(X, Y). Suppose that we also know a conditional distribution, defining a probability on Z for any given probability on X: P(Z|X). If we assume that Z and Y are independent given X, we can obtain the joint distribution across all three variables by multiplying the two probability functions together: P(X,Y,Z) = P(X,Y)P(Z|X).

A different way of looking at this is that it allows us to create probabilistic constraints connecting variables. A factor graph is essentially a network of these soft constraints. We would expect this to be a powerful representation, because we already know that constraints are a powerful representation for non-probabilistic systems. We would also expect inference to be very difficult, though, because solving systems of constraints is hard.

SPNs allow multiplication of distributions, but only when it does not introduce dependencies between distributions which must be "solved" in any way. To oversimplify just a smidge, we can multiply just in the case of total independence: P(X)P(Y). We are not allowed to use P(X|Y)P(Y), because if we start allowing those kinds of cross-distribution dependencies, things get tangled and inference becomes hard.

To supplement this reduced ability to multiply, we add in the ability to add. We compose complicated distributions as a series of sums and products of simpler distributions. (Hence the name.) Despite the simplifying requirements on the products, this turns out to be a rather powerful representation.

Just as we can think of a product as conjunctive, imposing a series of constraints on a system, we can think of a sum as being disjunctive, building up a probability distribution by enumeration of possibilities. This should bring to mind mixture models and clustering.

Reasoning based on enumeration is faster because, in a sense, it's just the already-solved version of the constraint problem: you have to explicitly list the set of possibilities, as opposed to starting with all possibilities and listing constraints to narrow them down.

Yet, it's also more powerful in some cases. It turns out that SPNs are much better at representing probabilistic parsing than factor graphs are. Parsing is a technique which is essential in natural language processing, but it's also been used for other purposes; image parsing has been a thing for a long time, and I think it's a good way of trying to get a handle on a more intricately structured model of images and other data. A parse is elegantly represented via a sum of possibilities. It can be represented with constraints, and this approach has been successfully used. Those applications require special-purpose optimizations to avoid the exponential time inference associated with factor graphs and constraint networks, though.

The realization that factor graphs aren't very good at representing this is what really broke my resolve as far as factor graphs go. This indicated to me that factor graphs really were missing a critical representational capability; the ability to enumerate possibilities.

Grammar-like theories of general learning have been an old obsession of mine, which I had naively assumed could be handled well within the factor-graph world.

My new view suggested that inference and learning should both be a combination of sum-reasoning and product-reasoning. Sum-learning includes methods like clustering, boosting, and bagging: we learn enumerative models. Reasoning with these is quite fast. Product-learning splits reality into parts which can be modeled separately and then combined. These two learning steps interact. Through this process, we create inherently fast, grammar-like models of the world.

IV. Everything is Probabilistic

Around the same time I was coming to these conclusions, the wider research community was starting to get excited about the new distributed representations. My shift in thinking was taking me toward grammars, so I was quite excited about Richard Socher's RNN representation. This demonstrated using one fairly simple algorithm for both language parsing and image parsing, producing state-of-the-art results along with a learned representation that could be quite useful for other tasks. RNNs have continued to produce impressive results moving forward; in fact, I would go so far as to say that they are producing results which look like precisely what we want to see out of a nascent AGI technology. These methods produce cross-domain structured generalizations powerful enough to classify previously-unseen objects based on knowledge obtained by reading, which (as I said in my previous post) seems quite encouraging. Many other intriguing results have been published as well.

Unfortunately, it's not clear how to fit RNNs in with more directly probabilistic models. The vector representations at the heart of RNNs could be re-conceived as restricted Boltzmann machines (RBMs) or another similar probabilistic model, giving a distributed representation with a fully probabilistic semantics (and taking advantage of the progress in deep belief networks). However, this contradicts the conclusion of section II: an RBM is a complex probabilistic model which must be approximated. Didn't I just say that we should do away with overly-complex models if we know we'll just be approximating them?

Carrying forward the momentum from the previous section, it might be tempting to abandon probabilistic methods entirely, in favor of the new neural approaches. SPNs restrict the form of probabilistic models to insure fast inference. But why accept these restrictions? Neural networks are always fast, and they don't put up barriers against certain sorts of complexity in models.

The objective functions for the neural models are still (often) probabilistic. A neural network can be trained to output a probability. We do not need everything inside the network to be a probability. We do lose something in this approach: it's harder to reverse a function (reasoning backwards via Bayes' Law) and perform other probabilistic manipulations. However, there may be solutions to these problems (such as training a network to invert another network).

V. Further Thought Needed

These neural models are impressive, and it seems as if a great deal could be achieved by extending what's been done and putting those pieces together into one multi-domain knowledge system. However, this could never yield AGI as it stands: as this paper notes (Section 6), vector representations do not perform any logical deduction to answer questions; rather, answers are baked-in during learning. These systems often can correctly answer totally new questions which have not been trained on, but that is because the memorization of the other answers forced the vectors into the right "shape" to make the correct answers evident. While this technique is powerful, it can't capture aspects of intelligence which require thinking.

Similarly, it's not possible for all probabilistic models to be tractable. SPNs may be a powerful tool for creating fast probabilistic models, but the restrictions prevent us from modeling everything. Intelligence requires some inference to be difficult! So, we need a notion of difficult inference!

It seems like there is a place for both shallow and deep methods; unstructured and highly structured models. Fitting all these pieces together is the challenge.