Friday, February 7, 2014

DeepMind Papers

There has been a lot of talk about Google's acquisition of DeepMind. I won't try to review the facts for those who don't know about it-- there's got to be about 100 roughly identical news articles you can find. What those news articles only rarely include is a link to DeepMind's research publications; and as far as I've seen, only to the paper on playing Atari games using model-free RL. (Maybe it's the most public-friendly.) Before the news broke, I didn't even know that DeepMind was putting out any information on what they were doing; their very blank company website made me think they were being hush-hush.

So. For the curious, here are all the papers I could find which include at least one author

Deep Autoregressive Networks

Playing Atari with Deep Reinforcement Learning

Neural Variational Inference and Learning in Belief Networks

Unit Tests for Stochastic Optimization

An Approximation of the Universal Intelligence Measure

Stochastic Back-Propagation and Variational Inference in Deep Latent Gaussian Models

Unsupervised Feature Learning by Deep Sparse Coding

Deterministic Policy Gradient Algorithms

Bayesian Hierarchical Community Discovery

Monday, January 20, 2014

Math vs Logic

Math provides a series of games, which can be usefully applied to reality when their rules closely mirror those of some real system.

Originally, it would seem that math started out as just another part of language. Language itself has been described as a game: a useful set of rules which we follow in order to get things done.

Eventually, math developed into a sort of sub-language or sub-game with a clearly independent set of rules. The objects of mathematical language were much different from the typical objects of everyday language, being more abstract while simultaneously being atypically precise, with very definite behavior. This "definite behavior" constitutes the rules of the game. The Pythagorean cult nurtured the idea of formal derivations from axioms or postulates.

(Around this time, Plato's idea of a separate pure mathematical reality started to seem plausible to some folks.)

Notice, however, that math still seemed like a single game. Euclid's Elements provided a wonderfully unified world of mathematics, in which number theory was considered as a geometric topic.

I think it wasn't until the 1800s that it started to become clear that we want to view math as a rich set of different possible games, instead. Non-euclidean geometry was discovered, and mathematicians started to explore variant geometries. The use of imaginary numbers became widely accepted due to the work of Gauss (and Euler in the previous century), and the quaternions were discovered. Group theory was being applied and developed.

Once we have this view, math becomes an exploration of all the possible systems we can define.

Within the same century, logic was gaining teeth and starting to look like a plausible foundation for all mathematical reasoning. It would be overstating things to claim that logic had not developed within the past two thousand years; however, developments in that century would overshadow all previous.

In using the number two thousand, I refer to Aristotle, who laid out the first formal system of logic around 350 BC. Aristotle provided a deep theory, but it was far from enough to account for all correct arguments. Euclid's Elements (written just 50 years later) may have used exceedingly rigorous arguments (a bright light in the history of mathematics), but they were still informal in the sense that there was no formal system of logic justifying every step. Instead, the reader must see that the steps are justified by force of reason. Aristotle's logic was simply too weak to do the job.

It was not until Frege, publishing in the late 1800s, that this became a possibility.

Frege attempted to set out a system of logic in which all of math could be formalized, and he went a long way toward achieving this. He invented modern logic. In the hands of others, it would become the language in which all of mathematics could be set out.

So, we see that logic steps in to provide a sort of super-game in which all the various games of mathematics can be played. It does so just as a unified picture of mathematics as a single game is crumbling.

My point is to answer a question which comes up now and then: what is the dividing line between math and logic? Is logic a sub-field of math, or is it the other way around? The reality is complex. Logic and math are clearly distinct: logic is an attempt to characterize justified arguments, whereas math merely relies heavily on these sorts of arguments. A system of logic can be viewed as just another mathematical system (just another game), but it must be admitted that logic has a different "flavor" than mathematics. I think the difference is in how we are trying to capture a very large space of possibilities within a logical system (even when we are studying very restricted systems of logic, such as logic with bounded quantification).

Ultimately, the difference is a historical one.

All facts cited here can be easily verified on Wikipedia. Thanks for reading!

Saturday, December 21, 2013

You Think Too Much

This first appeared as one of my few facebook notes. I'm copying it here; perhaps it's a better place for it.

This is what the phrase "You're overthinking it" is like.

Mary and Bob sit down to a game of chess. Mary is an inexperienced player, and Bob is giving her some advice. He's being nice, trying to hint at the right moves without telling her what to do. So, Bob is making his moves very quickly, since he's experienced, and Mary is not difficult to play against (and he's going easy on her anyway). Mary is very uncertain of most of her moves, and spends a lot of time staring at the board indecisively.

At one point, Mary is looking at the board in confusion. Bob sees very plainly what move she needs to make; she should move her rook to defend her knight. Mary is looking very carefully at all the possible moves she could make, trying hard to evaluate which things might be good or bad for her, trying to think a few moves ahead.

"You're thinking too much," Bob says. "It's very simple."

This advice sounds helpful to Bob. From Bob's perspective, Mary is spending a lot of time thinking about many alternatives when she should be quickly hitting on the critical move. And it's true: if Mary were a better player, she would be thinking less here.

From Mary's perspective, this is not very helpful at all. She tries to take Bob's advice. She tries to think less about her move. She figures, if Bob says "It's simple", this must mean that she doesn't need to look several moves ahead to see the consequences. She looks for possible moves again, this time looking for things that have good consequences for her immediately.

Mary moves the pawn up to threaten one of Bob's pieces.

Bob takes Mary's knight.

Bob explains to a frustrated Mary what she could have done to avoid this. "See? You're overthinking it" he adds. To Bob, this feels like the right explanation for Mary's wrong move: she was thinking about all these other pieces, when she needed to be defending her knight.

The worst part is, Mary starts to be convinced, too. She admits that she was taking a lot of time to look several moves ahead in all kinds of situations that turned out to be irrelevant to what she needed to do. She tries to think less during the rest of the game, and makes many other mistakes as a result.

Sunday, December 15, 2013

Where is AI Headed?

I've spent a lot of effort on this blog arguing for the direction of higher expressiveness. Machine intelligence should be able to learn anything a human can learn, and in order for that to be possible, it should be able to conceive of any concept that a human can. I have proceeded with the belief that this is the direction to push in order for the field to make progress.

Yet, in some ways at least, the field is headed in the opposite direction.

I've often discussed the Chomsky hierarchy, and how most techniques at present fall very low on it. I've often discussed hierarchies "above" the Chomsky hierarchy; hierarchies of logic & truth, problems of uncomputability and undefinability. Reaching for the highest expression of form, the most general notion of pattern.

Machine learning has made artificial intelligence increasingly practical. Yet, the most practical techniques are often the least expressively powerful. Machine learning flourished once it abandoned the symbolic obsession of GOFAI. Fernando Pereira famously said: "The older I get, the further down the Chomsky Hierarchy I go."

There's a good reason for this, too. Highly structured techniques like logic induction and genetic programming (both of which would go high in the hierarchy) don't scale well. Commercial machine learning is large-scale, and increasingly so. I mentioned this in connection with word2vec last time: "Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result." 

The "structure" I'm referring to provides more prior bias, which means more generalization capability. This is very useful when we want to come to the correct conclusion using small amounts of data. However, with more data, we can cover more and more cases without needing to actually make the generalization. At some point, the generalization becomes irrelevant in practice.

Take XML data. You can't parse XML with regular expressions.1 Regular expressions are too low on the Chomsky hierarchy to form a proper model of what's going on. However, for the Large Text Compression Benchmark, which requires us to compress XML data, the leading technique is the PAQ compressor. Compression is equivalent to prediction, so the task amounts to making a predictive model of XML data. PAQ works by constructing a probabilistic model of the sequence of bits, similar to a PPM model. This is not even capable of representing regular expressions. Learning regular expressions is like learning hidden markov models. PPM allows us to learn fully observable markov models. PAQ learns huge markov models that get the job done.

The structure of XML requires a recursive generalization, to understand the nested expressions. Yet, PAQ does acceptably well, because the depth of the recursion is usually quite low.

You can always push a problem lower down on the hierarchy if you're willing to provide more data (often exponentially more), and accept that it will learn the common cases and can't generalize the patterns to the uncommon ones. In practice, it's been an acceptable loss.

Part of the reason for this is that the data just keeps flowing. The simpler techniques require exponentially more data... and that's how much we're producing. It's only getting worse:

Has Big Data Made Anonymity Impossible? MIT Technology Review
At The New Yorker, Gary Marcus complains: Why Can't My Computer Understand Me? Reviewing the work of Hector Levesque, the article conveys a desire to "google-proof" AI, designing intelligence tests which are immune to the big-data approach. Using big data rather than common-sense logic to answer facts is seen as cheating. Levesque presents a series of problems which cannot (presently) be solved by such techniques, and calls others to "stop bluffing".

I can't help but agree. Yet, it seems the tide of history is against us. As the amount of data continues to increase, dumb techniques will achieve better and better results.

Will this trend turn around at some point?

Gary Marcus points out that some information just isn't available on the web. Yet, this is a diminishing reality. As more and more of our lives are online (and as the population rises), more and more will be available in the global brain.

Artificial intelligence is evolving into a specific role in that global brain: a role which requires only simple association-like intelligence, fueled by huge amounts of data. Humans provide the logically structured thoughts, the prior bias, the recursive generalizations; that's a niche which machines are not currently required to fill. At the present, this trend only seems to be increasing.

Should we give up structured AI?

I don't think so. We can forge a niche. We can climb the hierarchy. But it's not where the money is right now... and it may not be for some time.

1: Cthulhu will eat your face.

Monday, December 9, 2013

History of Distributed Representations

Commenting on the previous post, a friend pointed out that "distributed representations" are not so new. I thought I would take a look at the history to clarify the situation.

In a very broad sense, I was discussing the technique of putting a potentially nonlinear problem into a linear vector space. This vague idea matches to many techniques in machine learning. A number of well-developed algorithms take advantage of linearity assumptions, including PCA, logistic regression, and SVM.1 A common approach to machine learning is to find a number of features, which are just functions of your data, and use one of these techniques on the features (hoping they are close enough to linear). Another common technique, the kernel trick, projects features into a higher-dimensional space where the linearity assumption is more likely to get good results. Either way, a large part of the work to get good results is "feature engineering": choosing how to represent the data as a set of features to feed into the machine learning algorithm.

We could even argue that probability theory itself is an example: probabilities are always linear, no matter how nonlinear the underlying problem being described. (The probability of an event is the sum of the ways it could happen.) This gives us nice results; for example, there is always a Nash equilibrium for a game if we allow probabilistic strategies. This is not the case if we only consider "pure" strategies.

This theme is interesting to me, but, I was trying to be much more narrow in talking about recent developments in distributed representations. Like feature-based machine learning, a distributed representation will put data into a vector space to make it easier to work with. Unlike approaches relying on feature engineering, there is an emphasis on figuring out how to get the representations to "build themselves", often starting with randomly assigned vector representations.

The beginning of this kind of approach is probably latent semantic analysis (LSA), which is from 1988. LSA assigns 'semantic vectors' to words based on statistical analysis of the contexts those words occur in, based on the idea that words with similar meaning will have very similar statistics.

Given how old this technique is, the excitement around Google's release of the word2vec tool is striking. Reports spun it as deep learning for the masses. Deep learning is a much more recent wave of development. I think the term has lost much of its meaning in becoming a buzzword.2 Calling word2vec "deep" takes this to farcical levels: the techniques of word2vec improve previous models by removing the hidden layer from the network. Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result.

One of the exciting things about word2vec is the good results with solving word analogies by vector math. The result of vector computations like France - Paris and Russia - Moscow are very similar, meaning we can approximately find the vector for a capital given the vector for the corresponding nation. The same trick works for a range of word relationships.

However, I've talked with people who had the incorrect impression that this is a new idea. I'm not sure exactly how old it is, but I've heard the idea mentioned before, and I did find a reference from 2004 which appears to use LSA to do the same basic thing. (I can't see the whole article on google books...)

One thing which I thought was really new was the emerging theme of combining vectors to form representations of compound entities. This, too, is quite old. I found a paper from 1994, which cites harder-to-find papers from 1993, 1990, and 1989 that also developed techniques to combine vectors to create representations of compound objects. Recent developments seem much more useful, but, the basic idea is present.

So, all told, it's a fairly long-standing area which has seen large improvements in the actual techniques employed, but, whose central ideas were laid out (in one form or another) over 20 years ago.

1: By the way, don't get too hung up about what makes one machine learning technique "linear" and another "nonlinear". This is a false dichotomy. What I really mean is that a technique works in a vector space (which more or less means a space where + is defined and behaves very much like we expect), and relies "largely" on linear operations in this space. What does "linear" actually mean? A function F is linear if and only if F(x+y) = F(x) + F(y) and for scalar a, F(ax) = aF(x). PCA, for example, is justified by minimizing a squared error (a common theme), where the error is based on euclidean distance, a linear operation. Notice that taking the square isn't linear, but PCA is still thought of as a linear approach.

2: Deep learning has come to mean almost any multi-layer neural network. The term caught on with the success related to Deep Belief Networks, which proposed specific new techniques. Things currently being called "deep learning" often have little in common with this. I feel the term has been watered down by people looking to associate their work with the success of others. This isn't all bad. The work on multi-layered networks seems to have produced real progress in reducing or eliminating the need for feature engineering.

Tuesday, December 3, 2013

Distributed Representations

Distributed vector representations are a set of techniques which take a domain (usually, words) and embed it into a linear space (representing each word as a large vector of numbers). Useful tasks can then be represented as manipulations of these embedded representations. The embedding can be created in a variety of ways; often, it is learned by optimizing task performance. SENNA demonstrated that representations learned for one task are often useful for others.

There are so many interesting advances being made in distributed vector representations, it seems that a nice toolset is emerging which will soon be considered a basic part of machine intelligence.

Google's word2vec assigns distributed vector representations to individual words and a few short phrases. These representations have been shown to give intuitively reasonable results on analogy tasks with simple vector math: king - man + woman is approximately equal to the vector for queen, for example. This is despite not being explicitly optimized for that task, again showing that these representations tend to be useful for a wide range of tasks.

Similar approaches have aided machine translation tasks by turning word translation into a linear transform from one vector space to another.

One limitation of this approach is that we cannot do much to represent sentences. Sequences of words can be given somewhat useful representations by adding together the individual word representations, but this approach is limited.

Socher's RNN learns a matrix transform to compose two elements together and give them a score, which is then used for greedy parsing by composing together the highest-scoring items, with great success. This gives us useful vector representations for phrases and sentences.

Another approach which has been suggested is circular convolution. This combines vectors in a way which captures ordering information, unlike addition or multiplication. Impressively, the technique has solved Raven progressive matrix problems:

Then there's a project, COMPOSES, which seeks to create a language representation in which nouns get vector representations and other parts of speech get matrix representations (and possibly tensor representations?).

I haven't looked into the details fully, but conceptually it makes sense: the parts of speech which intuitively represent modifiers are linear functions, while the parts of speech which are intuitively static objects are getting operated on by these functions.

The following paper gives a related approach:

Here, everything is represented as a matrix of the same size. Representing the objects as functions is somewhat limiting, but the uniform representation makes it easy to jump to higher-level functions (modifiers on modifiers) without adding anything. This seems to have the potential to enable a surprisingly wide range of reasoning capabilities, given the narrow representation.

As the authors of that last paper mention, the approach can only support reasoning of a "memorized" sort. There is no mechanism which would allow chained logical inferences to answer questions. This seems like a good characterization of the general limitations of the broader set of techniques. The distributed representation of a word, phrase, image, or other object is a static encoding which represents, in some sense, a classification of the object into a fuzzy categorization system we've learned. How can we push the boundary here, allowing for a more complex reasoning? Can these vector representations be integrated into a more generally capable probabilistic logic system?

Sunday, August 11, 2013

Progress in Logical Priors

It's been a while since I've posted here. I've been having a lot of ideas, but I suppose the phd student life makes me focus more on implementing than on speculating (which is a good thing).

I presented my first sole-authored paper (based on this blog post) at the AGI conference in December of last year, and it was cited by Marcus Hutter in an interesting paper about approximating universal intelligence (which was presented at this year's AGI conference, which was once again a summer conference, so already took place).

When I set out to write the paper, my main goal was to show that we gain something by representing beliefs as something like logical statements, rather than as something like programs. This allows our beliefs to decompose more easily, readily allows inference in "any direction" (whereas programs are naturally executed in one direction, producing specific results in a specific order), and also allowing incomputable hypotheses to be dealt with in a partial way (dealing somewhat more gracefully with the possibility by explicitly representing it in the hypothesis class, but incompletely).

My desire to put forward this thesis was partly out of an annoyance with people invoking the Curry-Howard isomorphism all-too-often, to claim that logic and computation are really one and the same. I still think this is misguided, and not what the curry-howard isomorphism really says when you get down to it. The "programs are proofs" motto is misleading. There is no consensus on how to deal with Turing-complete programs in this way; turing-complete programming languages seem to correspond to trivial logics where you can prove anything from anything!*

Annoyed, I wanted to show that there was a material difference between the two ways of representing knowledge.

As I wrote the paper and got feedback from colleagues, it became clear that I was fighting a losing fight for that thesis: although the first-order prior represented a new mathematical object with interesting features along the lines I was advocating, it would be possible to write a somewhat program-like representation with the same features. I would still argue each of the advantages I mentioned, and still argue against naive invocations of Curry-Howard, but I was trying to make these arguments too strong, and it wasn't working. In any case, this was a point that didn't need to be made in order for the paper to be interesting, for two reasons:

  1. If desired, you could re-do everything in a more computational way. It would still be a new, interesting distribution with features similar to but different from the Solomonoff distribution.
  2. A universal distribution over logic, done right, is interesting even if it had turned out to be somehow equivalent to the Solomonoff distribution.

So, all told, I downplayed the "logic is different from computation" side of the paper, and tried to focus more on the prior itself.

After submitting the paper, I went back to working on other things. Although I still thought about logical priors every so often, I didn't make very much conceptual progress for a while.

At the July MIRI workshop, I got the opportunity to spend time on the topic again, with other smart folks. We spend roughly a day going over the paper, and then discussed how to take things further.

The main problem with the first-order prior is that the probability of a universal statement does not approach 1 as we see more and more examples. This is because all the examples in the world will still be consistent with the statement "there exists a counterexample"; so, if we are randomly sampling sentences to compose a logical theory, the probability that we add that sentences doesn't drop below a certain minimum.

So, for example, if we are observing facts about the natural numbers, we will not converge to arbitrarily high probability for generalizations of these facts. To make it more concrete, we cannot arrive at arbitrarily high probabilities for the Goldbach conjecture by observing more and more examples of even numbers being written as the sum of two primes.

This isn't a bad thing in all cases. Holding back some fixed probability for the existence of a counterexample matches with the semantics of first-order logic, which is not supposed to be able to rule out omega-inconsistent theories. (Omega inconsistency is the situation where we deny a universal statement while simultaneously believing all the examples.)

For some domains, though, we really do want to rule out omega-inconsistency; the natural numbers are one of these cases. The reason the first-order prior allows some probability for omega-inconsistent possibilities is that first-order logic is unable to express the fact that natural numbers correspond exactly to the finite ones. ("Finite" cannot be properly characterized in first-order logic.) More expressive logics, such as second-order logic, can make this kind of assertion; so, we might hope to specify reasonable probability distributions over those logics which have the desired behavior.

Unfortunately, it is not difficult to show that the desired behavior is not approximable. If the probability of universal statements approaches 1 as we observe increasingly many examples, then it must equal 1 if we believe all the examples. Let's take an example. If we believe all the axioms of peano arithmetic, then we may be able to prove all the examples of the Goldbach conjecture. In fact, we end up believing all true Pi_1 statements in the arithmetic hierarchy. But this implies that we believe all true Sigma_2 statements, if our beliefs are closed under implication. This in turn means that we believe all the examples of the Pi_3 universal statements, which means we must believe the true Pi_3 with probability 1, since we supposed that we believe universal statements if we believe their examples. And so on. This process can be used to argue that we must believe the true statements on every level of the hierarchy.

Since the hierarchy transcends every level of hypercomputation, there can be no hope of a convergent approximation for it. So, convergence of universal statements to probability 1 as we see more examples is (very) uncomputable. This may seem a bit surprising, given the naturalness of the idea.

Marcus Hutter has discussed distributions like this, and argues that it's OK: this kind of distribution doesn't try to capture our uncertainty about logically undecidable statements. Instead, his probability distribution represents the strong inductive power that we could have if we could infallibly arrive at correct mathematical beliefs.

Personally, though, I am much more interested in approximable distributions, and approaches which do try to represent the kind of uncertainty we have about undecidable mathematical statements.

My idea has been that we can get something interesting by requiring convergence on the Pi_1 statements only.

One motivation for this is that Pi_1 convergence guarantees that a logical probability distribution will eventually recognize the consistency of any axiomatic system, which sort-of gets around the 2nd incompleteness theorem: an AI based on this kind of distribution would eventually recognize that any axioms you give it to start with are consistent, which would allow it to gradually increase its logical strength as it came to recognize more mathematical truth. This plausibly seems like a step in the direction of self-trusting AI, one of the goals of MIRI.

The immediate objection to this is that the system still won't trust itself, because it is not a set of axioms, but rather, is a convergent approximation of a probability distribution. Convergence facts are higher up in the arithmetic hierarchy, which suggests that the system won't be able to trust itself even if it does become able to (eventually) trust axiomatic systems.

This intuition turns out to be wrong! There is a weak sense in which Pi_1 convergence implies self-trust. Correctness for Pi_1 implies that we believe the true Sigma_2 statements, which are statements of the form "There exists x such that for all y, R(x,y)" where R is some primitive recursive relation. Take R to be "y is greater than x, and at time y in the approximation process, our probability of statement S is greater than c." (The arithmetic hierarchy can discuss the probability approximation process through a godel-encoding.) The relevant Sigma_2 statements place lower bounds on the limiting probabilities from our probability approximation. We can state upper bounds in a similar way.

This shows that a probability distribution which has Pi_1 convergence will obey something strikingly like the probabilistic reflection principle which came out of a previous MIRI workshop. If its probabilities fall within specific bounds, it will believe that (but the converse, that if it believes they fall within specific bounds, they do, does not hold). This gives such a probability distribution a significant amount of self-knowledge.

So, Pi_1 convergence looks like a nice thing to have. But is it?

During the MIRI workshop, Will Sawin proved that this leads to bad (possibly unacceptable) results: any logically coherent, approximable probability distribution over statements in arithmetic which assigns probability 1 to true pi_1 statements will assign probability 0 to some true pi_2 statements. This seems like a rather severe error; the whole purpose of using probabilities to represent uncertainty about mathematical truth would be to allow "soft failure", where we don't have complete mathematical knowledge, but can assign reasonable probabilities so as to be less than completely in the dark. This theorem shows that we get hard failures if we try for pi_1 convergence.

How concerned should we be? Some of the "hard failures" here correspond to the necessary failures in probabilistic reflection. These actually seem quite tolerable. There could be a lot more errors than that, though.

One fruitful idea might be to weaken the coherence requirement. The usual argument for coherence  is the dutch book argument; but this makes the assumption that bets will pay off, which does not apply here, since we may never face the truth or falsehood of certain mathematical statements. Intuitionistic probability comes out of a variation of the dutch book argument for the case when bets not paying off at all is a possible outcome. This does not require that probabilities sum to 1, which means we can have a gap between the probability of X and the probability of not-X.

An extreme version of this was proposed by Marcello Herreschoff at the MIRI workshop; he suggested that we can get Pi_1 convergence by only sampling Pi_1 statements. This gets what we want, but results in probability gaps at higher levels in the hierarchy; it's possible that a sampled theory will never prove or disprove some complicated statements. (This is similar to the intuitionistic probability idea, but doesn't actually satisfy the intuitionistic coherence requirements. I haven't worked this out, though, so take what I'm saying with a grain of salt.)

We may even preserve some degree of probabilistic reflection this way, since the true Pi_1 still imply the true Sigma_2.

That particular approach seems rather extreme; perhaps too limiting. The general idea, though, may be promising: we may be able to get the advantages of Pi_1 convergence without the disadvantages.

*(Source: Last paragraph of this section on wikipedia.)