Friday, July 18, 2014

Losing Faith in Factor Graphs

In my post Beliefs, I split AGI into two broad problems:


  1. What is the space of possible beliefs?
  2. How do beliefs interact?
The idea behind #1 was: can we formulate a knowledge representation which could in principle express any concept a human can conceive of?

#2 represents the more practical concern: how do we implement inference over these beliefs?

(This was needlessly narrow. I could at least have included something like: 3. How do beliefs conspire to produce actions? That's more like what my posts on procedural logic discuss. 1 and 2 alone don't exactly get you AGI. Nonetheless, these are central questions for me.)

Factor graphs are a fairly general representation for networks of probabilistic beliefs, subsuming bayesian networks, markov networks, and most other graphical models. Like those two, factor graphs are only as powerful as propositional logic, but can be a useful tool for representing more powerful belief structures as well. In other words, the factor graph "toolbox" includes some useful solutions to #2 which we may be able to apply to whatever solution for #1 we come up with. When I started graduate school at USC 3 years ago, I was basically on board with this direction. That's the direction taken by Sigma, the cognitive architecture effort I joined.

I've had several shifts of opinion since that time.

I. Learning is Inference

My first major shift (as I recall) was to give up on the idea of a uniform inference technique for both learning models and applying them ("induction" vs "deduction"). In principle, Bayes' Law tells us that learning is just probabilistic inference. In practice, unless you're using Monte Carlo methods, practical implementations tend have much different algorithms for the two cases. There was a time when I hoped that parameter learning could be implemented via belief propagation in factor graphs, properly understood: we just need to find the appropriate way to structure the factor graph such that parameters are explicitly represented as variables we reason about.

It's technically possible, but not really such a good idea. One reason is that we don't usually care about the space of possible parameter settings, so long as we find one combination which predicts data well. There is no need to keep the spread of possibilities except so far as it facilitates making good predictions. On the other hand, we do usually care about maintaining a spread of possible values for the latent variables within the model itself, precisely because this does tend to help. (The distinction could be is ambiguous! In principle there might not be a clean line between latent variables, model parameters, and even the so-called hyperparameters. In practice, the distinction is quite clear, though, and that's the point here.)

Instead, I ended up working on a perfectly normal gradient-descent learning algorithm for Sigma. Either you're already familiar with this term, or you've got a lot to learn about machine learning; so, I wont try to go into details. The main point is that this is a totally non-probabilistic method, which only believes a single parameter setting at any given time, but attempts to adjust these in response to data.

The next big shift in my thinking has been less easily summarized, and has been taking place over a longer period of time. Last year, I wrote a post attempting to think about it from a perspective of "local" vs "global" methods. At that time, my skepticism about whether factor graphs provide a good foundation to start with was already strong, but I don't think I articulated it very well.

II. Inference is Approximate

Inference in factor graphs is exponential time in the general case, which means that (unless the factor graph has a simple form) we need to use faster approximate inference.

This makes perfect sense if we assume that "inference" is a catch-all term referring to the way beliefs move around in the space of possible beliefs. It should be impossible to do exact inference on the whole belief space.

If we concede that "inference" is distinct from "learning", though, we have a different situation. What's the point in learning a model that is so difficult to apply? If we are going to go ahead and approximate it with a simpler function, doesn't it make sense to learn the simpler function in the first place?

This is part of the philosophy behind sum-product networks (SPNs). Like neural networks, SPN inference is always linear time: you basically just have to propagate function values. Unlike neural networks, everything is totally probabilistic. This may be important for several reasons; it means we can always do reasoning on partial data (filling in the missing parts using the probabilistic model), and reason "in any direction" thanks to Bayes' Law (where neural networks tend to define a one-direction input-output relationship, making reasoning in the reverse direction difficult).

Why are SPNs so efficient?

III. Models are Factored Representations

The fundamental idea behind factor graphs is that we can build up complicated probability distributions by multiplying together simple pieces. Multiplication acts like a conjunction operation, putting probabilistic knowledge together. Suppose that we know a joint distribution connecting X with YP(X, Y). Suppose that we also know a conditional distribution, defining a probability on Z for any given probability on X: P(Z|X). If we assume that Z and Y are independent given X, we can obtain the joint distribution across all three variables by multiplying the two probability functions together: P(X,Y,Z) = P(X,Y)P(Z|X).

A different way of looking at this is that it allows us to create probabilistic constraints connecting variables. A factor graph is essentially a network of these soft constraints. We would expect this to be a powerful representation, because we already know that constraints are a powerful representation for non-probabilistic systems. We would also expect inference to be very difficult, though, because solving systems of constraints is hard.

SPNs allow multiplication of distributions, but only when it does not introduce dependencies between distributions which must be "solved" in any way. To oversimplify just a smidge, we can multiply just in the case of total independence: P(X)P(Y). We are not allowed to use P(X|Y)P(Y), because if we start allowing those kinds of cross-distribution dependencies, things get tangled and inference becomes hard.

To supplement this reduced ability to multiply, we add in the ability to add. We compose complicated distributions as a series of sums and products of simpler distributions. (Hence the name.) Despite the simplifying requirements on the products, this turns out to be a rather powerful representation.

Just as we can think of a product as conjunctive, imposing a series of constraints on a system, we can think of a sum as being disjunctive, building up a probability distribution by enumeration of possibilities. This should bring to mind mixture models and clustering.

Reasoning based on enumeration is faster because, in a sense, it's just the already-solved version of the constraint problem: you have to explicitly list the set of possibilities, as opposed to starting with all possibilities and listing constraints to narrow them down.

Yet, it's also more powerful in some cases. It turns out that SPNs are much better at representing probabilistic parsing than factor graphs are. Parsing is a technique which is essential in natural language processing, but it's also been used for other purposes; image parsing has been a thing for a long time, and I think it's a good way of trying to get a handle on a more intricately structured model of images and other data. A parse is elegantly represented via a sum of possibilities. It can be represented with constraints, and this approach has been successfully used. Those applications require special-purpose optimizations to avoid the exponential time inference associated with factor graphs and constraint networks, though.

The realization that factor graphs aren't very good at representing this is what really broke my resolve as far as factor graphs go. This indicated to me that factor graphs really were missing a critical representational capability; the ability to enumerate possibilities.

Grammar-like theories of general learning have been an old obsession of mine, which I had naively assumed could be handled well within the factor-graph world.

My new view suggested that inference and learning should both be a combination of sum-reasoning and product-reasoning. Sum-learning includes methods like clustering, boosting, and bagging: we learn enumerative models. Reasoning with these is quite fast. Product-learning splits reality into parts which can be modeled separately and then combined. These two learning steps interact. Through this process, we create inherently fast, grammar-like models of the world.

IV. Everything is Probabilistic

Around the same time I was coming to these conclusions, the wider research community was starting to get excited about the new distributed representations. My shift in thinking was taking me toward grammars, so I was quite excited about Richard Socher's RNN representation. This demonstrated using one fairly simple algorithm for both language parsing and image parsing, producing state-of-the-art results along with a learned representation that could be quite useful for other tasks. RNNs have continued to produce impressive results moving forward; in fact, I would go so far as to say that they are producing results which look like precisely what we want to see out of a nascent AGI technology. These methods produce cross-domain structured generalizations powerful enough to classify previously-unseen objects based on knowledge obtained by reading, which (as I said in my previous post) seems quite encouraging. Many other intriguing results have been published as well.

Unfortunately, it's not clear how to fit RNNs in with more directly probabilistic models. The vector representations at the heart of RNNs could be re-conceived as restricted Boltzmann machines (RBMs) or another similar probabilistic model, giving a distributed representation with a fully probabilistic semantics (and taking advantage of the progress in deep belief networks). However, this contradicts the conclusion of section II: an RBM is a complex probabilistic model which must be approximated. Didn't I just say that we should do away with overly-complex models if we know we'll just be approximating them?

Carrying forward the momentum from the previous section, it might be tempting to abandon probabilistic methods entirely, in favor of the new neural approaches. SPNs restrict the form of probabilistic models to insure fast inference. But why accept these restrictions? Neural networks are always fast, and they don't put up barriers against certain sorts of complexity in models.

The objective functions for the neural models are still (often) probabilistic. A neural network can be trained to output a probability. We do not need everything inside the network to be a probability. We do lose something in this approach: it's harder to reverse a function (reasoning backwards via Bayes' Law) and perform other probabilistic manipulations. However, there may be solutions to these problems (such as training a network to invert another network).

V. Further Thought Needed

These neural models are impressive, and it seems as if a great deal could be achieved by extending what's been done and putting those pieces together into one multi-domain knowledge system. However, this could never yield AGI as it stands: as this paper notes (Section 6), vector representations do not perform any logical deduction to answer questions; rather, answers are baked-in during learning. These systems often can correctly answer totally new questions which have not been trained on, but that is because the memorization of the other answers forced the vectors into the right "shape" to make the correct answers evident. While this technique is powerful, it can't capture aspects of intelligence which require thinking.

Similarly, it's not possible for all probabilistic models to be tractable. SPNs may be a powerful tool for creating fast probabilistic models, but the restrictions prevent us from modeling everything. Intelligence requires some inference to be difficult! So, we need a notion of difficult inference!

It seems like there is a place for both shallow and deep methods; unstructured and highly structured models. Fitting all these pieces together is the challenge.

Tuesday, July 15, 2014

More Distributed Vectors

Since my last post on distributed vector representations, interest in this area has continued to spread across the research community. This exposition on Colah's blog is quite good, although it unfortunately perpetuates the confusing view that distributed representations are necessarily "deep learning". (In fact, as far as I can see, the trend is in the opposite direction: you can do better by using simpler networks so that you can train faster and scale up to larger datasets. This reflects a very general trend to which deep networks seem to be a rare exception.)

The story Colah tells is quite exciting. Vector representations (AKA "word embeddings") are able to perform well on a variety of tasks which they have not been trained on at all; they seem to "magically" encode general knowledge about language after being trained on just one language task. The form of this knowledge makes it easier to translate between languages, too, because the relationship structure between concepts is similar in the two languages. This even extends to image classification: there have been several successes with so-called zero-shot learning, where a system is able to correctly classify images even when it's never seen examples of those classes before, thanks to the general world knowledge provided by distributed word representations.

For example, it's possible to recognize a cat having only seen dogs, but having read about both dogs and cats.

(This seems rather encouraging!)

Colah mentions that while encoding has been very successful, there is a corresponding decoding problem which seems to be much harder. One paper is mentioned as a hopeful direction for solving this. Colah is talking about trying to decode representations coming out of RNNs, a representation I'm quite fond of because it gives (in some sense) a semantic parse of a sentence. However, another option which I'd like to see tried would be to decode representations based on the Paragraph Vector algorithm. This looks easier, and besides, Paragraph Vector actually got better results for sentiment analysis (one of the key domains where RNNs provide a natural solution). Again, we can point to the general trend of AI toward simpler models. RNNs are a way of combining semantic vectors with probabilistic context-free grammers; Paragraph Vector combines semantic vectors with a markov model. Markov models are simpler and less powerful; therefore, by the contrarian logic of the field, we expect them to do better. And, they do.

All of this makes distributed vectors sound quite promising as a general-purpose representation of concepts within an AI system. In addition to aiding bilingual translation and image-word correspondence problems, RNNs have also been applied to predict links in common-sense knowledge bases, showing that the same model can also be applied to understand information presented in a more logical form (and perhaps to form a bridge between logical representations and natural language). I imagine that each additional task which the vectors are used on can add more implicit knowledge to the vector structure, further increasing the number of "zero-shot" generalizations it may get correct in future tasks. This makes me envision a highly general system which accumulates knowledge in vectors over the course of its existence, achieving lifelong learning as opposed to being re-trained on each task. Vector representations by themselves are obviously not sufficient for AGI (for example, there's no model of problem solving), but they could be a very useful tool within an AGI system.

I mentioned in a previous post that the idea of distributed vectors isn't really that new. One older type of word embedding is latent semantic analysis (LSA). Some researchers who are from the LSA tradition (Marco Baroni, Georgiana Dinu, & German Krusewski) have gotten annoyed at the recent hype, and decided that LSA-style embedding was not getting a fair trial; the new word embeddings have not been systematically compared to the older methods. This paper is the result. The verdict: the new stuff really is better!

When it's coming from people who openly admit that they hoped to prove the opposite, it sounds rather convincing. However, I do have some reservations. The title of their paper is: Don't count, predict! The authors refer to the older LSA-like methods as "count vectors", and newer methods as "predict vectors". The LSA-like methods rely on some transformation of raw co-occurrence counts, whereas the new neural methods train to predict something, such as predicting the current word given several previous words. (For example, LSA uses the singular value decomposition to make a low-rank approximation of tf-idf counts. Paragraph Vector trains to find a latent variable representing the paragraph topic, and from this and several previous words, predict each next word in the paragraph.)

As I said: they are assuming the count vs predict distinction is what explains the different performance of the two options. The main experiment of the paper, an extensive comparison of word2vec (representing the prediction-based techniques) vs DISSECT (representing the counting-based techniques) strongly supports this position. Two other techniques are also tried, though: SENNA and DM. Both of these did worse than word2vec, but the relative comparison of the two is much muddier than the comparison between DISSECT and word2vec. This weakens the conclusion somewhat. Are we expected to believe that new counting-based models will continue to be worse than new prediction-driven models?

If we believed that counting models had reached a state of near-perfection, with only small improvements left to be found, then the conclusion would make sense. Prediction-based vector representations appear to still be in their early days, with large unexplored areas. Presumably, they still have significant room for improvement. If the authors believe that this is not the case for counting-based vector representations, the conclusion makes sense.

However, work I'm involved with may undermine this conclusion.

I've been doing distributed vector stuff with Volkan Ustun and others here at ICT. Our forthcoming paper has some suggestive evidence, showing performance very close to Google's word2vec when trained with similar vector size and amount of data. We are not doing any neural network training. Instead, we are creating representations by summing together initial random representations for each word. This seems to fall firmly into count-based methods.

The incredible thing about Volkan's technique is that it's essentially the first thing we thought of trying; what's going on is much simpler than what happens in word2vec. Yet, we seem to be getting similar results. (We should run a more direct comparison to be sure of what's going on.) If this is the case, it directly contradicts the conclusion of Don't Count, Predict!.

In any case, distributed vectors continue to offer a surprising amount of generality, and have some promise as a cross-task, cross-language, cross-modality unified representation.

Friday, February 7, 2014

DeepMind Papers

There has been a lot of talk about Google's acquisition of DeepMind. I won't try to review the facts for those who don't know about it-- there's got to be about 100 roughly identical news articles you can find. What those news articles only rarely include is a link to DeepMind's research publications; and as far as I've seen, only to the paper on playing Atari games using model-free RL. (Maybe it's the most public-friendly.) Before the news broke, I didn't even know that DeepMind was putting out any information on what they were doing; their very blank company website made me think they were being hush-hush.

So. For the curious, here are all the papers I could find which include at least one author @deepmind.com:

Deep Autoregressive Networks

Playing Atari with Deep Reinforcement Learning

Neural Variational Inference and Learning in Belief Networks

Unit Tests for Stochastic Optimization

An Approximation of the Universal Intelligence Measure

Stochastic Back-Propagation and Variational Inference in Deep Latent Gaussian Models

Unsupervised Feature Learning by Deep Sparse Coding

Deterministic Policy Gradient Algorithms

Bayesian Hierarchical Community Discovery

Learning Word Embeddings Efficiently with Noise-Contrastic Estimation

Monday, January 20, 2014

Math vs Logic

Math provides a series of games, which can be usefully applied to reality when their rules closely mirror those of some real system.

Originally, it would seem that math started out as just another part of language. Language itself has been described as a game: a useful set of rules which we follow in order to get things done.

Eventually, math developed into a sort of sub-language or sub-game with a clearly independent set of rules. The objects of mathematical language were much different from the typical objects of everyday language, being more abstract while simultaneously being atypically precise, with very definite behavior. This "definite behavior" constitutes the rules of the game. The Pythagorean cult nurtured the idea of formal derivations from axioms or postulates.

(Around this time, Plato's idea of a separate pure mathematical reality started to seem plausible to some folks.)

Notice, however, that math still seemed like a single game. Euclid's Elements provided a wonderfully unified world of mathematics, in which number theory was considered as a geometric topic.

I think it wasn't until the 1800s that it started to become clear that we want to view math as a rich set of different possible games, instead. Non-euclidean geometry was discovered, and mathematicians started to explore variant geometries. The use of imaginary numbers became widely accepted due to the work of Gauss (and Euler in the previous century), and the quaternions were discovered. Group theory was being applied and developed.

Once we have this view, math becomes an exploration of all the possible systems we can define.

Within the same century, logic was gaining teeth and starting to look like a plausible foundation for all mathematical reasoning. It would be overstating things to claim that logic had not developed within the past two thousand years; however, developments in that century would overshadow all previous.

In using the number two thousand, I refer to Aristotle, who laid out the first formal system of logic around 350 BC. Aristotle provided a deep theory, but it was far from enough to account for all correct arguments. Euclid's Elements (written just 50 years later) may have used exceedingly rigorous arguments (a bright light in the history of mathematics), but they were still informal in the sense that there was no formal system of logic justifying every step. Instead, the reader must see that the steps are justified by force of reason. Aristotle's logic was simply too weak to do the job.

It was not until Frege, publishing in the late 1800s, that this became a possibility.

Frege attempted to set out a system of logic in which all of math could be formalized, and he went a long way toward achieving this. He invented modern logic. In the hands of others, it would become the language in which all of mathematics could be set out.

So, we see that logic steps in to provide a sort of super-game in which all the various games of mathematics can be played. It does so just as a unified picture of mathematics as a single game is crumbling.

My point is to answer a question which comes up now and then: what is the dividing line between math and logic? Is logic a sub-field of math, or is it the other way around? The reality is complex. Logic and math are clearly distinct: logic is an attempt to characterize justified arguments, whereas math merely relies heavily on these sorts of arguments. A system of logic can be viewed as just another mathematical system (just another game), but it must be admitted that logic has a different "flavor" than mathematics. I think the difference is in how we are trying to capture a very large space of possibilities within a logical system (even when we are studying very restricted systems of logic, such as logic with bounded quantification).

Ultimately, the difference is a historical one.

All facts cited here can be easily verified on Wikipedia. Thanks for reading!

Saturday, December 21, 2013

You Think Too Much

This first appeared as one of my few facebook notes. I'm copying it here; perhaps it's a better place for it.

This is what the phrase "You're overthinking it" is like.

Mary and Bob sit down to a game of chess. Mary is an inexperienced player, and Bob is giving her some advice. He's being nice, trying to hint at the right moves without telling her what to do. So, Bob is making his moves very quickly, since he's experienced, and Mary is not difficult to play against (and he's going easy on her anyway). Mary is very uncertain of most of her moves, and spends a lot of time staring at the board indecisively.

At one point, Mary is looking at the board in confusion. Bob sees very plainly what move she needs to make; she should move her rook to defend her knight. Mary is looking very carefully at all the possible moves she could make, trying hard to evaluate which things might be good or bad for her, trying to think a few moves ahead.

"You're thinking too much," Bob says. "It's very simple."

This advice sounds helpful to Bob. From Bob's perspective, Mary is spending a lot of time thinking about many alternatives when she should be quickly hitting on the critical move. And it's true: if Mary were a better player, she would be thinking less here.

From Mary's perspective, this is not very helpful at all. She tries to take Bob's advice. She tries to think less about her move. She figures, if Bob says "It's simple", this must mean that she doesn't need to look several moves ahead to see the consequences. She looks for possible moves again, this time looking for things that have good consequences for her immediately.

Mary moves the pawn up to threaten one of Bob's pieces.

Bob takes Mary's knight.

Bob explains to a frustrated Mary what she could have done to avoid this. "See? You're overthinking it" he adds. To Bob, this feels like the right explanation for Mary's wrong move: she was thinking about all these other pieces, when she needed to be defending her knight.

The worst part is, Mary starts to be convinced, too. She admits that she was taking a lot of time to look several moves ahead in all kinds of situations that turned out to be irrelevant to what she needed to do. She tries to think less during the rest of the game, and makes many other mistakes as a result.

Sunday, December 15, 2013

Where is AI Headed?

I've spent a lot of effort on this blog arguing for the direction of higher expressiveness. Machine intelligence should be able to learn anything a human can learn, and in order for that to be possible, it should be able to conceive of any concept that a human can. I have proceeded with the belief that this is the direction to push in order for the field to make progress.

Yet, in some ways at least, the field is headed in the opposite direction.

I've often discussed the Chomsky hierarchy, and how most techniques at present fall very low on it. I've often discussed hierarchies "above" the Chomsky hierarchy; hierarchies of logic & truth, problems of uncomputability and undefinability. Reaching for the highest expression of form, the most general notion of pattern.

Machine learning has made artificial intelligence increasingly practical. Yet, the most practical techniques are often the least expressively powerful. Machine learning flourished once it abandoned the symbolic obsession of GOFAI. Fernando Pereira famously said: "The older I get, the further down the Chomsky Hierarchy I go."

There's a good reason for this, too. Highly structured techniques like logic induction and genetic programming (both of which would go high in the hierarchy) don't scale well. Commercial machine learning is large-scale, and increasingly so. I mentioned this in connection with word2vec last time: "Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result." 

The "structure" I'm referring to provides more prior bias, which means more generalization capability. This is very useful when we want to come to the correct conclusion using small amounts of data. However, with more data, we can cover more and more cases without needing to actually make the generalization. At some point, the generalization becomes irrelevant in practice.

Take XML data. You can't parse XML with regular expressions.1 Regular expressions are too low on the Chomsky hierarchy to form a proper model of what's going on. However, for the Large Text Compression Benchmark, which requires us to compress XML data, the leading technique is the PAQ compressor. Compression is equivalent to prediction, so the task amounts to making a predictive model of XML data. PAQ works by constructing a probabilistic model of the sequence of bits, similar to a PPM model. This is not even capable of representing regular expressions. Learning regular expressions is like learning hidden markov models. PPM allows us to learn fully observable markov models. PAQ learns huge markov models that get the job done.

The structure of XML requires a recursive generalization, to understand the nested expressions. Yet, PAQ does acceptably well, because the depth of the recursion is usually quite low.

You can always push a problem lower down on the hierarchy if you're willing to provide more data (often exponentially more), and accept that it will learn the common cases and can't generalize the patterns to the uncommon ones. In practice, it's been an acceptable loss.

Part of the reason for this is that the data just keeps flowing. The simpler techniques require exponentially more data... and that's how much we're producing. It's only getting worse:


Has Big Data Made Anonymity Impossible? MIT Technology Review
At The New Yorker, Gary Marcus complains: Why Can't My Computer Understand Me? Reviewing the work of Hector Levesque, the article conveys a desire to "google-proof" AI, designing intelligence tests which are immune to the big-data approach. Using big data rather than common-sense logic to answer facts is seen as cheating. Levesque presents a series of problems which cannot (presently) be solved by such techniques, and calls others to "stop bluffing".

I can't help but agree. Yet, it seems the tide of history is against us. As the amount of data continues to increase, dumb techniques will achieve better and better results.

Will this trend turn around at some point?

Gary Marcus points out that some information just isn't available on the web. Yet, this is a diminishing reality. As more and more of our lives are online (and as the population rises), more and more will be available in the global brain.

Artificial intelligence is evolving into a specific role in that global brain: a role which requires only simple association-like intelligence, fueled by huge amounts of data. Humans provide the logically structured thoughts, the prior bias, the recursive generalizations; that's a niche which machines are not currently required to fill. At the present, this trend only seems to be increasing.

Should we give up structured AI?

I don't think so. We can forge a niche. We can climb the hierarchy. But it's not where the money is right now... and it may not be for some time.

1: Cthulhu will eat your face.

Monday, December 9, 2013

History of Distributed Representations

Commenting on the previous post, a friend pointed out that "distributed representations" are not so new. I thought I would take a look at the history to clarify the situation.

In a very broad sense, I was discussing the technique of putting a potentially nonlinear problem into a linear vector space. This vague idea matches to many techniques in machine learning. A number of well-developed algorithms take advantage of linearity assumptions, including PCA, logistic regression, and SVM.1 A common approach to machine learning is to find a number of features, which are just functions of your data, and use one of these techniques on the features (hoping they are close enough to linear). Another common technique, the kernel trick, projects features into a higher-dimensional space where the linearity assumption is more likely to get good results. Either way, a large part of the work to get good results is "feature engineering": choosing how to represent the data as a set of features to feed into the machine learning algorithm.

We could even argue that probability theory itself is an example: probabilities are always linear, no matter how nonlinear the underlying problem being described. (The probability of an event is the sum of the ways it could happen.) This gives us nice results; for example, there is always a Nash equilibrium for a game if we allow probabilistic strategies. This is not the case if we only consider "pure" strategies.

This theme is interesting to me, but, I was trying to be much more narrow in talking about recent developments in distributed representations. Like feature-based machine learning, a distributed representation will put data into a vector space to make it easier to work with. Unlike approaches relying on feature engineering, there is an emphasis on figuring out how to get the representations to "build themselves", often starting with randomly assigned vector representations.

The beginning of this kind of approach is probably latent semantic analysis (LSA), which is from 1988. LSA assigns 'semantic vectors' to words based on statistical analysis of the contexts those words occur in, based on the idea that words with similar meaning will have very similar statistics.

Given how old this technique is, the excitement around Google's release of the word2vec tool is striking. Reports spun it as deep learning for the masses. Deep learning is a much more recent wave of development. I think the term has lost much of its meaning in becoming a buzzword.2 Calling word2vec "deep" takes this to farcical levels: the techniques of word2vec improve previous models by removing the hidden layer from the network. Using very shallow learning makes the technique faster, allowing it to be trained on (much!) larger amounts of data. This gives a higher-quality result.

One of the exciting things about word2vec is the good results with solving word analogies by vector math. The result of vector computations like France - Paris and Russia - Moscow are very similar, meaning we can approximately find the vector for a capital given the vector for the corresponding nation. The same trick works for a range of word relationships.

However, I've talked with people who had the incorrect impression that this is a new idea. I'm not sure exactly how old it is, but I've heard the idea mentioned before, and I did find a reference from 2004 which appears to use LSA to do the same basic thing. (I can't see the whole article on google books...)

One thing which I thought was really new was the emerging theme of combining vectors to form representations of compound entities. This, too, is quite old. I found a paper from 1994, which cites harder-to-find papers from 1993, 1990, and 1989 that also developed techniques to combine vectors to create representations of compound objects. Recent developments seem much more useful, but, the basic idea is present.

So, all told, it's a fairly long-standing area which has seen large improvements in the actual techniques employed, but, whose central ideas were laid out (in one form or another) over 20 years ago.

1: By the way, don't get too hung up about what makes one machine learning technique "linear" and another "nonlinear". This is a false dichotomy. What I really mean is that a technique works in a vector space (which more or less means a space where + is defined and behaves very much like we expect), and relies "largely" on linear operations in this space. What does "linear" actually mean? A function F is linear if and only if F(x+y) = F(x) + F(y) and for scalar a, F(ax) = aF(x). PCA, for example, is justified by minimizing a squared error (a common theme), where the error is based on euclidean distance, a linear operation. Notice that taking the square isn't linear, but PCA is still thought of as a linear approach.

2: Deep learning has come to mean almost any multi-layer neural network. The term caught on with the success related to Deep Belief Networks, which proposed specific new techniques. Things currently being called "deep learning" often have little in common with this. I feel the term has been watered down by people looking to associate their work with the success of others. This isn't all bad. The work on multi-layered networks seems to have produced real progress in reducing or eliminating the need for feature engineering.