Tuesday, July 15, 2014

More Distributed Vectors

Since my last post on distributed vector representations, interest in this area has continued to spread across the research community. This exposition on Colah's blog is quite good, although it unfortunately perpetuates the confusing view that distributed representations are necessarily "deep learning". (In fact, as far as I can see, the trend is in the opposite direction: you can do better by using simpler networks so that you can train faster and scale up to larger datasets. This reflects a very general trend to which deep networks seem to be a rare exception.)

The story Colah tells is quite exciting. Vector representations (AKA "word embeddings") are able to perform well on a variety of tasks which they have not been trained on at all; they seem to "magically" encode general knowledge about language after being trained on just one language task. The form of this knowledge makes it easier to translate between languages, too, because the relationship structure between concepts is similar in the two languages. This even extends to image classification: there have been several successes with so-called zero-shot learning, where a system is able to correctly classify images even when it's never seen examples of those classes before, thanks to the general world knowledge provided by distributed word representations.

For example, it's possible to recognize a cat having only seen dogs, but having read about both dogs and cats.

(This seems rather encouraging!)

Colah mentions that while encoding has been very successful, there is a corresponding decoding problem which seems to be much harder. One paper is mentioned as a hopeful direction for solving this. Colah is talking about trying to decode representations coming out of RNNs, a representation I'm quite fond of because it gives (in some sense) a semantic parse of a sentence. However, another option which I'd like to see tried would be to decode representations based on the Paragraph Vector algorithm. This looks easier, and besides, Paragraph Vector actually got better results for sentiment analysis (one of the key domains where RNNs provide a natural solution). Again, we can point to the general trend of AI toward simpler models. RNNs are a way of combining semantic vectors with probabilistic context-free grammers; Paragraph Vector combines semantic vectors with a markov model. Markov models are simpler and less powerful; therefore, by the contrarian logic of the field, we expect them to do better. And, they do.

All of this makes distributed vectors sound quite promising as a general-purpose representation of concepts within an AI system. In addition to aiding bilingual translation and image-word correspondence problems, RNNs have also been applied to predict links in common-sense knowledge bases, showing that the same model can also be applied to understand information presented in a more logical form (and perhaps to form a bridge between logical representations and natural language). I imagine that each additional task which the vectors are used on can add more implicit knowledge to the vector structure, further increasing the number of "zero-shot" generalizations it may get correct in future tasks. This makes me envision a highly general system which accumulates knowledge in vectors over the course of its existence, achieving lifelong learning as opposed to being re-trained on each task. Vector representations by themselves are obviously not sufficient for AGI (for example, there's no model of problem solving), but they could be a very useful tool within an AGI system.

I mentioned in a previous post that the idea of distributed vectors isn't really that new. One older type of word embedding is latent semantic analysis (LSA). Some researchers who are from the LSA tradition (Marco Baroni, Georgiana Dinu, & German Krusewski) have gotten annoyed at the recent hype, and decided that LSA-style embedding was not getting a fair trial; the new word embeddings have not been systematically compared to the older methods. This paper is the result. The verdict: the new stuff really is better!

When it's coming from people who openly admit that they hoped to prove the opposite, it sounds rather convincing. However, I do have some reservations. The title of their paper is: Don't count, predict! The authors refer to the older LSA-like methods as "count vectors", and newer methods as "predict vectors". The LSA-like methods rely on some transformation of raw co-occurrence counts, whereas the new neural methods train to predict something, such as predicting the current word given several previous words. (For example, LSA uses the singular value decomposition to make a low-rank approximation of tf-idf counts. Paragraph Vector trains to find a latent variable representing the paragraph topic, and from this and several previous words, predict each next word in the paragraph.)

As I said: they are assuming the count vs predict distinction is what explains the different performance of the two options. The main experiment of the paper, an extensive comparison of word2vec (representing the prediction-based techniques) vs DISSECT (representing the counting-based techniques) strongly supports this position. Two other techniques are also tried, though: SENNA and DM. Both of these did worse than word2vec, but the relative comparison of the two is much muddier than the comparison between DISSECT and word2vec. This weakens the conclusion somewhat. Are we expected to believe that new counting-based models will continue to be worse than new prediction-driven models?

If we believed that counting models had reached a state of near-perfection, with only small improvements left to be found, then the conclusion would make sense. Prediction-based vector representations appear to still be in their early days, with large unexplored areas. Presumably, they still have significant room for improvement. If the authors believe that this is not the case for counting-based vector representations, the conclusion makes sense.

However, work I'm involved with may undermine this conclusion.

I've been doing distributed vector stuff with Volkan Ustun and others here at ICT. Our forthcoming paper has some suggestive evidence, showing performance very close to Google's word2vec when trained with similar vector size and amount of data. We are not doing any neural network training. Instead, we are creating representations by summing together initial random representations for each word. This seems to fall firmly into count-based methods.

The incredible thing about Volkan's technique is that it's essentially the first thing we thought of trying; what's going on is much simpler than what happens in word2vec. Yet, we seem to be getting similar results. (We should run a more direct comparison to be sure of what's going on.) If this is the case, it directly contradicts the conclusion of Don't Count, Predict!.

In any case, distributed vectors continue to offer a surprising amount of generality, and have some promise as a cross-task, cross-language, cross-modality unified representation.

1 comment: