I wrote two posts in the spring about distributed vector representations of words, something our research group has been working on. One way of thinking about our method is as a random projection. A random projection is a technique to deal with high-dimensional data via low-dimensional summaries. This is very broadly useful. Turning a high-dimensional item into a lower-dimensional one is referred to as

*dimensionality reduction*, and there are many different techniques optimized for different applications.

### Random Projections

This is mathematically related to compressed sensing, a powerful signal processing technique which has emerged fairly recently.

The fundamental story as I see it is:

An

**-dimensional vector space has**

*n**n*orthogonal basis vectors (

*n*cardinal directions), which are all at right angles from each other. However, if we relax to

*approximate*orthogonality (

*almost*90 degree separation), the number of vectors we can pack in increases rapidly; especially for high

**n**. In fact, it increases so rapidly that

*even choosing vectors randomly*we are very likely to be able to fit exponentially many nearly-orthogonal vectors in. Let's say we can fit

*m*vectors, where

*m*is on the order of

**. (Look at theorem 6 here for the more precise formula.)**

*e*^{n}These pseudo-basis vectors define our random projection. We map the orthogonal basis vectors of a large space (the space we

*really*want to work with) down to this

*small*space of size

*n*, using one pseudo-orthogonal vector to represent each truly-othogonal vector in the higher space.

For example, if we want to represent a probability distribution on

*m*items, we would normally need that many numbers. This can be thought of as a point in the

**-dimensional space. However, we can approximately represent it with just**

*m**n*numbers by taking the projection. We will create some error, but we can approximately recover probabilities. This will work especially well if there are a few large probabilities and many small ones (and we don't care

*too*much about getting the small ones right).

### Scramble Graphs

The idea I had was to use this kind of representation within a probabilistic network. Let's say we are trying to represent a dynamic Bayesian network with variables that have thousands of possible values (for example, variables might be English words). The size of probabilistic relationships

This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?

My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.

Think about the

Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".

In order to train the tensor network

(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)

The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.

The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?

Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.

However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).

So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.

Unfortunately, it didn't work.

It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.

I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.

*between*these variables gets very large. English has about 10^{6}words. A table giving the probabilistic relationship between two words would need 10^{12}entries, and between three words, 10^{18}. Fortunately, these tables will be very sparse. 95% of the time we're using just the most common 5% of words, so we can restrict the vocabulary size to 10^{5}or even 10^{4}without doing too much damage. Furthermore, it's unlikely we're getting anywhere near 10^{12}words in training data, and impossible that we get 10^{18}. There are less than 10^{9}web pages, so if each page averaged 1,000 English words, we*could*possibly get near 10^{12}. (I don't know how many words the internet actually totals to.) Even then, because some pairs of words are much more common than others, our table of probabilities would be very sparse. It's much more likely that out data has just millions of words, though, which means we have at most millions of nonzero co-occurrence counts (and again, much less in practice due to the predominance of a few common co-occurrences).This sparsity makes it possible to store the large tables needed for probabilistic language models. But, what if there's a different way? What if we want to work with distributed representations of words directly?

My idea was to apply the random projection to the probability tables inside the graph, and then train the reduced representation directly (so that we never try to store the large table). This yields a kind of tensor network. The "probability tables" are now represented abstractly, by a matrix (in the 2-variable case) or tensor (for more variables) which may have negative values and isn't required to sum to 1.

Think about the

*tensor network*and the*underlying network.*In the tensor network, every variable from the underlying network has been replaced by a vector, and every probability table has been replaced by a tensor. The tensor network is an approximate representation of the underlying network; the relationship is defined by the fixed projection for each variable (to the smaller vector representations).Because the tensor network is defined by a random transform from the underlying network, I called these "scramble graphs".

In order to train the tensor network

*directly*without worrying about the underlying network, we want to find values for the tensors which cause the*underlying*network to at least approximately sum to 1 and have positive values, in the normal probabilistic way.(That's a fairly difficult constraint to maintain, but I hoped that an approximate solution would perform well enough. In any case, I didn't really end up getting that far.)

The interesting thing about this idea is that for the hidden variables (in the dynamic bayes network we're hypothetically trying to represent), we do not actually care what the underlying variables mean. We might think of them as some sort of topic model or what-have-you, but all we are really representing is the vectors.

The scramble graph looks an awful lot like a neural network with no nonlinearities. Why would we want a neural network with no nonlinearities? Aren't nonlinearities important?

Yes: the nonlinear transformations in neural networks serve an important role; without them, the "capacity" of the network (the ability to represent patterns) is greatly reduced. A multi-layer neural network with no nonlinearities is no more powerful than a single-layer network. The same statement applies to these tensor graphs: no added representation power is derived from stacking layers.

However: nonlinearity is also what makes it impossible to perform arbitrary probabilistic reasoning with neural networks. We can train them to output probabilities, but it is not easy to reverse the probability (via Bayes' Law), condition on partial information, and so on. Probabilistic models are always linear, and we need this to be able to easily do multi-directional reasoning (using a model for something it wasn't trained to do).

So, I thought scramble graphs might be an interesting compromise between neural networks and fully probabilistic models.

Unfortunately, it didn't work.

It seems like the "capacity" of the tensors is just too small. Vectors have exponential capacity in the sense I outlined at the beginning; they can usefully approximate an exponentially larger space. In my experiments, this property seems to go away when we jump to matrix-representations and tensor-representations. I tried to train a rank-3 matrix on artificial data (for which 100% accuracy was possible), using a distribution with the sparse property mentioned (so roughly 90% of the cases fell within 10% of the possibilities), but accuracy remained below 50%. Very approximately, the capacity seemed to be linear (rather than exponential) in the representation size: the fraction correct after training appeared to scale proportionately with the size of the tensor.

I don't know what the mathematics behind this phenomenon says. Perhaps I made some mistake in my training, or perhaps the sizes I used were too small to start seeing asymptotic effects (since the exponential capacity of vectors is asymptotic). I'm starting to think, though, that the math would confirm what I'm seeing: the exponential-capacity phenomenon is destroyed as soon as we move from vectors to matrices.

## No comments:

## Post a Comment