Posts Tagged ‘dimension reduction’

The Johnson-Lindenstrauss Lemma

May 18, 2017

Today is a good day I think to talk about the Johnson-Lindenstrauss dimension reduction lemma!

Johnson and Lindenstrauss. It cannot be coincidental that they are both wearing the same type of glasses.
Left image from Johnson’s website, right image from Gil Kalai’s blogpost in memory of Lindenstrauss.

The setting is as follows. Suppose you have a large number n points, living in the Euclidean space of dimension \mathbb{R}^{d} . We’ll call these points x_1,x_2,\ldots,x_n , and think of both the number of points n and the number of dimensions d as being very large, much larger than the measly “3” of our easy-to-grasp physical world.

Now, you may wonder, “But we do live in a three dimensional world thank you very much, so where would I obtain such high dimensional data?” Well, these situations arise increasingly more often in computer science applications, where a point in d -dimensional space may represent an abstraction of some actual real-world thing.

For example, imagine this not-too-farfetched biology experiment. A biologist wants to understand how a particular gene works in a bacterium, and she does this using the following procedure: first, she makes small changes in the DNA sections that code up the protein. Then she replaces the original gene in the bacterium by the new modified one. Finally, she measures the concentration of about 1000 different proteins in the bacterium. If the original gene was important, or if it was a key ingredient in a large protein pathway, then many of these concentrations will be different under the modified gene when compared to the unchanged gene.

Our biologist repeats this experiment many many times, each time changing a different part of the gene in a different way. Ultimately, she has performed say about 500 experiments. Each experiment results in a vector of 1000 numbers, one for each protein. She now has 500 points living in a 1000-dimensional space*. Our poor biologist must now analyze these points, and from them understand the way in which the original gene affects the proteins and operates within the cell. She will run many statistical and geometrical algorithms on these data, and these algorithms will usually have the downside that they really, really slow down as a function of dimension. What can she do?

* Actually, since a set of n points always spans a space of dimension no larger than n , these points only live in a 500-dimensional space. But that’s besides the point here, as we will see next.


Ok, back to mathematics. The Johnson-Lindenstrauss lemma roughly says the following. Suppose you have these n points living in \mathbb{R}^{d} . You can actually put them into a very small space – logarithmic in n – in such a way that the distances between the points are almost preserved: If two points were originally far away from each other, they’ll remain far away from each other, and if two points were close, they’ll remain close. More formally,

Lemma: Let 0 < \varepsilon <1 and let x_1,x_2,\ldots,x_n be n points in \mathbb{R}^{d} . Then there is a function f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k} such that for every two points x_i and x_j ,

(1-\varepsilon)||x_i - x_j||^2 \leq ||f(x_i)-f(x_j)||^2 \leq (1+\varepsilon)||x_i - x_j||^2

with k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3).

This lemma is good if you want to run geometric algorithms on your data. For example, in our biologist’s case, she might decide that if two modifications to the gene resulted in protein densities that are more or less the same (that is, two points in \mathbb{R}^d that are close to each other), then the modifications changed the gene in functionally the same manner. She might then decide to look for clusters of points which are close to each other. This is usually easier to do in smaller dimensional spaces. Applying a Johnson-Lindenstrauss transformation allows our biologist to run the algorithms faster and on more data, while (hopefully) not losing much accuracy.

Of course, this is useful only if the the new dimension k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) is much smaller than the original dimension d. This often happens in applications; for example, in a future post, we will work with n points in n dimensions, and the transition from a space of dimension n to space of dimension \log n will be quite substantial.

The proof we will present here follows a paper by Dasgupta and Gupta, which uses random projections. We’ll first illustrate this concept in two dimensions.

Suppose you have a set of points in the plane:

A one-dimensional subspace of the plane is just a line that goes through the origin. Once we have such a line, we can perform an orthogonal projection of the original points onto that line, giving us a new set of points, all on the line:

If we choose a different line, we get a different projection:

For some lines, the projection will totally wreck the distances between the original points; for example, in this image, we sent two different original points to the same point:

However, for other lines, it may be that the projection is able to preserve the ratio of distances fairly accurately.

The proof the Johnson-Lindenstrauss lemma uses the same idea, only in higher dimensions. Instead of points in the plane, we have our original d -dimensional space; and instead of projecting to a line, we try to find a subspace of dimension k so that projecting on to it will almost preserve distances. [Actually, we project the points and then blow them up by a constant factor, since orthogonal projections can only decrease the length of vectors; but this doesn’t change the gist of the argument. From now on when we say “distance preserving”, we’ll mean actually mean “up to a constant factor”. See the actual calculations for exact treatment].

A-priori, there is no grounds to think that such a subspace exists. However, it turns out that it indeed does, for a reason called “the concentration of measure in high dimensions” Roughly, it says that in high dimensions, some random processes – such as the orthogonal projection of a point onto a random subspace, or the scalar product of a unit vector with a Gaussian vector – are heavily concentrated around their expected value. The probability for such processes to be even a tiny bit away from their mean is exponentially small in the dimension. We usually don’t see these phenomena in 3d, because the lowly and pathetic exponent of 3 isn’t enough to give a profound effect, but in higher dimensions they flourish.

After all of this verbosity, here is the proof outline: for any k , we can choose a random k dimensional subspace of \mathbb{R}^d and project our points on to it. Choose two particular points x_i and x_j from our original cast in \mathbb{R}^d . It may either be that the distance between them is preserved up to a factor of 1 \pm \varepsilon , or it may not be preserved; denote the probability that it is not preserved is some number p . We can consider this “distance distortion” event for all of the {n \choose 2} = (n^2 - n)/2 pairs of points, and by the union bound, the probability for at least one pair of points to have their distance distorted is less than p \cdot (n^2 - n)/2 . If we can show that the probability p is smaller than 2 / (n^2 -n) , then the probability that at least one distance is distorted is smaller than 1. This means there is a non-zero probability that all the distances are preserved! But then there has to exist a distance-preserving projection; otherwise it would be impossible for this probability to be greater than 0. And as it turns out, we can indeed make the probability that a single distance is preserved be this small, by choosing k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) .

Actually, this proof not only shows that there exists such a transformation, it even tells you how to efficiently get one: the probability that a random projection gives a distance-preserving dimension reduction is actually quite high (and can be made very close to one if we just increase the number of dimensions by a bit). So you just need to take your points, project them randomly, and check that the distances are preserved; if not repeat, and sooner rather than later you are bound to get something that works.


Ok, let’s take a look the technical details. All we have to show is that with high probability, the length of a vector is nearly preserved when we project it to a random k dimensional subspace. Now, when performing the calculations, instead of taking a fixed vector and a random subspace, we can take a fixed subspace and a random vector; all the results will be the same, since all we care about when projecting is the relative orientation between the vector and the subspace.

So let’s take a random unit vector, and see what happens to its length. A uniformly random unit vector can be obtained by taking a random distribution which is spherically symmetric and normalizing it. Specifically, let X_1,\ldots,X_d be d independent Gaussian random variables with mean 0 and variance 1; in other words, the random vector X = (X_1,\ldots, X_d) distributes as a d dimensional Gaussian vector \mathcal{N}(0,\text{Id}_{d}) . Then the vector

Y = \frac{X}{||X||}

is a uniformly random unit vector. As for our fixed subspace, we’ll just take the space spanned by the first k coordinates, and denote the projection by Z = \frac{(X_1,\ldots ,X_k)}{||X||} . What is the expected value of the square of the length of Z ? Well, we know that the squared length of Y is 1; this is by design: we normalized it. We thus have:

||Y||^2 = 1 = \sum_{i = 1}^{d}\frac{X_i^2}{||X||^2}.

If this is true for ||Y||^2 , then it is also true for its expectation:

\mathbb{E}||Y||^2 = 1 = \sum_{i = 1}^{d}\mathbb{E}\frac{X_i^2}{||X||^2}.

Notice that the sum is symmetric in the X_i ‘s: in expectation, there is no reason for one X_i to behave differently from the other. Thus each term should contribute equally to the sum, and so for every i ,

\mathbb{E}\frac{X_i^2}{||X||^2} = 1/d.

Since Z is just the first k terms of Y , we get

\mathbb{E}||Z||^2 = \sum_{i=1}^{k}\mathbb{E}\frac{{X_i}^2}{||X||^2} = k/d.

Oops, the expected value is k/d , which is smaller than 1, the original norm of the vector! But not to worry, this can easily be fixed: after we project to a random subspace of dimension k , we just blow up all the points by a factor of \sqrt{d/k} . That way, when we look at the norm squared, the expectation will be d/k times larger, and the expectation will indeed by 1. This multiplication will not affect any of our probabilistic statements, since if ||Z||^2 is concentrated around its mean, so is \frac {d}{k}||Z||^2 .


Now to fulfil our promise and show that the squared length of Z is heavily concentrated around its mean. The calculations will be a bit lengthy, but do not despair (also, if you got up to here and understood everything, you are pretty much good-to-go for most of your Johnson-Lindenstrauss-ing needs).

We’ll show that for any positive real number \beta smaller than 1, the probability for ||Z||^2 to be \beta times smaller than its mean of k/d is exponentially small in the dimension and in \beta ; we’ll then take \beta = 1 - \varepsilon to show that the probability for a small deformation is small. The same calculations can also be repeated for checking when ||Z||^2 is greater than \beta for \beta > 1 , and then taking \beta = 1 + \varepsilon .

A standard method for showing such concentration of measure is using Markov’s inequality, which states that if W is a positive random variable, then for every a > 0 ,

\text{Pr}[W \geq a] \leq \frac{\mathbb{E}W}{a}.

Intuitively, all this inequality says is that if the expectation is small, then it’s impossible to have too large a probability of getting too large a value. Markov’s inequality is innocuous enough, but it becomes powerful when we combine it with exponential moments, that is, when we apply to e^{tW} instead of to W itself. Let’s see how that works.

When written explicitly, ||Z||^2 becomes

||Z||^2 = \frac{X_1^2 + \ldots + X_k^2} {X_1^2 + \ldots + X_d^2}.

So the probability \text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] can be written as

= \text{Pr}[d(X_1^2 + \ldots + X_k^2) \leq k \beta(X_1^2 + \ldots + X_k^2) ]

= \text{Pr}[k \beta(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2) \geq 0]

=  \text{Pr}[e^{t(k \beta (X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))} \geq 1].

Here t was some number greater than 0, and we will optimize over it soon in order to get the best possible bounds. Now we invoke Markov’s inequality with a = 1 , getting

\leq  \mathbb{E}e^{t(k(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))}.

The various X_i ‘s are independent, so the expectation of the product is the product of expectations:

\mathbb{E}e^{X_1 + X_2} = \mathbb{E}[e^{X_1}e^{X_2}] = \mathbb{E}[e^{X_1}]\mathbb{E}[e^{X_2}],

and after grouping together the X_i in our exponent, our probability is bounded by

=  \mathbb{E}[e^{t k \beta X_1^{2}}]^{d-k} \mathbb{E}[e^{t(k \beta - d)X_1^{2}}]^{k}

=  (1-2tk \beta)^{-(d-k)/2}(1-2k(k\beta-d))^{-k/2}.

This last equality we won’t go into, but it stems from knowledge about the moment generating function of Gaussians; in short, all that needs to be proven is that \mathbb{E}[e^{sX^2}] = 1/\sqrt{1-2s} for s<1/2 .

All that is left is to minimize this last expression. This we also won't do here, because it mainly involves annoying calculus which you can let WolframAlpha do for you. This finally gives:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \beta^{k/2}(1 + \frac{k(1-\beta)}{d-k})^{(d-k)/2}.

The expression on the right hand side is equal to e to the power of its logarithm:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2} \log \beta + \frac{d-k}{2} \log(1 + \frac{k(1-\beta)}{d-k})).

Finally, since \log(1+x) \leq x for all x \geq 0, we can replace the second \log factor by \frac{k(1-\beta)}{d-k} , giving

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2}(1-\beta + \log\beta)).

Now we just need to clean up. Put \beta = 1-\varepsilon , and get

\leq \exp(\frac{k}{2}(1-(1-\varepsilon) + \log(1-\varepsilon))).

By more calculus it can be shown that \log(1-x) \leq -x - x^2/2 for 0\leq x <1 , and at last we have

\leq \exp(\frac{k \varepsilon^2}{4}).

And indeed, if we choose k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) , this exponent will be smaller than 1/n^2 , just as we wanted (actually, we can choose just k = 8 \log n / (\varepsilon^2) , but to deal with the second case of \beta = 1+\varepsilon we will also need the \varepsilon^3/3 term). A similar calculation for \beta = 1+\varepsilon will give another factor of 1/n^2 , and in total the probability of having a large distortion will be just 2/n^2 per pair of points.