Posts Tagged ‘Geometry’

The Johnson-Lindenstrauss Lemma

May 18, 2017

Today is a good day I think to talk about the Johnson-Lindenstrauss dimension reduction lemma!

Johnson and Lindenstrauss. It cannot be coincidental that they are both wearing the same type of glasses.
Left image from Johnson’s website, right image from Gil Kalai’s blogpost in memory of Lindenstrauss.

The setting is as follows. Suppose you have a large number n points, living in the Euclidean space of dimension \mathbb{R}^{d} . We’ll call these points x_1,x_2,\ldots,x_n , and think of both the number of points n and the number of dimensions d as being very large, much larger than the measly “3” of our easy-to-grasp physical world.

Now, you may wonder, “But we do live in a three dimensional world thank you very much, so where would I obtain such high dimensional data?” Well, these situations arise increasingly more often in computer science applications, where a point in d -dimensional space may represent an abstraction of some actual real-world thing.

For example, imagine this not-too-farfetched biology experiment. A biologist wants to understand how a particular gene works in a bacterium, and she does this using the following procedure: first, she makes small changes in the DNA sections that code up the protein. Then she replaces the original gene in the bacterium by the new modified one. Finally, she measures the concentration of about 1000 different proteins in the bacterium. If the original gene was important, or if it was a key ingredient in a large protein pathway, then many of these concentrations will be different under the modified gene when compared to the unchanged gene.

Our biologist repeats this experiment many many times, each time changing a different part of the gene in a different way. Ultimately, she has performed say about 500 experiments. Each experiment results in a vector of 1000 numbers, one for each protein. She now has 500 points living in a 1000-dimensional space*. Our poor biologist must now analyze these points, and from them understand the way in which the original gene affects the proteins and operates within the cell. She will run many statistical and geometrical algorithms on these data, and these algorithms will usually have the downside that they really, really slow down as a function of dimension. What can she do?

* Actually, since a set of n points always spans a space of dimension no larger than n , these points only live in a 500-dimensional space. But that’s besides the point here, as we will see next.

Ok, back to mathematics. The Johnson-Lindenstrauss lemma roughly says the following. Suppose you have these n points living in \mathbb{R}^{d} . You can actually put them into a very small space – logarithmic in n – in such a way that the distances between the points are almost preserved: If two points were originally far away from each other, they’ll remain far away from each other, and if two points were close, they’ll remain close. More formally,

Lemma: Let 0 < \varepsilon <1 and let x_1,x_2,\ldots,x_n be n points in \mathbb{R}^{d} . Then there is a function f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k} such that for every two points x_i and x_j ,

(1-\varepsilon)||x_i - x_j||^2 \leq ||f(x_i)-f(x_j)||^2 \leq (1+\varepsilon)||x_i - x_j||^2

with k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3).

This lemma is good if you want to run geometric algorithms on your data. For example, in our biologist’s case, she might decide that if two modifications to the gene resulted in protein densities that are more or less the same (that is, two points in \mathbb{R}^d that are close to each other), then the modifications changed the gene in functionally the same manner. She might then decide to look for clusters of points which are close to each other. This is usually easier to do in smaller dimensional spaces. Applying a Johnson-Lindenstrauss transformation allows our biologist to run the algorithms faster and on more data, while (hopefully) not losing much accuracy.

Of course, this is useful only if the the new dimension k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) is much smaller than the original dimension d. This often happens in applications; for example, in a future post, we will work with n points in n dimensions, and the transition from a space of dimension n to space of dimension \log n will be quite substantial.

The proof we will present here follows a paper by Dasgupta and Gupta, which uses random projections. We’ll first illustrate this concept in two dimensions.

Suppose you have a set of points in the plane:

A one-dimensional subspace of the plane is just a line that goes through the origin. Once we have such a line, we can perform an orthogonal projection of the original points onto that line, giving us a new set of points, all on the line:

If we choose a different line, we get a different projection:

For some lines, the projection will totally wreck the distances between the original points; for example, in this image, we sent two different original points to the same point:

However, for other lines, it may be that the projection is able to preserve the ratio of distances fairly accurately.

The proof the Johnson-Lindenstrauss lemma uses the same idea, only in higher dimensions. Instead of points in the plane, we have our original d -dimensional space; and instead of projecting to a line, we try to find a subspace of dimension k so that projecting on to it will almost preserve distances. [Actually, we project the points and then blow them up by a constant factor, since orthogonal projections can only decrease the length of vectors; but this doesn’t change the gist of the argument. From now on when we say “distance preserving”, we’ll mean actually mean “up to a constant factor”. See the actual calculations for exact treatment].

A-priori, there is no grounds to think that such a subspace exists. However, it turns out that it indeed does, for a reason called “the concentration of measure in high dimensions” Roughly, it says that in high dimensions, some random processes – such as the orthogonal projection of a point onto a random subspace, or the scalar product of a unit vector with a Gaussian vector – are heavily concentrated around their expected value. The probability for such processes to be even a tiny bit away from their mean is exponentially small in the dimension. We usually don’t see these phenomena in 3d, because the lowly and pathetic exponent of 3 isn’t enough to give a profound effect, but in higher dimensions they flourish.

After all of this verbosity, here is the proof outline: for any k , we can choose a random k dimensional subspace of \mathbb{R}^d and project our points on to it. Choose two particular points x_i and x_j from our original cast in \mathbb{R}^d . It may either be that the distance between them is preserved up to a factor of 1 \pm \varepsilon , or it may not be preserved; denote the probability that it is not preserved is some number p . We can consider this “distance distortion” event for all of the {n \choose 2} = (n^2 - n)/2 pairs of points, and by the union bound, the probability for at least one pair of points to have their distance distorted is less than p \cdot (n^2 - n)/2 . If we can show that the probability p is smaller than 2 / (n^2 -n) , then the probability that at least one distance is distorted is smaller than 1. This means there is a non-zero probability that all the distances are preserved! But then there has to exist a distance-preserving projection; otherwise it would be impossible for this probability to be greater than 0. And as it turns out, we can indeed make the probability that a single distance is preserved be this small, by choosing k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) .

Actually, this proof not only shows that there exists such a transformation, it even tells you how to efficiently get one: the probability that a random projection gives a distance-preserving dimension reduction is actually quite high (and can be made very close to one if we just increase the number of dimensions by a bit). So you just need to take your points, project them randomly, and check that the distances are preserved; if not repeat, and sooner rather than later you are bound to get something that works.

Ok, let’s take a look the technical details. All we have to show is that with high probability, the length of a vector is nearly preserved when we project it to a random k dimensional subspace. Now, when performing the calculations, instead of taking a fixed vector and a random subspace, we can take a fixed subspace and a random vector; all the results will be the same, since all we care about when projecting is the relative orientation between the vector and the subspace.

So let’s take a random unit vector, and see what happens to its length. A uniformly random unit vector can be obtained by taking a random distribution which is spherically symmetric and normalizing it. Specifically, let X_1,\ldots,X_d be d independent Gaussian random variables with mean 0 and variance 1; in other words, the random vector X = (X_1,\ldots, X_d) distributes as a d dimensional Gaussian vector \mathcal{N}(0,\text{Id}_{d}) . Then the vector

Y = \frac{X}{||X||}

is a uniformly random unit vector. As for our fixed subspace, we’ll just take the space spanned by the first k coordinates, and denote the projection by Z = \frac{(X_1,\ldots ,X_k)}{||X||} . What is the expected value of the square of the length of Z ? Well, we know that the squared length of Y is 1; this is by design: we normalized it. We thus have:

||Y||^2 = 1 = \sum_{i = 1}^{d}\frac{X_i^2}{||X||^2}.

If this is true for ||Y||^2 , then it is also true for its expectation:

\mathbb{E}||Y||^2 = 1 = \sum_{i = 1}^{d}\mathbb{E}\frac{X_i^2}{||X||^2}.

Notice that the sum is symmetric in the X_i ‘s: in expectation, there is no reason for one X_i to behave differently from the other. Thus each term should contribute equally to the sum, and so for every i ,

\mathbb{E}\frac{X_i^2}{||X||^2} = 1/d.

Since Z is just the first k terms of Y , we get

\mathbb{E}||Z||^2 = \sum_{i=1}^{k}\mathbb{E}\frac{{X_i}^2}{||X||^2} = k/d.

Oops, the expected value is k/d , which is smaller than 1, the original norm of the vector! But not to worry, this can easily be fixed: after we project to a random subspace of dimension k , we just blow up all the points by a factor of \sqrt{d/k} . That way, when we look at the norm squared, the expectation will be d/k times larger, and the expectation will indeed by 1. This multiplication will not affect any of our probabilistic statements, since if ||Z||^2 is concentrated around its mean, so is \frac {d}{k}||Z||^2 .

Now to fulfil our promise and show that the squared length of Z is heavily concentrated around its mean. The calculations will be a bit lengthy, but do not despair (also, if you got up to here and understood everything, you are pretty much good-to-go for most of your Johnson-Lindenstrauss-ing needs).

We’ll show that for any positive real number \beta smaller than 1, the probability for ||Z||^2 to be \beta times smaller than its mean of k/d is exponentially small in the dimension and in \beta ; we’ll then take \beta = 1 - \varepsilon to show that the probability for a small deformation is small. The same calculations can also be repeated for checking when ||Z||^2 is greater than \beta for \beta > 1 , and then taking \beta = 1 + \varepsilon .

A standard method for showing such concentration of measure is using Markov’s inequality, which states that if W is a positive random variable, then for every a > 0 ,

\text{Pr}[W \geq a] \leq \frac{\mathbb{E}W}{a}.

Intuitively, all this inequality says is that if the expectation is small, then it’s impossible to have too large a probability of getting too large a value. Markov’s inequality is innocuous enough, but it becomes powerful when we combine it with exponential moments, that is, when we apply to e^{tW} instead of to W itself. Let’s see how that works.

When written explicitly, ||Z||^2 becomes

||Z||^2 = \frac{X_1^2 + \ldots + X_k^2} {X_1^2 + \ldots + X_d^2}.

So the probability \text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] can be written as

= \text{Pr}[d(X_1^2 + \ldots + X_k^2) \leq k \beta(X_1^2 + \ldots + X_k^2) ]

= \text{Pr}[k \beta(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2) \geq 0]

=  \text{Pr}[e^{t(k \beta (X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))} \geq 1].

Here t was some number greater than 0, and we will optimize over it soon in order to get the best possible bounds. Now we invoke Markov’s inequality with a = 1 , getting

\leq  \mathbb{E}e^{t(k(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))}.

The various X_i ‘s are independent, so the expectation of the product is the product of expectations:

\mathbb{E}e^{X_1 + X_2} = \mathbb{E}[e^{X_1}e^{X_2}] = \mathbb{E}[e^{X_1}]\mathbb{E}[e^{X_2}],

and after grouping together the X_i in our exponent, our probability is bounded by

=  \mathbb{E}[e^{t k \beta X_1^{2}}]^{d-k} \mathbb{E}[e^{t(k \beta - d)X_1^{2}}]^{k}

=  (1-2tk \beta)^{-(d-k)/2}(1-2k(k\beta-d))^{-k/2}.

This last equality we won’t go into, but it stems from knowledge about the moment generating function of Gaussians; in short, all that needs to be proven is that \mathbb{E}[e^{sX^2}] = 1/\sqrt{1-2s} for s<1/2 .

All that is left is to minimize this last expression. This we also won't do here, because it mainly involves annoying calculus which you can let WolframAlpha do for you. This finally gives:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \beta^{k/2}(1 + \frac{k(1-\beta)}{d-k})^{(d-k)/2}.

The expression on the right hand side is equal to e to the power of its logarithm:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2} \log \beta + \frac{d-k}{2} \log(1 + \frac{k(1-\beta)}{d-k})).

Finally, since \log(1+x) \leq x for all x \geq 0, we can replace the second \log factor by \frac{k(1-\beta)}{d-k} , giving

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2}(1-\beta + \log\beta)).

Now we just need to clean up. Put \beta = 1-\varepsilon , and get

\leq \exp(\frac{k}{2}(1-(1-\varepsilon) + \log(1-\varepsilon))).

By more calculus it can be shown that \log(1-x) \leq -x - x^2/2 for 0\leq x <1 , and at last we have

\leq \exp(\frac{k \varepsilon^2}{4}).

And indeed, if we choose k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) , this exponent will be smaller than 1/n^2 , just as we wanted (actually, we can choose just k = 8 \log n / (\varepsilon^2) , but to deal with the second case of \beta = 1+\varepsilon we will also need the \varepsilon^3/3 term). A similar calculation for \beta = 1+\varepsilon will give another factor of 1/n^2 , and in total the probability of having a large distortion will be just 2/n^2 per pair of points.

How many points does it take to define…

May 14, 2012

You know how it is. One moment you are relaxing on the grass with some friends discussing the weather, the next you start stretching your mind as to how many points it takes to define various shapes: ellipses, ellipsoids, triangles, what not. Let’s take a peek at this sort of problem.
We’ll start with circles. One of the first things that you learn in school about circles, is that any three noncollinear points define the circumference of one circle, and one circle only. Lets see how that is.
First we show that we can indeed construct a circle that will pass through all three points. Let’s label our points A, B, and C.

We draw the lines AB and BC, and find their midpoints.

Next we extend perpendiculars from the midpoints. They intersect at point O.

AOB is an isosceles triangle, since the line we extended from the midpoint is perpendicular to AB. That means that OA = OB. The same can be said about triangle BOC, and it follows that OB = OC. This means that OA = OB = OC. The point O is equidistant from all three points, and O is the center of the circle.

Having shown that our three points define a circle, now we need to show that only one circle is defined. Suppose that two circles were indeed defined. Then all three of our points would lie both on the first circle and on the second one. This means that our two circles intersect (otherwise there would be no point that is shared by both). But the intersection of two different circles can only occur at either one point (in which case they are tangent), or two points. This contradicts the fact that all three points are defined on both circles – this only happens when the two circles coincide exactly, meaning that they are the same.

Ok, but what about ellipses, which are a generalization of circles? Intuitively, it seems as if it should take four points to define an ellipse, since it’s basically a circle but with “2 radii” instead of one. Unfortunately, by the same reasoning shown above, this intuition does not hold.
Suppose that we have K points, and we want these points to define only a single ellipse. If the points do define two ellipses, then the points must be found on the circumference of both – so they must be on the intersections between the two ellipses. Two ellipses intersect at most at four points – so if we have only four points, we are not able to resolve the ambiguity between the two. In order for there to be a contradiction, there must be at least five distinct points. Note that this proof doesn’t say that five points are enough; it only says that four are not.

The ellipse (and therefore also the circle) are actually part of a family of curves called “conic sections”: circle, ellipse, parabola, hyperbola. These are the shapes you get when you look at the intersection between a cone and a plane.

Analytically, the equation governing the conic sections is of the form:

Ax^2 + Bxy + Cy^2 + Dx + Ey + F = 0

Each point we have at our disposal can be put in as the x and y values, which gives a single equation for A, B, C, D, E, and F. Hence, in the general case, in order to fully find all of the coefficients, we need five points in order to get five linear equations (there are six coefficients, but we can always set one of them to 1, scaling the others accordingly). So the actual answer for an ellipse is five.
Of course, not every five points will give you an ellipse – there might be no solution at all, or the solution might yield a parabola or a hyperbola. This equation also shows us why we only need three points for a circle – in this case, A = C, and B = 0, so there are two less coefficients to find.
As for an ellipsoid – a “3D ellipse” – well, it is a quadratic surface, which is defined by 10 coefficients, meaning that 9 points are needed.

What about a triangle? How many points does it take to define one? While a triangle might seem like a rather simple shape, it actually isn’t when compared with a circle or an ellipse. With the latter two, we have very strong rules about how to draw them – the fact that the points have to be all at an equal distance from the foci limits us greatly, when compared with the freedom of polygons. In fact, with polygons we might have ambiguity with any number of points, since all the points can lie on the same edge. There is no minimal number of points which always generates a unique triangle. Instead we’ll try to think – what is the minimal number points which can (but not necessarily will) generate a unique triangle?
The number that comes up the fastest when we say “triangle” is three, but it takes more than three points to define a triangle. Why, three points define a circle, and that circle can either be a bounding circle of a triangle, or be bounded within a different triangle, so obviously there is an ambiguity.

The minimum number of points that I managed to find that define a triangle is six. That doesn’t mean that every six points define a triangle; it just means that no configuration of five points can uniquely define one. The configuration looks like this:

The fact that there are three collinear triplets “forces” us to draw lines through them, which reduces the freedom we have when drawing the triangle. Notice that by removing just one point (either a center point or an edge point), we can suddenly make all sorts of other triangles.

That’s it for now on defining shapes with points. If you have some spare time, or are for some reason stuck walking back and forth along a fence with nothing to do, consider trying to figure out how many points you need for different shapes in your head; hours of fun are sure to follow.