Posts Tagged ‘Probability’

The Johnson-Lindenstrauss Lemma

May 18, 2017

Today is a good day I think to talk about the Johnson-Lindenstrauss dimension reduction lemma!

Johnson and Lindenstrauss. It cannot be coincidental that they are both wearing the same type of glasses.
Left image from Johnson’s website, right image from Gil Kalai’s blogpost in memory of Lindenstrauss.

The setting is as follows. Suppose you have a large number n points, living in the Euclidean space of dimension \mathbb{R}^{d} . We’ll call these points x_1,x_2,\ldots,x_n , and think of both the number of points n and the number of dimensions d as being very large, much larger than the measly “3” of our easy-to-grasp physical world.

Now, you may wonder, “But we do live in a three dimensional world thank you very much, so where would I obtain such high dimensional data?” Well, these situations arise increasingly more often in computer science applications, where a point in d -dimensional space may represent an abstraction of some actual real-world thing.

For example, imagine this not-too-farfetched biology experiment. A biologist wants to understand how a particular gene works in a bacterium, and she does this using the following procedure: first, she makes small changes in the DNA sections that code up the protein. Then she replaces the original gene in the bacterium by the new modified one. Finally, she measures the concentration of about 1000 different proteins in the bacterium. If the original gene was important, or if it was a key ingredient in a large protein pathway, then many of these concentrations will be different under the modified gene when compared to the unchanged gene.

Our biologist repeats this experiment many many times, each time changing a different part of the gene in a different way. Ultimately, she has performed say about 500 experiments. Each experiment results in a vector of 1000 numbers, one for each protein. She now has 500 points living in a 1000-dimensional space*. Our poor biologist must now analyze these points, and from them understand the way in which the original gene affects the proteins and operates within the cell. She will run many statistical and geometrical algorithms on these data, and these algorithms will usually have the downside that they really, really slow down as a function of dimension. What can she do?

* Actually, since a set of n points always spans a space of dimension no larger than n , these points only live in a 500-dimensional space. But that’s besides the point here, as we will see next.

Ok, back to mathematics. The Johnson-Lindenstrauss lemma roughly says the following. Suppose you have these n points living in \mathbb{R}^{d} . You can actually put them into a very small space – logarithmic in n – in such a way that the distances between the points are almost preserved: If two points were originally far away from each other, they’ll remain far away from each other, and if two points were close, they’ll remain close. More formally,

Lemma: Let 0 < \varepsilon <1 and let x_1,x_2,\ldots,x_n be n points in \mathbb{R}^{d} . Then there is a function f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k} such that for every two points x_i and x_j ,

(1-\varepsilon)||x_i - x_j||^2 \leq ||f(x_i)-f(x_j)||^2 \leq (1+\varepsilon)||x_i - x_j||^2

with k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3).

This lemma is good if you want to run geometric algorithms on your data. For example, in our biologist’s case, she might decide that if two modifications to the gene resulted in protein densities that are more or less the same (that is, two points in \mathbb{R}^d that are close to each other), then the modifications changed the gene in functionally the same manner. She might then decide to look for clusters of points which are close to each other. This is usually easier to do in smaller dimensional spaces. Applying a Johnson-Lindenstrauss transformation allows our biologist to run the algorithms faster and on more data, while (hopefully) not losing much accuracy.

Of course, this is useful only if the the new dimension k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) is much smaller than the original dimension d. This often happens in applications; for example, in a future post, we will work with n points in n dimensions, and the transition from a space of dimension n to space of dimension \log n will be quite substantial.

The proof we will present here follows a paper by Dasgupta and Gupta, which uses random projections. We’ll first illustrate this concept in two dimensions.

Suppose you have a set of points in the plane:

A one-dimensional subspace of the plane is just a line that goes through the origin. Once we have such a line, we can perform an orthogonal projection of the original points onto that line, giving us a new set of points, all on the line:

If we choose a different line, we get a different projection:

For some lines, the projection will totally wreck the distances between the original points; for example, in this image, we sent two different original points to the same point:

However, for other lines, it may be that the projection is able to preserve the ratio of distances fairly accurately.

The proof the Johnson-Lindenstrauss lemma uses the same idea, only in higher dimensions. Instead of points in the plane, we have our original d -dimensional space; and instead of projecting to a line, we try to find a subspace of dimension k so that projecting on to it will almost preserve distances. [Actually, we project the points and then blow them up by a constant factor, since orthogonal projections can only decrease the length of vectors; but this doesn’t change the gist of the argument. From now on when we say “distance preserving”, we’ll mean actually mean “up to a constant factor”. See the actual calculations for exact treatment].

A-priori, there is no grounds to think that such a subspace exists. However, it turns out that it indeed does, for a reason called “the concentration of measure in high dimensions” Roughly, it says that in high dimensions, some random processes – such as the orthogonal projection of a point onto a random subspace, or the scalar product of a unit vector with a Gaussian vector – are heavily concentrated around their expected value. The probability for such processes to be even a tiny bit away from their mean is exponentially small in the dimension. We usually don’t see these phenomena in 3d, because the lowly and pathetic exponent of 3 isn’t enough to give a profound effect, but in higher dimensions they flourish.

After all of this verbosity, here is the proof outline: for any k , we can choose a random k dimensional subspace of \mathbb{R}^d and project our points on to it. Choose two particular points x_i and x_j from our original cast in \mathbb{R}^d . It may either be that the distance between them is preserved up to a factor of 1 \pm \varepsilon , or it may not be preserved; denote the probability that it is not preserved is some number p . We can consider this “distance distortion” event for all of the {n \choose 2} = (n^2 - n)/2 pairs of points, and by the union bound, the probability for at least one pair of points to have their distance distorted is less than p \cdot (n^2 - n)/2 . If we can show that the probability p is smaller than 2 / (n^2 -n) , then the probability that at least one distance is distorted is smaller than 1. This means there is a non-zero probability that all the distances are preserved! But then there has to exist a distance-preserving projection; otherwise it would be impossible for this probability to be greater than 0. And as it turns out, we can indeed make the probability that a single distance is preserved be this small, by choosing k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) .

Actually, this proof not only shows that there exists such a transformation, it even tells you how to efficiently get one: the probability that a random projection gives a distance-preserving dimension reduction is actually quite high (and can be made very close to one if we just increase the number of dimensions by a bit). So you just need to take your points, project them randomly, and check that the distances are preserved; if not repeat, and sooner rather than later you are bound to get something that works.

Ok, let’s take a look the technical details. All we have to show is that with high probability, the length of a vector is nearly preserved when we project it to a random k dimensional subspace. Now, when performing the calculations, instead of taking a fixed vector and a random subspace, we can take a fixed subspace and a random vector; all the results will be the same, since all we care about when projecting is the relative orientation between the vector and the subspace.

So let’s take a random unit vector, and see what happens to its length. A uniformly random unit vector can be obtained by taking a random distribution which is spherically symmetric and normalizing it. Specifically, let X_1,\ldots,X_d be d independent Gaussian random variables with mean 0 and variance 1; in other words, the random vector X = (X_1,\ldots, X_d) distributes as a d dimensional Gaussian vector \mathcal{N}(0,\text{Id}_{d}) . Then the vector

Y = \frac{X}{||X||}

is a uniformly random unit vector. As for our fixed subspace, we’ll just take the space spanned by the first k coordinates, and denote the projection by Z = \frac{(X_1,\ldots ,X_k)}{||X||} . What is the expected value of the square of the length of Z ? Well, we know that the squared length of Y is 1; this is by design: we normalized it. We thus have:

||Y||^2 = 1 = \sum_{i = 1}^{d}\frac{X_i^2}{||X||^2}.

If this is true for ||Y||^2 , then it is also true for its expectation:

\mathbb{E}||Y||^2 = 1 = \sum_{i = 1}^{d}\mathbb{E}\frac{X_i^2}{||X||^2}.

Notice that the sum is symmetric in the X_i ‘s: in expectation, there is no reason for one X_i to behave differently from the other. Thus each term should contribute equally to the sum, and so for every i ,

\mathbb{E}\frac{X_i^2}{||X||^2} = 1/d.

Since Z is just the first k terms of Y , we get

\mathbb{E}||Z||^2 = \sum_{i=1}^{k}\mathbb{E}\frac{{X_i}^2}{||X||^2} = k/d.

Oops, the expected value is k/d , which is smaller than 1, the original norm of the vector! But not to worry, this can easily be fixed: after we project to a random subspace of dimension k , we just blow up all the points by a factor of \sqrt{d/k} . That way, when we look at the norm squared, the expectation will be d/k times larger, and the expectation will indeed by 1. This multiplication will not affect any of our probabilistic statements, since if ||Z||^2 is concentrated around its mean, so is \frac {d}{k}||Z||^2 .

Now to fulfil our promise and show that the squared length of Z is heavily concentrated around its mean. The calculations will be a bit lengthy, but do not despair (also, if you got up to here and understood everything, you are pretty much good-to-go for most of your Johnson-Lindenstrauss-ing needs).

We’ll show that for any positive real number \beta smaller than 1, the probability for ||Z||^2 to be \beta times smaller than its mean of k/d is exponentially small in the dimension and in \beta ; we’ll then take \beta = 1 - \varepsilon to show that the probability for a small deformation is small. The same calculations can also be repeated for checking when ||Z||^2 is greater than \beta for \beta > 1 , and then taking \beta = 1 + \varepsilon .

A standard method for showing such concentration of measure is using Markov’s inequality, which states that if W is a positive random variable, then for every a > 0 ,

\text{Pr}[W \geq a] \leq \frac{\mathbb{E}W}{a}.

Intuitively, all this inequality says is that if the expectation is small, then it’s impossible to have too large a probability of getting too large a value. Markov’s inequality is innocuous enough, but it becomes powerful when we combine it with exponential moments, that is, when we apply to e^{tW} instead of to W itself. Let’s see how that works.

When written explicitly, ||Z||^2 becomes

||Z||^2 = \frac{X_1^2 + \ldots + X_k^2} {X_1^2 + \ldots + X_d^2}.

So the probability \text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] can be written as

= \text{Pr}[d(X_1^2 + \ldots + X_k^2) \leq k \beta(X_1^2 + \ldots + X_k^2) ]

= \text{Pr}[k \beta(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2) \geq 0]

=  \text{Pr}[e^{t(k \beta (X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))} \geq 1].

Here t was some number greater than 0, and we will optimize over it soon in order to get the best possible bounds. Now we invoke Markov’s inequality with a = 1 , getting

\leq  \mathbb{E}e^{t(k(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))}.

The various X_i ‘s are independent, so the expectation of the product is the product of expectations:

\mathbb{E}e^{X_1 + X_2} = \mathbb{E}[e^{X_1}e^{X_2}] = \mathbb{E}[e^{X_1}]\mathbb{E}[e^{X_2}],

and after grouping together the X_i in our exponent, our probability is bounded by

=  \mathbb{E}[e^{t k \beta X_1^{2}}]^{d-k} \mathbb{E}[e^{t(k \beta - d)X_1^{2}}]^{k}

=  (1-2tk \beta)^{-(d-k)/2}(1-2k(k\beta-d))^{-k/2}.

This last equality we won’t go into, but it stems from knowledge about the moment generating function of Gaussians; in short, all that needs to be proven is that \mathbb{E}[e^{sX^2}] = 1/\sqrt{1-2s} for s<1/2 .

All that is left is to minimize this last expression. This we also won't do here, because it mainly involves annoying calculus which you can let WolframAlpha do for you. This finally gives:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \beta^{k/2}(1 + \frac{k(1-\beta)}{d-k})^{(d-k)/2}.

The expression on the right hand side is equal to e to the power of its logarithm:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2} \log \beta + \frac{d-k}{2} \log(1 + \frac{k(1-\beta)}{d-k})).

Finally, since \log(1+x) \leq x for all x \geq 0, we can replace the second \log factor by \frac{k(1-\beta)}{d-k} , giving

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2}(1-\beta + \log\beta)).

Now we just need to clean up. Put \beta = 1-\varepsilon , and get

\leq \exp(\frac{k}{2}(1-(1-\varepsilon) + \log(1-\varepsilon))).

By more calculus it can be shown that \log(1-x) \leq -x - x^2/2 for 0\leq x <1 , and at last we have

\leq \exp(\frac{k \varepsilon^2}{4}).

And indeed, if we choose k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) , this exponent will be smaller than 1/n^2 , just as we wanted (actually, we can choose just k = 8 \log n / (\varepsilon^2) , but to deal with the second case of \beta = 1+\varepsilon we will also need the \varepsilon^3/3 term). A similar calculation for \beta = 1+\varepsilon will give another factor of 1/n^2 , and in total the probability of having a large distortion will be just 2/n^2 per pair of points.

New paper on arXiv: indistinguishable sceneries on the Boolean hypercube

January 28, 2017

I’m happy to say that fellow student Uri Grupel and I uploaded a paper to the arXiv recently under the title “Indistinguishable sceneries on the Boolean hypercube” ( We had great fun working on it, and most of the theorems are actually pretty simple and do not use heavy mathematical machinery, so I’d like to share the motivation and results with you below. I’ll try to write a fairly simple exposition, without most of the technical terms and details. If you find this too simple, just read the paper!

Our paper shows that the scenery reconstruction problem for the n-dimensional hypercube is impossible for n \geq 4. It does this using the combinatorial properties of what we call “locally biased” and “locally balanced” functions, some properties of which the paper investigates.

Don’t know what scenery reconstruction is? Want to learn about and play around with locally biased and locally balanced functions? Read on.

We’ll start with a simpler-to-state problem. Suppose you are given a cycle graph, that is, a collection of vertices arranged in a cycle. For example, here is a cycle with 13 vertices:


Each vertex is colored either black or white. There are many different such possible colorings: if there are n vertices, then there are 2^n different colorings. However, the cycle has some symmetries, and some of these colorings can be considered to be the same. For example, in the following figure, if we rotate the center cycle 5 places clockwise, we get the cycle on the left. Similarly, if we reflect the center cycle around a vertical line, we get the cycle on the right. Such colorings are called “isomorphic”, and we will treat them all as “the same” for all intents and purposes.

Suppose now we place a person at one of the vertices at random. The person then performs a random walk: at each step, she randomly goes to the vertex either to her left or to her right, with each possibility equally likely to occur. Then she tells us the color of the vertex she is currently at. The result is a sequence of colors, one for each time step, which constitute the trace of the random walk on the cycle. Here is a short example of such a trace:


The idea you should keep in your mind is this: the coloring of the cycle is a “scenery”, and the agent, walking randomly along the cycle, tells you what she sees but *not* where she is going. The scenery reconstruction on the n-cycle is then framed as follows: suppose that you are given the length of the cycle n and also an infinitely long trace of scenery observations by your agent. Can you reconstruct the scenery itself? That is, can you find the original color of each vertex of the cycle?

As an easy example, if the original cycle was colored all black, then no matter where the agent goes, she will always report “black” at every time step. So if she reports “BBBBBBB…”, we can be confident that the original scenery is the all black cycle. Similarly, if the agent reports “BWBWBWB…”, we can be confident that the original scenery is the alternating black-white cycle (we can assume that in the long run, the agent will never be stuck in a “back and forth between two vertices only path). But what about more complicated sceneries, like the one given below? Can we always reconstruct even very bizzare colorings?


The (perhaps surprising) answer is: yes! It was shown that if you are given a long enough series of observations, you can always reconstruct the original coloring! (up to isomorphism, meaning, you cannot differentiate between colorings which are shifts or reflections of one another, but we already mentioned above that weview these to be “the same”). So while this basic question is closed, there are other open questions still to be answered. For example, how many observations do we need so that we will be able to reconstruct the coloring with very high probability?

Variants of the n-cycle reconstruction problem abound. What if, in our random walk, we don’t go left or right with equal probability? What if we can jump other distances? What if we color and walk on another graph other than the cycle?

The latter question is what Uri’s and my paper addresses. Instead of the n-cycle, we investigate the scenery reconstruction problem on the Boolean hypercube in dimension n. In dimension 1, the cube is just a line. In dimension 2, a square. In dimension 3, an actual cube. In higher dimensions visualizations start to get messy. One way to think about it is as follows: the n+1-dimensional hypercube is composed of just two identical n-dimensional hypercubes, with edges connecting matching vertices in the two cubes.


So again, we place an agent at one of the vertices at random, and she starts a random walk. This time, there are n possible choices of where to go in each step. Again, she reports only the color of the vertex she is currently at. Now can we reconstruct the original coloring?

For n=1, the hypercube is just two points connected by an edge, and there aren’t that many colorings on it, so it’s no big deal. For n=2, the hypercube is a square, which is actually a cycle of length 4, and we already know how to deal with cycles, so that’s no big deal either. For n=3, it can be shown that the methods used to solve for the cycle, combined with some case-by-base analysis, again are able to reconstruct the scenery. What about other dimensions?

The (perhaps surprising) answer is: no! Starting from n=4, there exist two colorings which are non-isomorphic (that is, they are not obtained by rotations or reflections of the cube), but which nevertheless cannot be distinguished from one another based on the sequence of color observations of our agent alone! So if we are given such a sequence, we can never be sure whether it came from one coloring or the other.

Here is such a pair in dimension n=4:


To see that these are indeed two different colorings (meaning, that one is not just a rotation or reflection of the other), note that in the left image there are two small rings of black vertices, each of size 4, while in the right image there is only one large ring, of size 8. This difference cannot be due to rotations and reflections alone.

If you look carefuly, you will see an interesting property of these two colorings: every vertex, no matter what color it is, has exactly two neighbors which are black, and exactly two neighbors which are white. It is this property that prohibits us from distinguishing between the two colorings: no matter where our agent is, at the next step she will report “black” with probability 1/2, and “white” with probability 1/2, for both colorings. In technical terms, the sequence of observations distributes as a sequence of iid Bernoulli variables with success probability of 1/2.

We can take this property and generalize it: we call a coloring “locally p-biased” if every vertex, no matter what color it is, has a p-fraction of its neighbors colored black, and a 1-p-fraction of its neighbors colored white. If there are two non-isomorphic such colorings on the hypercube, for any p, then the scenery reconstruction problem is impossible to solve on that hypercube. In fact, this statement is true for any graph, not just the hypercube, so finding locally biased colorings on different graphs shows that the scenerey reconstruction problem is impossible for those graphs. But we think that such colorings stand alone by their own right, and deserve study regardless of their application to scenery reconstruction; they are simply interesting combinatorially!

Our first result was to classify for which values of p an n-dimensional hypercube even has a locally biased coloring. We saw a 1/2-biased coloring for n=4. Obviously, a 1/2-biased coloring requires the dimension to be even, since each vertex has n neighbors, and exactly n/2 of those need to be black, and n/2 white. So is there a locally 1/2-biased coloring on the 6-dimensional hypercube? What about a 1/3-biased coloring? You can try it out yourself for a bit and convince yourself that there is no 1/3-biased coloring on the 3-dimensional cube. After a bit of work, we came up with the following theorem:

Theorem 1: There exists a locally p-biased coloring on the n dimensional hypercube if and only if p is of the form p = b/2^k and 2^k divides n.

This shows, for example, that there are no locally biased colorings on odd-dimensional hypercubes. But for even-dimensional hypercubes, there do exist such colorings, and then it’s interesting to start counting them: as we said before, if there are more than two colorings for the same p value, then the scenery reconstruction problem is impossible.

Our proof uses elements from coding theory, and is not very complicated. I’ll tell it in short in the following paragraphs: feel free to skip it if you don’t want to go ito the details.

One side, the “only if”, tells you which colorings are impossible. In the n-dimensional hypercube, each vertex has n neighbors, and there are 2^n vertices overall. Suppose that in a specific locally p-biased coloring, a total of l vertices are colored black. Then when you pick a random vertex, it is black with probability l/2^n. On the other hand, when the agent takes a random step, it will be black with probability p = m/n for some m. These two probabilities are equal, so

b/n = l/2^n.

Decomposing n into its prime powers, n = c2^k where c is odd. Then

l = 2^{n-k} \cdot m / c

and since l must be a whole integer (and not a fraction), we get that m = bc for some b. This gives p = b/2^k after a short calculation.

The other direction, which says that for all p values of the proper form you can build a locally p-biased coloring, is a bit harder. We start by building locally 1/n-biased colorings for n that is a power of two, that is, n=2^k. Such colorings have the peculiar property that they induce tiles which tile the hypercube. What do I mean by this? In such a coloring, every vertex, no matter the color, has exactly 1 black neighbor. In particular, this is true for black vertices. So black vertices always come together in connected pairs, and all the neighbors of such pairs are white. Further, these white neighbors already have their “1-black-neighbor” condition satisfied, so all their other neighbors are also white. This is exemplified in the following shape, or “tile”:


If we can “tile the hypercube” with these shapes, that is, if we can cover the hypercube graph with these tiles without any pair of tiles overlapping, we will have a locally 1/n-biased coloring. Why? Well, these tiles exactly satisfy the properties stated in the above paragraph, and every vertex in such a tiling will have exactly one black neighbor.

The proof constructs such tilings by using perfect codes; without going into detail, these are well-known methods of tiling the hypercube with a half-tile: instead of two black vertices and two rings of white vertices, these methods tile the hypercube with just a single black vertex and a single ring of white vertices. These are actually known as “balls” on the hypercube:


By taking two copies of a perfect code and connecting the black vertices, we obtain the needed tiling! This gives us a locally 1/n-biased coloring, for n that is a power of two (the perfect codes we need only exist in dimension 2^k-1, so when we take two copies of the hypercube together we get a hypercube of dimension n = 2^k).

To obtain a locally m/n-biased coloring, we join together m different disjoint 1/n-biased colorings. By m disjoint colorings, we mean that if a vertex is colored black in one coloring, then it is necessarily colored white in all the others. Finally, using a bit of algebraic manipulation which I will not go into here, we can extend these colorings into hypercubes of dimension c \cdot 2^k for any c, which is exactly what our theorem stated we could do.

The theorem shows existence of some locally p-biased colorings, but tells us nothing of the number of such colorings. In order to show that the scenery reconstruction problem is impossible, we need to find at least two different such colorings. That’s where our next two theorems come into play:

Theorem 2.1: As a function of n, there is at least a super-exponential number of different locally 1/n-biased colorings on the n dimensional hypercube.

Theorem 2.2: As a function of n, there are at least C \cdot 2^{\sqrt{n}} / \sqrt{n} different locally 1/2-biased colorings on the n dimensional hypercube, where C is some constant.

Theorem 2.1 is proved by counting the number of perfect codes, which were used in proving the previous theorem; we won’t go into it here.

The proof of Theorem 2.2 is more interesting, and relies on an algebraic treatment of our colorings. It goes as follows.

Up until now, we have treated our colorings as assigning the color either black or white to each vertex. We could have equally well assigned the numbers “+1” or “-1“, and all the theorems and statements would have stayed the same: all we care is that there are two distinct “colors”. But this description has an immediate advantage: we can now easily perform arithmetic operations on the colorings. For example, if both f_1 and f_2 are colorings, then their product f = f_1 \cdot f_2 is also a coloring, where by multiplication we mean that if f_1 assigns the color f_1(v) to a vertex v, and f_2 assigns the color f_2(v), then f(v) = f_1(v)\cdot f_2(v). In words, what this multiplication means is that f is black precisely when f_1 and f_2 have the same color. This algebraic representation is often easier to work with than the verbal one.

Now, the n-dimensional hypercube can be described as follows: it is the set of points with n coordinates, where each coordinate can be either +1 or -1 (you can easily verify that this fits in with our notions of 2d and 3d cubes; it works just the same for higher dimensions). Thus a coloring is just a function

f(x_1, x_2, \ldots, x_n)

which returns either 1 or -1, and all the x_i‘s are also either 1 or -1.

This representation lets us combine two locally 1/2-biased colorings on small hypercubes together into one locally 1/2-biased coloring on a larger hypercubes: Suppose that f_1 is such a coloring on the n dimensional hypercube, and f_2 is such a coloring on the m dimensional hypercube. Then the new coloring

f(x_1, x_2, \ldots, x_{n+m}) = f_1(x_1,\ldots,x_n) f_2(x_{n+1}, \ldots, x_{n+m})

is a locally 1/2-biased coloring on the n+m dimensional hypercube! Our method for counting the number of locally 1/2-biased colorings then goes as follows: we construct many high dimensional colorings from different lower-dimensional ones. In fact, our “elementary building blocks”, from which we build all other colorings, are basically generalizations of the two colorings of the 4-dimensional hypercube which we encountered earlier in this post. It can be shown (via Fourier decomposition, a method which we won’t go into here) that taking combinations of different low-dimensional colorings always gives non-isomorphic high-dimensional colorings. In other words, if we use different building blocks, then the resultant colorings will be distinct.

We can then get a bound on the number locally 1/2-biased colorings by counting in how many distinct ways we can add up low-dimensional colorings into a high-dimensional one. After a bit of arithmetic, we find that there are at least as many colorings as there are solutions to this inequality:

4a_1 + 8a_2 + \ldots + 4k a_k \leq n,

where a_1,\ldots,a_k are non-negative integers. The number of solutions to this inequality is at least C \cdot 2^{\sqrt{n}} / \sqrt{n} , giving us Theorem 2.2.

This, I think, is quite neat: it’s enough to find just two different colorings in order to show that the scenery reconstruction problem is impossible, but in fact there are many, many more of them. And what we gave was only a lower bound: it’s entirely plausible that there is a more clever way of counting these colorings, which will give an even larger number.

So far, so good. We saw that the scenery reconstruction problem is impossible for even-dimensional hypercubes, and even saw that there are different indistinguishable colorings. What about odd-dimensional cubes?

The answer comes from what we call “locally p-balanced” colorings. In these colorings, every vertex, no matter what color it is, has a p-fraction of its neighbors colored in its own color, and a 1-p-fraction of its neighbors colored the opposite color. These colorings also yield identical scenery observations: when our agent tells us the colors she sees, the next color will always be the same as the current one with probability p. For example, the following figure shows a locally 2/3-stable and a locally 1/3-stable coloring on the 3-dimensional cube.


Locally balanced colorings are different than locally biased colorings, although the two are related. For example, we saw that there is a strong restriction on the possible p values for locally biased colorings, but it turns out that for every p, there is at least one locally balanced coloring. On the other hand, if p = 1/n or p = (n-1)/n, then there is exactly one locally balanced coloring, while, as we saw, there are many, many locally biased ones.

The main point, though, is that every locally 1/2-biased coloring f on the n-dimensional hypercube can easily give a locally balanced coloring g on the n+1 dimensional hypercube, by:

g(x_1, \ldots, x_{n+1}) = g(x_1, \ldots, x_n) \cdot x_{n+1}.

You should verify for yourself that this is indeed a locally balanced coloring, and calculate its p value. This construction gives different locally balanced functions for all odd-dimensional hypercubes of dimension n \geq 5, showing that the scenery reconstruction problem is impossible for all n \geq 4. And that’s what we promised to show in the beginning of this post.

You should play around with locally biased and locally balanced colorings yourself; they serve as good brain teasers, and indeed we’ll end with a small riddle: here are two locally balanced colorings on the infinite one-dimensional and two-dimensional integer lattices. Can you find one for the three dimensional lattice?



Harmonic Hacking

April 3, 2013

Not long ago, I started learning Lisp, the dreaded programming language notorious for its (multiple (parentheses)). Specifically, Common Lisp, which is one of the major Lisp dialects (yes, lispers consider Lisp a language by all means). There are several good books on the topic, amongst them Peter Seibel’s Practical Common Lisp. In this book he suggests to use the “emacs” editor and an emacs Lisp mode called SLIME, which lets the user run Lisp directly in the editor, as well as offering other IDE features. I followed his advice, seeing it as an opportunity to finally discover what emacs is all about (it’s both amazingly wonderful and horribly despicable at the same time. Those of you who program should definitely try it, although the learning curve is a tad high).
SLIME has some nifty features, but the really interesting thing about this editor, is that every time you open the Lisp interpreter, it gives you some words of encouragement.


Why, yes, I suppose it could indeed be the start of a beautiful program! This is actually a very nice way to start your programming session, much better than the “startup tips” that many integrated development environments use.
But the really interesting thing about these words of encouragement, is that every time you open an interpreter, a new phrase is displayed. Opening a second instance, I now get:


Of course, it’s obvious that the sentences are built into the program – they are not generated with a random algorithm (say, via a Markovian process) but are just displayed at random. The most probable implementation is that there is a collection of such phrases, and at startup the program chooses one at random and prints it on the screen.
An interesting question to ask is: without checking the source code, how can I know how many such phrases there are in total? An experimental approach would be to open the program again and again and again, each time writing down the encouragement displayed. If there is a fixed number of phrases, then at some point no new phrases will be introduced. But how do I know when to stop looking? When do I finally decide that I have seen all the phrases that exist in the “words-of-encouragement-repository”? What if, by sheer chance, there was one poor sentence which was never selected? If I stop early, I will leave that one out.

Luckily, some basic probability can help us here and give us an answer which is accurate to within whatever error threshold we see fit.
Let’s start with a clean state. We’ll imagine a case in which we have no clue as to how many different sentences there are. There could be only one, there could be two, there could be 200. Call this number N.
The first time we run SLIME, we would get a sentence which we have never before seen, since we haven’t seen any sentences yet. The repository size is at least 1.
The second time we run SLIME, we might get the same sentence, or we might get a new one. The more sentences there are in the repository, the higher the chances of getting a new one: if there are N sentences overall, then there is only a \frac{1}{N} chance of getting the same first sentence.
If we got a new one, great! we know that the repository size is at least 2. But if not, we can ask for a third sentence. If that one is the same as the first, we still don’t know if there is more than one sentence; but, assuming there are N overall choices, the chances of getting the same sentence twice in a row is (\frac{1}{N})^2 . So we take a fourth one, and if THAT is the same, well, the chances of that happening are (\frac{1}{N})^3.
We see that if we poll again and again and keep getting the same sentence, then it isn’t very likely that N is very large, because high powers of \frac{1}{N} will go to zero very quickly for large N. In fact, if we were to keep getting only the same original sentence, we would very soon be convinced that there is only one sentence in the repository.

Let us generalize. Suppose that we have asked for sentences several times already, and we have already seen k different ones. We would like to know what the chances are that “this is it, there are only k sentences in the whole words-of-encouragement-repository, and no more”.
If there are indeed only k sentences, then no matter how many more times we ask, we will not get any new encouragement. If, however, the actual repository size is larger than k, then eventually, by chance, we might run across one that we haven’t seen before.
So our strategy is this: we keep asking for more sentences; if, after a long time, we don’t get any new ones, we can say that there are probably only k sentences. But, what is “long time”, and what is “probably”? These can be quantified exactly.
We have already seen k different sentences. That means that the repository size is at least k. But what if it were exactly k+1? That means that every time we ask for a sentence, there is a \frac{1}{k+1} chance of getting that extra one which we haven’t seen. This means that there is a \frac{k}{k+1} chance of getting one that we have already seen.
So if there are k+1, and not k, different phrases, then the chance of not seeing a new encouragement phrase s times in a row is (\frac{k}{k+1})^s . But \frac{k}{k+1} is a fraction smaller than 1. This means that if we keep asking for more sentences, but only get ones we already saw, the probability that there are actually k+1 different items in the repository drops lower and lower.
We can now define a threshold which says how confident we are that the repository indeed contains only k items. If we choose, for example, 95%, then we ask for more and more sentences until (\frac{k}{k+1})^s < 0.05 . Then, after s polls, we can declare “There is a 95% chance that there are only k, and not k+1, different phrases!” This makes clear what we mean by “long time” and “probably” – we choose our level of confidence, and wait for a streak until that confidence is reached. This might not be exactly the confidence that there are k sentences (and not any other number), but it suites our cause well.
Of course, it might be that during our search, a new item will indeed come up. Wonderful! We increased our knowledge about the repository. In this case, we need to update our k value, and reset the s count.

So I wrote a program (in Lisp, of course!) which does the following. It starts in a clean state, not knowing anything about the number of sentences in the words-of-encouragement-repository. It is then fed the yielded sentences, one by one. At all times, if the number of different sentences it has seen is k, it looks at the current streak s of non-new sentences it received, and asks itself how likely this streak is, assuming that the actual repository size is k+1. When the probability is less than a given threshold, it stops, proclaiming that within the requested confidence, there are k different sentences in the repository.
Note that if the repository size is finite, then this algorithm is bound to terminate – each new poll either increases the known repository size, or decreases the chance of the repository being bigger. Further, we only check k+1, and not all possible numbers larger than k, because (\frac{k}{k+z})^s drops to zero faster for larger z’s, so if we are confident that the repository is smaller than k+1, we are even more confident that it’s smaller than k+z.

The result? I gave the program a threshold of 0.01, which means 99% confidence. It terminated after processing 55 sentences, stating that there were 8 different phrases with a 0.00899 chance of error.


How does that compare with SLIME’s source code?

(defvar slime-words-of-encouragement
 '("Let the hacking commence!"                                    ; 1
   "Hacks and glory await!"                                       ; 2
   "Hack and be merry!"                                           ; 3
   "Your hacking starts... NOW!"                                  ; 4
   "May the source be with you!"                                  ; 5
   "Take this REPL, brother, and may it serve you well."          ; 6
   "Lemonodor-fame is but a hack away!"                           ; 7
   ,(format "%s, this could be the start of a beautiful program." ; 8
 "Scientifically-proven optimal words of hackerish encouragement.") ; Documentation

Victory for probability and Lisp.

Footnote too cool not to tell about:
A nice related question to ask is, “if there are N different sentences overall, then on average, how many times will it take to see all of them, if each time I select one at random?”
The best case, of course, is N times – if, by miraculous stroke of luck, each time you pick a sentence at random, you pick a different one. The worst case, potentially, is sort of never – there might be one which would, by a foul stroke of misfortune, never get selected (although the chances of this happening are practically 0). So, what is the average case?
Suppose that you have already seen k out of N sentences. Then the chance of seeing a new one is p = \frac{N-k}{N}. So on average, you would need to \frac{1}{p} = \frac{N}{N-k} attempts. So the total expected value is

\sum_{k=0}^{N-1} \frac{N}{N-k} =  N \sum_{k=0}^{N-1} \frac{1}{N-k} = N \sum_{k=1}^{N} \frac{1}{k}

In words, this is N, our repository size, times the harmonic series, H = \sum_{k=1}^{N} \frac{1}{k} . This series is divergent, meaning that as N goes to infinity, so does H. However, while it grows relatively quickly for small N, for larger ones additional elements barely have any effect. For example, when N = 11, H is about 3. If we want to double that, H = 6, then N = 227. If we want to double that, H = 12, then N = 91,380. So while the harmonic series’ effect on the expected value is not very large, it ensures that the average case will need more than N attempts.