## Archive for the ‘Mathematics’ Category

### The Johnson-Lindenstrauss Lemma

May 18, 2017

Today is a good day I think to talk about the Johnson-Lindenstrauss dimension reduction lemma!

Johnson and Lindenstrauss. It cannot be coincidental that they are both wearing the same type of glasses.
Left image from Johnson’s website, right image from Gil Kalai’s blogpost in memory of Lindenstrauss.

The setting is as follows. Suppose you have a large number $n$ points, living in the Euclidean space of dimension $\mathbb{R}^{d}$. We’ll call these points $x_1,x_2,\ldots,x_n$, and think of both the number of points $n$ and the number of dimensions $d$ as being very large, much larger than the measly “3” of our easy-to-grasp physical world.

Now, you may wonder, “But we do live in a three dimensional world thank you very much, so where would I obtain such high dimensional data?” Well, these situations arise increasingly more often in computer science applications, where a point in $d$-dimensional space may represent an abstraction of some actual real-world thing.

For example, imagine this not-too-farfetched biology experiment. A biologist wants to understand how a particular gene works in a bacterium, and she does this using the following procedure: first, she makes small changes in the DNA sections that code up the protein. Then she replaces the original gene in the bacterium by the new modified one. Finally, she measures the concentration of about 1000 different proteins in the bacterium. If the original gene was important, or if it was a key ingredient in a large protein pathway, then many of these concentrations will be different under the modified gene when compared to the unchanged gene.

Our biologist repeats this experiment many many times, each time changing a different part of the gene in a different way. Ultimately, she has performed say about 500 experiments. Each experiment results in a vector of 1000 numbers, one for each protein. She now has 500 points living in a 1000-dimensional space*. Our poor biologist must now analyze these points, and from them understand the way in which the original gene affects the proteins and operates within the cell. She will run many statistical and geometrical algorithms on these data, and these algorithms will usually have the downside that they really, really slow down as a function of dimension. What can she do?

* Actually, since a set of $n$ points always spans a space of dimension no larger than $n$, these points only live in a 500-dimensional space. But that’s besides the point here, as we will see next.

Ok, back to mathematics. The Johnson-Lindenstrauss lemma roughly says the following. Suppose you have these $n$ points living in $\mathbb{R}^{d}$. You can actually put them into a very small space – logarithmic in $n$ – in such a way that the distances between the points are almost preserved: If two points were originally far away from each other, they’ll remain far away from each other, and if two points were close, they’ll remain close. More formally,

Lemma: Let $0 < \varepsilon <1$ and let $x_1,x_2,\ldots,x_n$ be $n$ points in $\mathbb{R}^{d}$. Then there is a function $f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}$ such that for every two points $x_i$ and $x_j$,

$(1-\varepsilon)||x_i - x_j||^2 \leq ||f(x_i)-f(x_j)||^2 \leq (1+\varepsilon)||x_i - x_j||^2$

with $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3).$

This lemma is good if you want to run geometric algorithms on your data. For example, in our biologist’s case, she might decide that if two modifications to the gene resulted in protein densities that are more or less the same (that is, two points in $\mathbb{R}^d$ that are close to each other), then the modifications changed the gene in functionally the same manner. She might then decide to look for clusters of points which are close to each other. This is usually easier to do in smaller dimensional spaces. Applying a Johnson-Lindenstrauss transformation allows our biologist to run the algorithms faster and on more data, while (hopefully) not losing much accuracy.

Of course, this is useful only if the the new dimension $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)$ is much smaller than the original dimension $d$. This often happens in applications; for example, in a future post, we will work with $n$ points in $n$ dimensions, and the transition from a space of dimension $n$ to space of dimension $\log n$ will be quite substantial.

The proof we will present here follows a paper by Dasgupta and Gupta, which uses random projections. We’ll first illustrate this concept in two dimensions.

Suppose you have a set of points in the plane:

A one-dimensional subspace of the plane is just a line that goes through the origin. Once we have such a line, we can perform an orthogonal projection of the original points onto that line, giving us a new set of points, all on the line:

If we choose a different line, we get a different projection:

For some lines, the projection will totally wreck the distances between the original points; for example, in this image, we sent two different original points to the same point:

However, for other lines, it may be that the projection is able to preserve the ratio of distances fairly accurately.

The proof the Johnson-Lindenstrauss lemma uses the same idea, only in higher dimensions. Instead of points in the plane, we have our original $d$-dimensional space; and instead of projecting to a line, we try to find a subspace of dimension $k$ so that projecting on to it will almost preserve distances. [Actually, we project the points and then blow them up by a constant factor, since orthogonal projections can only decrease the length of vectors; but this doesn’t change the gist of the argument. From now on when we say “distance preserving”, we’ll mean actually mean “up to a constant factor”. See the actual calculations for exact treatment].

A-priori, there is no grounds to think that such a subspace exists. However, it turns out that it indeed does, for a reason called “the concentration of measure in high dimensions” Roughly, it says that in high dimensions, some random processes – such as the orthogonal projection of a point onto a random subspace, or the scalar product of a unit vector with a Gaussian vector – are heavily concentrated around their expected value. The probability for such processes to be even a tiny bit away from their mean is exponentially small in the dimension. We usually don’t see these phenomena in 3d, because the lowly and pathetic exponent of 3 isn’t enough to give a profound effect, but in higher dimensions they flourish.

After all of this verbosity, here is the proof outline: for any $k$, we can choose a random $k$ dimensional subspace of $\mathbb{R}^d$ and project our points on to it. Choose two particular points $x_i$ and $x_j$ from our original cast in $\mathbb{R}^d$. It may either be that the distance between them is preserved up to a factor of $1 \pm \varepsilon$, or it may not be preserved; denote the probability that it is not preserved is some number $p$. We can consider this “distance distortion” event for all of the ${n \choose 2} = (n^2 - n)/2$ pairs of points, and by the union bound, the probability for at least one pair of points to have their distance distorted is less than $p \cdot (n^2 - n)/2$. If we can show that the probability $p$ is smaller than $2 / (n^2 -n)$, then the probability that at least one distance is distorted is smaller than 1. This means there is a non-zero probability that all the distances are preserved! But then there has to exist a distance-preserving projection; otherwise it would be impossible for this probability to be greater than 0. And as it turns out, we can indeed make the probability that a single distance is preserved be this small, by choosing $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)$.

Actually, this proof not only shows that there exists such a transformation, it even tells you how to efficiently get one: the probability that a random projection gives a distance-preserving dimension reduction is actually quite high (and can be made very close to one if we just increase the number of dimensions by a bit). So you just need to take your points, project them randomly, and check that the distances are preserved; if not repeat, and sooner rather than later you are bound to get something that works.

Ok, let’s take a look the technical details. All we have to show is that with high probability, the length of a vector is nearly preserved when we project it to a random $k$ dimensional subspace. Now, when performing the calculations, instead of taking a fixed vector and a random subspace, we can take a fixed subspace and a random vector; all the results will be the same, since all we care about when projecting is the relative orientation between the vector and the subspace.

So let’s take a random unit vector, and see what happens to its length. A uniformly random unit vector can be obtained by taking a random distribution which is spherically symmetric and normalizing it. Specifically, let $X_1,\ldots,X_d$ be $d$ independent Gaussian random variables with mean 0 and variance 1; in other words, the random vector $X = (X_1,\ldots, X_d)$ distributes as a $d$ dimensional Gaussian vector $\mathcal{N}(0,\text{Id}_{d})$. Then the vector

$Y = \frac{X}{||X||}$

is a uniformly random unit vector. As for our fixed subspace, we’ll just take the space spanned by the first $k$ coordinates, and denote the projection by $Z = \frac{(X_1,\ldots ,X_k)}{||X||}$. What is the expected value of the square of the length of $Z$? Well, we know that the squared length of $Y$ is 1; this is by design: we normalized it. We thus have:

$||Y||^2 = 1 = \sum_{i = 1}^{d}\frac{X_i^2}{||X||^2}.$

If this is true for $||Y||^2$, then it is also true for its expectation:

$\mathbb{E}||Y||^2 = 1 = \sum_{i = 1}^{d}\mathbb{E}\frac{X_i^2}{||X||^2}.$

Notice that the sum is symmetric in the $X_i$‘s: in expectation, there is no reason for one $X_i$ to behave differently from the other. Thus each term should contribute equally to the sum, and so for every $i$,

$\mathbb{E}\frac{X_i^2}{||X||^2} = 1/d.$

Since $Z$ is just the first $k$ terms of $Y$, we get

$\mathbb{E}||Z||^2 = \sum_{i=1}^{k}\mathbb{E}\frac{{X_i}^2}{||X||^2} = k/d.$

Oops, the expected value is $k/d$, which is smaller than 1, the original norm of the vector! But not to worry, this can easily be fixed: after we project to a random subspace of dimension $k$, we just blow up all the points by a factor of $\sqrt{d/k}$. That way, when we look at the norm squared, the expectation will be $d/k$ times larger, and the expectation will indeed by 1. This multiplication will not affect any of our probabilistic statements, since if $||Z||^2$ is concentrated around its mean, so is $\frac {d}{k}||Z||^2$.

Now to fulfil our promise and show that the squared length of $Z$ is heavily concentrated around its mean. The calculations will be a bit lengthy, but do not despair (also, if you got up to here and understood everything, you are pretty much good-to-go for most of your Johnson-Lindenstrauss-ing needs).

We’ll show that for any positive real number $\beta$ smaller than 1, the probability for $||Z||^2$ to be $\beta$ times smaller than its mean of $k/d$ is exponentially small in the dimension and in $\beta$; we’ll then take $\beta = 1 - \varepsilon$ to show that the probability for a small deformation is small. The same calculations can also be repeated for checking when $||Z||^2$ is greater than $\beta$ for $\beta > 1$, and then taking $\beta = 1 + \varepsilon$.

A standard method for showing such concentration of measure is using Markov’s inequality, which states that if $W$ is a positive random variable, then for every $a > 0$,

$\text{Pr}[W \geq a] \leq \frac{\mathbb{E}W}{a}.$

Intuitively, all this inequality says is that if the expectation is small, then it’s impossible to have too large a probability of getting too large a value. Markov’s inequality is innocuous enough, but it becomes powerful when we combine it with exponential moments, that is, when we apply to $e^{tW}$ instead of to $W$ itself. Let’s see how that works.

When written explicitly, $||Z||^2$ becomes

$||Z||^2 = \frac{X_1^2 + \ldots + X_k^2} {X_1^2 + \ldots + X_d^2}.$

So the probability $\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}]$ can be written as

$= \text{Pr}[d(X_1^2 + \ldots + X_k^2) \leq k \beta(X_1^2 + \ldots + X_k^2) ]$

$= \text{Pr}[k \beta(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2) \geq 0]$

$= \text{Pr}[e^{t(k \beta (X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))} \geq 1].$

Here $t$ was some number greater than 0, and we will optimize over it soon in order to get the best possible bounds. Now we invoke Markov’s inequality with $a = 1$, getting

$\leq \mathbb{E}e^{t(k(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))}.$

The various $X_i$‘s are independent, so the expectation of the product is the product of expectations:

$\mathbb{E}e^{X_1 + X_2} = \mathbb{E}[e^{X_1}e^{X_2}] = \mathbb{E}[e^{X_1}]\mathbb{E}[e^{X_2}],$

and after grouping together the $X_i$ in our exponent, our probability is bounded by

$= \mathbb{E}[e^{t k \beta X_1^{2}}]^{d-k} \mathbb{E}[e^{t(k \beta - d)X_1^{2}}]^{k}$

$= (1-2tk \beta)^{-(d-k)/2}(1-2k(k\beta-d))^{-k/2}.$

This last equality we won’t go into, but it stems from knowledge about the moment generating function of Gaussians; in short, all that needs to be proven is that $\mathbb{E}[e^{sX^2}] = 1/\sqrt{1-2s}$ for $s<1/2$.

All that is left is to minimize this last expression. This we also won't do here, because it mainly involves annoying calculus which you can let WolframAlpha do for you. This finally gives:

$\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \beta^{k/2}(1 + \frac{k(1-\beta)}{d-k})^{(d-k)/2}.$

The expression on the right hand side is equal to $e$ to the power of its logarithm:

$\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2} \log \beta + \frac{d-k}{2} \log(1 + \frac{k(1-\beta)}{d-k})).$

Finally, since $\log(1+x) \leq x$ for all $x \geq 0$, we can replace the second $\log$ factor by $\frac{k(1-\beta)}{d-k}$, giving

$\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2}(1-\beta + \log\beta)).$

Now we just need to clean up. Put $\beta = 1-\varepsilon$, and get

$\leq \exp(\frac{k}{2}(1-(1-\varepsilon) + \log(1-\varepsilon))).$

By more calculus it can be shown that $\log(1-x) \leq -x - x^2/2$ for $0\leq x <1$, and at last we have

$\leq \exp(\frac{k \varepsilon^2}{4}).$

And indeed, if we choose $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)$, this exponent will be smaller than $1/n^2$, just as we wanted (actually, we can choose just $k = 8 \log n / (\varepsilon^2)$, but to deal with the second case of $\beta = 1+\varepsilon$ we will also need the $\varepsilon^3/3$ term). A similar calculation for $\beta = 1+\varepsilon$ will give another factor of $1/n^2$, and in total the probability of having a large distortion will be just $2/n^2$ per pair of points.

### New paper on arXiv: indistinguishable sceneries on the Boolean hypercube

January 28, 2017

I’m happy to say that fellow student Uri Grupel and I uploaded a paper to the arXiv recently under the title “Indistinguishable sceneries on the Boolean hypercube” (https://arxiv.org/abs/1701.07667). We had great fun working on it, and most of the theorems are actually pretty simple and do not use heavy mathematical machinery, so I’d like to share the motivation and results with you below. I’ll try to write a fairly simple exposition, without most of the technical terms and details. If you find this too simple, just read the paper!

Our paper shows that the scenery reconstruction problem for the $n$-dimensional hypercube is impossible for $n \geq 4$. It does this using the combinatorial properties of what we call “locally biased” and “locally balanced” functions, some properties of which the paper investigates.

Don’t know what scenery reconstruction is? Want to learn about and play around with locally biased and locally balanced functions? Read on.

We’ll start with a simpler-to-state problem. Suppose you are given a cycle graph, that is, a collection of vertices arranged in a cycle. For example, here is a cycle with $13$ vertices:

Each vertex is colored either black or white. There are many different such possible colorings: if there are $n$ vertices, then there are $2^n$ different colorings. However, the cycle has some symmetries, and some of these colorings can be considered to be the same. For example, in the following figure, if we rotate the center cycle 5 places clockwise, we get the cycle on the left. Similarly, if we reflect the center cycle around a vertical line, we get the cycle on the right. Such colorings are called “isomorphic”, and we will treat them all as “the same” for all intents and purposes.

Suppose now we place a person at one of the vertices at random. The person then performs a random walk: at each step, she randomly goes to the vertex either to her left or to her right, with each possibility equally likely to occur. Then she tells us the color of the vertex she is currently at. The result is a sequence of colors, one for each time step, which constitute the trace of the random walk on the cycle. Here is a short example of such a trace:

The idea you should keep in your mind is this: the coloring of the cycle is a “scenery”, and the agent, walking randomly along the cycle, tells you what she sees but *not* where she is going. The scenery reconstruction on the $n$-cycle is then framed as follows: suppose that you are given the length of the cycle $n$ and also an infinitely long trace of scenery observations by your agent. Can you reconstruct the scenery itself? That is, can you find the original color of each vertex of the cycle?

As an easy example, if the original cycle was colored all black, then no matter where the agent goes, she will always report “black” at every time step. So if she reports “BBBBBBB…”, we can be confident that the original scenery is the all black cycle. Similarly, if the agent reports “BWBWBWB…”, we can be confident that the original scenery is the alternating black-white cycle (we can assume that in the long run, the agent will never be stuck in a “back and forth between two vertices only path). But what about more complicated sceneries, like the one given below? Can we always reconstruct even very bizzare colorings?

The (perhaps surprising) answer is: yes! It was shown that if you are given a long enough series of observations, you can always reconstruct the original coloring! (up to isomorphism, meaning, you cannot differentiate between colorings which are shifts or reflections of one another, but we already mentioned above that weview these to be “the same”). So while this basic question is closed, there are other open questions still to be answered. For example, how many observations do we need so that we will be able to reconstruct the coloring with very high probability?

Variants of the $n$-cycle reconstruction problem abound. What if, in our random walk, we don’t go left or right with equal probability? What if we can jump other distances? What if we color and walk on another graph other than the cycle?

The latter question is what Uri’s and my paper addresses. Instead of the $n$-cycle, we investigate the scenery reconstruction problem on the Boolean hypercube in dimension $n$. In dimension 1, the cube is just a line. In dimension 2, a square. In dimension 3, an actual cube. In higher dimensions visualizations start to get messy. One way to think about it is as follows: the $n+1$-dimensional hypercube is composed of just two identical $n$-dimensional hypercubes, with edges connecting matching vertices in the two cubes.

So again, we place an agent at one of the vertices at random, and she starts a random walk. This time, there are $n$ possible choices of where to go in each step. Again, she reports only the color of the vertex she is currently at. Now can we reconstruct the original coloring?

For $n=1$, the hypercube is just two points connected by an edge, and there aren’t that many colorings on it, so it’s no big deal. For $n=2$, the hypercube is a square, which is actually a cycle of length $4$, and we already know how to deal with cycles, so that’s no big deal either. For $n=3$, it can be shown that the methods used to solve for the cycle, combined with some case-by-base analysis, again are able to reconstruct the scenery. What about other dimensions?

The (perhaps surprising) answer is: no! Starting from $n=4$, there exist two colorings which are non-isomorphic (that is, they are not obtained by rotations or reflections of the cube), but which nevertheless cannot be distinguished from one another based on the sequence of color observations of our agent alone! So if we are given such a sequence, we can never be sure whether it came from one coloring or the other.

Here is such a pair in dimension $n=4$:

To see that these are indeed two different colorings (meaning, that one is not just a rotation or reflection of the other), note that in the left image there are two small rings of black vertices, each of size $4$, while in the right image there is only one large ring, of size $8$. This difference cannot be due to rotations and reflections alone.

If you look carefuly, you will see an interesting property of these two colorings: every vertex, no matter what color it is, has exactly two neighbors which are black, and exactly two neighbors which are white. It is this property that prohibits us from distinguishing between the two colorings: no matter where our agent is, at the next step she will report “black” with probability $1/2$, and “white” with probability $1/2$, for both colorings. In technical terms, the sequence of observations distributes as a sequence of iid Bernoulli variables with success probability of $1/2$.

We can take this property and generalize it: we call a coloring “locally $p$-biased” if every vertex, no matter what color it is, has a $p$-fraction of its neighbors colored black, and a $1-p$-fraction of its neighbors colored white. If there are two non-isomorphic such colorings on the hypercube, for any $p$, then the scenery reconstruction problem is impossible to solve on that hypercube. In fact, this statement is true for any graph, not just the hypercube, so finding locally biased colorings on different graphs shows that the scenerey reconstruction problem is impossible for those graphs. But we think that such colorings stand alone by their own right, and deserve study regardless of their application to scenery reconstruction; they are simply interesting combinatorially!

Our first result was to classify for which values of $p$ an $n$-dimensional hypercube even has a locally biased coloring. We saw a $1/2$-biased coloring for $n=4$. Obviously, a $1/2$-biased coloring requires the dimension to be even, since each vertex has $n$ neighbors, and exactly $n/2$ of those need to be black, and $n/2$ white. So is there a locally $1/2$-biased coloring on the $6$-dimensional hypercube? What about a $1/3$-biased coloring? You can try it out yourself for a bit and convince yourself that there is no $1/3$-biased coloring on the $3$-dimensional cube. After a bit of work, we came up with the following theorem:

Theorem 1: There exists a locally $p$-biased coloring on the $n$ dimensional hypercube if and only if $p$ is of the form $p = b/2^k$ and $2^k$ divides $n$.

This shows, for example, that there are no locally biased colorings on odd-dimensional hypercubes. But for even-dimensional hypercubes, there do exist such colorings, and then it’s interesting to start counting them: as we said before, if there are more than two colorings for the same $p$ value, then the scenery reconstruction problem is impossible.

Our proof uses elements from coding theory, and is not very complicated. I’ll tell it in short in the following paragraphs: feel free to skip it if you don’t want to go ito the details.

One side, the “only if”, tells you which colorings are impossible. In the $n$-dimensional hypercube, each vertex has $n$ neighbors, and there are $2^n$ vertices overall. Suppose that in a specific locally $p$-biased coloring, a total of $l$ vertices are colored black. Then when you pick a random vertex, it is black with probability $l/2^n$. On the other hand, when the agent takes a random step, it will be black with probability $p = m/n$ for some $m$. These two probabilities are equal, so

$b/n = l/2^n.$

Decomposing $n$ into its prime powers, $n = c2^k$ where $c$ is odd. Then

$l = 2^{n-k} \cdot m / c$

and since $l$ must be a whole integer (and not a fraction), we get that $m = bc$ for some $b$. This gives $p = b/2^k$ after a short calculation.

The other direction, which says that for all $p$ values of the proper form you can build a locally $p$-biased coloring, is a bit harder. We start by building locally $1/n$-biased colorings for $n$ that is a power of two, that is, $n=2^k$. Such colorings have the peculiar property that they induce tiles which tile the hypercube. What do I mean by this? In such a coloring, every vertex, no matter the color, has exactly $1$ black neighbor. In particular, this is true for black vertices. So black vertices always come together in connected pairs, and all the neighbors of such pairs are white. Further, these white neighbors already have their “$1$-black-neighbor” condition satisfied, so all their other neighbors are also white. This is exemplified in the following shape, or “tile”:

If we can “tile the hypercube” with these shapes, that is, if we can cover the hypercube graph with these tiles without any pair of tiles overlapping, we will have a locally $1/n$-biased coloring. Why? Well, these tiles exactly satisfy the properties stated in the above paragraph, and every vertex in such a tiling will have exactly one black neighbor.

The proof constructs such tilings by using perfect codes; without going into detail, these are well-known methods of tiling the hypercube with a half-tile: instead of two black vertices and two rings of white vertices, these methods tile the hypercube with just a single black vertex and a single ring of white vertices. These are actually known as “balls” on the hypercube:

By taking two copies of a perfect code and connecting the black vertices, we obtain the needed tiling! This gives us a locally $1/n$-biased coloring, for $n$ that is a power of two (the perfect codes we need only exist in dimension $2^k-1$, so when we take two copies of the hypercube together we get a hypercube of dimension $n = 2^k$).

To obtain a locally $m/n$-biased coloring, we join together $m$ different disjoint $1/n$-biased colorings. By $m$ disjoint colorings, we mean that if a vertex is colored black in one coloring, then it is necessarily colored white in all the others. Finally, using a bit of algebraic manipulation which I will not go into here, we can extend these colorings into hypercubes of dimension $c \cdot 2^k$ for any $c$, which is exactly what our theorem stated we could do.

The theorem shows existence of some locally $p$-biased colorings, but tells us nothing of the number of such colorings. In order to show that the scenery reconstruction problem is impossible, we need to find at least two different such colorings. That’s where our next two theorems come into play:

Theorem 2.1: As a function of $n$, there is at least a super-exponential number of different locally $1/n$-biased colorings on the $n$ dimensional hypercube.

Theorem 2.2: As a function of $n$, there are at least $C \cdot 2^{\sqrt{n}} / \sqrt{n}$ different locally $1/2$-biased colorings on the $n$ dimensional hypercube, where $C$ is some constant.

Theorem 2.1 is proved by counting the number of perfect codes, which were used in proving the previous theorem; we won’t go into it here.

The proof of Theorem 2.2 is more interesting, and relies on an algebraic treatment of our colorings. It goes as follows.

Up until now, we have treated our colorings as assigning the color either black or white to each vertex. We could have equally well assigned the numbers “$+1$” or “$-1$“, and all the theorems and statements would have stayed the same: all we care is that there are two distinct “colors”. But this description has an immediate advantage: we can now easily perform arithmetic operations on the colorings. For example, if both $f_1$ and $f_2$ are colorings, then their product $f = f_1 \cdot f_2$ is also a coloring, where by multiplication we mean that if $f_1$ assigns the color $f_1(v)$ to a vertex $v$, and $f_2$ assigns the color $f_2(v)$, then $f(v) = f_1(v)\cdot f_2(v)$. In words, what this multiplication means is that $f$ is black precisely when $f_1$ and $f_2$ have the same color. This algebraic representation is often easier to work with than the verbal one.

Now, the $n$-dimensional hypercube can be described as follows: it is the set of points with $n$ coordinates, where each coordinate can be either $+1$ or $-1$ (you can easily verify that this fits in with our notions of $2d$ and $3d$ cubes; it works just the same for higher dimensions). Thus a coloring is just a function

$f(x_1, x_2, \ldots, x_n)$

which returns either $1$ or $-1$, and all the $x_i$‘s are also either $1$ or $-1$.

This representation lets us combine two locally $1/2$-biased colorings on small hypercubes together into one locally $1/2$-biased coloring on a larger hypercubes: Suppose that $f_1$ is such a coloring on the $n$ dimensional hypercube, and $f_2$ is such a coloring on the $m$ dimensional hypercube. Then the new coloring

$f(x_1, x_2, \ldots, x_{n+m}) = f_1(x_1,\ldots,x_n) f_2(x_{n+1}, \ldots, x_{n+m})$

is a locally $1/2$-biased coloring on the $n+m$ dimensional hypercube! Our method for counting the number of locally $1/2$-biased colorings then goes as follows: we construct many high dimensional colorings from different lower-dimensional ones. In fact, our “elementary building blocks”, from which we build all other colorings, are basically generalizations of the two colorings of the $4$-dimensional hypercube which we encountered earlier in this post. It can be shown (via Fourier decomposition, a method which we won’t go into here) that taking combinations of different low-dimensional colorings always gives non-isomorphic high-dimensional colorings. In other words, if we use different building blocks, then the resultant colorings will be distinct.

We can then get a bound on the number locally $1/2$-biased colorings by counting in how many distinct ways we can add up low-dimensional colorings into a high-dimensional one. After a bit of arithmetic, we find that there are at least as many colorings as there are solutions to this inequality:

$4a_1 + 8a_2 + \ldots + 4k a_k \leq n,$

where $a_1,\ldots,a_k$ are non-negative integers. The number of solutions to this inequality is at least $C \cdot 2^{\sqrt{n}} / \sqrt{n}$, giving us Theorem 2.2.

This, I think, is quite neat: it’s enough to find just two different colorings in order to show that the scenery reconstruction problem is impossible, but in fact there are many, many more of them. And what we gave was only a lower bound: it’s entirely plausible that there is a more clever way of counting these colorings, which will give an even larger number.

So far, so good. We saw that the scenery reconstruction problem is impossible for even-dimensional hypercubes, and even saw that there are different indistinguishable colorings. What about odd-dimensional cubes?

The answer comes from what we call “locally $p$-balanced” colorings. In these colorings, every vertex, no matter what color it is, has a $p$-fraction of its neighbors colored in its own color, and a $1-p$-fraction of its neighbors colored the opposite color. These colorings also yield identical scenery observations: when our agent tells us the colors she sees, the next color will always be the same as the current one with probability $p$. For example, the following figure shows a locally $2/3$-stable and a locally $1/3$-stable coloring on the $3$-dimensional cube.

Locally balanced colorings are different than locally biased colorings, although the two are related. For example, we saw that there is a strong restriction on the possible $p$ values for locally biased colorings, but it turns out that for every $p$, there is at least one locally balanced coloring. On the other hand, if $p = 1/n$ or $p = (n-1)/n$, then there is exactly one locally balanced coloring, while, as we saw, there are many, many locally biased ones.

The main point, though, is that every locally $1/2$-biased coloring $f$ on the $n$-dimensional hypercube can easily give a locally balanced coloring $g$ on the $n+1$ dimensional hypercube, by:

$g(x_1, \ldots, x_{n+1}) = g(x_1, \ldots, x_n) \cdot x_{n+1}.$

You should verify for yourself that this is indeed a locally balanced coloring, and calculate its $p$ value. This construction gives different locally balanced functions for all odd-dimensional hypercubes of dimension $n \geq 5$, showing that the scenery reconstruction problem is impossible for all $n \geq 4$. And that’s what we promised to show in the beginning of this post.

You should play around with locally biased and locally balanced colorings yourself; they serve as good brain teasers, and indeed we’ll end with a small riddle: here are two locally balanced colorings on the infinite one-dimensional and two-dimensional integer lattices. Can you find one for the three dimensional lattice?

### To Carnegie Hall!

August 4, 2016

Recently a friend and I had a chat about music, and he asked me if I do any composition of my own. Unfortunately I do not. I suppose I could blame it on a lack of improvisation skill, which in turn originates from a desire to play pieces “as they were originally intended”, i.e. sticking to the sheet music, though I guess the true answer also has something to do with a fear of failure of some sort.
In general, classical music sports a rather bold distinction between “performer” and “composer”. The composer is the person who creates the music, the performer is the person(s) who executes the music, and they need not be the same gal (indeed, the composer may write for an instrument she does not even know how to play! or write for an entire orchestra / ensemble / etc). The fact that classical music is “classic” also contributes greatly to the distinction: most of the great composers of yore are dead; the best we mortals can do is echo their masterpieces.
But being a classical performer is no shame. Indeed, some performers have risen to a demigod stature among the population (ok, among a very particular slice of the population, but it is a demigod stature nonetheless). These men and women have brought the art of execution of art to a grandmaster level. They are experts in their field; they tune every staccato and accent to picometer precision. They know each intricacy of each phrase by heart, mind, and finger.

Why am I telling you this? Because while in music both performers and composers are abundant, and both are respectable careers to aspire to, it seems to me that in high level mathematics, it is mostly the “composition” that is lauded. By “mathematical composers”, I mean research mathematicians, who explore the boundaries of the science, try to invent new mathematical structures and understand existing ones, and in general, prove a bunch of theorems, lemmas, corollaries, claims, propositions and remarks.
By “mathematical performers”, I mean those who take the work of the composers, and give the audience such a breathtaking show, that they’ll get a three-time standing ovation, eventually being forced to return to the stage to give an encore in the form of a “Using volume to prove Sperner’s Lemma” proof.

Yeah, I know, there aren’t much of the latter, and I think that we are all the poorer for it. What I envision is a mathematical lecturer virtuoso. Someone who can, through all the jumbled, messed up and interwoven six-part counterpoint of a proof, bring out a clear and lucid melody that will ring and resonate loud truth in the ears of the audience. Someone who can aptly tame the fierce and complex mathematical topics that generation upon generation of graduate students have failed to grasp, and finally bestow knowledge upon the ignorant. Someone who has studied the ancient texts and knows by heart, by mind, and by finger each intricacy of each phrase. Who can tune every theorem and lemma to picometer precision. An orator of great rhetoric, brilliant diction, and perfect handwriting. A lecture-hall veteran, who practices six hours a day and in the rest of her time finds out the best way to build a lecture series on a wide, demanding topic. In short, a full-time, professional, high level mathematics teacher.

Of course, the profession “full time teacher” is not unheard of. Yet, as far as I know, most teachers – i.e. most of those whose profession is to teach, and indeed do hone their presentation technique – are aimed at educating elementary and high schools. The number of such teachers at the academia level is small, if not infinitesimal. They do exist, for sure – at the Technion, as far as I know, at least two mathematics lecturers hold a full time position: Aviv Censor and Aliza Malek. They constantly receive much praise and awards, and their lecture halls are crammed so tightly, people stand in the hallways and peek through open doors just to hear them talk (though alas, I never chanced to study under them; this is in part because most of the courses they teach are aimed at non-mathematicians, and in any case are intended for undergraduates). But such men and women are a rarity.

Why is this? It’s quite understandable that many people would prefer to go into research rather than performance, but even then I would expect to see more performers than we have so far. Two other immediate reasons are: 1) lack of paying customers, lack of demand. 2) low social status when compared to research mathematicians (“Oh, you don’t invent anything of your own?”).
But this isn’t so with music, and *should not* be so with mathematics. I can therefore only hope that I live to see the day, when Carnegie hall is filled to burst with excited concert-goers; and when the lights turn on after an hour and a half of a dazzling performance of “The Nash Embedding Theorem”, there will not be a man or woman left unmoved, their hearts pounding with reborn youth, the math as music to their ears.

### A primitive model for genetic recombination

August 17, 2015

Introduction:
I’m taking a class in general genetics at the Technion, and there we learned about genetic recombination, and in particular, homologous chromosome crossover: a phenomenon where chromosomes exchange sections between themselves during meiosis.
When this happens, some of the members of the population exhibit “recombinant genomes”, which are different than their parent genomes should supposedly generate. Surprisingly, this part of the population never exceeds 50%, even though at first look it seems as if it could.

In this post, we’ll see a model of chromosomal crossover statistics that explains this phenomenon, as well as giving an estimate to the physical distance between genes as a function of population statistics. I’ll assume you know some basic genetic terms, such as “dominant” and “heterozygote”, but I’ll explain about crossovers in general and describe the problem in more detail below. You can skip directly to “The model basics” if you already know about recombination.
The post will be about 50% biology and 50% math.

Biological background:
We’ll work through an example with the commonly used traits of EYE COLOR and WING-SIZE in fruit flies. Both are controlled by single genes found on the X chromosome.
A fly’s eyes can be either red or white, with red being a dominant quality. We’ll mark the dominant gene with R and the recessive gene with r. Thus, if a fly’s genome contains Rr or RR, it will have red eyes; otherwise, if it contains rr, it will have white eyes.
Similarly, a fly’s wings can be either long or short, with long being a dominant quality. We’ll mark the dominant gene with W, and the recessive with w, so long winged flies have Ww or WW, and short winged flies have ww as their genotype.

Suppose we have a heterozygote cis female. In other words, her genome contains both the dominant and the recessive genes (so she has RrWw in her genome), and both of the dominant genes are found on the same homologous chromosome. In other words, her two X chromosomes look like this:

During meiosis, her two homologous chromosomes duplicate and then separate, and we get two types of possible germ cells: RW and rw:

However, it is also possible for crossover to occur: two chromatids are “sliced” at some point, and then the two parts of each are glued to each other.

If this happens during meiosis, the outcome is four possible germ cells: RW, Rw, rW, rw:

Now, what happens when we mate our RrWw female with a white eyed, short winged male? Since these traits are found on the X chromosome, and a male fly only has one of those, he necessarily has the recessive allele, rw. We don’t care about the Y chromosome here.

Upon mating, the male fly will give the offspring either an X or a Y chromosome. Let’s ignore the males at this point, and focus just on the females. Since our male’s genotype is rw, we will get the following combinations: RrWw, rrww, Rrww, rrWw. All of these are phenotypically different, and each represents a different combination of red/white eye and long/short wing. The Rrww and rrWw genotypes are recombinant – they only exist in the population because of recombination.

Suppose now that the chance for recombination between R and W is some number q between 0 and 1. Then if we look at a very large collection of germ cells from the mother, we expect the following distribution:

RW should be $\frac{1}{2}(1-q)$ of the germ cell pool
rw should be $\frac{1}{2}(1-q)$ of the germ cell pool
Rw should be $\frac{1}{2}q$ of the germ cell pool
rW should be $\frac{1}{2}q$ of the germ cell pool

This is because q of the population should be recombinant, and whenever there is recombination we get an equal amount of Rw and rW.
After mating, when looking at the females, we only need to add the father’s recessive genes, and we get:

RrWw should be $\frac{1}{2}(1-q)$ of the population
rrww should be $\frac{1}{2}(1-q)$ of the population
Rrww should be $\frac{1}{2}q$ of the population
rrWw should be $\frac{1}{2}q$ of the population

Thus, Rrww and rrWw comprise $\frac{1}{2}q+\frac{1}{2}q = q$ of the population. This can be measured in real experimental trials, since each of the above genotypes translates into a different observable phenotype.
At this point in our theory, q can be any number between 0 and 1. If q is 0, then there is never any recombination, and the two genotypes RW and rw go hand in hand forever. If q is 1, then recombination always happens.
However, it is an empirical fact that the percentage of recombinant population is never more than 50%! The measured value of q is always less than or equal to 0.5.

There must be some mechanism that prevents recombination from happening too often. We can make appeals as to the utility of this mechanism, and wonder whether it is good or bad to have a small number or a large number of recombinations between genes – but for now, let’s try to think of an underlying model.

Image source: wikipedia

The model basics:
We treat the chromosome as a linear piece of DNA, with a length of “1 chromosome” – in essence, it is a line segment of length 1. The different genes are points on this line, and are therefore assigned a position 0pos1. In reality genes have some finite width on the DNA strands so a more accurate model will treat them as small intervals, but it will be easier to consider them as points.
We’ll assume that the gene that codes for eye color is on the left of the gene that codes for wing size. Denoting the position of the first by x and the second by y, we have this schematic for our chromosome:

The primary element in our model is the crossover event, or a cut. In this event, two homologous chromosomes are cut at a random place, distributed uniformly across its entire length. The chromosomes then swap strands at this position.

There are two options here. If the cut occurs in in interval between x and y, the genes will be swapped, and we have recombination. However, if the cut occurs outside the interval [x,y], then those two genes will not be affected. Since the cut distribution is uniform, the chance to land between the two genes is just y-x, so the probability of recombination is $q = y - x$.

This is a simple operation, and it’s tempting to think that it is the entire deal, but this is not so. In a crossover event, if two genes are far away from each other, meaning, at the opposite sides of the chromosome, then the probability of recombination can be very close to 1: nearly every cut we make will separate them. But we never observe a q above 0.5! There is obviously something more that we are missing here.

Image source: Science magazine

The answer: the above description is true only for a single crossover event – a single cut. However, there is no guarantee that a chromosome will undergo any crossovers at all during meiosis. Further, a chromosome may actually undergo several crossover events, as was experimentally discovered when looking at the recombination relations between a triplet of genes on the same chromosome. But look what happens when there are two crossover events in the same interval [x,y]: the strands are switched twice, and ultimately there is no recombination between the two genes!

We can now convince ourselves: whether or not we see recombination between two genes depends on the parity of the number of crossover events that occurred between them. When looking at the population statistics, what we ultimately see is the average of the parity of crossovers.
As an artificial example, suppose that during meiosis, there is a 50% chance of performing a single cut, and a 50% chance of not performing any cuts at all. In that case, for two far away genes, which are always separated by any cut, there is a 50% chance of getting recombination, and 50% chance of not getting it. In other words, q was reduced from 1 to 0.5. In general, in this case the observed probability of getting recombination is $q = \frac{1}{2}(y-x)$, as half the time we do not get a recombination at all.
Of course, there is no reason to assume that there is a 50% chance of getting no crossover event, and 50% of getting exactly one – the number of crossovers could behave in different ways – but we see that the actual percentage of recombinant population depends on the distribution of the number of crossover events in the chromosome. Which distribution should we choose?

A slightly flawed offer:
A simple choice would be a binomial distribution. The reasoning goes as follows: during meiosis, there are all sorts of enzymes floating about the chromosomes, which are responsible for cutting them up and gluing them back together. There may be a large number n of these enzymes floating about, but they only have a certain probability p of actually performing their duty. Of course, we assume that they act independently, even though in reality they may interfere with each other. So the number of crossovers depends on the numbers of “successes”, where a success is an enzyme doing its work properly, which happens with probability p. This means that the number of cuts distributes according to $C \sim Bin(n,p)$.

So assuming the number of crossover events distributes according to $C \sim Bin(n,p)$, what is the probability of getting an odd number of crossovers? Let’s take a moment to calculate it.

For any $n$, denote that probability by $P_n$. Suppose you already checked $n-1$ of the enzymes. Then with probability $P_{n-1}$, you already have an odd number of crossovers, so you don’t need any more of them. Further, with probability $1-P_{n-1}$, you have an even number, and you want another crossover to get an odd number. So the probability obeys the recurrence relation

$P_n = P_{n-1}(1-p)+(1-P_{n-1})p.$

with the initial condition that $P_0=0$, as if there are zero enzymes there are zero crossovers, which is an even number.
More nicely:

$P_n = P_{n-1}(1-2p)+p$

$P_0 = 0.$

If we look at just this equation:

$P_n = P_{n-1}(1-2p)$

we quickly see that the answer is $P_n= a \cdot (1-2p)^n$. However, we also have that additive +p in our original equation. It turns out we only need a small adjustment in order to compensate it though, and in this case we just have to add an extra constant, so that

$P_n = a \cdot (1-2p)^n + c.$

Since the equation is linear, this is actually very much like the particular solution of a differential equation, and we can find c directly by putting it into $P_n$ in the recurrence relation:

$c = c (1-2p) + p,$

which gives

$c = \frac{1}{2}.$

Taking into consideration the initial condition, the solution is then,

$P_n = \frac{1}{2} - \frac{1}{2}(1-2p)^n$

Wonderful! For very large n, the probability of getting an odd number of crossovers goes to 0.5! Even for relatively low probabilities p, the quantity $(1-2p)^n$ goes to 0 very quickly.

This gives an answer regarding two genes which are very far away: they are affected by every cut performed by the enzymes, and so their recombination probability is exactly the same as the probability for getting an odd number of cuts. But what about genes which are closer? For them we actually have to take into consideration the fact that not every cut the enzymes make will cause a crossover.
Notice the following: the number of cuts in every chromosome is distributed binomially, $C \sim Bin(n,p)$. If we already know the number of cuts to perform – say, k – then the number of cuts which affect the two genes at positions x and y is also distributed binomially as $Bin(k,y-x)$, since every cut has a probability of y-x of crossing the two genes. So the number of crossovers G between y and x, conditioned that $C = k$, is $Bin(k,y-x)$, and k itself distributes as $B(n,p)$.
Now comes the cool part: there is a theorem about binomial distributions which says the following: if X is a random variable that distributes binomially, $X \sim Bin(n,p)$, and Y is a random variable that conditioned on X distributes binomially, $Y|X = Bin(X,q)$, then Y is also binomial, $Y \sim Bin(n, pq)$! Using this theorem, the number of cuts S which swap between x and y goes as $S \sim Bin(n, p \cdot (y-x))$.
Now we can apply the same reasoning as before, only this time, a “success event” is not merely when the enzymes perform a crossover anywhere on the chromosome, but rather when they perform it in some place between x and y.
The final probability of getting recombination between two genes is then

$q = \frac{1}{2} - \frac{1}{2}(1-2p(y-x))^n$

This is very nice, and it gives us some asymptotics as well. For large values of $p(y-x)$, the second factor is negligible, and we have $q =\frac{1}{2}$. For small values of $p(y-x)$, the second factor can be expanded to first order, and the two $\frac{1}{2}$’s will cancel each other out, giving us $q \propto (y-x)$.

Slightly improving the binomial:
Overall, the model proves adequate in its predictions, and its simplicity is alluring. However, it is not without problems. For example, its two parameters – p and n – must somehow be found out, and it is not entirely clear how to do so. In fact, the very fact that we have a fixed n here seems out of place: by keeping it around, we assume that there is a constant number of enzymes working about, when it is much more reasonable that number varies from cell to cell. After all, when manufacturing hundreds or thousands of enzymes, there must be variation in the numbers.

Luckily, there is a simple way to fix this, which is actually firmly based on reality. Instead of assuming that the number of cuts the enzymes make is distributed binomially, we assume it follows a Poisson distribution, $C \sim Pois(\lambda)$, for a yet unknown $\lambda$. This actually makes a lot of sense when we remember that Poisson distributions are used in in real life to describe queues and manufacturing processes, when what we know is the average time it takes to perform a single event.
If the number of overall cuts has a Poisson distribution, how does the number of crossovers between x and y behave? Well, given that the number of cuts is k, the number of crossovers is still as before, $Bin(k, y-x)$. But again the theorems of probability smile upon us, and there is a theorem stating that if $C \sim Pois(\lambda)$ and conditioned on $C = k$ we have $S|C \sim Bin(C,y-x)$, then

$S \sim Pois(\lambda(y-x)).$

So the distribution of crossovers between x and y will also follow a Poisson distribution!
Now we only have to remember the simple trick, that

$Pois(\lambda)= \lim_{n \rightarrow \infty} Bin(n,\frac{\lambda}{n}).$

Thus, under the assumption of a Poisson distribution, the final probability of getting recombination between two genes is

$q = \frac{1}{2} - \lim_{n \rightarrow \infty} \frac{1}{2}(1-\frac{2 \lambda(y-x)}{n})^{\frac{1}{n}},$

or, more simply,

$q = \frac{1}{2} - \frac{1}{2}(1-e^{-2 \lambda (y-x)}).$

This again has the same desirable properties as before, but the model is simpler: we got rid of the annoying n parameter, and the probability parameter p was replaced by the rate parameter $\lambda$.
(Note: For small values of (y-x), the probability for recombination is $q=\lambda(y-x)$; if only we could set $\lambda = 1$ and get a direct relationship between q and the distance between genes…)

To conclude:
The percentage of recombinant phenotypes in a population of offspring is always smaller than 50%. This is not an arbitrary number, but stems from the underlying biological mechanism of recombination. Because multiple crossover events can occur between two genes, what’s ultimately important is the parity of the number of such events. When the total number of crossovers in a chromosome follows a Poisson distribution, the parity can be readily computed. It behaves like a fair coin toss for genes which are physically far away from each other, but is linear in the distance between the genes when they are close to each other.
This “Poisson crossover model” is very simple, and of course does not explain all there is about recombination (genes are not points on a line; distribution is probably not Poisson; events are not independent; there are “recombination hotspots” in chromosomes; the chromosome is a messy tangle, not all of which is accessible; etc). But it looks like a good starting point, and to me seems adequate at explaining the basic behaviour of recombination.

### Programming complex transformations

December 2, 2014

Facebook should change their “It’s complicated” status to “It’s complex”. That way, real people with imaginary girlfriends could spread the news to all their friends!

– Ancient proverb

We were learning about conformal maps in complex function theory. While we did plenty of “circles are sent to circles” and “connected components are sent to connected components”, it’s almost obvious that we barely got to see any actual map in action.
I remembered that I saw some very nice geometrical transforms in the wonderful Youtube series “Dimensions” by Leys, Ghys and Alvarez (link here), and decided to write a small program of my own that, given a picture, can apply any (reasonable) complex function to it, treating it as a transformation on pictures.

What does this mean? We can think of a picture as an array of pixels, each with its own $x$ and $y$ values. Treating the center of the picture as $(0,0)$, each pixel has a different coordinate. The pair $(x,y)$ in $\mathbb{R}^2$ can be treated as $z = x + iy$ in the complex plane. We can then take this complex number and put it in any complex function $f(z)$; the result is some other complex number, $w = a + ib$. We then interpret this new number as the coordinates of a new pixel; so in the new picture, the color at position $(a,b) = f(x+iy)$ will be the same as the color at position $(x,y)$ in the original picture.

If $f$ is a funky enough function, the results should be awesome, and this lets you understand a little better what all sorts of analytic functions do (analytic functions preserve angles between the two pictures, so however twisted things get, we’ll always have some sanity).

Here, look at what happened to our poor tiger (original image courtesy of Wikipedia commons):

This…

Turns to this…

(confession: ok, this mapping isn’t a standard conformal map; read more to see what’s actually happening here).

I’ll now describe some of the problems and solutions I ran into; if you just want to see more pretty pictures, feel free to do just that.

The naive way to get a mapping is to do just what was described in the above explanation: take the original picture, and for each pixel $(x,y)$, plot it at $f(x,y)$. Unfortunately this has several problems. These stem mainly from discretization: the coordinates of the pixels come in integer units, and two nearby pixels will always differ by at least 1 in one of their coordinates. Proper scaling can make this discretization as small as we wish, so effectively we can have any two nearby pixels differ by $\frac{1}{n}$ for any $n$ of our choice, but problems can still occur.

The first problem is that even if your function is onto, meaning that it’s possible to designate the color of every pixel in the new picture, the result may still have gaping holes or “isolated pixels”. This severity of this problem depends on the function itself (I guess it generally depends on whether the absolute value of its derivative is close to 1 or not), and for some choices of $f$, your end product might only be partially filled.
In this example, generated naively by $f = \sqrt{z}$, the top of the picture is falling to pieces (also, it’s evident that the edges of the picture are getting left out, though this is because our source picture is finite, not due to discretization).

A partial solution is to “fill in the blanks” by averaging over neighboring pixels: each blank pixel will take the average color its nearest neighbors. Assuming that the picture is not “too broken up”, this can work just fine – if holes are completely surrounded, you won’t really notice the difference (and it almost makes sense theoretically to do this, in terms of the middle-value theorem). Indeed, running this fix helps the picture quite a lot, although there are still untreated areas which cannot be helped:

The second problem is that some transformations can span over enormous scales. For example, with the transformation $z \rightarrow \frac{-1}{z}$, the interior of the unit disc exchanges places with the exterior. This means that when going over the pixels, ones very close to the origin are going to get sent way off to the edge of the new image, while ones far away are all going to be sent near the origin.

The result is that while the new image is very very large, most of the “interesting” things (aka – most of the actual original picture) is contained in a very, very small region. Look at this example of $z \rightarrow \frac{-1}{z}$:

Not very interesting, is it?

You can adjust for this phenomenon if you know where to cut off your picture, ignoring the endpoints that are way off. But how can you know in advance the size of your image, or which points are good and which are not? This generally requires analyzing your function beforehand, which we do not want to do.

As a (very) partial fix, I noticed that most of these image points are isolated – the discretization means that two remote points will probably not be directly near each other. I wrote some small cleanup code that finds these points and eliminates them, rescaling the picture appropriately. Of course, there will always be isolated points; in fact, due to the discrete nature, every point in the new picture is isolated, so in effect we have to specify how close two points have to be to each other to be considered neighbors or not; this is done according to the resolution of the target image.

In any case, the code I wrote works iteratively. Here are the first three iterations:

You can see that as we iterate, the pictures get better (and focus more on what’s important, the actual head of the tiger, instead of empty space). However, I would still consider this to be rather inadequate (although we do get a “particle-erosion” effect for free, which is cool).

A third problem with this method is that f doesn’t have to be a one-to-one function, meaning that there may be two possible colors for a given pixel in the generated picture. How do we decide which one to take? Do we combine? Do we override? This is a general problem not due to discretization, and I just ignored it here (the code overrides new pixels). Here is $f(z)=z^2$:

For our last image with this method, here is $\log(\text{tiger})$.

The naive method is riddled with problems and artifacts. However, there is a way that generally treats discretization better, and while it also has some black empty space, and some pixelated areas, it doesn’t tend to have small annoying holes (for nice functions, anyway).

The solution is this: instead of going over all the pixels $x,y$ in the original picture and computing the coordinates of the pixel in the new picture, $w = f(x+iy)$, we go over all the pixels $(a,b)$ in the new picture, and calculate the inverse $x+iy=f^{-1}(a+ib)$. If we do not exceed the size of the original picture, we basically ensure that we will not have holes in the result, because we actually go over each pixel and calculate its color, instead of having it “get picked by accident”, as was in the original method.

This method eliminates all the sporadic and isolated holes and points we had using the naive way. Here is $f(z) = \frac{-1}{z}$ using the inverse method:

Much better!
Of course, there are still a few problems:

1. The edges are pixelated – this is because they all draw from basically the same region – the immediate center of the tiger’s face – and a small region that is smeared over a large area is pixelated.
2. There is still a gaping hole in the center. In order to fill it up, we would need an even larger original tiger image – the inverse of these points is out of bounds. In fact, for this specific function, $\frac{1}{z}$, we always have some finite dark area at the center, since we do not work with infinite pictures. There are ways to overcome this, but not with simple finite image transformations.

One drawback of this method is that we need to directly specify the inverse of $f$. This may be simple, in case of simple functions like $\sqrt{z}$ or $\frac{1}{z}$, but in general it may not always be easy.
Further, there may be artifacts arising from this methods. For example, suppose we want to see the map $f(z) = \sqrt{z}$. In order to use the inverse method, we would have to calculate $z^2$ for all the points in the original image.
But the equation $w^2 = z$  has two solutions for $w$: both $\sqrt{z}$ and $- \sqrt{z}$! When using the inverse method while giving $z^2$ as an inverse function, we get a different image from the direct method:

And indeed, notice – the tiger has been replicated! This can never happen with the direct method, which by definition maps each point only once.

Now you see how I cheated you a bit with the first picture in this post – it cannot be the result of a conformal map, since there are clearly multiple instances of the tiger! In fact, it was created by calculating the inverse of $\frac{1}{z-100(1+i)} + \frac{2.5}{z+100(1+i)}$.

That’s it for now; happy mapping!

### They should have sent a complexity theorist

October 30, 2014

My O(1) readers are probably restlessly wondering where I’ve been, how I survived Israel’s freakishly sweaty summer, and what’s up in general.
Well, the truth is, I did manage to sweat under the Mediterranean sun, but most of the summer I spent in the United States. The official reason, given to the consulate and on my visa documents, was to do a project on “jump-starting a recent protocol on infinite randomness, using quantum-mechanical experiments”. The visa was approved, but I doubt that the people who stamped my passport understood what it was all about; hell, even I didn’t really know what I was going to do. I therefore dedicate this post to the men and women of the not-so-lubricated bureaucratic machinery that made my trip possible, in hopes that it will teach them all they ever wanted to know about unbounded randomness expansion using untrusted quantum devices, but were too afraid to ask. (Further dedication goes to Scott, who kindly took me under his wing and oversaw the project).
(The following post isn’t really technical, but feel free to skip over parts you don’t understand; there’s zounds more text that doesn’t require any fancy quantum mechanical stuff).

Randomness is important in life. For example, the optimal strategy in rock-paper-scissors is to pick each option with probability 1/3rd. This might seem easy, but it isn’t: humans are quite bad at imitating random sequences or won’t do so even if they know it’s best for them (best for them in theory; but then, what else is there?). It’d be much better if we had an infinite sequence of random bits that we could use whenever we wanted to. How do we go about getting such a sequence?
Ask any physicist, and she’ll tell you, “why it’s easy! Use quantum mechanics!” And indeed, quantum mechanics seems to be the place in nature where randomness comes not from a lack of prior information for us humans (i.e, a coin flip is random because we don’t know its precise position or the precise flipping force or the precise air pressure and currents), but is inherent in the physical reality itself. For most part, most reasonable “hidden variable” theories – theories in which the randomness observed in experiments stems from quantities that *are* deterministic, but we just don’t know them – have been ruled out.
So, the easiest way to get random bits using quantum mechanics is to take a quantum state that is in superposition – say a photon with both horizontal polarization (represented as the qubit $|0 \rangle$) and vertical polarization (represented as the qubit $|1 \rangle$) – and just measure which one it is. Thus, the overall state of the photon is the superposition $\frac{1}{\sqrt{2}}(|0 \rangle +|1 \rangle)$, and measuring its polarization yields $|0 \rangle$ with probability 0.5, and $|1 \rangle$ with probability 0.5.
So far so good. In an ideal world, we would be done here. We’d build a nice small black box with a large red button on top. Every time we press it, the box would create a superposed photon, measure it, and output the result. Infinite bits at the press of a button.
But alas, we do not live in an ideal world, and most of us, while avid rock-paper-scissors players, do not have the necessary equipment or training to build quantum mechanical boxes. Of course, in this capitalistic global entrepreneurship enterprise world we live in, this isn’t much of a problem – we can always count on the market adjusting to the needs of the people, and companies selling quantum mechanical random number generators will sprout up like mushrooms after the rain. Hey, they already have.
The problem with these companies is that you can never quite be sure that they are honest. How do you know that they aren’t selling you only a pseudorandom number generator, which uses a deterministic algorithm and a small random seed? There are statistical tests you can run on the output, but we don’t know yet if it’s possible to discern between a pseudorandom output or a truly random output in reasonable time. If they are cheating you in this way, then your entire “random” sequence is vulnerable.
Further, even if the company you bought your box from did give you truly random bits, how do you know that they were created on the spot? Perhaps the company generated a gigantic random string back in their HQ, and just put it in your box. Every time you press the big red button, you get a new bit out of that string. The output is indeed random, but it wouldn’t be secure – the company could sell information about your random bits to the highest bidder, and you would face a gratuitous defeat in the rock-paper-scissors nationals.
These two problems apply to any true random number generators, but if you are using quantum ones there is yet another issue: even if the company did generate the bits on the fly, they could still get information on your bits by quantum entanglement. In a jiffy, instead of creating a single photon in the state $\frac{1}{\sqrt{2}}(|0 \rangle +|1 \rangle)$, they’d create two photons in the entangled state $\frac{1}{\sqrt{2}}(|00 \rangle +|11 \rangle)$: a superposition of “both photons have horizontal polarization” and “both photons have vertical polarization”. One photon they’d put in your box, the other they’d keep for themselves back in their HQ. The rules of quantum mechanics then say that when you press the big red button and the box measures the state – say you got a $|1 \rangle$ – then when the company measures their photon, they also get a $|1 \rangle$. They always get what you got, and again, your information is not secure. This does not involve any communication between the box and the company – it would work even if you put it in an underground vault in Andromeda – its just a peculiar property of quantum states.
So right now things are looking pretty grim: there’s this wonderful startup idea – to produce random bits using quantum mechanical phenomena – but buyers don’t have a guarantee that the companies aren’t cheating them. And we all know where that leads to.
But not all is lost! For if we are allowed to tweak the way our quantum mechanical boxes operate, we can build a statistical test that has nothing to do with ordinary randomness / pseudorandomness tests, and that test guarantees honesty. Boxes which pass the test must produce random bits; they produce these bits on the fly; and they can only have a tiny amount of entanglement with the company HQ, giving the company almost no information about your random sequence. A magical cure for all our maladies!
To see how it works, we’ll look at the famous CHSH game. In this game, Alice and Bob are two players who play cooperatively in the following very realistic scenario: they are both put into separate rooms, and each one is presented with a bit: Alice is shown X, and Bob is shown Y. Based on that bit, they have to output a bit themselves: Alice outputs A, and Bob outputs B. They win the game if

$A \oplus B = X \wedge Y.$

They are allowed to decide on a strategy beforehand, but once they are put into the rooms, they cannot communicate.
Suppose that X and Y are chosen uniformly at random, that is, they each have a 0.5 probability of being 0, and 0.5 probability of being 1. What is the optimal strategy for Alice and Bob – the one which gives them the highest chances of winning?
Here is one strategy: Alice and Bob ignore the inputs X and Y completely, and always both output 0. So $A \oplus B = 0$, and this is equal to $X \wedge Y$ for 75% of the cases: whenever either X or Y is 0. The success rate for this strategy is 0.75.
It can be shown that there is no better deterministic strategy (you can do the truth table, if you want). But then there is also no better probabilistic strategy, since it would just be a convex combination of deterministic ones. So the best Alice and Bob can do, if they are not allowed to communicate, is to win 0.75 of the time.
Well, classically, that is true, but it is not true if they are allowed to share quantum resources. Specifically, if they each have one photon of the entangled state $\frac{1}{\sqrt{2}}(|00 \rangle +|11 \rangle)$, then once they are shown the bits, they can measure their photons from a set of agreed upon non-trivial measurement, and output whatever their measurements give. Their outputs will be completely random (individually), but correlated with each other. If they choose the right measurements, they can boost up their win rate to 0.84 (!). This is the best, known strategy that does not involve communication between the players. (For a billion details, see here)
But wait, there’s more! The CHSH game is robust, in the sense that if Alice and Bob have a success rate very close to 0.84, then with high probability they are using a strategy that is not very different than the known optimal one. This means that the bits they output are very close to random (what does it mean “very different strategy”? There is a formal definition which we won’t go into here, but as an example, the strategy “0.01 of the time output 0 and 0.99 of the time use the optimal strategy” is very close to optimal; so is “measure in a slightly different way than the optimal, so the correlations are changed just a bit”).
We now have a test for our random number generator box. Instead of having one big red button which measures a $\frac{1}{\sqrt{2}}(|0 \rangle +|1 \rangle)$ photon, we’ll ask the manufacturer to give us two boxes. These boxes will act as Alice and Bob: they will accept as input two random bits, and output bits of their own. We can play the CHSH game many many times, and measure their success rate: if its very close to 0.84, then they cannot have coordinated their outputs in advance; they cannot have used a deterministic algorithm; and the company cannot have a lot of information about the output (ok, we haven’t shown this here, but in a jiffy: in order to have information about it, the company needs to be entangled with the two boxes; but almost all the entanglement is “used up” in order to boost the win rate from 0.75 to 0.84, so “nothing is left” for the company to know).
This is what is commonly called (well, common among the relevant community) as “certified randomness”- the CHSH game can be used to test the randomness of our devices (in fact, there is a broad class of “XOR games” that can be used – games which are similar to the CHSH, but may involve different requirements or more players).
We would really like to say that we are done here, but the keen eyed among you must have already noticed the bug in the plan. We have a pair of boxes that, when given two random bits, output two correlated bits. We need random numbers just to test if a random number generator works. What’s worse, we put in two, and get back less than two. We are actually consuming bits in the process! Alas, the market for quantum random number generators is much more blooming than the one for quantum random number extinguishers.
But not all is lost! If we are allowed to tweak the inputs to the pair of boxes, we can create a test that uses less random bits than it puts out. The main concept is as follows: we still let the boxes play a lot of CHSH games, only now, instead of having totally random X and Y (requiring 2 random bits per game), we alter the input a bit: Most of the time, we’ll have X = 0 and Y = 0. This is like a “dud” game, and if the boxes anticipate this, they can potentially output 0,0, as described before. However, for a very small percentage of randomly selected inputs, X and Y are selected at random, as usual; these are called “real” games. On these games, if the boxes are to perform with high win rate, they have to be honest – they have to actually play the winning CHSH strategy. The point is that the real games are chosen at random, and the boxes have no idea which ones they are. If they play assuming the X = 0, Y =0 dud games, they run the risk of falling on real games and winning with only 0.75 probability. The trick of these tests, then, is to find out how to distribute the real games among the duds, how strict to be when deciding if the boxes pass the tests, etc. This type of test is called a randomness expansion protocol, in that it takes requires a certain number of bits (for choosing which games are duds and which are real, and also for the inputs of the real game), but outputs more than was used. Both polynomial and exponential expansions have been developed, and more recently, even infinite expansion! The latter is generally done by back-feeding the output as input for the boxes, but the details are a bit complicated, especially the whole “proving that it works against entangled opponents” thing. It means that you can start with a finite string of randomness (say, one you obtained from a trusted source), and expand it into one as long as you wish! There will be errors, but they grow exponentially smaller the more initial bits you use.
Personally, I think this whole thing is really cool. If you trust your quantum box, then generating an infinite random string is as easy as $|1 \rangle$ $|2 \rangle$ $|3 \rangle$. But even if you don’t, you can still obtain an infinite random string. It requires a bit more pyrotechnics, and it requires you to somehow obtain a small number of random bits elsewhere, but it’s possible. And actually, despite the fact that we called our boxes quantum, they don’t necessarily have to be. All they have to do is win the CHSH game with probability close to 0.84. Quantum mechanics isn’t the final say in physics; maybe we’ll find better theories which supersede it. Any mechanism which wins better than the classical 0.75 can be used in this type of protocol.
And that’s pretty much the gist of “a recent protocol on infinite randomness, using quantum-mechanical experiments”: a method to use untrusted quantum boxes in order to take a small number of random bits, and turn it into an unbounded number of (nearly) random bits. That’s it.
Where does your humble servant come into play in this whole ordeal? A very tiny part, and that’s the “jump starting” in the clause “jump starting a recent protocol on infinite randomness”. An initial random string is needed in order to run the protocol, and the question is: how large is that string, as a function of the errors you are willing to tolerate? (there are plenty of places where errors accumulate, but I skipped the discussion of errors and robustness because it really only complicates matters, and it’s complicated enough as it is).
So that’s what I set to find out. I basically gave bounds for various expressions described in the protocols which relate to the number of random bits outputted. The answer? Well it depends on the error, of course. But, let’s say, the bound I got is on the order of O(1,000,000) for reasonable error. For the rest, you’ll have to read my paper, I guess.

### Microwaves and birthdays

September 10, 2014

Without doubt, one of the largest differences between the USA and Israel is the microwave ovens. Whereas almost all microwaves I encountered in Israel had either analog rotary timers or preset “30 sec or 1 min” buttons, here in the states there is an overwhelming prevalence (100% of an astounding three cases) of numpad microwaves.

Seemingly, all you have to do is put in the number of minutes / seconds you want to heat, and fin, you are done.
But wait; is it minutes, or seconds? What happens if I put in a three or four digit number? Do I have to be an arithmetic expert to operate my microwave?
In what can only be stated as the “non-continuity of microwave space-time”, the input parsing is simple: if you put in xx:yy, it will run for xx minutes, and yy seconds. Simple and intuitive. The thing is, nothing constrains yy to be smaller than 60. 1:99 is as valid input as any, and will indeed run for 1 minute and 99 seconds (=159 seconds total). 2:00 is also a valid input, running for 2 minutes, 0 seconds (=120 seconds total).
This is the natural way to handle user input, and I totally approve of it, if only for the programmers’ and designers’ sake for not handling annoying details. There is a nice time discontinuity when you plot the actual cooking time against the numbers you punch in, if you arrange them in lexicographical order:

Starting from 60 seconds cook time, the user has two choices of how she wants the input to be shaped, rather than just the feeble one available with a rotary timer. This is in agreement with the USA’s enhanced economic and political freedom; it is no wonder that these microwaves are more prevalent here (as for me, you can find me standing dumbstruck in front of the machines, trying to decide which number I should punch in).

As the title of the above plot suggests, it is interesting to see how different minute lengths affect our options. The shorter the minute, the more overlap there will be, and the more options you will have, until finally, for SECONDS_PER_MINUTE = 1, we have 100(!) different options of input. Here is the example for a 30 second minute:

On the other hand, given that we work in base 10 and that our two digit numbers only go up to 99, if we had a longer minute (and kept the input method the same, allowing only two “seconds” digits), we would have gaps:

Not every desired time can be reached; we will likely not be seeing any 200 second minutes in the Imperial system any time soon.

This whole ordeal reminded me of a wonderful fact I stumbled upon that has to do with discretizing age. Consider the standard high school algebra question: Albert is X years old, and his son Jorgenhausfer is Y years old. When will Albert be twice as old as his son?
The question is easy, but one can also ask for how long Albert is twice as old as his son. It turns out that Albert will be twice as old as Jorgenhausfer for exactly one year, but that time period may be split into two sections, depending on their birthdays! I can do no better justice to the issue than the discussion given here:
http://www.math.brown.edu/~banchoff/twiceasold/