## Posts Tagged ‘Statistics’

### The Johnson-Lindenstrauss Lemma

May 18, 2017

Today is a good day I think to talk about the Johnson-Lindenstrauss dimension reduction lemma!

Johnson and Lindenstrauss. It cannot be coincidental that they are both wearing the same type of glasses.
Left image from Johnson’s website, right image from Gil Kalai’s blogpost in memory of Lindenstrauss.

The setting is as follows. Suppose you have a large number $n$ points, living in the Euclidean space of dimension $\mathbb{R}^{d}$. We’ll call these points $x_1,x_2,\ldots,x_n$, and think of both the number of points $n$ and the number of dimensions $d$ as being very large, much larger than the measly “3” of our easy-to-grasp physical world.

Now, you may wonder, “But we do live in a three dimensional world thank you very much, so where would I obtain such high dimensional data?” Well, these situations arise increasingly more often in computer science applications, where a point in $d$-dimensional space may represent an abstraction of some actual real-world thing.

For example, imagine this not-too-farfetched biology experiment. A biologist wants to understand how a particular gene works in a bacterium, and she does this using the following procedure: first, she makes small changes in the DNA sections that code up the protein. Then she replaces the original gene in the bacterium by the new modified one. Finally, she measures the concentration of about 1000 different proteins in the bacterium. If the original gene was important, or if it was a key ingredient in a large protein pathway, then many of these concentrations will be different under the modified gene when compared to the unchanged gene.

Our biologist repeats this experiment many many times, each time changing a different part of the gene in a different way. Ultimately, she has performed say about 500 experiments. Each experiment results in a vector of 1000 numbers, one for each protein. She now has 500 points living in a 1000-dimensional space*. Our poor biologist must now analyze these points, and from them understand the way in which the original gene affects the proteins and operates within the cell. She will run many statistical and geometrical algorithms on these data, and these algorithms will usually have the downside that they really, really slow down as a function of dimension. What can she do?

* Actually, since a set of $n$ points always spans a space of dimension no larger than $n$, these points only live in a 500-dimensional space. But that’s besides the point here, as we will see next.

Ok, back to mathematics. The Johnson-Lindenstrauss lemma roughly says the following. Suppose you have these $n$ points living in $\mathbb{R}^{d}$. You can actually put them into a very small space – logarithmic in $n$ – in such a way that the distances between the points are almost preserved: If two points were originally far away from each other, they’ll remain far away from each other, and if two points were close, they’ll remain close. More formally,

Lemma: Let $0 < \varepsilon <1$ and let $x_1,x_2,\ldots,x_n$ be $n$ points in $\mathbb{R}^{d}$. Then there is a function $f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}$ such that for every two points $x_i$ and $x_j$,

$(1-\varepsilon)||x_i - x_j||^2 \leq ||f(x_i)-f(x_j)||^2 \leq (1+\varepsilon)||x_i - x_j||^2$

with $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3).$

This lemma is good if you want to run geometric algorithms on your data. For example, in our biologist’s case, she might decide that if two modifications to the gene resulted in protein densities that are more or less the same (that is, two points in $\mathbb{R}^d$ that are close to each other), then the modifications changed the gene in functionally the same manner. She might then decide to look for clusters of points which are close to each other. This is usually easier to do in smaller dimensional spaces. Applying a Johnson-Lindenstrauss transformation allows our biologist to run the algorithms faster and on more data, while (hopefully) not losing much accuracy.

Of course, this is useful only if the the new dimension $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)$ is much smaller than the original dimension $d$. This often happens in applications; for example, in a future post, we will work with $n$ points in $n$ dimensions, and the transition from a space of dimension $n$ to space of dimension $\log n$ will be quite substantial.

The proof we will present here follows a paper by Dasgupta and Gupta, which uses random projections. We’ll first illustrate this concept in two dimensions.

Suppose you have a set of points in the plane:

A one-dimensional subspace of the plane is just a line that goes through the origin. Once we have such a line, we can perform an orthogonal projection of the original points onto that line, giving us a new set of points, all on the line:

If we choose a different line, we get a different projection:

For some lines, the projection will totally wreck the distances between the original points; for example, in this image, we sent two different original points to the same point:

However, for other lines, it may be that the projection is able to preserve the ratio of distances fairly accurately.

The proof the Johnson-Lindenstrauss lemma uses the same idea, only in higher dimensions. Instead of points in the plane, we have our original $d$-dimensional space; and instead of projecting to a line, we try to find a subspace of dimension $k$ so that projecting on to it will almost preserve distances. [Actually, we project the points and then blow them up by a constant factor, since orthogonal projections can only decrease the length of vectors; but this doesn’t change the gist of the argument. From now on when we say “distance preserving”, we’ll mean actually mean “up to a constant factor”. See the actual calculations for exact treatment].

A-priori, there is no grounds to think that such a subspace exists. However, it turns out that it indeed does, for a reason called “the concentration of measure in high dimensions” Roughly, it says that in high dimensions, some random processes – such as the orthogonal projection of a point onto a random subspace, or the scalar product of a unit vector with a Gaussian vector – are heavily concentrated around their expected value. The probability for such processes to be even a tiny bit away from their mean is exponentially small in the dimension. We usually don’t see these phenomena in 3d, because the lowly and pathetic exponent of 3 isn’t enough to give a profound effect, but in higher dimensions they flourish.

After all of this verbosity, here is the proof outline: for any $k$, we can choose a random $k$ dimensional subspace of $\mathbb{R}^d$ and project our points on to it. Choose two particular points $x_i$ and $x_j$ from our original cast in $\mathbb{R}^d$. It may either be that the distance between them is preserved up to a factor of $1 \pm \varepsilon$, or it may not be preserved; denote the probability that it is not preserved is some number $p$. We can consider this “distance distortion” event for all of the ${n \choose 2} = (n^2 - n)/2$ pairs of points, and by the union bound, the probability for at least one pair of points to have their distance distorted is less than $p \cdot (n^2 - n)/2$. If we can show that the probability $p$ is smaller than $2 / (n^2 -n)$, then the probability that at least one distance is distorted is smaller than 1. This means there is a non-zero probability that all the distances are preserved! But then there has to exist a distance-preserving projection; otherwise it would be impossible for this probability to be greater than 0. And as it turns out, we can indeed make the probability that a single distance is preserved be this small, by choosing $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)$.

Actually, this proof not only shows that there exists such a transformation, it even tells you how to efficiently get one: the probability that a random projection gives a distance-preserving dimension reduction is actually quite high (and can be made very close to one if we just increase the number of dimensions by a bit). So you just need to take your points, project them randomly, and check that the distances are preserved; if not repeat, and sooner rather than later you are bound to get something that works.

Ok, let’s take a look the technical details. All we have to show is that with high probability, the length of a vector is nearly preserved when we project it to a random $k$ dimensional subspace. Now, when performing the calculations, instead of taking a fixed vector and a random subspace, we can take a fixed subspace and a random vector; all the results will be the same, since all we care about when projecting is the relative orientation between the vector and the subspace.

So let’s take a random unit vector, and see what happens to its length. A uniformly random unit vector can be obtained by taking a random distribution which is spherically symmetric and normalizing it. Specifically, let $X_1,\ldots,X_d$ be $d$ independent Gaussian random variables with mean 0 and variance 1; in other words, the random vector $X = (X_1,\ldots, X_d)$ distributes as a $d$ dimensional Gaussian vector $\mathcal{N}(0,\text{Id}_{d})$. Then the vector

$Y = \frac{X}{||X||}$

is a uniformly random unit vector. As for our fixed subspace, we’ll just take the space spanned by the first $k$ coordinates, and denote the projection by $Z = \frac{(X_1,\ldots ,X_k)}{||X||}$. What is the expected value of the square of the length of $Z$? Well, we know that the squared length of $Y$ is 1; this is by design: we normalized it. We thus have:

$||Y||^2 = 1 = \sum_{i = 1}^{d}\frac{X_i^2}{||X||^2}.$

If this is true for $||Y||^2$, then it is also true for its expectation:

$\mathbb{E}||Y||^2 = 1 = \sum_{i = 1}^{d}\mathbb{E}\frac{X_i^2}{||X||^2}.$

Notice that the sum is symmetric in the $X_i$‘s: in expectation, there is no reason for one $X_i$ to behave differently from the other. Thus each term should contribute equally to the sum, and so for every $i$,

$\mathbb{E}\frac{X_i^2}{||X||^2} = 1/d.$

Since $Z$ is just the first $k$ terms of $Y$, we get

$\mathbb{E}||Z||^2 = \sum_{i=1}^{k}\mathbb{E}\frac{{X_i}^2}{||X||^2} = k/d.$

Oops, the expected value is $k/d$, which is smaller than 1, the original norm of the vector! But not to worry, this can easily be fixed: after we project to a random subspace of dimension $k$, we just blow up all the points by a factor of $\sqrt{d/k}$. That way, when we look at the norm squared, the expectation will be $d/k$ times larger, and the expectation will indeed by 1. This multiplication will not affect any of our probabilistic statements, since if $||Z||^2$ is concentrated around its mean, so is $\frac {d}{k}||Z||^2$.

Now to fulfil our promise and show that the squared length of $Z$ is heavily concentrated around its mean. The calculations will be a bit lengthy, but do not despair (also, if you got up to here and understood everything, you are pretty much good-to-go for most of your Johnson-Lindenstrauss-ing needs).

We’ll show that for any positive real number $\beta$ smaller than 1, the probability for $||Z||^2$ to be $\beta$ times smaller than its mean of $k/d$ is exponentially small in the dimension and in $\beta$; we’ll then take $\beta = 1 - \varepsilon$ to show that the probability for a small deformation is small. The same calculations can also be repeated for checking when $||Z||^2$ is greater than $\beta$ for $\beta > 1$, and then taking $\beta = 1 + \varepsilon$.

A standard method for showing such concentration of measure is using Markov’s inequality, which states that if $W$ is a positive random variable, then for every $a > 0$,

$\text{Pr}[W \geq a] \leq \frac{\mathbb{E}W}{a}.$

Intuitively, all this inequality says is that if the expectation is small, then it’s impossible to have too large a probability of getting too large a value. Markov’s inequality is innocuous enough, but it becomes powerful when we combine it with exponential moments, that is, when we apply to $e^{tW}$ instead of to $W$ itself. Let’s see how that works.

When written explicitly, $||Z||^2$ becomes

$||Z||^2 = \frac{X_1^2 + \ldots + X_k^2} {X_1^2 + \ldots + X_d^2}.$

So the probability $\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}]$ can be written as

$= \text{Pr}[d(X_1^2 + \ldots + X_k^2) \leq k \beta(X_1^2 + \ldots + X_k^2) ]$

$= \text{Pr}[k \beta(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2) \geq 0]$

$= \text{Pr}[e^{t(k \beta (X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))} \geq 1].$

Here $t$ was some number greater than 0, and we will optimize over it soon in order to get the best possible bounds. Now we invoke Markov’s inequality with $a = 1$, getting

$\leq \mathbb{E}e^{t(k(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))}.$

The various $X_i$‘s are independent, so the expectation of the product is the product of expectations:

$\mathbb{E}e^{X_1 + X_2} = \mathbb{E}[e^{X_1}e^{X_2}] = \mathbb{E}[e^{X_1}]\mathbb{E}[e^{X_2}],$

and after grouping together the $X_i$ in our exponent, our probability is bounded by

$= \mathbb{E}[e^{t k \beta X_1^{2}}]^{d-k} \mathbb{E}[e^{t(k \beta - d)X_1^{2}}]^{k}$

$= (1-2tk \beta)^{-(d-k)/2}(1-2k(k\beta-d))^{-k/2}.$

This last equality we won’t go into, but it stems from knowledge about the moment generating function of Gaussians; in short, all that needs to be proven is that $\mathbb{E}[e^{sX^2}] = 1/\sqrt{1-2s}$ for $s<1/2$.

All that is left is to minimize this last expression. This we also won't do here, because it mainly involves annoying calculus which you can let WolframAlpha do for you. This finally gives:

$\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \beta^{k/2}(1 + \frac{k(1-\beta)}{d-k})^{(d-k)/2}.$

The expression on the right hand side is equal to $e$ to the power of its logarithm:

$\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2} \log \beta + \frac{d-k}{2} \log(1 + \frac{k(1-\beta)}{d-k})).$

Finally, since $\log(1+x) \leq x$ for all $x \geq 0$, we can replace the second $\log$ factor by $\frac{k(1-\beta)}{d-k}$, giving

$\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2}(1-\beta + \log\beta)).$

Now we just need to clean up. Put $\beta = 1-\varepsilon$, and get

$\leq \exp(\frac{k}{2}(1-(1-\varepsilon) + \log(1-\varepsilon))).$

By more calculus it can be shown that $\log(1-x) \leq -x - x^2/2$ for $0\leq x <1$, and at last we have

$\leq \exp(\frac{k \varepsilon^2}{4}).$

And indeed, if we choose $k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3)$, this exponent will be smaller than $1/n^2$, just as we wanted (actually, we can choose just $k = 8 \log n / (\varepsilon^2)$, but to deal with the second case of $\beta = 1+\varepsilon$ we will also need the $\varepsilon^3/3$ term). A similar calculation for $\beta = 1+\varepsilon$ will give another factor of $1/n^2$, and in total the probability of having a large distortion will be just $2/n^2$ per pair of points.

August 24, 2015

There’s a nicely morbid Perry Bible Fellowship comic showing a kid on his birthday:

That’s what you get with strict determinism: the day we die is already decided on the day we are born; on the day the universe was born, in fact. But since we ordinary mortals do not have access to Death’s all powerful computing machine that lets him calculate the end of all tales (an abacus?!), we have to use statistics. Specifically, we like to use life expectancy, which is basically the average age of death of a certain population at a certain time.
Calculating it is rather easy for times far away in the past. What was the life expectancy for infants at 1900? Just look at all the people who were born in 1900, and take the average of their life spans. For modern times it’s a bit more complicated, since there are still so many yucky living people who spoil your statistics, but we get good results under reasonable assumptions.

Here’s a cool thing about life: if you haven’t died so far (and, as you are reading this, I assume you haven’t), you will statistically outlive the average baby born on the same year as you. In other words, the older you are, the older you die. This is pretty obvious, but it’s a nice pat on the back as it shows your accomplishments in the field of “not dying”. Are you 30? Congratulations! You should no longer be afraid of dying from chicken pox, drowning in a bucket, baby measles and child-malnutrition; and your chances of going to war are severely reduced. So you have to take out all those deaths out of the equation, and the net result is that your projected age goes up.

But there is a contrasting force in this whole ordeal: the older you are, the less time you have to live – because you’ve already lived some portion of your life. The life expectancy of an infant born in the United States in 2013 is about 78 years. She has 78 years ahead of her. The life expectancy of a 90 year old in the United States in 2013 is 94. She will die older (she has already passed the 78 year mark), but only has 4 years ahead of her.

This leads to the question: what is “the best age to be alive”, in the sense that “your whole life ahead of you” is the longest period of time? This depends crucially on the mortality statistics: we can imagine that in a world where most children don’t make it to age 10, people who are 30 will have more to look forward to than five year olds.

In fact, this is what happened in Massachusetts during the 1850’s (data found here):

In orange, we see how many years a person has left to live as a function of age in Massachusetts in 1850. We see that a baby just born has roughly 40 years ahead of her, while a child who made it to age 10 has about 47 years ahead of her! In the 1850’s during childhood, every year you grow older actually increases your life expectancy! In view of the comic at the beginning of the post, for these children, every time they celebrate their birthday, Death should add a bead to their life total (statistically speaking, of course; deterministically we are all doomed anyway). Newborn babies and 20-year olds can expect to live the same amount of time.
The yellow line plots the function y = x. The place where it crosses the orange line is the “half-life” point – the age where you have put as much time behind you as you have in front of you.
Finally, in blue, we see the average age of death. It continuously rises as a function of the age of the person in question. Notice that it is always above the yellow line y = x: This just means that when you look at a specific living person, that person will die older than she is right now. Unfortunately the data I have only goes up to 80 years, but eventually the blue line and the yellow line should coincide, at the age of the oldest person who ever lived.

But that was way back a hundred and fifty years ago. With the advancement of modern medicine, food, and infrastructure, are things any different now? Indeed they are. Here is the United States data for 2013 (data found here):

First, life expectancy went up; that’s a no brainer. But what’s interesting is that now the orange line no longer has a peak in the beginning. Further, the blue curve stays almost flat for the first ~40 years or so of life: The difference in the expected age of someone who is 40 and someone who was just born is a mere 4 years; contrast this to the whopping 29 years in 1850.
We have so completely decimated child and youth mortality, that it no longer “pays off” to grow older, in terms of gains in life expectancy. Looking at the data shows that this phenomenon – the lack of orange peak – started in the 1920’s, at least in the United States.
So the Death and the Abacus comic is indeed relevant – but only for the modern era. In the past, getting to 10 would have warranted a real celebration – for that is the age where “you’ve got your whole life ahead of you” carries the most weight. But today? The countdown starts with your first gasping scream for air.

### A primitive model for genetic recombination

August 17, 2015

Introduction:
I’m taking a class in general genetics at the Technion, and there we learned about genetic recombination, and in particular, homologous chromosome crossover: a phenomenon where chromosomes exchange sections between themselves during meiosis.
When this happens, some of the members of the population exhibit “recombinant genomes”, which are different than their parent genomes should supposedly generate. Surprisingly, this part of the population never exceeds 50%, even though at first look it seems as if it could.

In this post, we’ll see a model of chromosomal crossover statistics that explains this phenomenon, as well as giving an estimate to the physical distance between genes as a function of population statistics. I’ll assume you know some basic genetic terms, such as “dominant” and “heterozygote”, but I’ll explain about crossovers in general and describe the problem in more detail below. You can skip directly to “The model basics” if you already know about recombination.
The post will be about 50% biology and 50% math.

Biological background:
We’ll work through an example with the commonly used traits of EYE COLOR and WING-SIZE in fruit flies. Both are controlled by single genes found on the X chromosome.
A fly’s eyes can be either red or white, with red being a dominant quality. We’ll mark the dominant gene with R and the recessive gene with r. Thus, if a fly’s genome contains Rr or RR, it will have red eyes; otherwise, if it contains rr, it will have white eyes.
Similarly, a fly’s wings can be either long or short, with long being a dominant quality. We’ll mark the dominant gene with W, and the recessive with w, so long winged flies have Ww or WW, and short winged flies have ww as their genotype.

Suppose we have a heterozygote cis female. In other words, her genome contains both the dominant and the recessive genes (so she has RrWw in her genome), and both of the dominant genes are found on the same homologous chromosome. In other words, her two X chromosomes look like this:

During meiosis, her two homologous chromosomes duplicate and then separate, and we get two types of possible germ cells: RW and rw:

However, it is also possible for crossover to occur: two chromatids are “sliced” at some point, and then the two parts of each are glued to each other.

If this happens during meiosis, the outcome is four possible germ cells: RW, Rw, rW, rw:

Now, what happens when we mate our RrWw female with a white eyed, short winged male? Since these traits are found on the X chromosome, and a male fly only has one of those, he necessarily has the recessive allele, rw. We don’t care about the Y chromosome here.

Upon mating, the male fly will give the offspring either an X or a Y chromosome. Let’s ignore the males at this point, and focus just on the females. Since our male’s genotype is rw, we will get the following combinations: RrWw, rrww, Rrww, rrWw. All of these are phenotypically different, and each represents a different combination of red/white eye and long/short wing. The Rrww and rrWw genotypes are recombinant – they only exist in the population because of recombination.

Suppose now that the chance for recombination between R and W is some number q between 0 and 1. Then if we look at a very large collection of germ cells from the mother, we expect the following distribution:

RW should be $\frac{1}{2}(1-q)$ of the germ cell pool
rw should be $\frac{1}{2}(1-q)$ of the germ cell pool
Rw should be $\frac{1}{2}q$ of the germ cell pool
rW should be $\frac{1}{2}q$ of the germ cell pool

This is because q of the population should be recombinant, and whenever there is recombination we get an equal amount of Rw and rW.
After mating, when looking at the females, we only need to add the father’s recessive genes, and we get:

RrWw should be $\frac{1}{2}(1-q)$ of the population
rrww should be $\frac{1}{2}(1-q)$ of the population
Rrww should be $\frac{1}{2}q$ of the population
rrWw should be $\frac{1}{2}q$ of the population

Thus, Rrww and rrWw comprise $\frac{1}{2}q+\frac{1}{2}q = q$ of the population. This can be measured in real experimental trials, since each of the above genotypes translates into a different observable phenotype.
At this point in our theory, q can be any number between 0 and 1. If q is 0, then there is never any recombination, and the two genotypes RW and rw go hand in hand forever. If q is 1, then recombination always happens.
However, it is an empirical fact that the percentage of recombinant population is never more than 50%! The measured value of q is always less than or equal to 0.5.

There must be some mechanism that prevents recombination from happening too often. We can make appeals as to the utility of this mechanism, and wonder whether it is good or bad to have a small number or a large number of recombinations between genes – but for now, let’s try to think of an underlying model.

Image source: wikipedia

The model basics:
We treat the chromosome as a linear piece of DNA, with a length of “1 chromosome” – in essence, it is a line segment of length 1. The different genes are points on this line, and are therefore assigned a position 0pos1. In reality genes have some finite width on the DNA strands so a more accurate model will treat them as small intervals, but it will be easier to consider them as points.
We’ll assume that the gene that codes for eye color is on the left of the gene that codes for wing size. Denoting the position of the first by x and the second by y, we have this schematic for our chromosome:

The primary element in our model is the crossover event, or a cut. In this event, two homologous chromosomes are cut at a random place, distributed uniformly across its entire length. The chromosomes then swap strands at this position.

There are two options here. If the cut occurs in in interval between x and y, the genes will be swapped, and we have recombination. However, if the cut occurs outside the interval [x,y], then those two genes will not be affected. Since the cut distribution is uniform, the chance to land between the two genes is just y-x, so the probability of recombination is $q = y - x$.

This is a simple operation, and it’s tempting to think that it is the entire deal, but this is not so. In a crossover event, if two genes are far away from each other, meaning, at the opposite sides of the chromosome, then the probability of recombination can be very close to 1: nearly every cut we make will separate them. But we never observe a q above 0.5! There is obviously something more that we are missing here.

Image source: Science magazine

The answer: the above description is true only for a single crossover event – a single cut. However, there is no guarantee that a chromosome will undergo any crossovers at all during meiosis. Further, a chromosome may actually undergo several crossover events, as was experimentally discovered when looking at the recombination relations between a triplet of genes on the same chromosome. But look what happens when there are two crossover events in the same interval [x,y]: the strands are switched twice, and ultimately there is no recombination between the two genes!

We can now convince ourselves: whether or not we see recombination between two genes depends on the parity of the number of crossover events that occurred between them. When looking at the population statistics, what we ultimately see is the average of the parity of crossovers.
As an artificial example, suppose that during meiosis, there is a 50% chance of performing a single cut, and a 50% chance of not performing any cuts at all. In that case, for two far away genes, which are always separated by any cut, there is a 50% chance of getting recombination, and 50% chance of not getting it. In other words, q was reduced from 1 to 0.5. In general, in this case the observed probability of getting recombination is $q = \frac{1}{2}(y-x)$, as half the time we do not get a recombination at all.
Of course, there is no reason to assume that there is a 50% chance of getting no crossover event, and 50% of getting exactly one – the number of crossovers could behave in different ways – but we see that the actual percentage of recombinant population depends on the distribution of the number of crossover events in the chromosome. Which distribution should we choose?

A slightly flawed offer:
A simple choice would be a binomial distribution. The reasoning goes as follows: during meiosis, there are all sorts of enzymes floating about the chromosomes, which are responsible for cutting them up and gluing them back together. There may be a large number n of these enzymes floating about, but they only have a certain probability p of actually performing their duty. Of course, we assume that they act independently, even though in reality they may interfere with each other. So the number of crossovers depends on the numbers of “successes”, where a success is an enzyme doing its work properly, which happens with probability p. This means that the number of cuts distributes according to $C \sim Bin(n,p)$.

So assuming the number of crossover events distributes according to $C \sim Bin(n,p)$, what is the probability of getting an odd number of crossovers? Let’s take a moment to calculate it.

For any $n$, denote that probability by $P_n$. Suppose you already checked $n-1$ of the enzymes. Then with probability $P_{n-1}$, you already have an odd number of crossovers, so you don’t need any more of them. Further, with probability $1-P_{n-1}$, you have an even number, and you want another crossover to get an odd number. So the probability obeys the recurrence relation

$P_n = P_{n-1}(1-p)+(1-P_{n-1})p.$

with the initial condition that $P_0=0$, as if there are zero enzymes there are zero crossovers, which is an even number.
More nicely:

$P_n = P_{n-1}(1-2p)+p$

$P_0 = 0.$

If we look at just this equation:

$P_n = P_{n-1}(1-2p)$

we quickly see that the answer is $P_n= a \cdot (1-2p)^n$. However, we also have that additive +p in our original equation. It turns out we only need a small adjustment in order to compensate it though, and in this case we just have to add an extra constant, so that

$P_n = a \cdot (1-2p)^n + c.$

Since the equation is linear, this is actually very much like the particular solution of a differential equation, and we can find c directly by putting it into $P_n$ in the recurrence relation:

$c = c (1-2p) + p,$

which gives

$c = \frac{1}{2}.$

Taking into consideration the initial condition, the solution is then,

$P_n = \frac{1}{2} - \frac{1}{2}(1-2p)^n$

Wonderful! For very large n, the probability of getting an odd number of crossovers goes to 0.5! Even for relatively low probabilities p, the quantity $(1-2p)^n$ goes to 0 very quickly.

This gives an answer regarding two genes which are very far away: they are affected by every cut performed by the enzymes, and so their recombination probability is exactly the same as the probability for getting an odd number of cuts. But what about genes which are closer? For them we actually have to take into consideration the fact that not every cut the enzymes make will cause a crossover.
Notice the following: the number of cuts in every chromosome is distributed binomially, $C \sim Bin(n,p)$. If we already know the number of cuts to perform – say, k – then the number of cuts which affect the two genes at positions x and y is also distributed binomially as $Bin(k,y-x)$, since every cut has a probability of y-x of crossing the two genes. So the number of crossovers G between y and x, conditioned that $C = k$, is $Bin(k,y-x)$, and k itself distributes as $B(n,p)$.
Now comes the cool part: there is a theorem about binomial distributions which says the following: if X is a random variable that distributes binomially, $X \sim Bin(n,p)$, and Y is a random variable that conditioned on X distributes binomially, $Y|X = Bin(X,q)$, then Y is also binomial, $Y \sim Bin(n, pq)$! Using this theorem, the number of cuts S which swap between x and y goes as $S \sim Bin(n, p \cdot (y-x))$.
Now we can apply the same reasoning as before, only this time, a “success event” is not merely when the enzymes perform a crossover anywhere on the chromosome, but rather when they perform it in some place between x and y.
The final probability of getting recombination between two genes is then

$q = \frac{1}{2} - \frac{1}{2}(1-2p(y-x))^n$

This is very nice, and it gives us some asymptotics as well. For large values of $p(y-x)$, the second factor is negligible, and we have $q =\frac{1}{2}$. For small values of $p(y-x)$, the second factor can be expanded to first order, and the two $\frac{1}{2}$’s will cancel each other out, giving us $q \propto (y-x)$.

Slightly improving the binomial:
Overall, the model proves adequate in its predictions, and its simplicity is alluring. However, it is not without problems. For example, its two parameters – p and n – must somehow be found out, and it is not entirely clear how to do so. In fact, the very fact that we have a fixed n here seems out of place: by keeping it around, we assume that there is a constant number of enzymes working about, when it is much more reasonable that number varies from cell to cell. After all, when manufacturing hundreds or thousands of enzymes, there must be variation in the numbers.

Luckily, there is a simple way to fix this, which is actually firmly based on reality. Instead of assuming that the number of cuts the enzymes make is distributed binomially, we assume it follows a Poisson distribution, $C \sim Pois(\lambda)$, for a yet unknown $\lambda$. This actually makes a lot of sense when we remember that Poisson distributions are used in in real life to describe queues and manufacturing processes, when what we know is the average time it takes to perform a single event.
If the number of overall cuts has a Poisson distribution, how does the number of crossovers between x and y behave? Well, given that the number of cuts is k, the number of crossovers is still as before, $Bin(k, y-x)$. But again the theorems of probability smile upon us, and there is a theorem stating that if $C \sim Pois(\lambda)$ and conditioned on $C = k$ we have $S|C \sim Bin(C,y-x)$, then

$S \sim Pois(\lambda(y-x)).$

So the distribution of crossovers between x and y will also follow a Poisson distribution!
Now we only have to remember the simple trick, that

$Pois(\lambda)= \lim_{n \rightarrow \infty} Bin(n,\frac{\lambda}{n}).$

Thus, under the assumption of a Poisson distribution, the final probability of getting recombination between two genes is

$q = \frac{1}{2} - \lim_{n \rightarrow \infty} \frac{1}{2}(1-\frac{2 \lambda(y-x)}{n})^{\frac{1}{n}},$

or, more simply,

$q = \frac{1}{2} - \frac{1}{2}(1-e^{-2 \lambda (y-x)}).$

This again has the same desirable properties as before, but the model is simpler: we got rid of the annoying n parameter, and the probability parameter p was replaced by the rate parameter $\lambda$.
(Note: For small values of (y-x), the probability for recombination is $q=\lambda(y-x)$; if only we could set $\lambda = 1$ and get a direct relationship between q and the distance between genes…)

To conclude:
The percentage of recombinant phenotypes in a population of offspring is always smaller than 50%. This is not an arbitrary number, but stems from the underlying biological mechanism of recombination. Because multiple crossover events can occur between two genes, what’s ultimately important is the parity of the number of such events. When the total number of crossovers in a chromosome follows a Poisson distribution, the parity can be readily computed. It behaves like a fair coin toss for genes which are physically far away from each other, but is linear in the distance between the genes when they are close to each other.
This “Poisson crossover model” is very simple, and of course does not explain all there is about recombination (genes are not points on a line; distribution is probably not Poisson; events are not independent; there are “recombination hotspots” in chromosomes; the chromosome is a messy tangle, not all of which is accessible; etc). But it looks like a good starting point, and to me seems adequate at explaining the basic behaviour of recombination.

### Bash Distribution

October 29, 2011

Bash.org, the “Quote Database”, hosts a collection of user submitted quotes from IRC. Users can submit parts of chat sessions that they participate in IRC channels, and if they are approved, they are uploaded to the site for everyone to see. Viewers can then browse the site and rank the quotes, giving them either a +1 or a -1 to their ranking. Some of the popular ones have a score of over 30,000. Of course, in the days of YouTube and Facebook, this concept isn’t new to us, but the site has been around since 1999, when I could only use either my telephone or the Internet, and the modem made sure that everyone in the house knew that you were connected.
The site is fun to browse, but I think anyone ought also to be interested in the distribution of quote rankings. How many high-ranking quotes are there? How many are there with negative scores? More generally, can we find a function that describes how many quotes have a certain score x? The first thing that comes to mind, of course, is the traditional Gaussian distribution, which appears in many places in nature. But we must remember that scores are not the sum of many random variables: it’s not like for every quote, a viewer votes +1 or -1 at random. There are higher quality quotes, and lower quality ones, and we expect that this will somehow be reflected in the distribution. In fact, history and nature have taught us that these distributions often come in the form of a power law, meaning that the number of quotes having a score of x is some function:

$f(x) = \alpha x^{c}$

Well, you can’t know for sure until you find out. So, faithful to my tradition, I wrote a small python script which goes over all the quotes on Bash.org, and saves their rating. There are 22,083 quotes currently stashed in their DB, and here is their distribution:

Notice the very long tail that positive ratings have, and the relatively shorter negative ratings one. Also note that the curve is asymmetric, and falls of much slower to the right than to the left. The majority of the quotes are positively ranked. Indeed, this does not seem like any Gaussian. Here again is the distribution, but with the scores divided into bins of size 10, to allow a smoother line. In red is the Gaussian which has the same mean and same standard deviation as the quote distribution.

The quote score is indeed a power curve – two power curves, actually: one for the left ascending side, and one for the descending side.
We could have stopped our analysis here and have been happy about it, but two weeks ago I read Mark Buchanan’s “The Social Atom”, which talks about how complex social phenomena can be explained by simple atomistic human behavior. For example, the distribution of wealth across a market obeys a power law, and can be generated by a simple money-trading model. The book itself is a rather light read and is very interesting; I recommend it to anyone interested in economics, sociology, physics, or computer modeling.
Inspired by the book, I created a simple model for ranking the quotes. In our model, there are people, and there are bash quotes. Each quote has a quality value, uniformly distributed, and each person has a “standard”, also uniformly distributed. A person likes a quote if its quality is above his standard. Thus, there are some people with high standards that will like only a small number of quotes, and some people with lower standards that will be entertained by most anything thrown at them. If a person dislikes a quote, he will lower its score by 1. If a person likes a quote, he will give it +1 rating, and also share it with his friends. Thus, low quality quotes remain unheard of for most of the population, and high quality ones will spread throughout the population.
Initially, these quotes are all on the web, and each person has a certain chance of “stumbling upon” a certain quote: running across it on the Internet in a search, reaching it through some external site, or just visiting Bash.org. The simulation is then as follows: for each quote, see who stumbles upon it, and let them tell their friends, until it stops spreading through the network.
This is obviously a very simple model, and there are a lot of factors which aren’t taken into consideration. For example, there are users who run across bash quotes and spread them to their friends, but do not rank them. There are also people who rank them, but do not spread them. And of course, the way we decide whether a person likes a quote or not seems quite arbitrary.
Nonetheless, the model, when run with the correct parameters, shows some features of the real distribution which we saw above. Here are the results from a run, where each agent has three friends with whom he shares the quotes. Again, we divide into bins of 10 for a smooth line.

The entire power curve is shifted to the left into the negative region, and there are no signs whatsoever of the long negative tail. However, despite these setbacks, we can be quite happy that we managed to capture one of the main features of the original distribution: the power-law fall of the high ranking quotes, and the long tail. I’m quite sure that not many modifications are needed in order to include the left side of the graph into the model – some feature which allows negatively ranked quotes to propagate even though the users have no reason to annoy their friends with low quality content.
Using this model, we can try to discover other phenomena which are not present in the original Bash site. For example, what happens if the site is “heavily moderated“, and initially only one person discovers each quote? In this case, he is the only one who says – “this quote will be distributed into the network”. Then the shape of the graph also greatly depends on how low or high his standard is.
Further, the results described above were achieved by setting the number of friends with whom each person shares quotes to a relatively low number – three or four. This may fit the Bash.org model, since it is an old and rather niche site. However, popularity contests on sites like YouTube should look ghastly different, since these are often shared on social networks such as Facebook, and the “number of friends” parameters can skyrocket to several hundreds. This parameter has a significant influence on the shape of the curve.
Overall, we saw how we can explain the popularity or rating distribution using a model based on a social network of people who share content with each other. There are still many things we can do with such model, and the specific one I gave certainly does not fit every site which can rank its items out on the net. The beauty is, that once we have built a model that fits our initial data, we can test it using other parameters in order to make predictions or say how the world would have behaved, if things were a little different.

### Random Chess Conquests

October 13, 2010

In our last statistical analysis of chess, we looked at the relative frequencies on which the different squares of the chessboard are stepped on, by any pieces. The games analyzed were carried out by masters and grandmasters, who exhibit remarkable intuition and control of the game. I speculate that the distribution amongst amateur players and beginners will bear noticeable differences, but unfortunately databases of such games are more scarce than those of the famous and beloved players. Our quest to see whether we can improve ourselves as chess players and better understand the game using only pretty color diagrams will have to wait for a while in that aspect.
However, not all is lost, and there is another nice comparison which can be made using the frequency maps: looking at randomly played out games. We all know that chess is game of skill, intuition, cleverness, and knowledge. Hence, playing moves at random does not seem like a very good idea. Sure, occasionally a good move will be made by chance, but most of the time the result will be the brutal destruction of your pieces; even new players, when they are at a loss and do not know what they should do, try to stick to some formation, expand, or protect their pieces. Thus, if a player monitors his games and notices that his distribution map contains too many striking similarities to those of random games, perhaps he should be worried that he is doing something wrong.
Of course, playing a random game does not mean that the frequency distribution will be 1/64 for every tile, for we do not just pick a place on the board at random and put a piece on it. Each side starts with his army, and there is a very limited number of moves that he can make on his first turn. Even in later turns, there are many tiles that are unreachable, and the closer you move a piece to your opponent’s forces, the higher the chances that he will strike it down. In general, we absolutely cannot hope to obtain a uniform color diagram. Similarly, the number of turns in the game plays a crucial role in the distributions. In the late game, there is much freedom of movement, although not many pieces; in the early game, there is only a fixed set of squares and pieces which can move into them. Thus, when taking the average distribution of several random games, the number of moves in each game has to be taken into consideration. It would therefore be of considerable interest to see how random games behave.
For this purpose, a (not-so-small) python script was written, capable of playing random chess games. During each turn, a piece was randomly selected, and moved to a random legal position (taking into consideration checks, of course. However, two moves were not implemented: castling, and en passant). The games were recorded and saved, allowing the frequency analyzer script from our previous discussion to operate on them (effectively, the standard PGN notation was used, although in a degenerate form). For each master’s game previously analyzed, the script played a simulated random game against itself, with the same amount of moves as the master’s one. Effectively, though, the total number of random moves was less than in the original, since some games were ended short by checkmates.
The color maps are as follows:
[White distribution – colored in red. Black distribution – colored in blue. Total – colored in purple]

We observe a considerable difference between the random distribution and that of chess masters, and a variety of effects to which we must state our opinion.
First we note that the four central rows (3,4,5,6) are much more inhabited than the border ones (1,2,7,8), where the pieces initially start at the beginning of the game. This may happen due to the fact that random moves, as their name so aptly implies, do not aim to go anywhere in particular. There is no “hunt for the king”, or any chase after more valuable pieces, and accordingly, there is no defence whatsoever against any such attacks. Rather than hinder us, this fact actually makes it easier to analyze the result, and the color maps can be described as follows: the more pieces that can reach a certain square, the brighter it is, and that is all. It can therefore be said that these maps serve the purpose of showing a “time-lapse” of all the threatened tiles in a random chess game. This is true only for a mass of random games, for in a single match each tile is not stepped on a great number of times. It is important to realize that this statement cannot be justifiably claimed about the previous article’s color maps, since in those games there was an overall strategy guiding the movement of the pieces. Thus, the fact that a tile is threatened by several pieces does not deductively imply that the master player will move any piece to that tile; each of his moves is calculated and serves to gain him material or position, or to defend himself, and most of the time these moves do not coincide with the amount of pieces that can step on a certain tile (an obvious example: when retreating with the queen in order to prevent it from being captured).
This elucidates why the center rows are so populated – simply put, more pieces can move there than to the bottom and upper rows. Why is this? First and foremost, when the game starts, the bottom and top two rows are already fully populated. There is no option to move any piece other than the pawns and knights, and they can only step on the center of the playing field. As the game progresses, more and more holes gape up in the remote rows, giving pieces a chance to step there. A very interesting comparison to the physical world opens up now: the whole process is remarkably similar in concept to conduction electron bands in semiconductors. A pure semiconductor at absolute zero temperature has a full valence band of electrons, and an empty conduction band (in the chess game, these are the remote rows, and the central rows, respectively). Progress of the game is equivalent to heating up the semiconductor, in which case some electrons from the valence band transfer to the conduction band – these are the pieces which move forward towards the central rows. These open up holes in the lower rows, which can then be filled by other pieces, giving more freedom of movement (lower resistance). As a way of comparison, here is a schematic (courtesy of http://www.chemistry.wustl.edu) of electron bands in semiconductors. The image on the left is the initial position; in the image on the right some electrons (chess pieces) have moved up and created holes in the structure of the initial position.

In general, we can also try to use thermodynamics to study this random chess game. We begin with a highly ordered arrangement, but end up with scattered pieces lying about. As the game progress, more and more pieces can move about, and extract themselves from their initial positions. A kinetic matter analogy would be attaching two containers of gases to one another. The gases would start at each end of their respective container, but with time they diffuse to other regions and intermix with each other. Collisions between gas particles is equivalent to the fact that most chess pieces cannot move through one another (all but the knight). Thus, although almost all of the pieces can move in all directions, we never return to the same position, or even ones resembling it, and generally tend to spread over the chessboard, just as in gas diffusion.
Of course, we should not get carried away. Two very important and decisive factors differentiate the random chess game from a classical thermodynamic example. The first is that the displacement processes of our “particles” is asymmetric, this being due to the movement of the pawns. Pawns cannot go backwards under any circumstances except when promoting. In the thermodynamic world, this would be represented by a filter or membrane, which prevents molecules from travelling backwards, severely complicating the ordeal. Second, and probably more important, is that pieces get captured as the game progresses. Chess, after all, consists of two armies unleashing all their wrath upon each other. This means that certain pieces won’t get that far until they are destroyed. Hence the analogy to thermodynamics is not an exact one, and I am not inclined to think that one can simply “apply” the known physical formulas for the game. However, it is certainly a good place to start. A good approach would be to try and mathematically quantify the behaviour of each piece as a physical particle in a degenerate “ensemble” of gas, perhaps dealing at first with more trivial chess cases instead of diving head first into all the complex rules and layout.
We now fully understand why the central regions are so much brighter than the outer rows. At first, the central rows are the only places the pieces can go to. Also, pawns (constituting half the initial pieces), when they make their first moves, only allow pieces to escape the bottom two rows, further contributing to this effect. As time progresses, more and more holes open up, and the center becomes more cluttered. If this were a normal ideal gas, then the pieces would be uniformly spread out, and overall, if games were played ad-infinitum, we would expect a much smoother distribution. However, this is not the case of chess, which is limited by checkmates and stalemates. Furthermore, pieces get captured, with increasing chances the closer they are to the opponent, reducing chances of increasing the border rows’ frequencies, and pawns can only advance in one direction. Thus, when playing for a finite and not so large amount of turns, the central rows are much more popular than the outer ones. One immediate conclusion from this is that when playing against an opponent who moves randomly, putting your king in the corner will highly increase his security (of course, most people don’t need strategy tips against random moves).
In conclusion, even this artificial scenario – a chess game carried out completely by random moves – can lead to some interesting investigation. The fact that it is random allows us to compare it to other statistical phenomena in nature, such as gas diffusion. A simple and straightforward analogy won’t hold, of course, but it could be very worthwhile to investigate what this direction yields. Hopefully, from this we might be able to state something intelligent about non-random situations, although this is just a speculation.

One final remark about the nature of random games. While it may seem like these games often end in stalemates or are generally uninteresting, this is far from the truth. Overall, when running the random-game-playing script, about one in thirty games ended in checkmate, with no clear preference for either black or white. Amongst these victories was the fastest checkmate possible in the game:
1. f3 e5
2. g4 Qh4# 0-1

It was quite amusing to see this come up in a random game. Other interesting endgames are presented below. Some of them involved promotions, some had a full board, others very sparse; such is the nature of random games.

### Chess Territorial Disputes

October 5, 2010

Introduction

Chess, often dubbed “The Game of Kings” is not considered an easy one. There are many basic rules, which often deter new players from learning it, unlike Go or Othello (Reversi), whose rules are simpler and lesser in quantity. Even once grasped, the concepts of the game and the ability to look ahead do not come easily to most. However, while many structures and strategies are subtle and require a serious length of time to recognize and master, there are a few aspects which can be seen with relative ease. For example, any rookie will recognize that the queen, having such generous freedom of movement, is generally more powerful than any other piece in the game. A slightly more complicated facet is that of board control. While discerning when positional advantage or game material are more important than each other is one of the more advanced distinctions a chess player can ascertain, some traits are easily observed, the most important of which is control of the center. Pieces in the center of the board can exercise their strength to their full potential, allowing them to threaten the largest number of opposing pieces and giving them a wide area should retreat be necessary. On the contrary, when pushed to the sides, bishops and knights are often cornered and captured, or are otherwise rendered useless or forced to retreat to unhelpful stances. Thus, chess can be seen as a game for control of the center. He who manages to establish a hold there gains a tactical advantage over his opponent.
Of course, the middle squares start empty, and are of equal distance from both players’ armies. It would seem only logical that they be the main site for activity, since in the beginning of the game, the other side of the board is heavily defended by a great number of opposing pieces. Fighting to take control of them involves both aggressive moves and calculated withdrawals, most of which we could assume to revolve around the four central squares.
It is therefore of a certain interest to see how chess moves are distributed along the board. Will they all be focused around the center? Which squares should be announced the “most popular”? From which squares do professional players abhor, and which ones attract more attention? Can we note any peculiar activity, or in general explain the various phenomena regarding this distribution? Can beginners learn anything merely by looking at the statistics of chess pros?
With these questions in mind, I set out to carry a simple calculation: to find out the relative frequencies at which game pieces step into different squares. In other words, to discover which squares will be stepped on more, and which less, in the average chess game. I chose to ignore all distinctions between pieces, and calculate just the number. The eventual result should be a chessboard, with a number between zero and one attached to each square. This number denotes the percentage of times it was stepped on by any piece, out of the total number of moves performed in a large amount of games.

Method

The first step involved is to acquire a large set of chess games. Since we are looking at a purely statistical property of chess, and considering the fact that it is quite doubtful that any single match can correctly represent the total distribution, a large amount of data is needed. The fantastic site http://www.chessgames.com/ serves perfectly for this purpose. It contains an archive of over 567,000 games from over 150 years, played by masters and grandmasters from all over the world. While information can be readily searched and filtered, I did not want any biases towards a certain playing style dictated by personality or year. Hence, the site’s random game option was used in order to get the moves.
ChessGames presents the matches in PGN format (Portable Game Notation; example), which is basically a text-based way to represent the moves. The site offers a choice of several Java viewers which assist the display of these games; however, since we are only interested in the data, this is unneeded. Luckily, each page contains a link to the raw PGN data. Thus, a small Python script was written using the handy urllib2. It continually accesses random games, finds the download link, and saves the moves to a file, where they will later be analyzed. Since it bears a striking similarity to the script I wrote for accessing Google’s search results, I was afraid that once again I would be identified as an “automated bot”. However, it seems as if ChessGames is not as busy a site as Google, nor is it often the target of denial of service attacks; my script ran without fault.
A collection of about 500 games seemed enough for this simple purpose; Effectively 509 games were downloaded. A second Python script was written with the purpose of calculating the various moves from this large collection of games. For each game, the script loaded the data, parsed the PGN format, and calculated how many times each square was visited, by any piece. Castling was considered as two moves by the rook and the king, increasing the count of the appropriate squares by one. A distinction was made between white and black sides, for while each possesses identical pieces, white always starts and thus tends to be more aggressive. I speculated that this might cause a difference in the distribution on the chessboard. Of course, the total number of moves for combined white and black pieces was also calculated. The results are as follows. Remember that white starts at rows 1,2 and black starts at rows 7,8.

Absolute:

White : 20182
=====================
8 |   64   39   70   87   78   66   32   33
7 |   98  102  132  142  148  134  101   75
6 |   99  119  224  221  195  280  142  141
5 |  156  343  283  560  605  347  454  181
4 |  332  279  697  974  832  568  366  337
3 |  243  433  864  564  627 1029  462  336
2 |   79  197  313  586  582  267  318  136
1 |   88  226  379  487  422  704  544  160
---------------------------------------
a    b    c    d    e    f    g    h

Black : 19882
=====================
8 |  102  221  353  449  454  756  599  185
7 |   67  247  303  691  714  317  377  138
6 |  351  421  854  648  682 1046  490  328
5 |  329  371  701  825  773  497  345  281
4 |  156  380  306  512  442  248  326  166
3 |  123  142  238  168  163  188  121   96
2 |  107  121  122  112  109  110   85   54
1 |   42   34   56   65   75   45   21   34
---------------------------------------
a    b    c    d    e    f    g    h

Total : 40064
=====================
8 |  166  260  423  536  532  822  631  218
7 |  165  349  435  833  862  451  478  213
6 |  450  540 1078  869  877 1326  632  469
5 |  485  714  984 1385 1378  844  799  462
4 |  488  659 1003 1486 1274  816  692  503
3 |  366  575 1102  732  790 1217  583  432
2 |  186  318  435  698  691  377  403  190
1 |  130  260  435  552  497  749  565  194
---------------------------------------
a    b    c    d    e    f    g    h


Relative:

White : 1.0
=====================
8 | 0.003 0.001 0.003 0.004 0.003 0.003 0.001 0.001
7 | 0.004 0.005 0.006 0.007 0.007 0.006 0.005 0.003
6 | 0.004 0.005 0.011 0.010 0.009 0.013 0.007 0.006
5 | 0.007 0.016 0.014 0.027 0.029 0.017 0.022 0.008
4 | 0.016 0.013 0.034 0.048 0.041 0.028 0.018 0.016
3 | 0.012 0.021 0.042 0.027 0.031 0.050 0.022 0.016
2 | 0.003 0.009 0.015 0.029 0.028 0.013 0.015 0.006
1 | 0.004 0.011 0.018 0.024 0.020 0.034 0.026 0.007
---------------------------------------
a     b     c     d     e     f     g     h

Black : 1.0
=====================
8 | 0.005 0.011 0.017 0.022 0.022 0.038 0.030 0.009
7 | 0.003 0.012 0.015 0.034 0.035 0.015 0.018 0.006
6 | 0.017 0.021 0.042 0.032 0.034 0.052 0.024 0.016
5 | 0.016 0.018 0.035 0.041 0.038 0.024 0.017 0.014
4 | 0.007 0.019 0.015 0.025 0.022 0.012 0.016 0.008
3 | 0.006 0.007 0.011 0.008 0.008 0.009 0.006 0.004
2 | 0.005 0.006 0.006 0.005 0.005 0.005 0.004 0.002
1 | 0.002 0.001 0.002 0.003 0.003 0.002 0.001 0.001
---------------------------------------
a     b     c     d     e     f     g     h

Total : 1.0
=====================
8 | 0.004 0.006 0.010 0.013 0.013 0.020 0.015 0.005
7 | 0.004 0.008 0.010 0.020 0.021 0.011 0.011 0.005
6 | 0.011 0.013 0.026 0.021 0.021 0.033 0.015 0.011
5 | 0.012 0.017 0.024 0.034 0.034 0.021 0.019 0.011
4 | 0.012 0.016 0.025 0.037 0.031 0.020 0.017 0.012
3 | 0.009 0.014 0.027 0.018 0.019 0.030 0.014 0.010
2 | 0.004 0.007 0.010 0.017 0.017 0.009 0.010 0.004
1 | 0.003 0.006 0.010 0.013 0.012 0.018 0.014 0.004
---------------------------------------
a     b     c     d     e     f     g     h


And of course, we cannot resist drawing a color map with intensity proportional to the relative frequency (created with OpenGL):

White:

Black:

Total:

Discussion

What can we learn from looking at these statistics? The following will not be an exhaustive approach; rather, we will look at interesting and distinguishable points we can derive from these data.
First, our main hypothesis has been affirmed; the center is indeed the focal point in the game of chess. The corners are rarely ever visited, as well as the squares a2, h2, a7, h7. From my own experience I know that usually the only pieces to travel there are the king, when hard pressed, and the rook, most often to defend lonely pawns. Of course, there is a vast gorge separating the factual observation – “the center squares are the most popular” – from any immediate conclusions. However, seeing this distribution, one could try a statistical approach to playing chess – when moving his pieces, one moves them in such a way so that they will threaten the tiles most visited by his opponent. Any real strategy will have to be incredibly more complex than this, but that may be an interesting course to follow.
One immediately apparent observation is that the board and the game are not symmetrical. This derives from the fact that the king and queen are not equal pieces. One consequence of this is that the “knight tile” of the kings (f3, f6) are visited by the appropriate color much more often – 25% – than the “knight tile” of the queen (c3, c6). Another asymmetry is castling. Castling kingside usually means having better protection for the king. It is also easier to do this, since there are less tiles in which the king can be checked, and only two pieces – the knight and bishop – have to be extracted from their initial position before doing so. Castling queenside requires moving away the queen in addition to the knight and bishop, and tends to leave the king more exposed (at the advantage of moving the rook to a more offensive position). These, along with some other minor reasons, would lead to players castling kingside more often. This is indeed the case, as can easily be seen in all color maps – the right side of rows 1 and 8 are brighter than the left, and the king’s protective area is definitely recognized.
Is white more aggressive than black? It seems so, at first glance. By comparing black’s relative frequency to white’s relative frequency for each square, it can be seen that white’s front row (row 4) is 5% more trodden on than black’s front row (row 5). Whether or not this is statistically significant in relation to the size of the data set was not verified.

Conclusions

While we have just brisked the surface of the investigation, it seems that there are many interesting points and facts one can discover when analyzing where chess pieces tend to go. A much more rigorous, thorough, and mathematical approach can probably reach much more decisive and important conclusions; however, we now have the tools to approach the problem. Extending the research might include numerical calculations, and taking into considerations the types of pieces and when they go to different squares.

Unfortunately, the programs written are much too large to be uploaded inline on this website.