A primitive model for genetic recombination

I’m taking a class in general genetics at the Technion, and there we learned about genetic recombination, and in particular, homologous chromosome crossover: a phenomenon where chromosomes exchange sections between themselves during meiosis.
When this happens, some of the members of the population exhibit “recombinant genomes”, which are different than their parent genomes should supposedly generate. Surprisingly, this part of the population never exceeds 50%, even though at first look it seems as if it could.

In this post, we’ll see a model of chromosomal crossover statistics that explains this phenomenon, as well as giving an estimate to the physical distance between genes as a function of population statistics. I’ll assume you know some basic genetic terms, such as “dominant” and “heterozygote”, but I’ll explain about crossovers in general and describe the problem in more detail below. You can skip directly to “The model basics” if you already know about recombination.
The post will be about 50% biology and 50% math.

Biological background:
We’ll work through an example with the commonly used traits of EYE COLOR and WING-SIZE in fruit flies. Both are controlled by single genes found on the X chromosome.
A fly’s eyes can be either red or white, with red being a dominant quality. We’ll mark the dominant gene with R and the recessive gene with r. Thus, if a fly’s genome contains Rr or RR, it will have red eyes; otherwise, if it contains rr, it will have white eyes.
Similarly, a fly’s wings can be either long or short, with long being a dominant quality. We’ll mark the dominant gene with W, and the recessive with w, so long winged flies have Ww or WW, and short winged flies have ww as their genotype.

Image source: https://theconversation.com/animals-in-research-drosophila-the-fruit-fly-13571 (note: the wings here are curled, not long/short)
From https://theconversation.com/animals-in-research-drosophila-the-fruit-fly-13571

Suppose we have a heterozygote cis female. In other words, her genome contains both the dominant and the recessive genes (so she has RrWw in her genome), and both of the dominant genes are found on the same homologous chromosome. In other words, her two X chromosomes look like this:


During meiosis, her two homologous chromosomes duplicate and then separate, and we get two types of possible germ cells: RW and rw:


However, it is also possible for crossover to occur: two chromatids are “sliced” at some point, and then the two parts of each are glued to each other.


If this happens during meiosis, the outcome is four possible germ cells: RW, Rw, rW, rw:


Now, what happens when we mate our RrWw female with a white eyed, short winged male? Since these traits are found on the X chromosome, and a male fly only has one of those, he necessarily has the recessive allele, rw. We don’t care about the Y chromosome here.


Upon mating, the male fly will give the offspring either an X or a Y chromosome. Let’s ignore the males at this point, and focus just on the females. Since our male’s genotype is rw, we will get the following combinations: RrWw, rrww, Rrww, rrWw. All of these are phenotypically different, and each represents a different combination of red/white eye and long/short wing. The Rrww and rrWw genotypes are recombinant – they only exist in the population because of recombination.

Suppose now that the chance for recombination between R and W is some number q between 0 and 1. Then if we look at a very large collection of germ cells from the mother, we expect the following distribution:

RW should be \frac{1}{2}(1-q) of the germ cell pool
rw should be \frac{1}{2}(1-q) of the germ cell pool
Rw should be \frac{1}{2}q of the germ cell pool
rW should be \frac{1}{2}q of the germ cell pool

This is because q of the population should be recombinant, and whenever there is recombination we get an equal amount of Rw and rW.
After mating, when looking at the females, we only need to add the father’s recessive genes, and we get:

RrWw should be \frac{1}{2}(1-q) of the population
rrww should be \frac{1}{2}(1-q) of the population
Rrww should be \frac{1}{2}q of the population
rrWw should be \frac{1}{2}q of the population


Thus, Rrww and rrWw comprise \frac{1}{2}q+\frac{1}{2}q = q of the population. This can be measured in real experimental trials, since each of the above genotypes translates into a different observable phenotype.
At this point in our theory, q can be any number between 0 and 1. If q is 0, then there is never any recombination, and the two genotypes RW and rw go hand in hand forever. If q is 1, then recombination always happens.
However, it is an empirical fact that the percentage of recombinant population is never more than 50%! The measured value of q is always less than or equal to 0.5.

There must be some mechanism that prevents recombination from happening too often. We can make appeals as to the utility of this mechanism, and wonder whether it is good or bad to have a small number or a large number of recombinations between genes – but for now, let’s try to think of an underlying model.

Image source: wikipedia
Image source: wikipedia

The model basics:
We treat the chromosome as a linear piece of DNA, with a length of “1 chromosome” – in essence, it is a line segment of length 1. The different genes are points on this line, and are therefore assigned a position 0pos1. In reality genes have some finite width on the DNA strands so a more accurate model will treat them as small intervals, but it will be easier to consider them as points.
We’ll assume that the gene that codes for eye color is on the left of the gene that codes for wing size. Denoting the position of the first by x and the second by y, we have this schematic for our chromosome:


The primary element in our model is the crossover event, or a cut. In this event, two homologous chromosomes are cut at a random place, distributed uniformly across its entire length. The chromosomes then swap strands at this position.

There are two options here. If the cut occurs in in interval between x and y, the genes will be swapped, and we have recombination. However, if the cut occurs outside the interval [x,y], then those two genes will not be affected. Since the cut distribution is uniform, the chance to land between the two genes is just y-x, so the probability of recombination is q = y - x  .


This is a simple operation, and it’s tempting to think that it is the entire deal, but this is not so. In a crossover event, if two genes are far away from each other, meaning, at the opposite sides of the chromosome, then the probability of recombination can be very close to 1: nearly every cut we make will separate them. But we never observe a q above 0.5! There is obviously something more that we are missing here.

Image source: Science magazine
Image source: Science magazine

The answer: the above description is true only for a single crossover event – a single cut. However, there is no guarantee that a chromosome will undergo any crossovers at all during meiosis. Further, a chromosome may actually undergo several crossover events, as was experimentally discovered when looking at the recombination relations between a triplet of genes on the same chromosome. But look what happens when there are two crossover events in the same interval [x,y]: the strands are switched twice, and ultimately there is no recombination between the two genes!


We can now convince ourselves: whether or not we see recombination between two genes depends on the parity of the number of crossover events that occurred between them. When looking at the population statistics, what we ultimately see is the average of the parity of crossovers.
As an artificial example, suppose that during meiosis, there is a 50% chance of performing a single cut, and a 50% chance of not performing any cuts at all. In that case, for two far away genes, which are always separated by any cut, there is a 50% chance of getting recombination, and 50% chance of not getting it. In other words, q was reduced from 1 to 0.5. In general, in this case the observed probability of getting recombination is q = \frac{1}{2}(y-x) , as half the time we do not get a recombination at all.
Of course, there is no reason to assume that there is a 50% chance of getting no crossover event, and 50% of getting exactly one – the number of crossovers could behave in different ways – but we see that the actual percentage of recombinant population depends on the distribution of the number of crossover events in the chromosome. Which distribution should we choose?

A slightly flawed offer:
A simple choice would be a binomial distribution. The reasoning goes as follows: during meiosis, there are all sorts of enzymes floating about the chromosomes, which are responsible for cutting them up and gluing them back together. There may be a large number n of these enzymes floating about, but they only have a certain probability p of actually performing their duty. Of course, we assume that they act independently, even though in reality they may interfere with each other. So the number of crossovers depends on the numbers of “successes”, where a success is an enzyme doing its work properly, which happens with probability p. This means that the number of cuts distributes according to C \sim Bin(n,p) .


So assuming the number of crossover events distributes according to C \sim Bin(n,p) , what is the probability of getting an odd number of crossovers? Let’s take a moment to calculate it.

For any n , denote that probability by P_n . Suppose you already checked n-1 of the enzymes. Then with probability P_{n-1} , you already have an odd number of crossovers, so you don’t need any more of them. Further, with probability 1-P_{n-1} , you have an even number, and you want another crossover to get an odd number. So the probability obeys the recurrence relation

P_n = P_{n-1}(1-p)+(1-P_{n-1})p.

with the initial condition that P_0=0 , as if there are zero enzymes there are zero crossovers, which is an even number.
More nicely:

P_n = P_{n-1}(1-2p)+p

P_0 = 0.

If we look at just this equation:

P_n = P_{n-1}(1-2p)

we quickly see that the answer is P_n= a \cdot (1-2p)^n  . However, we also have that additive +p in our original equation. It turns out we only need a small adjustment in order to compensate it though, and in this case we just have to add an extra constant, so that

P_n = a \cdot (1-2p)^n + c.

Since the equation is linear, this is actually very much like the particular solution of a differential equation, and we can find c directly by putting it into P_n in the recurrence relation:

c = c (1-2p) + p,

which gives

c = \frac{1}{2}.

Taking into consideration the initial condition, the solution is then,

P_n = \frac{1}{2} -  \frac{1}{2}(1-2p)^n

Wonderful! For very large n, the probability of getting an odd number of crossovers goes to 0.5! Even for relatively low probabilities p, the quantity (1-2p)^n goes to 0 very quickly.

This gives an answer regarding two genes which are very far away: they are affected by every cut performed by the enzymes, and so their recombination probability is exactly the same as the probability for getting an odd number of cuts. But what about genes which are closer? For them we actually have to take into consideration the fact that not every cut the enzymes make will cause a crossover.
Notice the following: the number of cuts in every chromosome is distributed binomially, C \sim Bin(n,p) . If we already know the number of cuts to perform – say, k – then the number of cuts which affect the two genes at positions x and y is also distributed binomially as Bin(k,y-x) , since every cut has a probability of y-x of crossing the two genes. So the number of crossovers G between y and x, conditioned that C = k , is Bin(k,y-x) , and k itself distributes as B(n,p) .
Now comes the cool part: there is a theorem about binomial distributions which says the following: if X is a random variable that distributes binomially, X \sim Bin(n,p) , and Y is a random variable that conditioned on X distributes binomially, Y|X = Bin(X,q) , then Y is also binomial, Y \sim Bin(n, pq) ! Using this theorem, the number of cuts S which swap between x and y goes as S \sim Bin(n, p \cdot (y-x)) .
Now we can apply the same reasoning as before, only this time, a “success event” is not merely when the enzymes perform a crossover anywhere on the chromosome, but rather when they perform it in some place between x and y.
The final probability of getting recombination between two genes is then

q = \frac{1}{2} -  \frac{1}{2}(1-2p(y-x))^n

This is very nice, and it gives us some asymptotics as well. For large values of p(y-x) , the second factor is negligible, and we have q =\frac{1}{2} . For small values of p(y-x) , the second factor can be expanded to first order, and the two \frac{1}{2} ’s will cancel each other out, giving us q \propto (y-x) .

Slightly improving the binomial:
Overall, the model proves adequate in its predictions, and its simplicity is alluring. However, it is not without problems. For example, its two parameters – p and n – must somehow be found out, and it is not entirely clear how to do so. In fact, the very fact that we have a fixed n here seems out of place: by keeping it around, we assume that there is a constant number of enzymes working about, when it is much more reasonable that number varies from cell to cell. After all, when manufacturing hundreds or thousands of enzymes, there must be variation in the numbers.

Luckily, there is a simple way to fix this, which is actually firmly based on reality. Instead of assuming that the number of cuts the enzymes make is distributed binomially, we assume it follows a Poisson distribution, C \sim Pois(\lambda) , for a yet unknown \lambda . This actually makes a lot of sense when we remember that Poisson distributions are used in in real life to describe queues and manufacturing processes, when what we know is the average time it takes to perform a single event.
If the number of overall cuts has a Poisson distribution, how does the number of crossovers between x and y behave? Well, given that the number of cuts is k, the number of crossovers is still as before, Bin(k, y-x) . But again the theorems of probability smile upon us, and there is a theorem stating that if C \sim Pois(\lambda) and conditioned on C = k we have S|C \sim Bin(C,y-x) , then

S \sim Pois(\lambda(y-x)).

So the distribution of crossovers between x and y will also follow a Poisson distribution!
Now we only have to remember the simple trick, that

Pois(\lambda)= \lim_{n \rightarrow \infty} Bin(n,\frac{\lambda}{n}).

Thus, under the assumption of a Poisson distribution, the final probability of getting recombination between two genes is

q = \frac{1}{2} - \lim_{n \rightarrow \infty} \frac{1}{2}(1-\frac{2 \lambda(y-x)}{n})^{\frac{1}{n}},

or, more simply,

q = \frac{1}{2} - \frac{1}{2}(1-e^{-2 \lambda (y-x)}).

This again has the same desirable properties as before, but the model is simpler: we got rid of the annoying n parameter, and the probability parameter p was replaced by the rate parameter \lambda .
(Note: For small values of (y-x), the probability for recombination is q=\lambda(y-x) ; if only we could set \lambda = 1 and get a direct relationship between q and the distance between genes…)

To conclude:
The percentage of recombinant phenotypes in a population of offspring is always smaller than 50%. This is not an arbitrary number, but stems from the underlying biological mechanism of recombination. Because multiple crossover events can occur between two genes, what’s ultimately important is the parity of the number of such events. When the total number of crossovers in a chromosome follows a Poisson distribution, the parity can be readily computed. It behaves like a fair coin toss for genes which are physically far away from each other, but is linear in the distance between the genes when they are close to each other.
This “Poisson crossover model” is very simple, and of course does not explain all there is about recombination (genes are not points on a line; distribution is probably not Poisson; events are not independent; there are “recombination hotspots” in chromosomes; the chromosome is a messy tangle, not all of which is accessible; etc). But it looks like a good starting point, and to me seems adequate at explaining the basic behaviour of recombination.

One thought on “A primitive model for genetic recombination

  1. Hello Renan,

    My name is Susan and I am the content manager of Clapway.com, a US based online publication. I came across your blog and really enjoy the content you have on your site. We would like to discuss a potential collaboration with you. Would you be able to talk to us on Thursday via Skype? If so, I am available at content@clapway.com.

    Looking forward to hearing from you.

    Susan Xu
    Content Manager at Clapway
    195 Plymouth St #6/17
    Brooklyn, NY 11201

Leave a Reply to Susan Xu Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s