Posts Tagged ‘Biology’

The Johnson-Lindenstrauss Lemma

May 18, 2017

Today is a good day I think to talk about the Johnson-Lindenstrauss dimension reduction lemma!

Johnson and Lindenstrauss. It cannot be coincidental that they are both wearing the same type of glasses.
Left image from Johnson’s website, right image from Gil Kalai’s blogpost in memory of Lindenstrauss.

The setting is as follows. Suppose you have a large number n points, living in the Euclidean space of dimension \mathbb{R}^{d} . We’ll call these points x_1,x_2,\ldots,x_n , and think of both the number of points n and the number of dimensions d as being very large, much larger than the measly “3” of our easy-to-grasp physical world.

Now, you may wonder, “But we do live in a three dimensional world thank you very much, so where would I obtain such high dimensional data?” Well, these situations arise increasingly more often in computer science applications, where a point in d -dimensional space may represent an abstraction of some actual real-world thing.

For example, imagine this not-too-farfetched biology experiment. A biologist wants to understand how a particular gene works in a bacterium, and she does this using the following procedure: first, she makes small changes in the DNA sections that code up the protein. Then she replaces the original gene in the bacterium by the new modified one. Finally, she measures the concentration of about 1000 different proteins in the bacterium. If the original gene was important, or if it was a key ingredient in a large protein pathway, then many of these concentrations will be different under the modified gene when compared to the unchanged gene.

Our biologist repeats this experiment many many times, each time changing a different part of the gene in a different way. Ultimately, she has performed say about 500 experiments. Each experiment results in a vector of 1000 numbers, one for each protein. She now has 500 points living in a 1000-dimensional space*. Our poor biologist must now analyze these points, and from them understand the way in which the original gene affects the proteins and operates within the cell. She will run many statistical and geometrical algorithms on these data, and these algorithms will usually have the downside that they really, really slow down as a function of dimension. What can she do?

* Actually, since a set of n points always spans a space of dimension no larger than n , these points only live in a 500-dimensional space. But that’s besides the point here, as we will see next.

Ok, back to mathematics. The Johnson-Lindenstrauss lemma roughly says the following. Suppose you have these n points living in \mathbb{R}^{d} . You can actually put them into a very small space – logarithmic in n – in such a way that the distances between the points are almost preserved: If two points were originally far away from each other, they’ll remain far away from each other, and if two points were close, they’ll remain close. More formally,

Lemma: Let 0 < \varepsilon <1 and let x_1,x_2,\ldots,x_n be n points in \mathbb{R}^{d} . Then there is a function f: \mathbb{R}^{d} \rightarrow \mathbb{R}^{k} such that for every two points x_i and x_j ,

(1-\varepsilon)||x_i - x_j||^2 \leq ||f(x_i)-f(x_j)||^2 \leq (1+\varepsilon)||x_i - x_j||^2

with k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3).

This lemma is good if you want to run geometric algorithms on your data. For example, in our biologist’s case, she might decide that if two modifications to the gene resulted in protein densities that are more or less the same (that is, two points in \mathbb{R}^d that are close to each other), then the modifications changed the gene in functionally the same manner. She might then decide to look for clusters of points which are close to each other. This is usually easier to do in smaller dimensional spaces. Applying a Johnson-Lindenstrauss transformation allows our biologist to run the algorithms faster and on more data, while (hopefully) not losing much accuracy.

Of course, this is useful only if the the new dimension k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) is much smaller than the original dimension d. This often happens in applications; for example, in a future post, we will work with n points in n dimensions, and the transition from a space of dimension n to space of dimension \log n will be quite substantial.

The proof we will present here follows a paper by Dasgupta and Gupta, which uses random projections. We’ll first illustrate this concept in two dimensions.

Suppose you have a set of points in the plane:

A one-dimensional subspace of the plane is just a line that goes through the origin. Once we have such a line, we can perform an orthogonal projection of the original points onto that line, giving us a new set of points, all on the line:

If we choose a different line, we get a different projection:

For some lines, the projection will totally wreck the distances between the original points; for example, in this image, we sent two different original points to the same point:

However, for other lines, it may be that the projection is able to preserve the ratio of distances fairly accurately.

The proof the Johnson-Lindenstrauss lemma uses the same idea, only in higher dimensions. Instead of points in the plane, we have our original d -dimensional space; and instead of projecting to a line, we try to find a subspace of dimension k so that projecting on to it will almost preserve distances. [Actually, we project the points and then blow them up by a constant factor, since orthogonal projections can only decrease the length of vectors; but this doesn’t change the gist of the argument. From now on when we say “distance preserving”, we’ll mean actually mean “up to a constant factor”. See the actual calculations for exact treatment].

A-priori, there is no grounds to think that such a subspace exists. However, it turns out that it indeed does, for a reason called “the concentration of measure in high dimensions” Roughly, it says that in high dimensions, some random processes – such as the orthogonal projection of a point onto a random subspace, or the scalar product of a unit vector with a Gaussian vector – are heavily concentrated around their expected value. The probability for such processes to be even a tiny bit away from their mean is exponentially small in the dimension. We usually don’t see these phenomena in 3d, because the lowly and pathetic exponent of 3 isn’t enough to give a profound effect, but in higher dimensions they flourish.

After all of this verbosity, here is the proof outline: for any k , we can choose a random k dimensional subspace of \mathbb{R}^d and project our points on to it. Choose two particular points x_i and x_j from our original cast in \mathbb{R}^d . It may either be that the distance between them is preserved up to a factor of 1 \pm \varepsilon , or it may not be preserved; denote the probability that it is not preserved is some number p . We can consider this “distance distortion” event for all of the {n \choose 2} = (n^2 - n)/2 pairs of points, and by the union bound, the probability for at least one pair of points to have their distance distorted is less than p \cdot (n^2 - n)/2 . If we can show that the probability p is smaller than 2 / (n^2 -n) , then the probability that at least one distance is distorted is smaller than 1. This means there is a non-zero probability that all the distances are preserved! But then there has to exist a distance-preserving projection; otherwise it would be impossible for this probability to be greater than 0. And as it turns out, we can indeed make the probability that a single distance is preserved be this small, by choosing k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) .

Actually, this proof not only shows that there exists such a transformation, it even tells you how to efficiently get one: the probability that a random projection gives a distance-preserving dimension reduction is actually quite high (and can be made very close to one if we just increase the number of dimensions by a bit). So you just need to take your points, project them randomly, and check that the distances are preserved; if not repeat, and sooner rather than later you are bound to get something that works.

Ok, let’s take a look the technical details. All we have to show is that with high probability, the length of a vector is nearly preserved when we project it to a random k dimensional subspace. Now, when performing the calculations, instead of taking a fixed vector and a random subspace, we can take a fixed subspace and a random vector; all the results will be the same, since all we care about when projecting is the relative orientation between the vector and the subspace.

So let’s take a random unit vector, and see what happens to its length. A uniformly random unit vector can be obtained by taking a random distribution which is spherically symmetric and normalizing it. Specifically, let X_1,\ldots,X_d be d independent Gaussian random variables with mean 0 and variance 1; in other words, the random vector X = (X_1,\ldots, X_d) distributes as a d dimensional Gaussian vector \mathcal{N}(0,\text{Id}_{d}) . Then the vector

Y = \frac{X}{||X||}

is a uniformly random unit vector. As for our fixed subspace, we’ll just take the space spanned by the first k coordinates, and denote the projection by Z = \frac{(X_1,\ldots ,X_k)}{||X||} . What is the expected value of the square of the length of Z ? Well, we know that the squared length of Y is 1; this is by design: we normalized it. We thus have:

||Y||^2 = 1 = \sum_{i = 1}^{d}\frac{X_i^2}{||X||^2}.

If this is true for ||Y||^2 , then it is also true for its expectation:

\mathbb{E}||Y||^2 = 1 = \sum_{i = 1}^{d}\mathbb{E}\frac{X_i^2}{||X||^2}.

Notice that the sum is symmetric in the X_i ‘s: in expectation, there is no reason for one X_i to behave differently from the other. Thus each term should contribute equally to the sum, and so for every i ,

\mathbb{E}\frac{X_i^2}{||X||^2} = 1/d.

Since Z is just the first k terms of Y , we get

\mathbb{E}||Z||^2 = \sum_{i=1}^{k}\mathbb{E}\frac{{X_i}^2}{||X||^2} = k/d.

Oops, the expected value is k/d , which is smaller than 1, the original norm of the vector! But not to worry, this can easily be fixed: after we project to a random subspace of dimension k , we just blow up all the points by a factor of \sqrt{d/k} . That way, when we look at the norm squared, the expectation will be d/k times larger, and the expectation will indeed by 1. This multiplication will not affect any of our probabilistic statements, since if ||Z||^2 is concentrated around its mean, so is \frac {d}{k}||Z||^2 .

Now to fulfil our promise and show that the squared length of Z is heavily concentrated around its mean. The calculations will be a bit lengthy, but do not despair (also, if you got up to here and understood everything, you are pretty much good-to-go for most of your Johnson-Lindenstrauss-ing needs).

We’ll show that for any positive real number \beta smaller than 1, the probability for ||Z||^2 to be \beta times smaller than its mean of k/d is exponentially small in the dimension and in \beta ; we’ll then take \beta = 1 - \varepsilon to show that the probability for a small deformation is small. The same calculations can also be repeated for checking when ||Z||^2 is greater than \beta for \beta > 1 , and then taking \beta = 1 + \varepsilon .

A standard method for showing such concentration of measure is using Markov’s inequality, which states that if W is a positive random variable, then for every a > 0 ,

\text{Pr}[W \geq a] \leq \frac{\mathbb{E}W}{a}.

Intuitively, all this inequality says is that if the expectation is small, then it’s impossible to have too large a probability of getting too large a value. Markov’s inequality is innocuous enough, but it becomes powerful when we combine it with exponential moments, that is, when we apply to e^{tW} instead of to W itself. Let’s see how that works.

When written explicitly, ||Z||^2 becomes

||Z||^2 = \frac{X_1^2 + \ldots + X_k^2} {X_1^2 + \ldots + X_d^2}.

So the probability \text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] can be written as

= \text{Pr}[d(X_1^2 + \ldots + X_k^2) \leq k \beta(X_1^2 + \ldots + X_k^2) ]

= \text{Pr}[k \beta(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2) \geq 0]

=  \text{Pr}[e^{t(k \beta (X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))} \geq 1].

Here t was some number greater than 0, and we will optimize over it soon in order to get the best possible bounds. Now we invoke Markov’s inequality with a = 1 , getting

\leq  \mathbb{E}e^{t(k(X_1^2 + \ldots + X_d^2) - d(X_1^2 + \ldots + X_k^2))}.

The various X_i ‘s are independent, so the expectation of the product is the product of expectations:

\mathbb{E}e^{X_1 + X_2} = \mathbb{E}[e^{X_1}e^{X_2}] = \mathbb{E}[e^{X_1}]\mathbb{E}[e^{X_2}],

and after grouping together the X_i in our exponent, our probability is bounded by

=  \mathbb{E}[e^{t k \beta X_1^{2}}]^{d-k} \mathbb{E}[e^{t(k \beta - d)X_1^{2}}]^{k}

=  (1-2tk \beta)^{-(d-k)/2}(1-2k(k\beta-d))^{-k/2}.

This last equality we won’t go into, but it stems from knowledge about the moment generating function of Gaussians; in short, all that needs to be proven is that \mathbb{E}[e^{sX^2}] = 1/\sqrt{1-2s} for s<1/2 .

All that is left is to minimize this last expression. This we also won't do here, because it mainly involves annoying calculus which you can let WolframAlpha do for you. This finally gives:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \beta^{k/2}(1 + \frac{k(1-\beta)}{d-k})^{(d-k)/2}.

The expression on the right hand side is equal to e to the power of its logarithm:

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2} \log \beta + \frac{d-k}{2} \log(1 + \frac{k(1-\beta)}{d-k})).

Finally, since \log(1+x) \leq x for all x \geq 0, we can replace the second \log factor by \frac{k(1-\beta)}{d-k} , giving

\text{Pr}[||Z||^2 \leq \frac{\beta k}{d}] \leq \exp(\frac{k}{2}(1-\beta + \log\beta)).

Now we just need to clean up. Put \beta = 1-\varepsilon , and get

\leq \exp(\frac{k}{2}(1-(1-\varepsilon) + \log(1-\varepsilon))).

By more calculus it can be shown that \log(1-x) \leq -x - x^2/2 for 0\leq x <1 , and at last we have

\leq \exp(\frac{k \varepsilon^2}{4}).

And indeed, if we choose k = 4 \log n / (\varepsilon^2/2 - \varepsilon^3/3) , this exponent will be smaller than 1/n^2 , just as we wanted (actually, we can choose just k = 8 \log n / (\varepsilon^2) , but to deal with the second case of \beta = 1+\varepsilon we will also need the \varepsilon^3/3 term). A similar calculation for \beta = 1+\varepsilon will give another factor of 1/n^2 , and in total the probability of having a large distortion will be just 2/n^2 per pair of points.

Of Mark and of Biology

November 14, 2009

As some of you might know, I wanted to take biology during high school, but instead chose physics, for various reasons. However, this does not mean that my interest in biology lessened, and I still find the topic quite fascinating, especially when related to microbiology or physics. During the weekend, I have written two short articles, presented below. They are somewhat technical, but also interesting, so please do try to read them. The first is on the Isoacceptor and its roles, and the second is on Thermophilus Proteins and how they effect various processes within the cell.

1) Review of the Isoacceptor TRNA
The isoacceptor trnaleu anticodon gag reads the nucleophilic group of how the cognate trnas can interact with proteins such as the ribosome crystals. From this, an antisecretory effectùit reduces the bacteria archaea, and a stable n aminoacylated minihelix – functioning as reflected by the sequential model of base pair of the peptidyl – transfers trobro and sugars in the catalytic mechanism of the geometry of the s subunit. From the next report from these differences, the p sitebound trna is in termination, and s crystal structure of the codon reading and trnas enter the first identified low resolution. In analogy with rf, and how they induce ester bond hydrolysis of numerous bundles of the e in the ribosome, making this reaction site and a post antibiotic drugs and s subunit works. They inferred a survey of inhibition allosteric inhibitors, although comparison of less stringently monitored explaining the s and qvist established with the structures is a site. While leaving humans, ribosomes differ: most frequent occurrence of near cognate asl alone. This may reflect a protein synthesis such as the situation changed it recruits – all other bacterial re growth – this quest against resistance mutants is not shown at all.

2) Thermophilus Proteins and the Cell
The proteins carry out of the regulatory molecules as essential, but there is the cell that can receive one of the aminoglycosides. Subsequently, the s ribosomal proteins i.e. site – while those in figure seemed to their susceptibilities tested, aminoglycosides are shown in the streptomyces along with non cognate asl alone and intramuscularly. Some promising successes in recent experimental studies show their large proteins edit ribosome and might be administered and effective drugs, and it was also induced to fit together ribosomes being either one part in the surface, and collaborators ogle, and site bound in the closed conformation. Increased rate of the united statesthermophilus proteins can be found in the activation entropy. This could be explained as well as a site aminoacyl trna. In summary, the closed conformation is awarded to stabilize the archaeon h marismortui since the minor groove of one amino acid and crescentin filaments are for the high resolution, and may display symmetry model put forth by the microsome fraction. This is termed absolute subtype of gtp hydrolysis in a number of Æoh of s subunit for subsequent structure without disrupting or deletions in the ribosome with s bound, when bpg binds to three dimensional chromosomes.

You may notice that these articles aren’t very easy to read or understand, but you may blame it on their very technical nature. However, the real reason for this, is that they were not written by me at all, but rather by a very simple Markov chain. Markov chains are models, in which going from one state to another depends only on the current state, and not on the development of the entire model. This means that we construct a scientific article, one word at a time, each time based only on what we have so far. The choice of which word to choose next is done randomly; however, we do not just pick a random word from the English dictionary, but rather do so intelligently, based on a probability distribution. Prior to writing the articles, we give the program many other articles to “read” and analyse (in this case: several pages from Wikipedia about various proteins were chosen). During this process, it notes which words tend to come after each other. For example, in academic biology, the phrase “protein synthesis” is much more abundant than the phrase “protein cats”. When writing the article, if our current word is “protein”, and we want to choose the next one, we are much more likely to choose the word to be “synthesis” rather than “cat”, in order to produce a paper that mimics what is written by university professors. In essence, we start with a single word or phrase, and slowly, word by word, build up the article.

The model I have written is very simple, and does not claim to be comprehensive, robust, efficient or complete. It is a “depth 1” Markov model, meaning that the next word is chosen based only on the previous word. A “depth 2” model, on the other hand, might look at the previous two words, and so on. The higher the depth, the more accurate are the sentences that will be produced, but also the more memory and CPU the program will consume. It is not difficult to produce a simple depth 2 model, but a depth 1 was sufficient for me in this case. Also, it does not deal with punctuation or grammar, although one can construct a model that does that. Overall, Markov models are used thoroughly in the industry and the academy, since they can generate quite good results. More complicated models called “Hidden Markov Models” try deal with several variables at once (unlike this one, which only deals with choosing a new word). For example, one can build a program that writes music, with note pitches, harmony, and tempo and rhythm all being variables which must be considered.
All in all, I am quite happy with the results, especially considering the simplicity of the model. In the above articles, I made less than 20 edits each, most of which consisted of adding proper periods and commas. A serious model can do this automatically, and generate results which barely need to be edited in order to look reasonable. Notice that if you compare the articles to real academic research papers, you will probably be as confused reading one as reading the other. So a final word: do not let technical jargon and long complicated sentences blind you to the fact, that some people write useless and utter unreadable crap.

The python script:

class Markov:

    def __init__(self):
        self.wordsDict = {}

    def addFileStatistics(self, filename):
        f = open(filename)
        text =

    def _addNextWordStatistics(self, text):
        noCharactersText = text
        # Removing characters we don't want to deal with
        for character in ["e.g.", ".", ",", "'", "!" ,"?" ,":",
                ,";", "'", '"', "\\", "/", "`", "%", "(", ")",
                "-", "_", "*", "[", "]", "\n", "\r"]:
            noCharactersText = noCharactersText.replace(character," ")
        words = noCharactersText.split()
        words = [word.lower() for word in words]
        for i in xrange( len(words)-1 ):
            word = words[i]
            nextWord = words[i+1]
            if word not in self.wordsDict:
                self.wordsDict[word] = []                


    def generateText(self, startingWord, numOfWords):
        newText = [startingWord]
        lastWord = newText[0]                        

        for i in xrange(numOfWords):
            if lastWord in self.wordsDict:
                concurrents = self.wordsDict[lastWord]
                concurrents = self.wordsDict \

            nextWord = random.choice(concurrents)
            lastWord = nextWord

        return "".join([word + " " for word in newText])