A short note about amino acids and Hamming distance

It’s always interesting to see how work in one area of mathematics seems to help or imply results in a totally different field. Not long ago, I read a bit on the problem of random packing – if you park cars randomly on a street, how many cars can you fit in before you run out of space? During their work on this problem and its related variants, mathematicians Itoh and Solomon discovered a seemingly surprising connection to… molecular biology.


Human DNA / RNA is composed of 4 nucleotides, denoted C,G,A,T (for DNA) or C,G,A,U (for RNA). When the cell creates proteins, it reads these in triplets called codons; each codon codes for either a start/stop instruction, or for a single amino acid.
Since there are 4 types of nucleotides, there are 4*4*4 = 64 different combinations; potentially, each one could code for a different amino acid. However, the human body is capable of creating only 20 amino acids. The “start” command also generates an amino acid, so this, in conjunction with the “stop” command, means the 64 nucleotide combinations only describe 21 functions. There is therefore some redundancy, and several combinations describe the same acid / command. The following table summarizes which acid is encoded by which triplet:


There are several interesting phenomena in this table. I am no professional in the field of biochemistry, but I will give my humble thoughts on why they occur.

First, note that the amino acids tend to be encoded in clusters. For example, Proline is encoded by CCU, CCC, CCA, or CCG; in other words, any codon that begins with CC. Two questions may arise: 1) Why are the triplets so similar to each other – why aren’t they scattered throughout the entire table? 2) Why is there variation only in the last nucleotide? Why don’t we see acids being encoded by “any codon that ends with CC”?
The answer to the first is probably rooted in resistance to mutations. Suppose that due to a random mutation, one of the letters of the codon changes. It could be quite harmful if the amino acid encoded would also be different, since this can change the protein’s function. If the mutation is random – i.e there is an equal chance for every position to change into every other letter – then if we have an amino acid encoded by “everything that starts with CC”, it is resistant to 33% of the mutations, since any mutation changing the last letter would not change the encoding. However, if the same acid had codons that are all different by two or three nucleotides, chances are that any mutation would change it.
This can be more formally stated in terms of Hamming distance – which, given two strings, tells in how many places they are different from one another. For example, the distance between CCA and CCU is 1, while the distance between CCA and GAU is 3. Codons which are further away from each other will have a smaller chance to turn into one another due to mutation, and scattering the encoding across the whole table means a large Hamming distance from one codon to another.
The answer to the second question, why are there variations in the last nucleotide, I do not know, but can only speculate. One idea is that in the distant past, organisms were simpler, and used less building blocks. Specifically, codons were of length two, and not three, and in the transition from two to three, a nucleotide was added at the end. This idea has many “maybes” in it. Another idea, by my friend Eyal, is that the ribosome, which is responsible for putting the proteins together, has lower accuracy after already processing two nucleotides, and hence we want redundancy in the third.

A second question regarding the distribution of encodings in the table above is, why are there 21 different functions? Why not more, and why not less? It is here that a mathematical model may provide some insight, as found by Itoh and Solomon.
A codon may be written as a sequence of three letters out of a choice of the four letters CGAU. We can also write it in binary as a string of length six: C = 00, G = 01, A= 10, U = 11. Let’s consider all 64 different strings in binary, which correspond to all the possible codons. Choose one at random, and write it down. Now, keep choosing strings at random from the ones you haven’t chosen yet. However, write them down only if their Hamming distance from each of the strings you have already written down is at least 2. At the end, all the strings that you have written down will be different from each other by more than 1 bit. Sometimes the process ends and you have written only 14 strings, and other times you have written a full 32 (one for each opposite corner of the six dimensional hypercube generated by the strings).
The natural question to ask is, if you repeat this process, what is average number of strings that you eventually write down? I did not solve this analytically, but I did write a Lisp program to calculate it. The average of 10000 runs turns out to be about 20.1. This is strikingly close to the 21 different functions encoded by codons…
Of course, this may be entirely coincidental. Whatever way the modern genetic apparatus has developed, it was certainly more complex than picking out random strings and checking if they are close to each other or not. Further, a Hamming distance of 2 may still describe codons which are different by only one nucleotide, since it takes two bits for each letter. However, one cannot but think that there is indeed a connection, for as we have established above, resistance to mutations is caused by having a large Hamming distance between different amino acids. In any case, this is just a quotation of a result, not a proposed model.
Still, it is interesting to see that our own genetics are basically governed by the same considerations and principles we use when designing an error-resistant communication system and encoding. The fact that basic research on random parking of cars (in more than one dimension, eventually) led to insights in information, is also a comforting reminder, that you never know where you’ll go when tackling a seemingly innocent problem.

3 thoughts on “A short note about amino acids and Hamming distance

  1. Pretty much on the nose – a degenerate genetic code is more robust to mutations (although I’m not sure if there’s any experimental tests of this – they’d be pretty fiddly) and there’s also a chemical reason: the third position in the tRNA molecules’ anti-codon can “wobble” a bit, allowing for non-standard base-pair bindings. That means some of the tRNA adapters (there’s like 40 of them in all) for a given aminoacid actually bind by default (and as far as I know with similar affinity) to more than one codon in the mRNA.
    Anyway, your point about 21 being a very likely number for a random choice process is interesting, especially if it could be modelled onto an evolutionary process

  2. I just read an abstract from the current issue of Science that claims there are differences in expression of “identical” triplets. It might interest you.

  3. very nice observation. I also wonder if the triple codons are also built in such a way to help solve the problem of missing/extra letter in the sequence that codes for protein. it can creates gibrish or even the mechanism may guess the extra or missing letter and correct for it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s