The Genetic Code

Certain mathematical physicists can’t help wondering why the genetic code works exactly the way it does. As you probably know, DNA is a helix bridged by pairs of bases, which come in 4 kinds:

adenine (A)
thymine (T)
cytosine (C)
guanine (G)

Because of how they’re shaped, A can only connect to T:

while C can only connect to G:

When DNA is copied to ‘messenger RNA’ as part of the process of making proteins, the T gets copied to uracil, or U. The other three base pairs stay the same.

A protein is made of lots of amino acids. A sequence of three base pairs forms a ‘codon’, which codes for a single amino acid. Here’s some messenger RNA with the codons indicated:

But here’s where it gets tricky: while there are 43 = 64 codons, they code for only 20 amino acids. Typically more than one codon codes for the same amino acid. There are two exceptions. One is the amino acid tryptophan, which is encoded only by UGG. The other is methionine, which is encoded only by AUG. AUG is also the ‘start codon’, which tells the cell where the code for a protein starts. So, methionine shows up at the start of every protein (or most maybe just most?), at least at first. It’s usually removed later in the protein manufacture process.

There are also three ‘stop codons’, which mark the end of a protein. They have cute names:

• UAG (‘amber’)
• UAA (‘ochre’)
• UGA (‘opal’)

But look at the actual pattern of which codons code for which amino acids:

It looks sort of regular… but also sort of irregular! Note how:

• Almost all amino acids either have 4 codons coding for them, or 2.
• If 4 codons code for the same amino acid, it’s because we can change the last base without any effect.
• If 2 codons code for the same amino acid, it’s because we can change the last base from U to C or from A to G without any effect.
• The amino acid tryptophan, with just one base pair coding for it, is right next to the 3 stop codons.

And so on…

This what attracts the mathematical physicists I’m talking about. They’re wondering what is the pattern here! Saying the patterns are coincidental—a “frozen accident of history”—won’t please these people.

Though I certainly don’t vouch for their findings, I sympathize with the impulse to find order amid chaos. Here are some papers I’ve seen:

• José Eduardo M. Hornos, Yvone M. M. Hornos and Michael Forger, Symmetry and symmetry breaking, algebraic approach to the genetic code, International Journal of Modern Physics B, 13 (1999), 2795-2885.

After a very long review of symmetry in physics, starting with the Big Bang and moving up through the theory of Lie algebras and Cartan’s classification of simple Lie algebras, the authors describe their program:

The first step in the search for symmetries in the genetic code consists in selecting a simple Lie algebra and an irreducible representation of this Lie algebra on a vector space of dimension 64: such a representation will in the following be referred to as a codon representation.

There turn out to be 11 choices. Then they look at Lie subalgebras of these Lie algebras that have codon representations, and try to organize the codons for the same amino acid into irreducible representations of these subalgebras. This follows the ‘symmetry breaking’ strategy that particle physicists use to organize particles into families (but with less justification, it seems to me). They show:

There is no symmetry breaking pattern through chains of subalgebras capable of reproducing exactly the degeneracies of the genetic code.

This is not the end of the paper, however!

Here’s another paper, which seems to focus on how the genetic code might be robust against small errors:

• Miguel A. Jimenez-Montano, Carlos R. de la Mora-Basanez, and Thorsten Poeschel, The hypercube structure of the genetic code explains conservative and non-conservative amino acid substitutions in vivo and in vitro.

And here’s another:

• S. Petoukhov, The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts.

But these three papers seem rather ‘Platonic’ in inspiration: they don’t read like biology papers. What papers on the genetic code do biologists like best? I know there’s a lot of research on the origin of this code.

Maybe some of these would be interesting. I haven’t read any of them! But they seem a bit more mainstream than the ones I just listed:

• T. A. Ronneberg, L. F. Landweber, S. J. Freeland, Testing a biosynthetic theory of the genetic code: fact or artifact?, Proc. Natl. Acad. Sci. U.S.A. 97 (200), 13690–13695.

It has long been conjectured that the canonical genetic code evolved from a simpler primordial form that encoded fewer amino acids (e.g. Crick 1968). The most influential form of this idea, “code coevolution” (Wong 1975) proposes that the genetic code coevolved with the invention of biosynthetic pathways for new amino acids. It further proposes that a comparison of modern codon assignments with the conserved metabolic pathways of amino acid biosynthesis can inform us about this history of code expansion. Here we re-examine the biochemical basis of this theory to test the validity of its statistical support. We show that the theory’s definition of “precursor-product” amino acid pairs is unjustified biochemically because it requires the energetically unfavorable reversal of steps in extant metabolic pathways to achieve desired relationships. In addition, the theory neglects important biochemical constraints when calculating the probability that chance could assign precursor-product amino acids to contiguous codons. A conservative correction for these errors reveals a surprisingly high 23% probability that apparent patterns within the code are caused purely by chance. Finally, even this figure rests on post hoc assumptions about primordial codon assignments, without which the probability rises to 62% that chance alone could explain the precursor-product pairings found within the code. Thus we conclude that coevolution theory cannot adequately explain the structure of the genetic code.

• Pavel V. Baranov, Maxime Venin and Gregory Provan, Codon size reduction as the origin of the triplet genetic code, PLoS ONE 4 (2009), e5708.

The genetic code appears to be optimized in its robustness to missense errors and frameshift errors. In addition, the genetic code is near-optimal in terms of its ability to carry information in addition to the sequences of encoded proteins. As evolution has no foresight, optimality of the modern genetic code suggests that it evolved from less optimal code variants. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination that can encode the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding, as well as the discovery of organisms with ambiguous and dual decoding, suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. To explore this possibility we designed a mathematical model of the evolution of primitive digital coding systems which can decode nucleotide sequences into protein sequences. These coding systems can evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such coding systems depend on the accuracy of the generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggests an explanation of how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of the complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions.

What’s the “recent experimental evidence on quadruplet decoding”, and what organisms have “ambiguous” or “dual” decoding?

• Tsvi Tlusty, A model for the emergence of the genetic code as a transition in a noisy information channel, J. Theor. Bio. 249 (2007), 331–342.

The genetic code maps the sixty-four nucleotide triplets (codons) to twenty amino-acids. Some argue that the specific form of the code with its twenty amino-acids might be a ‘frozen accident’ because of the overwhelming effects of any further change. Others see it as a consequence of primordial biochemical pathways and their evolution. Here we examine a scenario in which evolution drives the emergence of a genetic code by selecting for an amino-acid map that minimizes the impact of errors. We treat the stochastic mapping of codons to amino-acids as a noisy information channel with a natural fitness measure. Organisms compete by the fitness of their codes and, as a result, a genetic code emerges at a supercritical transition in the noisy channel, when the mapping of codons to amino-acids becomes nonrandom. At the phase transition, a small expansion is valid and the emergent code is governed by smooth modes of the Laplacian of errors. These modes are in turn governed by the topology of the error-graph, in which codons are connected if they are likely to be confused. This topology sets an upper bound – which is related to the classical map-coloring problem – on the number of possible amino-acids. The suggested scenario is generic and may describe a mechanism for the formation of other error-prone biological codes, such as the recognition of DNA sites by proteins in the transcription regulatory network.

• Tsvi Tlusty, A colorful origin for the genetic code: Information theory, statistical mechanics and the emergence of molecular codes, Phys. Life. Rev. 7 (2010), 362–376.

• S. J. Freeland, T. Wu and N. Keulmann, The case for an error minimizing standard genetic code, Orig. Life Evol. Biosph. 33 (2009), 457–477.

• G. Sella and D. Ardell, The coevolution of genes and genetic codes: Crick’s frozen accident revisited, J. Mol. Evol. 63 (2006), 297–313.

27 Responses to The Genetic Code

  1. DavidTweed says:

    I remember seeing an empirical computer simulation which seemed to be similar to the Tsvi Tlusty paper: given the structures have to be “close enough to form bonds”, particularly during copying, in 3-D within the “jostling” cellular environment, the coding for codons is most “robust” in that:

    1. using a coding with shorter, more single-variants decreases DNA molecule length so fewer “steps” need to be decoded for a given cell’s usage, but then a minor error is more likely to stop/give erroneous transcription.

    2. using a coding with more codons coded by multiple longer sequences it’s less likely to get instantaneous errors, but the DNA molecule length increases.

    So it was argued to finding the point which overall maximises the number of full correct transcriptions in the presence of both effects.

    (IIRC there were some arguable assumptions in the simulation.)

  2. Rod Carvalho says:

    The link to the first paper (Hornos et al.) is broken.

  3. John Baez says:

    If we look at this chart:

    we can see it as broken into 16 ‘blocks’ of 4 codons. Except for three blocks that include start or stop codons, all these blocks code for either 1 or 2 amino acids.

    Puzzle. Is there a pattern to these 1‘s and 2‘s?

    • streamfortyseven says:

      Don’t know if this helps, but put the column for the second base “C” first in order, then the column for the base “U” second in order, then the column for the base “G” third in order, and the column for the base “A” last in order, so the ordering for the second base columns is “C”, “U”, “G”, “A”. Even so, if there’s a pattern I don’t see it. Maybe if someone were able to calculate electrostatic potential surfaces for the bases, and the resultant codons, patterns might be visible… same thing for the amino acids, see if there’s symmetry in there someplace.

    • John Baez says:

      Well, let me start by writing out the 1‘s and 2‘s that come from the chart above. The three quirky blocks that include start or stop codons will be written as X:

      2 1 X X
      1 1 2 1
      X 1 2 2
      1 1 2 1

      Nobody should take this too seriously, but I can’t help wanting make guesses for these 3 blocks, like this:

      2 1 2 2
      1 1 2 1
      1 1 2 2
      1 1 2 1

      I find this tantalizingly close to having a pattern. If I could reinterpret the block including the start codon as a 2, I’d be much happier:

      2 1 2 2
      1 1 2 1
      2 1 2 2
      1 1 2 1

      Each row is obtained from the previous one by reflecting it and switching 1‘s and 2‘s.

      But that doesn’t seem fair. If we count start and stop codons along with amino acids, we get

      2 1 2 3
      1 1 2 1
      2 1 2 2
      1 1 2 1

  4. David Corfield says:

    Is there a reason why exchanging U and C as third base never makes a difference, and only rarely in the case of A and G? Yet there’s no sign of a greater affinity of U with C than with G in other places.

  5. Blake Stacey says:

    OK, I can’t remember where I might’ve read that bit about a “primordial genetic code” with shorter codons (it’s not in Stryer’s Biochemistry, which I just pulled off the shelf and checked), but I have found discussions of related ideas.

  6. Aaron Golas says:

    Speaking to the ambiguity of the third base in the codon, there’s also the wobble base pair hypothesis (Wikipedia summary, with a couple links to papers in the reference section). The idea is that, when tRNA binds to mRNA, the first two bases of the codon are more strongly selective about their binding than the third base, for which (due to the structure of the tRNA) different pairings might have similar thermodynamic stability. For example, a 5′-GAA-3′ tRNA anticodon would ideally (following traditional Watson-Crick base pairing) bind a 5′-UUC-3′ codon, but due to wobble at the third base it could also bind to a 5′-UUU-3′ codon with about equal stability. The U-C and A-G groupings appear to reflect more common wobble substitutions. (Note that UUC and UUU both code for Phe!)

    Remember, also, that the primary nucleotide bases can be modified, which affects base pairing. For example, inosine (I, modified from A) can pair about equally well with A, C, and U. Ah, chemistry. :-P

    Thus, we have two different redundancies at the third base of the codon: protection from point mutation, and protection from alternate tRNA binding during translation.

  7. John Furey says:

    There is no mathematical reason there is only 20 (or so!) standard amino acids, since there is no physico-chemical reason either. Is that some reality-based reasoning theorem?

    There are in fact nonstandard amino acids in nature, and there more than twice as many artificial ones that have been made via transcription:

    • Wang Q, Parrish AR, Wang L (March 2009). “Expanding the genetic code for biological studies”. Chem. Biol. 16(3), 323–336.

    That’s just using standard cellular techniques with ribosomes etc., not even counting other unnatural isomers, D amino acids, etc. Many more can be produced using other techniques e.g. solid state synthesis.

  8. Graham says:

    JB: So, methionine shows up at the start of every protein (or most maybe just most?)”

    Bacteria use a modified form of methionine, but otherwise this is universal. (Molecular Biology of the Cell, Fifth Edition, Alberts et al, p380).

    JB: What’s the “recent experimental evidence on quadruplet decoding”

    I think that is a reference to a technique in synthetic biology where they’ve managed to make artificial ribosomes and tRNAs which use a quadruplet code. I think this stuff is in its very very early stages.

    JB: and what organisms have “ambiguous” or “dual” decoding?

    I guess this is referring to frame shifts. ( There are also cases where both strands of DNA are protein coding (in an overlapping but not exactly overlapping way).

    • John Baez says:

      Thanks for all the answers, Graham!

      I think frame shifts are really cool. For those not in the know, this is the idea: when reading DNA to create proteins, the cell reads 3 bases at a time and turns each 3-letter ‘word’ into an amino acid, as sketched in this blog entry. So, if a mistake happens and the cell starts reading 1 or 2 or 2 base pairs down from where it should, a completely different protein is made. For example,


      may become:


      And this can cause nasty diseases like Tay-Sachs.

      But life, being the wonderfully flexible thing it is, has also figured out ways to exploit this possibility to its advantage! The same stretch of DNA can be deliberately read in 3 different ways to create different proteins!

      For example, the hepatitis C virus does this. A relevant buzzword is translational frameshift.

  9. I am learning a lot from doing these linkfests. You’ll notice that starting with this one, I’m getting a little more organized [….]

  10. Since we were talking about the genetic code here recently, I’ll wrap up by mentioning one thing learned about this from Susan’s talk […]

  11. DavidTweed says:

    Sort of related to this: this news report says that other amino acids (outside the 20) can be added to proteins changing some of their biochemical properties. (This particular example appears to be “weighing down” a molecule so that it gets processed by the kidneys slower, enabling more of the biologically active part to be processed.) So this suggests that it’s not the case that simply “there are 20 worthwhile amino acids, so evolution figured out how to code for that 20” but that it’s as much a decision to have 20 synthesisable amino acids as the decision how to represent them.

  12. Tom Kriske says:

    Using just the four standard nucleotides and the mechanics of Young Tableaux, one can generate the 20 amino acids. Is this widely appreciated?

      • Tom Kriske says:

        Begin with just the symmetric numbering of 3 boxes over 4 elements i.e. (111), (112), (113),(114),(122),(123),(124),(133),(134),(144),(222),(223),(224),(233),(234),(244),(333),(334),(344),(444). Notice there are 20. Now consider the 24 (4! ) combinations which permute the number assignments and allow A, G, C, T to take any number from one to four. Therein one finds a unique linear combination of 4 number assignments which renders at least one occurrence of each of the 20 AAs in our genetic code. Plus the stops.
        Kind of cool to play around with! There’s a number of symmetries that show up between A and T and C and G. Also, it’s interesting to note that although the masses of the nucleotides are all quite different, the mass of A+T = C+G to within 1%.

        • Tom Kriske says:

          And as an addendum to the above, it’s worth pointing out that within this tableau formalism one can take the purely antisymmetric vertical boxes to be the double stranded couples A/T & C/G; and moreover the 3 box mixed states, symmetrized along a strand and antisymmetrized across a strand, form what I refer to as a binary codon core, with some interesting relations among themselves as well, and give some insight into what a binary genetic code might have looked like back in the day. Also, using hook length, one can determine the dimension of these various representations, and through a product rule generate higher dimensional reps – potentially generating entire proteins. I know, some hand waving there! But interesting.

  13. Tom Kriske says:

    And as an aside, knowing your fondness for the number 24, consider the equation:
    Summation on i from 0 to 4 operating on {2^i mod3 times binary coefficient (4, i)} = 24.

    The terms are 1 + 8 + 6 + 8 + 1. Break the 8’s in half to get two 4’s each and you get the 7 conjugacy classes of the binary tetrahedral group, of order 24; the mod3 coming from the fact that the binary tetrahedral group is isomorphic to SL(2,3).

    The joy and mystery of mathematics!

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.