Certain mathematical physicists can’t help wondering why the genetic code works exactly the way it does. As you probably know, DNA is a helix bridged by pairs of bases, which come in 4 kinds:
Because of how they’re shaped, A can only connect to T:
while C can only connect to G:
A protein is made of lots of amino acids. A sequence of three base pairs forms a ‘codon’, which codes for a single amino acid. Here’s some messenger RNA with the codons indicated:
But here’s where it gets tricky: while there are 43 = 64 codons, they code for only 20 amino acids. Typically more than one codon codes for the same amino acid. There are two exceptions. One is the amino acid tryptophan, which is encoded only by UGG. The other is methionine, which is encoded only by AUG. AUG is also the ‘start codon’, which tells the cell where the code for a protein starts. So, methionine shows up at the start of every protein (or most maybe just most?), at least at first. It’s usually removed later in the protein manufacture process.
There are also three ‘stop codons’, which mark the end of a protein. They have cute names:
• UAG (‘amber’)
• UAA (‘ochre’)
• UGA (‘opal’)
But look at the actual pattern of which codons code for which amino acids:
It looks sort of regular… but also sort of irregular! Note how:
• Almost all amino acids either have 4 codons coding for them, or 2.
• If 4 codons code for the same amino acid, it’s because we can change the last base without any effect.
• If 2 codons code for the same amino acid, it’s because we can change the last base from U to C or from A to G without any effect.
• The amino acid tryptophan, with just one base pair coding for it, is right next to the 3 stop codons.
And so on…
This what attracts the mathematical physicists I’m talking about. They’re wondering what is the pattern here! Saying the patterns are coincidental—a “frozen accident of history”—won’t please these people.
Though I certainly don’t vouch for their findings, I sympathize with the impulse to find order amid chaos. Here are some papers I’ve seen:
• José Eduardo M. Hornos, Yvone M. M. Hornos and Michael Forger, Symmetry and symmetry breaking, algebraic approach to the genetic code, International Journal of Modern Physics B, 13 (1999), 2795-2885.
After a very long review of symmetry in physics, starting with the Big Bang and moving up through the theory of Lie algebras and Cartan’s classification of simple Lie algebras, the authors describe their program:
The first step in the search for symmetries in the genetic code consists in selecting a simple Lie algebra and an irreducible representation of this Lie algebra on a vector space of dimension 64: such a representation will in the following be referred to as a codon representation.
There turn out to be 11 choices. Then they look at Lie subalgebras of these Lie algebras that have codon representations, and try to organize the codons for the same amino acid into irreducible representations of these subalgebras. This follows the ‘symmetry breaking’ strategy that particle physicists use to organize particles into families (but with less justification, it seems to me). They show:
There is no symmetry breaking pattern through chains of subalgebras capable of reproducing exactly the degeneracies of the genetic code.
This is not the end of the paper, however!
Here’s another paper, which seems to focus on how the genetic code might be robust against small errors:
• Miguel A. Jimenez-Montano, Carlos R. de la Mora-Basanez, and Thorsten Poeschel, The hypercube structure of the genetic code explains conservative and non-conservative amino acid substitutions in vivo and in vitro.
And here’s another:
But these three papers seem rather ‘Platonic’ in inspiration: they don’t read like biology papers. What papers on the genetic code do biologists like best? I know there’s a lot of research on the origin of this code.
Maybe some of these would be interesting. I haven’t read any of them! But they seem a bit more mainstream than the ones I just listed:
• T. A. Ronneberg, L. F. Landweber, S. J. Freeland, Testing a biosynthetic theory of the genetic code: fact or artifact?, Proc. Natl. Acad. Sci. U.S.A. 97 (200), 13690–13695.
It has long been conjectured that the canonical genetic code evolved from a simpler primordial form that encoded fewer amino acids (e.g. Crick 1968). The most influential form of this idea, “code coevolution” (Wong 1975) proposes that the genetic code coevolved with the invention of biosynthetic pathways for new amino acids. It further proposes that a comparison of modern codon assignments with the conserved metabolic pathways of amino acid biosynthesis can inform us about this history of code expansion. Here we re-examine the biochemical basis of this theory to test the validity of its statistical support. We show that the theory’s definition of “precursor-product” amino acid pairs is unjustified biochemically because it requires the energetically unfavorable reversal of steps in extant metabolic pathways to achieve desired relationships. In addition, the theory neglects important biochemical constraints when calculating the probability that chance could assign precursor-product amino acids to contiguous codons. A conservative correction for these errors reveals a surprisingly high 23% probability that apparent patterns within the code are caused purely by chance. Finally, even this figure rests on post hoc assumptions about primordial codon assignments, without which the probability rises to 62% that chance alone could explain the precursor-product pairings found within the code. Thus we conclude that coevolution theory cannot adequately explain the structure of the genetic code.
• Pavel V. Baranov, Maxime Venin and Gregory Provan, Codon size reduction as the origin of the triplet genetic code, PLoS ONE 4 (2009), e5708.
The genetic code appears to be optimized in its robustness to missense errors and frameshift errors. In addition, the genetic code is near-optimal in terms of its ability to carry information in addition to the sequences of encoded proteins. As evolution has no foresight, optimality of the modern genetic code suggests that it evolved from less optimal code variants. The length of codons in the genetic code is also optimal, as three is the minimal nucleotide combination that can encode the twenty standard amino acids. The apparent impossibility of transitions between codon sizes in a discontinuous manner during evolution has resulted in an unbending view that the genetic code was always triplet. Yet, recent experimental evidence on quadruplet decoding, as well as the discovery of organisms with ambiguous and dual decoding, suggest that the possibility of the evolution of triplet decoding from living systems with non-triplet decoding merits reconsideration and further exploration. To explore this possibility we designed a mathematical model of the evolution of primitive digital coding systems which can decode nucleotide sequences into protein sequences. These coding systems can evolve their nucleotide sequences via genetic events of Darwinian evolution, such as point-mutations. The replication rates of such coding systems depend on the accuracy of the generated protein sequences. Computer simulations based on our model show that decoding systems with codons of length greater than three spontaneously evolve into predominantly triplet decoding systems. Our findings suggest a plausible scenario for the evolution of the triplet genetic code in a continuous manner. This scenario suggests an explanation of how protein synthesis could be accomplished by means of long RNA-RNA interactions prior to the emergence of the complex decoding machinery, such as the ribosome, that is required for stabilization and discrimination of otherwise weak triplet codon-anticodon interactions.
What’s the “recent experimental evidence on quadruplet decoding”, and what organisms have “ambiguous” or “dual” decoding?
• Tsvi Tlusty, A model for the emergence of the genetic code as a transition in a noisy information channel, J. Theor. Bio. 249 (2007), 331–342.
The genetic code maps the sixty-four nucleotide triplets (codons) to twenty amino-acids. Some argue that the specific form of the code with its twenty amino-acids might be a ‘frozen accident’ because of the overwhelming effects of any further change. Others see it as a consequence of primordial biochemical pathways and their evolution. Here we examine a scenario in which evolution drives the emergence of a genetic code by selecting for an amino-acid map that minimizes the impact of errors. We treat the stochastic mapping of codons to amino-acids as a noisy information channel with a natural fitness measure. Organisms compete by the fitness of their codes and, as a result, a genetic code emerges at a supercritical transition in the noisy channel, when the mapping of codons to amino-acids becomes nonrandom. At the phase transition, a small expansion is valid and the emergent code is governed by smooth modes of the Laplacian of errors. These modes are in turn governed by the topology of the error-graph, in which codons are connected if they are likely to be confused. This topology sets an upper bound – which is related to the classical map-coloring problem – on the number of possible amino-acids. The suggested scenario is generic and may describe a mechanism for the formation of other error-prone biological codes, such as the recognition of DNA sites by proteins in the transcription regulatory network.
• Tsvi Tlusty, A colorful origin for the genetic code: Information theory, statistical mechanics and the emergence of molecular codes, Phys. Life. Rev. 7 (2010), 362–376.
• S. J. Freeland, T. Wu and N. Keulmann, The case for an error minimizing standard genetic code, Orig. Life Evol. Biosph. 33 (2009), 457–477.
• G. Sella and D. Ardell, The coevolution of genes and genetic codes: Crick’s frozen accident revisited, J. Mol. Evol. 63 (2006), 297–313.