This week Lisa and I are visiting her 90-year-old mother in Montréal. Friday I’m giving a talk at the Université du Québec à Montréal. The main person I know there is André Joyal, an expert on category theory and algebraic topology. So, I decided to give a talk explaining how some ideas from these supposedly ‘pure’ branches of math show up in biology.
My talk is called ‘Operads and the Tree of Life’.
In biology, trees are very important:
So are trees of a more abstract sort: phylogenetic trees describe the history of evolution. The biggest phylogenetic tree is the ‘Tree of Life’. It includes all the organisms on our planet, alive now or anytime in the past. Here’s a rough sketch of this enormous tree:
Its structure is far from fully understood. So, biologists typically study smaller phylogenetic trees, like this tree of dog-like species made by Elaine Ostrander:
Abstracting still further, we can also think of a tree as a kind of purely mathematical structure, like this:
Trees are important in combinatorics, but also in algebraic topology. The reason is that in algebraic topology we get pushed into studying spaces equipped with enormous numbers of operations. We’d get hopelessly lost without a good way of drawing these operations. We can draw an operation with n inputs and one output as a little tree like this:
We can also draw the various ways of composing these operations. Composing them is just like building a big tree out of little trees!
An operation with n inputs and one output is called an n-ary operation. In the late 1960s, various mathematicians including Boardmann and Vogt realized that spaces with tons of n-ary operations were crucial to algebraic topology. To handle all these operations, Peter May invented the concept of an operad. This formalizes the way operations can be drawn as trees. By now operads are a standard tool, not just in topology, but also in algebraic geometry, string theory and many other subjects.
But how do operads show up in biology?
When attending a talk by Susan Holmes on phylogenetic trees, I noticed that her work on phylogenetic trees was closely related to a certain operad. And when I discussed her work here, James Griffin pointed out that this operad can be built using a slight variant of a famous construction due to Boardman and Vogt: their so-called ‘W construction’!
I liked the idea that trees and operads in topology might be related to phylogenetic trees. And thinking further, I found that the relation was real, and far from a coincidence. In fact, phylogenetic trees can be seen as operations in a certain operad… and this operad is closely related to the way computational biologists model DNA evolution as a branching sort of random walk.
That’s what I’d like to explain now.
I’ll be a bit sketchy, because I’d rather get across the basic ideas than the technicalities. I could even be wrong about some fine points, and I’d be glad to talk about those in the comments. But the overall picture is solid.
First, let’s ponder the mathematical structure of a phylogenetic tree. First, it’s a tree: a connected graph with no circuits. Second, it’s a rooted tree, meaning it has one vertex which is designated the root. And third, the leaves are labelled.
I should explain the third part! For any rooted tree, the vertices with just one edge coming out of them are called leaves. If the root is drawn at the bottom of the tree, the leaves are usually drawn at the top. In biology, the leaves are labelled by names of species: these labels matter. In mathematics, we can label the leaves by numbers where is the number of leaves.
Summarizing all this, we can say a phylogenetic tree should at least be a leaf-labelled rooted tree.
That’s not all there is to it. But first, a comment. When you see a phylogenetic tree drawn by a biologist, it’ll pretty much always a binary tree, meaning that as we move up any edge, away from the root, it either branches into two new edges or ends in a leaf. The reason is that while species often split into two as they evolve, it is less likely for a species to split into three or more new species all at once.
So, the phylogenetic trees we see in biology are usually leaf-labeled rooted binary trees. However, we often want to guess such a tree from some data. In this game, trees that aren’t binary become important too!
Why? Well, here another fact comes into play. In a phylogenetic tree, typically each edge can be labeled with a number saying how much evolution occurred along that edge. But as this number goes to zero, we get a tree that’s not binary anymore. So, we think of non-binary trees as conceptually useful ‘borderline cases’ between binary trees.
So, it’s good to think about phylogenetic trees that aren’t necessarily binary… and have edges labelled by numbers. Let’s make this into a formal definition:
Definition A phylogenetic tree is a leaf-labeled rooted tree where each edge not touching a leaf is labeled by a positive real number called its length.
By the way, I’m not claiming that biologists actually use this definition. I’ll write for the set of phylogenetic trees with leaves. This becomes a topological space in a fairly obvious way, where we can trace out a continuous path by continuously varying the edge lengths of a tree. But when some edge lengths approach zero, our graph converges to one where the vertices at ends of these edges ‘fuse into one’, leaving us with a graph with fewer vertices.
Here’s an example for you to check your understanding of what I just said. With the topology I’m talking about, there’s a continuous path in that looks like this:
These trees are upside-down, but don’t worry about that. You can imagine this path as a process where biologists slowly change their minds about a phylogenetic tree as new data dribbles in. As they change their minds, the tree changes shape in a continuous way.
For more on the space of phylogenetic trees, see:
• Louis Billera, Susan Holmes and Karen Vogtmann, Geometry of the space of phylogenetic trees, Advances in Applied Mathematics 27 (2001), 733-767.
How are phylogenetic trees related to operads? I have three things to say about this. First, they are the operations of an operad:
Theorem 1. There is an operad called the phylogenetic operad, or whose space of n-ary operations is
If you don’t know what an operad is, I’d better tell you now. They come in different flavors, and technically I’ll be using ‘symmetric topological operads’. But instead of giving the full definition, which you can find on the nLab, I think it’s better if I sketch some of the key points.
For starters, an operad consists of a topological space for each . The point in are called the n-ary operations of You can visualize an n-ary operation as a black box with input wires and one output wire:
Of course, this also looks like a tree.
We can permute the inputs of an n-ary operation and get a new n-ary operation, so we have an action of the permutation group on . You visualize this as permuting input wires:
More importantly, we can compose operations! If we have an n-ary operation , and n more operations, say , we can compose with all the rest and get an operation called
Here’s how you should imagine it:
Composition and permutation must obey some laws, all of which are completely plausible if you draw them as pictures. For example, the associative law makes a composite of composites like this well-defined:
Now, these pictures look a lot like trees. So it shouldn’t come as a shock that phylogenetic trees are the operations of some operad But let’s sketch why it’s true.
First, we can permute the ‘inputs’—meaning the labels on the leaves—of any phylogenetic tree and get a new phylogenetic tree. This is obvious.
Second, and more importantly, we can ‘compose’ phylogenetic trees. How do we do this? Simple: we glue the roots of a bunch of phylogenetic trees to the leaves of another and get a new one!
More precisely, suppose we have a phylogenetic tree with leaves, say . And suppose we have more, say Then we can glue the roots of to the leaves of to get a new phylogenetic tree called
Third and finally, all the operad laws hold. Since these laws all look obvious when you draw them using pictures, this is really easy to show.
If you’ve been paying careful attention, you should be worrying about something now. In operad theory, we think of an operation as having n inputs and one output. For example, this guy has 3 inputs and one output:
But in biology, we think of a phylogenetic tree as having one input and n outputs. We start with one species (or other grouping of organisms) at the bottom of the tree, let it evolve and branch, and wind up with n of them!
In other words, operad theorists read a tree from top to bottom, while biologists read it from bottom to top.
Luckily, this isn’t a serious problem. Mathematicians often use a formal trick where they take an operation with n inputs and one output and think of it as having one input and n outputs. They use the prefix ‘co-’ to indicate this formal trick.
So, we could say that phylogenetic trees stand for ‘co-operations’ rather than operations. Soon this trick will come in handy. But not just yet!
The W construction
Boardman and Vogt had an important construction for getting new operads for old, called the ‘W construction’. Roughly speaking, if you start with an operad , this gives a new operad whose operations are leaf-labelled rooted trees where:
1) all vertices except leaves are labelled by operations of and a vertex with n input edges must be labelled by an n-ary operation of
2) all edges except those touching the leaves are labelled by numbers in .
If you think about it, the operations of are strikingly similar to phylogenetic trees, except that:
1) in phylogenetic trees the vertices don’t seem to be labelled by operations of a operad,
2) we use arbitrary nonnegative numbers to label edges, instead of numbers in .
The second point is a real difference, but it doesn’t matter much: if Boardman and Vogt had used nonnegative numbers instead of numbers in to label edges in the W construction, it would have worked just as well. Technically, they’d get a ‘weakly equivalent’ operad.
The first point is not a real difference. You see, there’s an operad called which has exactly one operation of each arity. So, labelling vertices by operations of is a completely trivial process.
As a result, we conclude:
Theorem 2. The phylogenetic operad is weakly equivalent to .
If you’re not an expert on operads (such a person is called an ‘operadchik’), you may be wondering what stands for. The point is that operads have ‘algebras’, where the abstract operations of the operad are realized as actual operations on some topological space. And the algebras of are precisely commutative topological monoids: that is, topological spaces equipped with a commutative associative product!
Branching Markov processes and evolution
By now, if you haven’t fallen asleep, you should be brimming with questions, such as:
1) What does it mean that phylogenetic trees are the operations of some operad ? Why should we care?
2) What does it mean to apply the W construction to the operad ? What’s the significance of doing this?
3) What does it mean that is weakly equivalent to ? You can see the definition of weak equivalence here, but it’s pretty technical, so it needs some explanation.
The answers to questions 2) and 3) take us quickly into fairly deep waters of category theory and algebraic topology—deep, that is, if you’ve never tried to navigate them. However, these waters are well-trawled by numerous experts, and I have little to say about questions 2) and 3) that they don’t already know. So given how long this talk already is, I’ll instead try to answer question 1). This is where some ideas from biology come into play.
I’ll summarize my answer in a theorem, and then explain what the theorem means:
Theorem 3. Given any continuous-time Markov process on a finite set , the vector space whose basis is naturally becomes a coalgebra of the phylogenetic operad.
Impressive, eh? But this theorem is really just saying that biologists are already secretly using the phylogenetic operad.
Biologists who try to infer phylogenetic trees from present-day genetic data often use simple models where the genotype of each species follows a ‘random walk’. Also, species branch in two at various times. These models are called Markov models.
The simplest Markov model for DNA evolution is the Jukes–Cantor model. Consider a genome of fixed length: that is, one or more pieces of DNA having a total of base pairs. For example, this tiny genome has base pairs, just enough to illustrate the 4 possible choices, which are called A, T, C and G:
Since there are 4 possible choices for each base pair, there are possible genotypes with base pairs. In the human genome, is about . So, there are about
genotypes of this length. That’s a lot!
As time passes, the Jukes–Cantor model says that the human genome randomly walks through this enormous set of possibilities, with each base pair having the same rate of randomly flipping to any other base pair.
Biologists have studied many ways to make this model more realistic in many ways, but in a Markov model of DNA evolution we’ll typically have some finite set of possible genotypes, together with some random walk on this set. But the term ‘random walk’ is a bit imprecise: what I really mean is a ‘continuous-time Markov process’. So let me define that.
Fix a finite set . For each time and pair of points i, j in , a continuous-time Markov process gives a number saying the probability that starting at the point i at time zero, the random walk will go to the point j at time . We can think of these numbers as forming an square matrix at each time . We demand that four properties hold:
1) depends continuously on .
2) For all s, t we have .
3) is the identity matrix.
4) For all j and t we have:
All these properties make a lot of sense if you think a bit, though condition 2) says that the random walk does not change character with the passage of time, which would be false given external events like, say, ice ages. As far as math jargon goes, conditions 1)-3) say that is a continuous one-parameter semigroup, while condition 4) together with the fact that says that at each time, is a stochastic matrix.
Let be the vector space whose basis is . To avoid getting confused, let’s write for the basis vector corresponding to . Any probability distribution on gives a vector in . Why? Because it gives a probability for each , and we can think of these as the components of a vector .
Similarly, for any time , we can think of the matrix as a linear operator
So, if we start with some probability distribution of genotypes, and let them evolve for a time according to our continuous-time Markov process, by the end the probability distribution will be .
But species do more than evolve this way: they also branch! A phylogenetic tree describes a way for species to evolve and branch.
So, you might hope that any phylogenetic tree gives a ‘co-operation’ that takes one probability distribution as input and returns n probability distributions as output.
That’s true. But these n probability distributions will be correlated, so it’s better to think of them as a single probability distribution on the set This can be seen as a vector in the vector space , the tensor product of n copies of
So, any phylogenetic tree gives a linear operator from to . We’ll call it
because we’ll build it starting from the Markov process
Here’s a sketch of how we build it—I’ll give a more precise account in the next and final section. A phylogenetic tree is made of a bunch of vertices and edges. So, I just need to give you an operator for each vertex and each edge, and you can compose them and tensor them to get the operator :
1) For each vertex with one edge coming in and n coming out:
we need an operator
that describes what happens when one species branches into n species. This operator takes the probability distribution we put in and makes n identical and perfectly correlated copies. To define this operator, we use the fact that the vector space has a basis labelled by the genotypes Here’s how the operator is defined:
2) For each edge of length , we need an operator that describes a random walk of length This operator is provided by our continuous-time Markov process: it’s
And that’s it! By combining these two kinds of operators, one for ‘branching’ and one for ‘random walking’, we get a systematic way to take any phylogenetic tree and get an operator
In fact, these operators obey just the right axioms to make into what’s called a ‘coalgebra’ of the phylogenetic operad. But to see this—that is, to prove Theorem 3—it helps to use a bit more operad technology.
I haven’t even defined coalgebras of operads yet. And I don’t think I’ll bother. Why not? Well, while the proof of Theorem 3 is fundamentally trivial, it’s sufficiently sophisticated that only operadchiks would enjoy it without a lengthy warmup. And you’re probably getting tired by now.
So, to most of you reading this: bye! It was nice seeing you! And I hope you sensed the real point of this talk:
Some of the beautiful structures used in algebraic topology are also lurking in biology. These structures may or may not be useful in biology… but we’ll never know if we don’t notice them and say what they are! So, it makes sense for mathematicians to spend some time looking for them.
Now, let me sketch a proof of Theorem 3. It follows from a more general theorem:
Theorem 4. Suppose is an object in some symmetric monoidal topological category . Suppose that is equipped with an action of the additive monoid . Suppose also that is a cocommutative coalgebra. Then naturally becomes a coalgebra of the phylogenetic operad.
How does this imply Theorem 3? In Theorem 3, is the category of finite-dimensional real vector space. The action of on is the continuous-time Markov process. And becomes a cocommutative coalgebra because it’s a vector space with a distinguished basis, namely the finite set . This makes into a cocommutative coalgebra in the usual way, where the comultiplication:
‘duplicates’ basis vectors:
while the counit:
These correspond to species splitting in two and species going extinct, respectively. (Biologists trying to infer phylogenetic trees often ignore extinction, but it’s mathematically and biologically natural to include it.) So, all the requirements are met to apply Theorem 4 and make into coalgebra of the phylogenetic operad.
But how do we prove Theorem 4? It follows immediately from Theorem 5:
Theorem 5. The phylogenetic operad is the coproduct of the operad and the additive monoid , viewed as an operad with only 1-ary operations.
Given how coproducts works, this means that an algebra of both and is automatically an algebra of . In other words, any commutative algebra with an action of is an algebra of . Dualizing, it follows that any cocommutative coalgebra with an action of is an coalgebra of And that’s Theorem 4!
But why is Theorem 5 true? First of all, I should emphasize that the idea of using it was suggested by Tom Leinster in our last blog conversation on the phylogenetic operad. And in fact, Tom proved a result very similar to Theorem 5 here:
• Tom Leinster, Coproducts of operads, and the W-construction, 14 September 2000.
He gives an explicit description of the coproduct of an operad and a monoid, viewed as an operad with only unary operations. He works with non-symmetric, non-topological operads, but his ideas also work for symmetric, topological ones. Applying his ideas to the coproduct of and , we see that we get the phylogenetic operad!
And so, phylogenetic trees turn out to be related to coproducts of operads. Who’d have thought it? But we really don’t have as many fundamentally different ideas as you might think: it’s hard to have new ideas. So if you see biologists and algebraic topologists both drawing pictures of trees, you should expect that they’re related.