• John Baez, Operads and the tree of life, 6 July 2011.

In trying to the make the ideas precise I recruited the help of Nina Otter, who was then a graduate student at ETH Zürich, and who is now a grad student at Oxford working on mathematical biology with Heather Harrington.

It took us quite a while to work out all the details, and I could never have done it myself. But now we’re done! Here’s our paper:

• John Baez and Nina Otter, Operads and phylogenetic trees.

]]>By the way, when I gave my talk at UQAM the fellow in charge of the combinatorics seminar, Franco Saliola, said he had been to a nice talk by the mathematical biologist Lior Pachter. Have you heard of him? His website says:

]]>I work on the fundamental problem of comparative genomics: the determination of the origins and evolutionary history of the nucleotides in all extant genomes. My work incorporates various aspects of genomics, including the reconstruction of ancestral genomes (paleogenomics), the modeling of genome dynamics (phylogenomics and systems biology) and the assignment of function to genome elements (functional genomics and epigenomics).

In addition to working on algorithms and mathematical foundations for comparative genomics, I also work on genome projects and perform large scale computational analyses. I have been a member of the mouse, rat, chicken and fly genome sequencing consortia, and the ENCODE project.

My research draws on tools from discrete mathematics, algebra and statistics. I am also interested in questions in these subjects that are motivated by biology problems.

Trees have a topology and they can have branch lengths or node times. As far as the topology alone is concerned, the main observation is the phylogenetic trees are unbalanced. I have just added an image to

http://www.azimuthproject.org/azimuth/show/Tree+of+life

to show what I mean. This imbalance show up at every level down to trees with 4 leaves which is the smallest size that imbalance can occur. All the obvious mathematical models (eg a birth-death process assuming constant rates of birth and death) make more balanced trees than this.

As far as node times are concerned it will depend on context. For example, for gene trees within a species, coalescent theory

(http://en.wikipedia.org/wiki/Coalescent_theory) gives a useful model. Looking backward in time, the coalescences happen very quickly, then slow down. For a gene sampled from say 10 individuals, and assuming a constant population size, the expected time for the first coalescence is proportional to 1/(10*9), the additional time to the next coalescence is proportional to 1/(9*8), and so on down the the last pair to meet with expected time proportional to 1/(2*1). Looking forwards in time, this gives a tree that grows faster then exponential.

“In other words, that you can work only with trees whose branches all make it to the present, without any harm. I’ve been hoping to find some context in which this actually causes problems. Do you know about that?”

It can cause problems if you want to estimate dates. For example, there are 22 Crocodylla (crocodiles, alligators and gharials), and they separated from the rest of the tree of life a very long time ago (lets say 100My for the sake of argument). If you assumed there were no extinctions, you would estimate the time of the most recent common ancestor of extant Crocodylla (that is, the time of the first speciation in the phylogenetic tree for Crocodylla ) to be a very long time ago as well. Unlike the coalescence case, the expected times go like 1/21, 1/20, … 1/2, 1/1 going back in time, so – very roughly – you might estimate this time as 1/(1+1/2+…1/21) = 1/3.64 = .27 of 100My since the ancestor species separated from the rest of the tree of life, that is 73My ago. More realistically, there will have been many extinctions and quite likely there were once many more than 22 in this group. In this case, it could easily be that the most recent common ancestor of extant Crocodylla is very recent.

It also causes problems if you know some dates from fossils and want to estimate speciation and extinction rates. This article might be a good place to start.

“Estimating diversification rates from phylogenetic information”. By Ricklefs R E, Trends Ecol Evol. 2007.

http://www.bio-nica.info/biblioteca/Ricklefs2007PhylogeneticInformation.pdf

Finally, ‘phylogenetics’ means estimating trees from genes and ‘phylogenomics’ means estimating trees from whole genomes.

]]>Actually biologists do assume a probability distribution on the branching patterns, and the choice can affect the results. You are right that little is known about speciation rates and extinction rates, so guesses have to be made.

Thanks for the correction! I only know a little about phylogenomics, so it’s nice to know that when I say something wrong, there’s a chance you’ll appear and correct me.

What branching patterns might be particularly likely and/or unlikely?

Also, you mention ‘extinction rates’. One person told me that extinction events were completely ignored in the process of guessing a phylogenetic tree from present-day DNA data. In other words, that you can work only with trees whose branches all make it to the present, without any harm. I’ve been hoping to find some context in which this actually causes problems. Do you know about that?

(Of course we need to think about extinction events when we also use DNA data from extinct species… but I’m not talking about that now.)

]]>http://www.bunniestudios.com/blog/?p=353 ]]>