I used unshuffles in my work on Lie 2-algebra, and one needs shuffles to make the tensor algebra of vector space into a bialgebra. But I haven’t been shuffling much lately.

]]>It appears in many branches of maths incl. operads (Loday/Vallette: Algebraic Operads – of which I know nothing) and Markov chains (Persi Diaconis). But it has not been explicitly appreciated in tensor calculus (Veblen already used it implicitly) – which I plan to change…

]]>You said earlier that the phylogenetic trees you consider are more fancy than the ones we consider. Maybe if you explained them to me I could see if they were operations in a mathematically interesting operad. (I can’t promise to do anything more useful than that.)

That’s fine, I don’t have much hope this is going to lead anywhere very useful. It might give me a different way of looking at something.

I am working with the multispecies coalescent model. This model has a bunch of ‘gene trees’ sitting inside a species tree. The individual trees are quite ordinary, but the gene trees are constrained by the species tree. There are some pictures (Figs 1 and 9) in this paper: http://mbe.oxfordjournals.org/content/27/3/570.full. The model incorporates a simple model from population genetics along each branch of the species tree.

The trees are binary and ultrametric. The tips of the species tree are labelled with species. I’ll just talk about a single gene tree. They are all basically the same sort of thing, so if you understand one gene tree, you understand them all. The gene tree will have at least as many tips as the species tree, usually more. At the tips are DNA sequences which have been sampled from individuals belonging to one of the species. The constraint is that going back in time, two DNA sequences cannot meet (coalesce) until the species they belong to have met.

A more biological description. Suppose you have two human DNA sequences for a particular gene. (You can generally get two sequences from each individual organism so they could be from the same person.) Suppose the human population is 50,000. That might sound crazy given the current population, but over the last six million years it is a reasonable guess. That means 100,000 copies in the human gene pool, and so using a very simple model of population genetics, the probability that this pair meets is 1/100,000 per generation, say 25 years. So the probability they have not met by six million years into the past is about exp(-6/2.5) or 9%. If they don’t meet, it is an example of ‘incomplete lineage sorting’. Once they’re back that far, they can meet DNA sequences from chimps. It means that a gene tree can have a different topology to the species tree.

I knew about the latex thing, and even put one ‘latex’ into my post. Then I got distracted by something else, and clicked on post before adding the rest. ‘infinitesimal stochastic’ is fine. Biologists just say ‘rate matrix’.

]]>Cool! I’ll check it out.

]]>Thanks!

The main thing I learnt was that composability is the key thing about operads (whereas I was thinking vaguely of topology, metric, or measure).

Right. An operad is a bunch of operations that you can compose. I attempted to say this in the blog article:

… trees are often used to describe the composition of

n-ary operations. This is formalized in the theory of operads. An ‘operad’ is an algebraic stucture where for each natural number we have a set whose elements are considered as abstractn-ary operations, not necessarily operating on anything yet. An element can be depicted as a planar tree with one vertex andnlabelled leaves:We can compose these operations in a tree-like way to get new operations:

and an associative law holds, making this sort of composite unambiguous:

The formula you mention here:

The formula for a single step in the calculation looks like this. Suppose a vertex in the tree has children, which have vectors , and the branches between and its children have lengths . Suppose is a by infinitesimal stochastic matrix which gives the rates at which members of turn into one another. Then we make vectors , and multiply these element-wise to make the vector at .

is implicit in some stuff discussed in our paper, and even in the blog article here—though we were hiding it amid clouds of jargon, to make certains sorts of mathematician able to understand it better. The idea, as we see it, is that the phylogenetic operad is built from two simpler ones:

• one called that describes time evolution, where the genome randomly changes according to a Markov process,

and

• one called that describes branching, where the genome gets duplicated, or ‘*n*-plicated’.

Technically the phylogenetic operad is the ‘coproduct’ of these two other operads.

Of course I’m not claiming this viewpoint is better for normal humans.

You said earlier that the phylogenetic trees *you* consider are more fancy than the ones we consider. Maybe if you explained them to me I could see if they were operations in a mathematically interesting operad. (I can’t promise to do anything more useful than that.)

By the way, I fixed your LaTeX: when you post a comment there are instructions directly above that say how to use LaTeX here, but almost nobody ever reads them. In the process I took the liberty of changing “stochastic” to “infinitesimal stochastic”. A **stochastic** matrix is one that describes a finite amount of time evolution for a Markov process: namely, a square matrix of nonnegative numbers where the columns sum to 1. We need some other name for a matrix such that is stochastic for all . Some people call this an **intensity matrix** or **stochastic Hamiltonian**. But I really want an adjective: I can’t grammatically say “now let us prove that this matrix is intensity”. Also, since in physics one speaks of the Hamiltonian of a particular physical system, one should not also use that word to denote a matrix with particular mathematical properties.

So, I use the adjective **infinitesimal stochastic** to denote a matrix such that is stochastic for all . This goes along with a tradition where we think of as describing an ‘infinitesimal’ amount of time evolution, and as describing a ‘finite’ amount of time evolution.

You might be interested in smooth blowup of the Billera-Holmes-Vogtmann model for tree space described in

• Navigation in tree spaces, *Adv. Appl. Math.* **67** (2015), 75–95.

It’s also connected to operads (more precisely, to the Deligne-Mumford-Knudsen operad defined by genus zero **real** algebraic curves); its spaces are aspherical, with an interesting (system of) fundamental group(s), analogous in some ways to the operad defined by the braid groups. It has an intrinsic pseudometric, with wall-crossing phenomena that look suspiciously biological…

Best, sincerely

```
(:+{)} Jack
```

]]>Thanks, John. The main thing I learnt was that composability is the key thing about operads (whereas I was thinking vaguely of topology, metric, or measure).

A Markov model maps each tree to a matrix that describes how a probability distribution on the genome data of the species at the root of the tree stochastically evolves to a probability distributions on the genome data of the descendant species at the leaves. Ways of gluing trees together get sent to ways of combining these matrices.

(I’m being quite rough here, just to get the point across with a minimum of distraction.)

Of course this is not be how you usually think of a Markov model—like the Jukes–Cantor model,

Actually, that is very much how I think about phylogenetic analysis as a whole. Except, upside down and inside out.

We start with probability distributions on the genome data of the descendant species at the leaves. These are what we observe in the DNA of various present-day organisms. (Usually these are regarded as known sequences, ie, as points rather than distributions, but that’s for computational efficiency not realism.) Then using Felsenstein’s tree-pruning algorithm, we work recursively back to the root, calculating vectors in (The vectors are called ‘conditional likelihoods of subtrees’ or ‘partial likelihoods’.) An extra step at the root converts the vector there to a number, which is the probability of the observed data, given the tree. In other words, it is the likelihood of the tree, given the observations. The likelihood of the tree, given the observations, is The Central Object of study in phylogenetics analysis (though I don’t suppose many biologists would put it that way!).

The formula for a single step in the calculation looks like this. Suppose a vertex in the tree has children, which have vectors , and the branches between and its children have lengths . Suppose is a by infinitesimal stochastic matrix which gives the rates at which members of turn into one another. Then we make vectors , and multiply these element-wise to make the vector at .

This seems the best bet for finding something that interests me into something ‘operadic’.

I already knew the Billera-Holmes-Vogtmann paper. And yes, that does seem about halfway between you and me. As I read it, their main reason why biologists should care is that it provides (in principle) ways of summarising The Central Object.

]]>Thank you John, I will look for that.

(I just realized btw that my question as phrased is a hybrid one, since a presumably small-but-important subset of the genetic info which is in principle available at a given time exists in already-dead organisms.

So to answer the Q, one must consider not only ‘evolutionary decay’ – the generation-over-generation loss of information concerning the biosphere’s phylogenetic state at a given earlier time due to evolutionary change – but also ‘physical decay’ – the degradation of genetic info in dead organisms.)

]]>