The Mathematics of Biodiversity (Part 3)

27 June, 2012

We tend to think of biodiversity as a good thing, but sometimes it’s deadly. Yesterday Andrei Korobeinikov gave a talk on ‘Viral evolution within a host’, which was mainly about AIDS.

The virus that causes this disease, HIV, can reproduce very fast. In an untreated patient near death there are between 1010 and 1012 new virions per day! Remember, a virion is an individual virus particle. The virus also has a high mutation rate: about 3 × 10-5 mutations per generation for each base—that is, each molecule of A,T,C, or G in the RNA of the virus. That may not seem like a lot, but if you multiply it by 1012 you’ll see that a huge number of new variations of each base arise within the body of a single patient.

So, evolution is at work within you as you die.

And in fact, many scientists believe that the diversity of the virus eventually overwhelms your immune system! Although it’s apparently not quite certain, it seems that while the body generates B cells and T cells to attack different variants of HIV as they arise, they eventually can’t keep up with the sheer number of variants.

Of course, the fact that the HIV virus attacks the immune system makes the disearse even worse. Here in blue you see the number of T cells per cubic millimeter of blood, and in red you see the number of virions per cubic centimeter of blood for a typical untreated patient:

Mathematicians and physicists have looked at some very simple models to get a qualitative understanding of these issues. One famous paper that started this off is:

• Lev S. Tsimring, Herbert Levine and David A. Kessler, RNA virus evolution via a fitness-space model, Phys. Rev. Lett. 76 (1996), 4440–4443.

The idea here is to say that at any time t the viruses have a probability density p(r,t) of having fitness r. In fact the different genotypes of the virus form a cloud in a higher-dimensional space, but these authors are treating that space is 1-dimensional, with fitness as its one coordinate, just to keep things simple. They then write down an equation for how the population density changes with time:

\displaystyle{\frac{\partial }{\partial t}p(r,t) = (r - \langle r \rangle)\, p(r,t) + D \frac{\partial^2 }{\partial r}p(r,t) - \frac{\partial}{\partial r}(v_{\mathrm{drift}}\, p(r,t)) }

This is a replication-mutation-drift equation. If we just had

\displaystyle{\frac{\partial }{\partial t}p(r,t) = (r - \langle r \rangle)\, p(r,t) }

this would be a version of the replicator equation, which I explained recently in Information Geometry (Part 9). Here

\displaystyle{ \langle r \rangle = \int_0^\infty r p(r,t) dr }

is the mean fitness, and the replicator equations says that the fraction of organisms of a given type grows at a rate proportional to how much their fitness exceeds the mean fitness: that’s where the (r - \langle r \rangle) comes from.

If we just had

\displaystyle{\frac{\partial }{\partial t}p(r,t) =  D \frac{\partial^2 }{\partial r^2}p(r,t) }

this would be the heat equation, which describes diffusion occurring at a rate D. This models the mutation of the virus, though not in a very realistic way.

If we just had

\displaystyle{\frac{\partial}{\partial t} p(r,t) =  - \frac{\partial}{\partial r}(v_{\mathrm{drift}} \, p(r,t)) }

the fitness of the virus would increase at rate equal to the drift velocity v_{\mathrm{drift}}.

If we include both the diffusion and drift terms:

\displaystyle{\frac{\partial }{\partial t} p(r,t) =  D \frac{\partial^2 }{\partial r^2}p(r,t) - \frac{\partial}{\partial r}(v_{\mathrm{drift}} \, p(r,t)) }

we get the Fokker–Planck equation. This is a famous model of something that’s spreading while also drifting along at a constant velocity: for example, a drop of ink in moving water. Its solutions look like this:

Here we start with stuff concentrated at one point, and it spreads out into a Gaussian while drifting along.

By the way, watch out: what biologists call ‘genetic drift’ is actually a form of diffusion, not what physicists call ‘drift’.

More recently, people have looked at another very simple model. You can read about it here:

• Martin A. Nowak, and R. M. May, Virus Dynamics, Oxford University Press, Oxford, 2000.

In this model the variables are:

• the number of healthy human cells of some type, \mathrm{H}(t)

• the number of infected human cells of that type, \mathrm{I}(t)

• the number of virions, \mathrm{V}(t)

These are my names for variables, not theirs. It’s just a sick joke that these letters spell out ‘HIV’.

Chemists like to describe how molecules react and turn into other molecules using ‘chemical reaction networks’. You’ve seen these if you’ve taken chemistry, but I’ve been explaining more about the math of these starting in Network Theory (Part 17). We can also use them here! Though May and Nowak probably didn’t put it this way, we can consider a chemical reaction network with the following 6 reactions:

• the production of a healthy cell:

\longrightarrow \mathrm{H}

• the infection of a healthy cell by a virion:

\mathrm{H} + \mathrm{V} \longrightarrow \mathrm{I}

• the production of a virion by an infected cell:

\mathrm{I} \longrightarrow \mathrm{I} + \mathrm{V}

• the death of a healthy cell:

\mathrm{H} \longrightarrow

• the death of a infected cell:

\mathrm{I} \longrightarrow

• the death of a virion:

\mathrm{V} \longrightarrow

Using a standard recipe which I explained, we can get from this chemical reaction network to some ‘rate equations’ saying how the number of healthy cells, infected cells and virions changes with time:

\displaystyle{ \frac{d\mathrm{H}}{dt}  =   \alpha - \beta \mathrm{H}\mathrm{V} - \gamma \mathrm{H} }

\displaystyle{ \frac{d\mathrm{I}}{dt}  = \beta \mathrm{H}\mathrm{V} - \delta \mathrm{I} }

\displaystyle{ \frac{d\mathrm{V}}{dt}  = - \beta \mathrm{H}\mathrm{V} + \epsilon \mathrm{I} - \zeta \mathrm{V} }

The Greek letters are constants called ‘rate constants’, and there’s one for each of the 6 reactions. The equations we get this way are exactly those described by Nowak and May!

What Andrei Korobeinikov is to unify the ideas behind the two models I’ve described here. Alas, I don’t have the energy to explain how. Indeed, I don’t even have the energy to explain what the models I’ve described actually predict. Sad, but true.

I don’t see anything online about Korobeinikov’s new work, but you can read some of his earlier work here:

• Andrei Korobeinikov, Global properties of basic virus dynamics models.

• Suzanne M. O’Regan, Thomas C. Kelly, Andrei Korobeinikov, Michael J. A. O’Callaghan and Alexei V. Pokrovskii, Lyapunov functions for SIR and SIRS epidemic models, Appl. Math. Lett. 23 (2010), 446-448.

The SIR and SIRS models are models of disease that also arise from chemical reaction networks. I explained them back in Network Theory (Part 3). That was before I introduced the terminology of chemical reaction networks… back then I was talking about ‘stochastic Petri nets’, which are an entirely equivalent formalism. Here’s the stochastic Petri net for the SIRS model:

Puzzle: Draw the stochastic Petri net for the HIV model discussed above. It should have 3 yellow circles and 6 aqua squares.

Information Geometry (Part 13)

26 June, 2012

Last time I gave a sketchy overview of evolutionary game theory. Now let’s get serious.

I’ll start by explaining ‘Nash equilibria’ for 2-person games. These are situations where neither player can profit by changing what they’re doing. Then I’ll introduce ‘mixed strategies’, where the players can choose among several strategies with different probabilities. Then I’ll introduce evolutionary game theory, where we think of each strategy as a species, and its probability as the fraction of organisms that belong to that species.

Back in Part 9, I told you about the ‘replicator equation’, which says how these fractions change with time thanks to natural selection. Now we’ll see how this leads to the idea of an ‘evolutionarily stable strategy’. And finally, we’ll see that when evolution takes us toward such a stable strategy, the amount of information the organisms have ‘left to learn’ keeps decreasing!

Nash equilibria

We can describe a certain kind of two-person game using a payoff matrix, which is an n \times n matrix A_{ij} of real numbers. We think of A_{ij} as the payoff that either player gets if they choose strategy i and their opponent chooses strategy j.

Note that in this kind of game, there’s no significant difference between the ‘first player’ and the ‘second player’: either player wins an amount A_{ij} if they choose strategy i and their opponent chooses strategy j. So, this kind of game is called symmetric even though the matrix A_{ij} may not be symmetric. Indeed, it’s common for this matrix to be antisymmetric, meaning A_{ij} = - A_{ji}, since in this case what one player wins, the other loses. Games with this extra property are called zero-sum games. But we won’t limit ourselves to those!

We say a strategy i is a symmetric Nash equilibrium if

A_{ii} \ge A_{ji}

for all j. This means that if both players use strategy i, neither gains anything by switching to another strategy.

For example, suppose our matrix is

\left( \begin{array}{rr} -1 & -12 \\  0 & -3 \end{array} \right)

Then we’ve got the Prisoner’s Dilemma exactly as described last time! Here strategy 1 is cooperate and strategy 2 is defect. If a player cooperates and so does his opponent, he wins

A_{11} = -1

meaning he gets one month in jail. We include a minus sign because ‘winning a month in jail’ is not a good thing. If the player cooperates but his opponent defects, he gets a whole year in jail:

A_{12} = -12

If he defects but his opponent cooperates, he doesn’t go to jail at all:

A_{21} = 0

And if they both defect, they both get three months in jail:

A_{22} = -3

You can see that defecting is a Nash equilibrium, since

A_{22} \ge A_{12}

So, oddly, if our prisoners know game theory and believe Nash equilibria are best, they’ll both be worse off than if they cooperate and don’t betray each other.


Nash equilibria for mixed strategies

So far we’ve been assuming that with 100% certainty, each player chooses one strategy i = 1,2,3,\dots, n. Since we’ll be considering more general strategies in a minute, let’s call these pure strategies.

Now let’s throw some probability theory into the stew! Let’s allow the players to pick different pure strategies with different probabilities. So, we define a mixed strategy to be a probability distribution on the set of pure strategies. In other words, it’s a list of n nonnegative numbers

p_i \ge 0

that sum to one:

\displaystyle{ \sum_{i=1}^n p_i = 1 }

Say I choose the mixed strategy p while you, my opponent, choose the mixed strategy q. Say our choices are made independently. Then the probability that I choose the pure strategy i while you chose j is

p_i q_j

so the expected value of my winnings is

\displaystyle{ \sum_{i,j = 1}^n p_i A_{ij} q_j }

or using vector notation

p \cdot A q

where the dot is the usual dot product on \mathbb{R}^n.

We can easily adapt the concept of Nash equilibrium to mixed strategies. A mixed strategy q is a symmetric Nash equilibrium if for any other mixed strategy p,

q \cdot A q \ge  p \cdot A q

This means that if both you and I are playing the mixed strategy q, I can’t improve my expected winnings by unilaterally switching to the mixed strategy p. And neither can you, because the game is symmetric!

If this were a course on game theory, I would now do some examples. But it’s not, so I’ll just send you to page 6 of Sandholm’s paper: he looks at some famous games like ‘hawks and doves’ and ‘rock paper scissors’.

Evolutionarily stable strategies

We’re finally ready to discuss evolutionarily stable strategies. To do this, let’s reinterpret the ‘pure strategies’ i = 1,2,3, \dots n as species. Here I don’t necessarily mean species in the classic biological sense: I just mean different kinds of self-replicating entities, or replicators. For example, they could be different alleles of the same gene.

Similarly, we’ll reinterpret the ‘mixed strategy’ p as describing a mixed population of replicators, where the fraction of replicators belonging to the ith species is p_i. These numbers are still probabilities: p_i is the probability that a randomly chosen replicator will belong to the ith species.

We’ll reinterpret the payoff matrix A_{ij} as a fitness matrix. In our earlier discussion of the replicator equation, we assumed that the population P_i of the ith species grew according to the replicator equation

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) P_i }

where the fitness function f_i is any smooth function of the populations of each kind of replicator.

But in evolutionary game theory it’s common to start by looking at a simple special case where

\displaystyle{f_i(P_1, \dots, P_n)   = \sum_{j=1}^n A_{ij} p_j }


\displaystyle{ p_j = \frac{P_j}{\sum_k P_k} }

is the fraction of replicators who belong to the jth species.

What does this mean? The idea is that we have a well-mixed population of game players—or replicators. Each one has its own pure strategy—or species. Each one randomly roams around and ‘plays games’ with each other replicator it meets. It gets to reproduce at a rate proportional to its expected winnings.

This is unrealistic in all sorts of ways, but it’s mathematically cute, and it’s been studied a lot, so it’s good to know about. Today I’ll explain evolutionarily stable strategies only in this special case. Later I’ll go back to the general case.

Suppose that we select a sample of replicators from the overall population. What is the mean fitness of the replicators in this sample? For this, we need to know the probability that a replicator from this sample belongs to the ith species. Say it’s q_j. Then the mean fitness of our sample is

\displaystyle{ \sum_{i,j=1}^n q_i A_{ij} p_j }

This is just a weighted average of the fitnesses in our earlier formula. But using the magic of vectors, we can write this sum as

q \cdot A p

We already saw this type of expression in the last section! It’s my expected winnings if I play the mixed strategy q and you play the mixed strategy p.

John Maynard Smith defined q to be evolutionarily stable strategy if when we add a small population of ‘invaders’ distributed according to any other probability distribution p, the original population is more fit than the invaders.

In simple terms: a small ‘invading’ population will do worse than the population as a whole.

Mathematically, this means:

q \cdot A ((1-\epsilon)q + \epsilon p) >  p \cdot A ((1-\epsilon)q + \epsilon p)

for all mixed strategies p and all sufficiently small \epsilon \ge 0 . Here

(1-\epsilon)q + \epsilon p

is the population we get by replacing an \epsilon-sized portion of our original population by invaders.

Puzzle: Show that q is an evolutionarily stable strategy if and only these two conditions hold for all mixed stategies p:

q \cdot A q \ge p \cdot A q

and also, for all q \ne p,

q \cdot A q = p \cdot A q \; \implies \; q \cdot A p > p \cdot A p

The first condition says that q is a symmetric Nash equilibrium. In other words, the invaders can’t on average be better playing against the original population than members of the original population are. The second says that if the invaders are just as good at playing against the original population, they must be worse at playing against each other! The combination of these conditions means the invaders won’t take over.

Again, I should do some examples… but instead I’ll refer you to page 9 of Sandholm’s paper, and also these course notes:

• Samuel Alizon and Daniel Cownden, Evolutionary games and evolutionarily stable strategies.

• Samuel Alizon and Daniel Cownden, Replicator dynamics.

The decrease of relative information

Now comes the punchline… but with a slight surprise twist at the end. Last time we let

P = (P_1, \dots , P_n)

be a population that evolves with time according to the replicator equation, and we let p be the corresponding probability distribution. We supposed q was some fixed probability distribution. We saw that the relative information

I(q,p) = \displaystyle{ \sum_i \ln \left(\frac{q_i}{ p_i }\right) q_i }


\displaystyle{ \frac{d}{dt} I(q,p) =  (p - q) } \cdot f(P)

where f(P) is the vector of fitness functions. So, this relative information can never increase if

(p - q) \cdot f(P) \le 0

for all P.

We can adapt this to the special case we’re looking at now. Remember, right now we’re assuming

\displaystyle{f_i(P_1, \dots, P_n)   = \sum_{j=1}^n A_{ij} p_j }


f(P) = A p

Thus, the relative information will never increase if

(p - q) \cdot A p \le 0

or in other words,

q \cdot A p \ge p \cdot A p  \qquad \qquad \qquad \qquad \qquad \qquad  (1)

Now, this looks very similar to the conditions for an evolutionary stable strategy as stated in the Puzzle above. But it’s not the same! That’s the surprise twist.

Remember, the Puzzle says that q is an evolutionarily stable state if for all mixed strategies p we have

q \cdot A q \ge p \cdot A q  \qquad \qquad \qquad \qquad \qquad \qquad (2)

and also

q \cdot A q = p \cdot A q \; \implies \; q \cdot A p > p \cdot A p  \qquad \; (3)

Note that condition (1), the one we want, is neither condition (2) nor condition (3)! This drove me crazy for almost a day.

I kept thinking I’d made a mistake, like mixing up p and q somewhere. You’ve got to mind your p’s and q’s in this game!

But the solution turned out to be this. After Maynard Smith came up with his definition of ‘evolutionarily stable state’, another guy came up with a different definition:

• Bernhard Thomas, On evolutionarily stable sets, J. Math. Biology 22 (1985), 105–115.

For him, an evolutionarily stable strategy obeys

q \cdot A q \ge p \cdot A q  \qquad \qquad \qquad \qquad \qquad \qquad (2)

and also

q \cdot A p \ge p \cdot A p  \qquad \qquad \qquad \qquad \qquad \qquad  (1)

Condition (1) is stronger than condition (3), so he renamed Maynard Smith’s evolutionarily stable strategies weakly evolutionarily stable strategies. And condition (1) guarantees that the relative information I(q,p) can never increase. So, now we’re happy.

Except for one thing: why should we switch from Maynard Smith’s perfectly sensible concept of evolutionarily stable state to this new stronger one? I don’t really know, except that

• it’s not much stronger


• it lets us prove the theorem we want!

So, it’s a small mystery for me to mull over. If you have any good ideas, let me know.

Information Geometry (Part 12)

24 June, 2012

Last time we saw that if a population evolves toward an ‘evolutionarily stable state’, then the amount of information our population has ‘left to learn’ can never increase! It must always decrease or stay the same.

This result sounds wonderful: it’s a lot like the second law of thermodynamics, which says entropy must always increase. Of course there are some conditions for this wonderful result to hold. The main condition is that the population evolves according to the replicator equation. But the other is the existence of an evolutionarily stable state. Last time I wrote down the rather odd-looking definition of ‘evolutionary stable state’ without justifying it. I need to do that soon. But if you’ve never thought about evolutionary game theory, I think giving you a little background will help. So today let me try that.

Evolutionary game theory

We’ve been thinking of evolution as similar to inference or learning. In this analogy, organisms are like ‘hypotheses’, and the population ‘does experiments’ to see if these hypotheses make ‘correct predictions’ (i.e., can reproduce) or not. The successful ones are reinforced while the unsuccessful ones are weeded out. As a result, the population ‘learns’. And under the conditions of the theorem we discussed last time, the relative information—the amount ‘left to learn’—goes down!

While you might object to various points of this analogy, it’s useful—and that’s really all you can ask of an analogy. It’s useful because it lets us steal chunks of math from the subjects of Bayesian inference and machine learning and apply them to the study of biodiversity and evolution! This is what Marc Harper has been doing:

• Marc Harper, Information geometry and evolutionary game theory.

• Marc Harper, The replicator equation as an inference dynamic.

But now let’s bring in another analogy, also contained in Harper’s work. We can also think of evolution as similar to a game. In this analogy, organisms are like ‘strategies’—or if you prefer, they have strategies. The winners get to reproduce, while the losers don’t. John Maynard Smith started developing this analogy in 1973, and eventually wrote a whole book on it:

• John Maynard Smith, Evolution and the Theory of Games, Cambridge University Press, 1982.

As far as I can tell, evolutionary game theory has brought almost as many chunks of math to game theory as it has taken from it. Maybe it’s just my ignorance showing, but it seems that game theory becomes considerably deeper when we think about games that many players play again and again, with the winners getting to reproduce, while the losers are eliminated.

According to William Sandholm:

The birth of evolutionary game theory is marked by the publication of a series of papers by mathematical biologist John Maynard Smith. Maynard Smith adapted the methods of traditional game theory, which were created to model the behavior of rational economic agents, to the context of biological natural selection. He proposed his notion of an evolutionarily stable strategy (ESS) as a way of explaining the existence of ritualized animal conflict.

Maynard Smith’s equilibrium concept was provided with an explicit dynamic foundation through a diff erential equation model introduced by Taylor and Jonker. Schuster and Sigmund, following Dawkins, dubbed this model the replicator dynamic, and recognized the close links between this game-theoretic dynamic and dynamics studied much earlier in population ecology and population genetics. By the 1980s, evolutionary game theory was a well-developed and firmly established modeling framework in biology.

Towards the end of this period, economists realized the value of the evolutionary approach to game theory in social science contexts, both as a method of providing foundations for the equilibrium concepts of traditional game theory, and as a tool for selecting among equilibria in games that admit more than one. Especially in its early stages, work by economists in evolutionary game theory hewed closely to the interpretation set out by biologists, with the notion of ESS and the replicator dynamic understood as modeling natural selection in populations of agents genetically programmed to behave in specific ways. But it soon became clear that models of essentially the same form could be used to study the behavior of populations of active decision makers. Indeed, the two approaches sometimes lead to identical models: the replicator dynamic itself can be understood not only as a model of natural selection, but also as one of imitation of successful opponents.

While the majority of work in evolutionary game theory has been undertaken by biologists and economists, closely related models have been applied to questions in a variety of fields, including transportation science, computer science, and sociology. Some paradigms from evolutionary game theory are close relatives of certain models from physics, and so have attracted the attention of workers in this field. All told, evolutionary game theory provides a common ground for workers from a wide range of disciplines.

The Prisoner’s Dilemma

In game theory, the most famous example is the Prisoner’s Dilemma. In its original form, this ‘game’ is played just once:

Two men are arrested, but the police don’t have enough information to convict them. So they separate the two men, and offer both the same deal: if one testifies against his partner (or defects), and the other remains silent (and thus cooperates with his partner), the defector goes free and the cooperator goes to jail for 12 months. If both remain silent, both are sentenced to only 1 month in jail for a minor charge. If they both defect, they both receive a 3-month sentence. Each prisoner must choose either to defect or cooperate with his partner in crime; neither gets to hear what the other decides. What will they do?

Traditional game theory emphasizes the so-called ‘Nash equilibrium’ for this game, in which both prisoners defect. Why don’t they both cooperate? They’d both be better off if they both cooperated. However, for them to both cooperate is ‘unstable’: either one could shorten their sentence by defecting! By definition, a Nash equilibrium has the property that neither player can improve his situation by unilaterally changing his strategy.

In the Prisoner’s Dilemma, the Nash equilibrium is not very nice: both parties would be happier if they’d only cooperate. That’s why it’s called a ‘dilemma’. Perhaps the most tragic example today is global warming. Even if all players would be better off if all cooperate to reduce carbon emissions, any one will be better off if everybody except themselves cooperates while they emit more carbon.

For this and many other reasons, people have been interested in ‘solving’ the Prisoner’s Dilemma: that is, finding reasons why cooperation might be favored over defection.

This book got people really excited in seeing what evolutionary game theory has to say about the Prisoner’s Dilemma:

• Robert Axelrod, The Evolution of Cooperation, Basic Books, New York, 1984. (A related article with the same title is available online.)

The idea is that under certain circumstances, strategies that are ‘nicer’ than defection will gradually take over. The most famous of these strategies is ‘tit for tat’, meaning that you cooperate the first time and after that do whatever your opponent just did. I won’t go into this further, because it’s a big digression and I’m already digressing too far. I’ll just mention that from the outlook of evolutionary game theory, the Prisoner’s Dilemma is still full of surprises. Just this week, some fascinating new work has been causing a stir:

• William Press and Freeman Dyson, Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent, Edge, 18 June 2012.

I hope I’ve succeeded in giving you a vague superficial sense of the history of evolutionary game theory and why it’s interesting. Next time I’ll get serious about the task at hand, which is to understand ‘evolutionarily stable strategies’. If you want to peek ahead, try this nice paper:

• William H. Sandholm, Evolutionary game theory, 12 November 2007.

This is where I got the long quote by Sandholm on the history of evolutionary game theory. The original quote contained lots of references; if you’re interested in those, go to page 3 of this paper.

The Mathematics of Biodiversity (Part 2)

24 June, 2012

How likely is it that the next thing we see is one of a brand new kind? That sounds like a hard question. Last time I told you about the Good–Turing rule for answering this question.

The discussion that blog entry triggered has been very helpful! Among other things, it got Lou Jost more interested in this subject. Two days ago, he showed me the following simple argument for the Good–Turing estimate.

Suppose there are finitely many species of orchid. Suppose the fraction of orchids belonging to the ith species is p_i.

Suppose we start collecting orchids. Suppose each time we find one, the chance that it’s an orchid of the ith species is p_i. Of course this is not true in reality! For example, it’s harder to find a tiny orchid, like this:

than a big one. But never mind.

Say we collect a total of N orchids. What is the probability that we find no orchids of the ith species? It is

(1 - p_i)^N

Similarly, the probability that we find exactly one orchid of the ith species is

N p_i (1 - p_i)^{N-1}

And so on: these are the first two terms in a binomial series.

Let n_1 be the expected number of singletons: species for which we find exactly one orchid of that species. Then

\displaystyle{ n_1 = \sum_i N p_i (1 - p_i)^{N-1} }

Let D be the coverage deficit: the expected fraction of the total population consisting of species that remain undiscovered. Given our assumptions, this is the same as the chance that the next orchid we find will be of a brand new species.


\displaystyle{ D = \sum_i p_i (1-p_i)^N }

since p_i is the fraction of orchids belonging to the ith species and (1-p_i)^N is the chance that this species remains undiscovered.

Lou Jost pointed out that the formulas for n_1 and D are very similar! In particular,

\displaystyle{ \frac{n_1}{N} = \sum_i p_i (1 - p_i)^{N-1} }

should be very close to

\displaystyle{ D = \sum_i p_i (1 - p_i)^N }

when N is large. So, we should have

\displaystyle{ D \approx \frac{n_1}{N} }

In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.

Of course it would be nice to turn these ‘shoulds’ into precise theorems! Theorem 1 in this paper does that:

• David McAllester and Robert E. Schapire, On the convergence rate of Good–Turing estimators, February 17, 2000.

By the way: the only difference between the formulas for n_1/N and D is that the first contains the exponent N-1, while the second contains the exponent N. So, Lou Jost’s argument is a version of Boris Borcic’s ‘time-reversal’ idea:

Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.

The Mathematics of Biodiversity (Part 1)

21 June, 2012

I’m in Barcelona now, and I want to blog about this:

Research Program on the Mathematics of Biodiversity, June-July 2012, Centre de Recerca Matemàtica, Barcelona, Spain. Organized by Ben Allen, Silvia Cuadrado, Tom Leinster, Richard Reeve and John Woolliams.

We’re having daily informal talks and there’s no way I can blog about all of them, talk to people here, and still get enough work done. So, I’ll just mention a few things that strike me! For example, this morning Lou Jost told me about an interesting paper by I. J. Good.

I’d known of I. J. Good as one of the guys who came up with the concept of a ‘technological singularity’. In 1963 he wrote:

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.

He was a British mathematician who worked as a cryptologist at Bletchley Park with Alan Turing. After World War II, he continued to work with Turing on the design of computers and Bayesian statistics at the University of Manchester. Later he moved to the US. In 1968, thanks to his interest in artificial intelligence, he served as consultant for Stanley Kubrick’s film 2001: A Space Odyssey. He died in 2009.

Good was also a big chess enthusiast, and worked on writing programs to play chess. He’s the guy in front here:

If you want to learn more about his work on chess, click on this photo!

But the paper Lou Jost mentioned is on a rather different subject:

• Irving John Good, The population frequency of species and the estimation of population parameters, Biometrika 40 (1953), 237–264.

Let me just state one result, sloppily, without any details or precise hypotheses!

Puzzle: Suppose you go into the jungles of Ecuador and start collecting orchids. You count the number of orchids of each different species that you find. You get a list of numbers, something like this:

14, 10, 8, 6, 2, 1, 1, 1

What is the chance that the next orchid you find will belong to a new species?

Good gives a rule of thumb for solving problems of this type:

\displaystyle{ \frac{n_1}{N} }

Here N is the total number of orchid you collected, and n_i is the number of species for which you found exactly i orchids of that species. In our example,

n_1 = 3

since we found just one orchid of three different species: those are the three 1’s at the end of our list. Furthermore,

N = 14 + 10 + 8 + 6 + 2 + 1 + 1 = 42

So here is Good’s estimate the chance that the next orchid you collect will be of a new species:

\displaystyle{ \frac{n_1}{N} = \frac{3}{42} }

Good’s argument is nontrivial—and of course it depends on some assumptions on the nature of the distribution of populations of different species! Since he doesn’t state these assumptions succinctly and I haven’t read the paper carefully yet, I’m afraid you’ll have to read the paper to find out what they are.

Of course the math works for samples of anything that comes in distinct types, not just species of organisms! Good considers four examples:

• moths captured in a light-trap at Rothamsted, England,

• words in American newspapers,

• nouns in Macaulay’s essay on Bacon,

• chess openings in games published by the British Chess Magazine in 1951.

By comparing a small sample to a bigger one, he studies how well his rule works in practice, and apparently it does okay.

In his paper, I. J. Good thanks Alan Turing for coming up with the basic idea. In fact he says Turing gave an ‘intuitive demonstration’ of it—but he doesn’t give this intuitive demonstration, and according to Lou Jost he actually admits somewhere that he forgot it.

You can read more about the idea here:

Good–Turing frequency estimation.

By the way, Lou Jost is not only an expert on biodiversity and its relation to entropy! He lives in the jungles of Ecuador and has discovered over 60 new species of orchids, including the world’s smallest:

He found it in Ecuador, and the petals are just a few cells thick! (Typically, the news reports say he found it in Bolivia and the petals are just one cell thick.)

He said:

I found it among the roots of another plant that I had collected, another small orchid which I took back to grow in my greenhouse to get it to flower. A few months later I saw that down among the roots was a tiny little plant that I realised was more interesting than the bigger orchid. Looking at the flower is often the best way to be able to identify which species of orchid you’ve got hold of – and can tell you whether you’re looking at an unknown species or not.

Information Geometry (Part 11)

7 June, 2012

Last time we saw that given a bunch of different species of self-replicating entities, the entropy of their population distribution can go either up or down as time passes. This is true even in the pathetically simple case where all the replicators have constant fitness—so they don’t interact with each other, and don’t run into any ‘limits to growth’.

This is a bit of a bummer, since it would be nice to use entropy to explain how replicators are always extracting information from their environment, thanks to natural selection.

Luckily, a slight variant of entropy, called ‘relative entropy’, behaves better. When our replicators have an ‘evolutionary stable state’, the relative entropy is guaranteed to always change in the same direction as time passes!

Thanks to Einstein, we’ve all heard that times and distances are relative. But how is entropy relative?

It’s easy to understand if you think of entropy as lack of information. Say I have a coin hidden under my hand. I tell you it’s heads-up. How much information did I just give you? Maybe 1 bit? That’s true if you know it’s a fair coin and I flipped it fairly before covering it up with my hand. But what if you put the coin down there yourself a minute ago, heads up, and I just put my hand over it? Then I’ve given you no information at all. The difference is the choice of ‘prior’: that is, what probability distribution you attributed to the coin before I gave you my message.

My love affair with relative entropy began in college when my friend Bruce Smith and I read Hugh Everett’s thesis, The Relative State Formulation of Quantum Mechanics. This was the origin of what’s now often called the ‘many-worlds interpretation’ of quantum mechanics. But it also has a great introduction to relative entropy. Instead of talking about ‘many worlds’, I wish people would say that Everett explained some of the mysteries of quantum mechanics using the fact that entropy is relative.

Anyway, it’s nice to see relative entropy showing up in biology.

Relative Entropy

Inscribe an equilateral triangle in a circle. Randomly choose a line segment joining two points of this circle. What is the probability that this segment is longer than a side of the triangle?

This puzzle is called Bertrand’s paradox, because different ways of solving it give different answers. To crack the paradox, you need to realize that it’s meaningless to say you’ll “randomly” choose something until you say more about how you’re going to do it.

In other words, you can’t compute the probability of an event until you pick a recipe for computing probabilities. Such a recipe is called a probability measure.

This applies to computing entropy, too! The formula for entropy clearly involves a probability distribution, even when our set of events is finite:

S = - \sum_i p_i \ln(p_i)

But this formula conceals a fact that becomes obvious when our set of events is infinite. Now the sum becomes an integral:

S = - \int_X p(x) \ln(p(x)) \, d x

And now it’s clear that this formula makes no sense until we choose the measure d x. On a finite set we have a god-given choice of measure, called counting measure. Integrals with respect to this are just sums. But in general we don’t have such a god-given choice. And even for finite sets, working with counting measure is a choice: we are choosing to believe that in the absence of further evidence, all options are equally likely.

Taking this fact into account, it seems like we need two things to compute entropy: a probability distribution p(x), and a measure d x. That’s on the right track. But an even better way to think of it is this:

\displaystyle{ S = - \int_X  \frac{p(x) dx}{dx} \ln \left(\frac{p(x) dx}{dx}\right) \, dx }

Now we see the entropy depends two measures: the probability measure p(x)  dx we care about, but also the measure d x. Their ratio is important, but that’s not enough: we also need one of these measures to do the integral. Above I used the measure dx to do the integral, but we can also use p(x) dx if we write

\displaystyle{ S = - \int_X \ln \left(\frac{p(x) dx}{dx}\right) p(x) dx }

Either way, we are computing the entropy of one measure relative to another. So we might as well admit it, and talk about relative entropy.

The entropy of the measure d \mu relative to the measure d \nu is defined by:

\begin{array}{ccl} S(d \mu, d \nu) &=& \displaystyle{ - \int_X \frac{d \mu(x) }{d \nu(x)} \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right)  d\nu(x) } \\   \\  &=& \displaystyle{ - \int_X  \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right) d\mu(x) } \end{array}

The second formula is simpler, but the first looks more like summing -p \ln(p), so they’re both useful.

Since we’re taking entropy to be lack of information, we can also get rid of the minus sign and define relative information by

\begin{array}{ccl} I(d \mu, d \nu) &=& \displaystyle{ \int_X \frac{d \mu(x) }{d \nu(x)} \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right)  d\nu(x) } \\   \\  &=& \displaystyle{  \int_X  \ln \left(\frac{d \mu(x)}{ d\nu(x) }\right) d\mu(x) } \end{array}

If you thought something was randomly distributed according to the probability measure d \nu, but then you you discover it’s randomly distributed according to the probability measure d \mu, how much information have you gained? The answer is I(d\mu,d\nu).

For more on relative entropy, read Part 6 of this series. I gave some examples illustrating how it works. Those should convince you that it’s a useful concept.

Okay: now let’s switch back to a more lowbrow approach. In the case of a finite set, we can revert to thinking of our two measures as probability distributions, and write the information gain as

I(q,p) = \displaystyle{  \sum_i  \ln \left(\frac{q_i}{p_i }\right) q_i}

If you want to sound like a Bayesian, call p the prior probability distribution and q the posterior probability distribution. Whatever you call them, I(q,p) is the amount of information you get if you thought p and someone tells you “no, q!”

We’ll use this idea to think about how a population gains information about its environment as time goes by, thanks to natural selection. The rest of this post will be an exposition of Theorem 1 in this paper:

• Marc Harper, The replicator equation as an inference dynamic.

Harper says versions of this theorem ave previously appeared in work by Ethan Akin, and independently in work by Josef Hofbauer and Karl Sigmund. He also credits others here. An idea this good is rarely noticed by just one person.

The change in relative information

So: consider n different species of replicators. Let P_i be the population of the ith species, and assume these populations change according to the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each function f_i depends smoothly on all the populations. And as usual, we let

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

be the fraction of replicators in the ith species.

Let’s study the relative information I(q,p) where q is some fixed probability distribution. We’ll see something great happens when q is a stable equilibrium solution of the replicator equation. In this case, the relative information can never increase! It can only decrease or stay constant.

We’ll think about what all this means later. First, let’s see that it’s true! Remember,

\begin{array}{ccl} I(q,p) &=& \displaystyle{ \sum_i  \ln \left(\frac{q_i}{ p_i }\right) q_i }  \\ \\ &=&  \displaystyle{ \sum_i  \Big(\ln(q_i) - \ln(p_i) \Big) q_i } \end{array}

and only p_i depends on time, not q_i, so

\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(q,p)}  &=& \displaystyle{ - \frac{d}{dt} \sum_i \ln(p_i)  q_i }\\   \\  &=& \displaystyle{ - \sum_i \frac{\dot{p}_i}{p_i} \, q_i } \end{array}

where \dot{p}_i is the rate of change of the probability p_i. We saw a nice formula for this in Part 9:

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }


f_i(P) = f_i(P_1, \dots, P_n)


\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

is the mean fitness of the species. So, we get

\displaystyle{ \frac{d}{dt} I(q,p) } = \displaystyle{ - \sum_i \Big( f_i(P) - \langle f(P) \rangle  \Big) \, q_i }

Nice, but we can fiddle with this expression to get something more enlightening. Remember, the numbers q_i sum to one. So:

\begin{array}{ccl}  \displaystyle{ \frac{d}{dt} I(q,p) } &=&  \displaystyle{  \langle f(P) \rangle - \sum_i f_i(P) q_i  } \\  \\ &=& \displaystyle{  \sum_i f_i(P) (p_i - q_i)  }  \end{array}

where in the last step I used the definition of the mean fitness. This result looks even cuter if we treat the numbers f_i(P) as the components of a vector f(P), and similarly for the numbers p_i and q_i. Then we can use the dot product of vectors to say

\displaystyle{ \frac{d}{dt} I(q,p) = f(P) \cdot (p - q) }

So, the relative information I(q,p) will always decrease if

f(P) \cdot (p - q) \le 0

for all choices of the population P.

And now something really nice happens: this is also the condition for q to be an evolutionarily stable state. This concept goes back to John Maynard Smith, the founder of evolutionary game theory. In 1982 he wrote:

A population is said to be in an evolutionarily stable state if its genetic composition is restored by selection after a disturbance, provided the disturbance is not too large.

I will explain the math next time—I need to straighten out some things in my mind first. But the basic idea is compelling: an evolutionarily stable state is like a situation where our replicators ‘know all there is to know’ about the environment and each other. In any other state, the population has ‘something left to learn’—and the amount left to learn is the relative information we’ve been talking about! But as time goes on, the information still left to learn decreases!

Note: in the real world, nature has never found an evolutionarily stable state… except sometimes approximately, on sufficiently short time scales, in sufficiently small regions. So we are still talking about an idealization of reality! But that’s okay, as long as we know it.

Information Geometry (Part 10)

4 June, 2012

Last time I began explaining the tight relation between three concepts:

• entropy,

• information—or more precisely, lack of information,


• biodiversity.

The idea is to consider n different species of ‘replicators’. A replicator is any entity that can reproduce itself, like an organism, a gene, or a meme. A replicator can come in different kinds, and a ‘species’ is just our name for one of these kinds. If P_i is the population of the ith species, we can interpret the fraction

\displaystyle{ p_i = \frac{P_i}{\sum_j P_j} }

as a probability: the probability that a randomly chosen replicator belongs to the ith species. This suggests that we define entropy just as we do in statistical mechanics:

\displaystyle{ S = - \sum_i p_i \ln(p_i) }

In the study of statistical inference, entropy is a measure of uncertainty, or lack of information. But now we can interpret it as a measure of biodiversity: it’s zero when just one species is present, and small when a few species have much larger populations than all the rest, but gets big otherwise.

Our goal here is play these viewpoints off against each other. In short, we want to think of natural selection, and even biological evolution, as a process of statistical inference—or in simple terms, learning.

To do this, let’s think about how entropy changes with time. Last time we introduced a simple model called the replicator equation:

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) \, P_i }

where each population grows at a rate proportional to some ‘fitness functions’ f_i. We can get some intuition by looking at the pathetically simple case where these functions are actually constants, so

\displaystyle{ \frac{d P_i}{d t} = f_i \, P_i }

The equation then becomes trivial to solve:

\displaystyle{ P_i(t) = e^{t f_i } P_i(0)}

Last time I showed that in this case, the entropy will eventually decrease. It will go to zero as t \to +\infty whenever one species is fitter than all the rest and starts out with a nonzero population—since then this species will eventually take over.

But remember, the entropy of a probability distribution is its lack of information. So the decrease in entropy signals an increase in information. And last time I argued that this makes perfect sense. As the fittest species takes over and biodiversity drops, the population is acquiring information about its environment.

However, I never said the entropy is always decreasing, because that’s false! Even in this pathetically simple case, entropy can increase.

Suppose we start with many replicators belonging to one very unfit species, and a few belonging to various more fit species. The probability distribution p_i will start out sharply peaked, so the entropy will start out low:

Now think about what happens when time passes. At first the unfit species will rapidly die off, while the population of the other species slowly grows:


So the probability distribution will, for a while, become less sharply peaked. Thus, for a while, the entropy will increase!

This seems to conflict with our idea that the population’s entropy should decrease as it acquires information about its environment. But in fact this phenomenon is familiar in the study of statistical inference. If you start out with strongly held false beliefs about a situation, the first effect of learning more is to become less certain about what’s going on!

Get it? Say you start out by assigning a high probability to some wrong guess about a situation. The entropy of your probability distribution is low: you’re quite certain about what’s going on. But you’re wrong. When you first start suspecting you’re wrong, you become more uncertain about what’s going on. Your probability distribution flattens out, and the entropy goes up.

So, sometimes learning involves a decrease in information—false information. There’s nothing about the mathematical concept of information that says this information is true.

Given this, it’s good to work out a formula for the rate of change of entropy, which will let us see more clearly when it goes down and when it goes up. To do this, first let’s derive a completely general formula for the time derivative of the entropy of a probability distribution. Following Sir Isaac Newton, we’ll use a dot to stand for a time derivative:

\begin{array}{ccl} \displaystyle{  \dot{S}} &=& \displaystyle{ -  \frac{d}{dt} \sum_i p_i \ln (p_i)} \\   \\  &=& - \displaystyle{ \sum_i \dot{p}_i \ln (p_i) + \dot{p}_i }  \end{array}

In the last term we took the derivative of the logarithm and got a factor of 1/p_i which cancelled the factor of p_i. But since

\displaystyle{  \sum_i p_i = 1 }

we know

\displaystyle{ \sum_i \dot{p}_i = 0 }

so this last term vanishes:

\displaystyle{ \dot{S}= -\sum_i \dot{p}_i \ln (p_i) }

Nice! To go further, we need a formula for \dot{p}_i. For this we might as well return to the general replicator equation, dropping the pathetically special assumption that the fitness functions are actually constants. Then we saw last time that

\displaystyle{ \dot{p}_i = \Big( f_i(P) - \langle f(P) \rangle  \Big) \, p_i }

where we used the abbreviation

f_i(P) = f_i(P_1, \dots, P_n)

for the fitness of the ith species, and defined the mean fitness to be

\displaystyle{ \langle f(P) \rangle = \sum_i f_i(P) p_i  }

Using this cute formula for \dot{p}_i, we get the final result:

\displaystyle{ \dot{S} = - \sum_i \Big( f_i(P) - \langle f(P) \rangle \Big) \, p_i \ln (p_i) }

This is strikingly similar to the formula for entropy itself. But now each term in the sum includes a factor saying how much more fit than average, or less fit, that species is. The quantity - p_i \ln(p_i) is always nonnegative, since the graph of -x \ln(x) looks like this:

So, the ith term contributes positively to the change in entropy if the ith species is fitter than average, but negatively if it’s less fit than average.

This may seem counterintuitive!

Puzzle 1. How can we reconcile this fact with our earlier observations about the case when the fitness of each species is population-independent? Namely: a) if initially most of the replicators belong to one very unfit species, the entropy will rise at first, but b) in the long run, when the fittest species present take over, the entropy drops?

If this seems too tricky, look at some examples! The first illustrates observation a); the second illustrates observation b):

Puzzle 2. Suppose we have two species, one with fitness equal to 1 initially constituting 90% of the population, the other with fitness equal to 10 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 1, & &  p_1(0) = 0.9 \\ \\                            f_2 = 10 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

Puzzle 3. Suppose we have two species, one with fitness equal to 10 initially constituting 90% of the population, and the other with fitness equal to 1 initially constituting just 10% of the population:

\begin{array}{ccc} f_1 = 10, & &  p_1(0) = 0.9 \\ \\                            f_2 = 1 , & & p_2(0) = 0.1   \end{array}

At what rate does the entropy change at t = 0? Which species is responsible for most of this change?

I had to work through these examples to understand what’s going on. Now I do, and it all makes sense.

Next time

Still, it would be nice if there were some quantity that always goes down with the passage of time, reflecting our naive idea that the population gains information from its environment, and thus loses entropy, as time goes by.

Often there is such a quantity. But it’s not the naive entropy: it’s the relative entropy. I’ll talk about that next time. In the meantime, if you want to prepare, please reread Part 6 of this series, where I explained this concept. Back then, I argued that whenever you’re tempted to talk about entropy, you should talk about relative entropy. So, we should try that here.

There’s a big idea lurking here: information is relative. How much information a signal gives you depends on your prior assumptions about what that signal is likely to be. If this is true, perhaps biodiversity is relative too.