## Maximum Entropy and Ecology

21 February, 2013

I already talked about John Harte’s book on how to stop global warming. Since I’m trying to apply information theory and thermodynamics to ecology, I was also interested in this book of his:

John Harte, Maximum Entropy and Ecology, Oxford U. Press, Oxford, 2011.

There’s a lot in this book, and I haven’t absorbed it all, but let me try to briefly summarize his maximum entropy theory of ecology. This aims to be “a comprehensive, parsimonious, and testable theory of the distribution, abundance, and energetics of species across spatial scales”. One great thing is that he makes quantitative predictions using this theory and compares them to a lot of real-world data. But let me just tell you about the theory.

It’s heavily based on the principle of maximum entropy (MaxEnt for short), and there are two parts:

Two MaxEnt calculations are at the core of the theory: the first yields all the metrics that describe abundance and energy distributions, and the second describes the spatial scaling properties of species’ distributions.

### Abundance and energy distributions

The first part of Harte’s theory is all about a conditional probability distribution

$R(n,\epsilon | S_0, N_0, E_0)$

which he calls the ecosystem structure function. Here:

$S_0$: the total number of species under consideration in some area.

$N_0$: the total number of individuals under consideration in that area.

$E_0$: the total rate of metabolic energy consumption of all these individuals.

Given this,

$R(n,\epsilon | S_0, N_0, E_0) \, d \epsilon$

is the probability that given $S_0, N_0, E_0,$ if a species is picked from the collection of species, then it has $n$ individuals, and if an individual is picked at random from that species, then its rate of metabolic energy consumption is in the interval $(\epsilon, \epsilon + d \epsilon).$

Here of course $d \epsilon$ is ‘infinitesimal’, meaning that we take a limit where it goes to zero to make this idea precise (if we’re doing analytical work) or take it to be very small (if we’re estimating $R$ from data).

I believe that when we ‘pick a species’ we’re treating them all as equally probable, not weighting them according to their number of individuals.

Clearly $R$ obeys some constraints. First, since it’s a probability distribution, it obeys the normalization condition:

$\displaystyle{ \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) = 1 }$

Second, since the average number of individuals per species is $N_0/S_0,$ we have:

$\displaystyle{ \sum_n \int d \epsilon \; n R(n,\epsilon | S_0, N_0, E_0) = N_0 / S_0 }$

Third, since the average over species of the total rate of metabolic energy consumption of individuals within the species is $E_0/ S_0,$ we have:

$\displaystyle{ \sum_n \int d \epsilon \; n \epsilon R(n,\epsilon | S_0, N_0, E_0) = E_0 / S_0 }$

Harte’s theory is that $R$ maximizes entropy subject to these three constraints. Here entropy is defined by

$\displaystyle{ - \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) \ln(R(n,\epsilon | S_0, N_0, E_0)) }$

Harte uses this theory to calculate $R,$ and tests the results against data from about 20 ecosystems. For example, he predicts the abundance of species as a function of their rank, with rank 1 being the most abundant, rank 2 being the second most abundant, and so on. And he gets results like this:

The data here are from:

• Green, Harte, and Ostling’s work on a serpentine grassland,

• Luquillo’s work on a 10.24-hectare tropical forest, and

• Cocoli’s work on a 2-hectare wet tropical forest.

The fit looks good to me… but I should emphasize that I haven’t had time to study these matters in detail. For more, you can read this paper, at least if your institution subscribes to this journal:

• J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

### Spatial abundance distribution

The second part of Harte’s theory is all about a conditional probability distribution

$\Pi(n | A, n_0, A_0)$

This is the probability that $n$ individuals of a species are found in a region of area $A$ given that it has $n_0$ individuals in a larger region of area $A_0.$

$\Pi$ obeys two constraints. First, since it’s a probability distribution, it obeys the normalization condition:

$\displaystyle{ \sum_n \Pi(n | A, n_0, A_0) = 1 }$

Second, since the mean value of $n$ across regions of area $A$ equals $n_0 A/A_0,$ we have

$\displaystyle{ \sum_n n \Pi(n | A, n_0, A_0) = n_0 A/A_0 }$

Harte’s theory is that $\Pi$ maximizes entropy subject to these two constraints. Here entropy is defined by

$\displaystyle{- \sum_n \Pi(n | A, n_0, A_0)\ln(\Pi(n | A, n_0, A_0)) }$

Harte explains two approaches to use this idea to derive ‘scaling laws’ for how $n$ varies with $n$. And again, he compares his predictions to real-world data, and get results that look good to my (amateur, hasty) eye!

I hope sometime I can dig deeper into this subject. Do you have any ideas, or knowledge about this stuff?

## The Mathematics of Biodiversity (Part 8)

14 July, 2012

Last time I mentioned that estimating entropy from real-world data is important not just for measuring biodiversity, but also for another area of biology: neurobiology!

When you look at something, neurons in your eye start firing. But how, exactly, is their firing related to what you see? Questions like this are hard! Answering them— ‘cracking the neural code’—is a big challenge. To make progress, neuroscientists are using information theory. But as I explained last time, estimating information from experimental data is tricky.

Romain Brasselet, now a postdoc at the Max Planck Institute for Biological Cybernetics at Tübingen, is working on these topics. He sent me a nice email explaining this area.

This is a bit of a digression, but the Mathematics of Biodiversity program in Barcelona has been extraordinarily multidisciplinary, with category theorists rubbing shoulders with ecologists, immunologists and geneticists. One of the common themes is entropy and its role in biology, so I think it’s worth posting Romain’s comments here. This is what he has to say…

### Information in neurobiology

I will try to explain why neurobiologists are today very interested in reliable estimates of entropy/information and what are the techniques we use to obtain them.

The activity of sensory as well as more central neurons is known to be modulated by external stimulations. In 1926, in a seminal paper, Adrian observed that neurons in the sciatic nerve of the frog fire action potentials (or spikes) when some muscle in the hindlimb is stretched. In addition, he observed that the frequency of the spikes increases with the amplitude of the stretching.

• E.D. Adrian, The impulses produced by sensory nerve endings. (1926).

For another very nice example, in 1962, Hubel and Wiesel found neurons in the cat visual cortex whose activity depends on the orientation of a visual stimulus, a simple black line over white background: some neurons fire preferentially for one orientation of the line (Hubel and Wiesel were awarded the 1981 Nobel Prize in Physiology for their work). This incidentally led to the concept of “receptive field” which is of tremendous importance in neurobiology—but though it’s fascinating, it’s a different topic.

Good, we are now able to define what makes a neuron tick. The problem is that neural activity is often very “noisy”: when the exact same stimulus is presented many times, the responses appear to be very different from trial to trial. Even careful observation cannot necessarily reveal correlations between the stimulations and the neural activity. So we would like a measure capable of capturing the statistical dependencies between the stimulation and the response of the neuron to know if we can say something about the stimulation just by observing the response of a neuron, which is essentially the task of the brain. In particular, we want a fundamental measure that does not rely on any assumption about the functioning of the brain. Information theory provides the tools to do this, that is why we like to use it: we often try to measure the mutual information between stimuli and responses.

To my knowledge, the first paper using information theory in neuroscience was by MacKay and McCulloch in 1952:

• Donald M. Mackay and Warren S. McCulloch, The limiting information capacity of a neuronal link, Bulletin of Mathematical Biophysics 14 (1952), 127–135.

But information theory was not used in neuroscience much until the early 90′s. It started again with a paper by Bialek et al. in 1991:

• W. Bialek, F. Rieke, R. R. de Ruyter van Steveninck and D. Warland, Reading a neural code, Science 252 (1991), 1854–1857.

However, when applying information-theoretic methods to biological data, we often have a limited sampling of the neural response, we are usually very happy when we have 50 trials for a given stimulus. Why is this limited sample a problem?

During the major part of the 20th century, following Adrian’s finding, the paradigm for the neural code was the frequency of the spikes or, equivalently, the number of spikes in a window of time. But in the early 90′s, it was observed that the exact timing of spikes is (in some cases) reliable across trials. So instead of considering the neural response as a single number (the number of spikes), the temporal patterns of spikes started to be taken into account. But time is continuous, so to be able to do actual computations, time was discretized and a neural response became a binary string.

Now, if you consider relevant time-scales, say, a 100 millisecond time window with a 1 millisecond bin with a firing frequency of about 50 per second, then your response space is huge and the estimates of information with only 50 trials are not reliable anymore. That’s why a lot of efforts have been carried to overcome the limited sampling bias.

Now, getting at the techniques developed in this field, John already mentioned the work by Liam Paninski, but here are other very interesting references:

• Stefano Panzeri and Alessandro Treves, Analytical estimates of limited sampling biases in different information measures, Network: Computation in Neural Systems 7 (1996), 87–107.

They computed the first-order bias of the information (related to the Miller–Madow correction) and then used a Bayesian technique to estimate the number of responses not included in the sample but that would be in an infinite sample (a goal similar to that of Good’s rule of thumb).

• S.P. Strong, R. Koberle, R.R. de Ruyter van Steveninck, and W. Bialek, Entropy and information in neural spike trains, Phys. Rev. Lett. 80 (1998), 197–200.

The entropy (or if you prefer, information) estimate can be expanded in a power series in $N$ (the sample size) around the true value. By computing the estimate for various values of $N$ and fitting it with a parabola, it is possible to estimate the value of the entropy as $N \rightarrow \infty.$

These approaches are also well-known:

• Ilya Nemenman, Fariel Shafee and William Bialek, Entropy and inference, revisited, 2002.

• Alexander Kraskov, Harald Stögbauer and Peter Grassberger, Estimating mutual information, Phys. Rev. E. 69 (2004), 066138.

Actually, Stefano Panzeri has quite a few impressive papers about this problem, and recently with colleagues he has made public a free Matlab toolbox for information theory (www.ibtb.org) implementing various correction methods.

Finally, the work by Jonathan Victor is worth mentioning, since he provided (to my knowledge again) the first estimate of mutual information using geometry. This is of particular interest with respect to the work by Christina Cobbold and Tom Leinster on measures of biodiversity that take the distance between species into account:

• J. D. Victor and K. P. Purpura, Nature and precision of temporal coding in visual cortex: a metric-space analysis, Journal of Neural Physiology 76 (1996), 1310–1326.

He introduced a distance between sequences of spikes and from this, derived a lower bound on mutual information.

• Jonathan D. Victor, Binless strategies for estimation of information from neural data, Phys. Rev. E. 66 (2002), 051903.

Taking inspiration from work by Kozachenko and Leonenko, he obtained an estimate of the information based on the distances between the closest responses.

Without getting too technical, that’s what we do in neuroscience about the limited sampling bias. The incentive is that obtaining reliable estimates is crucial to understand the ‘neural code’, the holy grail of computational neuroscientists.

## The Mathematics of Biodiversity (Part 7)

12 July, 2012

How ignorant are you?

Do you know?

Do you know how much don’t you know?

It seems hard to accurately estimate your lack of knowledge. It even seems hard to say precisely how hard it is. But the cool thing is, we can actually extract an interesting math question from this problem. And one answer to this question leads to the following conclusion:

There’s no unbiased way to estimate how ignorant you are.

But the devil is in the details. So let’s see the details!

The Shannon entropy of a probability distribution is a way of measuring how ignorant we are when this probability distribution describes our knowledge.

For example, suppose all we care about is whether this ancient Roman coin will land heads up or tails up:

If we know there’s a 50% chance of it landing heads up, that’s a Shannon entropy of 1 bit: we’re missing one bit of information.

But suppose for some reason we know for sure it’s going to land heads up. For example, suppose we know the guy on this coin is the emperor Pupienus Maximus, a egomaniac who had lead put on the back of all coins bearing his likeness, so his face would never hit the dirt! Then the Shannon entropy is 0: we know what’s going to happen when we toss this coin.

Or suppose we know there’s a 90% it will land heads up, and a 10% chance it lands tails up. Then the Shannon entropy is somewhere in between. We can calculate it like this:

$- 0.9 \log_2 (0.9) - 0.1 \log_2 (0.1) = 0.46899...$

so that’s how many bits of information we’re missing.

But now suppose we have no idea. Suppose we just start flipping the coin over and over, and seeing what happens. Can we estimate the Shannon entropy?

Here’s a naive way to do it. First, use your experimental data to estimate the probability that that the coin lands heads-up. Then, stick that probability into the formula for Shannon entropy. For example, say we flip the coin 3 times and it lands head-up once. Then we can estimate the probability of it landing heads-up as 1/3, and tails-up as 2/3. So we can estimate that the Shannon entropy is

$\displaystyle{ - \frac{1}{3} \log_2 (\frac{1}{3}) -\frac{2}{3} \log_2 (\frac{2}{3}) = 0.918... }$

But it turns out that this approach systematically underestimates the Shannon entropy!

Say we have a coin that lands up a certain fraction of the time, say $p.$ And say we play this game: we flip our coin $n$ times, see what we get, and estimate the Shannon entropy using the simple recipe I just illustrated.

Of course, our estimate will depend on the luck of the game. But on average, it will be less than the actual Shannon entropy, which is

$- p \log_2 (p) - (1-p) \log_2 (1-p)$

We can prove this mathematically. But it shouldn’t be surprising. After all, if $n = 1,$ we’re playing a game where we flip the coin just once. And with this game, our naive estimate of the Shannon entropy will always be zero! Each time we play the game, the coin will either land heads up 100% of the time, or tails up 100% of the time!

If we play the game with more coin flips, the error gets less severe. In fact it approaches zero as the number of coin flips gets ever larger, so that $n \to \infty.$ The case where you flip the coin just once is an extreme case—but extreme cases can be good to think about, because they can indicate what may happen in less extreme cases.

One moral here is that naively generalizing on the basis of limited data can make you feel more sure you know what’s going on than you actually are.

I hope you knew that already!

But we can also say, in a more technical way, that the naive way of estimating Shannon entropy is a biased estimator: the average value of the estimator is different from the value of the quantity being estimated.

Here’s an example of an unbiased estimator. Say you’re trying to estimate the probability that the coin will land heads up. You flip it $n$ times and see that it lands up $m$ times. You estimate that the probability is $m/n.$ That’s the obvious thing to do, and it turns out to be unbiased.

Statisticians like to think about estimators, and being unbiased is one way an estimator can be ‘good’. Beware: it’s not the only way! There are estimators that are unbiased, but whose standard deviation is so huge that they’re almost useless. It can be better to have an estimate of something that’s more accurate, even though on average it’s a bit too low. So sometimes, a biased estimator can be more useful than an unbiased estimator.

Nonetheless, my ears perked up when Lou Jost mentioned that there is no unbiased estimator for Shannon entropy. In rough terms, the moral is that:

There’s no unbiased way to estimate how ignorant you are.

I think this is important. For example, it’s important because Shannon entropy is also used as a measure of biodiversity. Instead of flipping a coin repeatedly and seeing which side lands up, now we go out and collect plants or animals, and see which species we find. The relative abundance of different species defines a probability distribution on the set of species. In this language, the moral is:

There’s no unbiased way to estimate biodiversity.

But of course, this doesn’t mean we should give up. We may just have to settle for an estimator that’s a bit biased! And people have spent a bunch of time looking for estimators that are less biased than the naive one I just described.

By the way, equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: there are many measures of biodiversity. The Shannon entropy is just a special case of the Rényi entropy, which depends on a parameter $q$: we get Shannon entropy when $q = 1.$

As $q$ gets smaller, the Rényi entropy gets more and more sensitive to rare species—or shifting back to the language of probability theory, rare events. It’s the rare events that make Shannon entropy hard to estimate, so I imagine there should be theorems about estimators for Rényi entropy, which say it gets harder to estimate as $q$ gets smaller. Do you know such theorems?

Also, I should add that biodiversity is better captured by the ‘Hill numbers’, which are functions of the Rényi entropy, than by the Rényi entropy itself. (See here for the formulas.) Since these functions are nonlinear, the lack of an unbiased estimator for Rényi entropy doesn’t instantly imply the same for the Hill numbers. So there are also some obvious questions about unbiased estimators for Hill numbers. Do you know answers to those?

Here are some papers on estimators for entropy. Most of these focus on estimating the Shannon entropy of a probability distribution on a finite set.

This old classic has a proof that the ‘naive’ estimator of Shannon entropy is biased, and estimates on the bias:

• Bernard Harris, The statistical estimation of entropy in the non-parametric case, Army Research Office, 1975.

He shows the bias goes to zero as we increase the number of samples: the number I was calling $n$ in my coin flip example. In fact he shows the bias goes to zero like $O(1/n).$ This is big big O notation which means that as $n \to +\infty,$ the bias is bounded by some constant times $1/n.$ This constant depends on the size of our finite set—or, if you want to do better, the class number, which is the number of elements on which our probability distribution is nonzero.

Using this idea, he shows that you can find a less biased estimator if you have a probability distribution $p_i$ on a finite set and you know that exactly $k$ of these probabilities are nonzero. To do this, just take the ‘naive’ estimator I described earlier and add $(k-1)/2n.$ This is called the Miller–Madow bias correction. The bias of this improved estimator goes to zero like $O(1/n^2).$

The problem is that in practice you don’t know ahead of time how many probabilities are nonzero! In applications to biodiversity this would amount to knowing ahead of time how many species exist, before you go out looking for them.

But what about the theorem that there’s no unbiased estimator for Shannon entropy? The best reference I’ve found is this:

• Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15 (2003) 1191-1254.

In Proposition 8 of Appendix A, Paninski gives a quick proof that there is no unbiased estimator of Shannon entropy for probability distributions on a finite set. But his paper goes far beyond this. Indeed, it seems like a pretty definitive modern discussion of the whole subject of estimating entropy. Interestingly, this subject is dominated by neurobiologists studying entropy of signals in the brain! So, lots of his examples involve brain signals.

Another overview, with tons of references, is this:

• J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: an overview.

This paper focuses on the situation where don’t know ahead of time how many probabilities are nonzero:

• Anne Chao and T.-J. Shen, Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample, Environmental and Ecological Statistics 10 (2003), 429&–443.

In 2003 there was a conference on the problem of estimating entropy, whose webpage has useful information. As you can see, it was dominated by neurobiologists:

Estimation of entropy and information of undersampled probability distributions: theory, algorithms, and applications to the neural code, Whistler, British Columbia, Canada, 12 December 2003.

By the way, I was very confused for a while, because these guys claim to have found an unbiased estimator of Shannon entropy:

• Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.

However, their way of estimating entropy has a funny property: in the language of biodiversity, it’s only well-defined if our samples include at least one species of each organism. So, we cannot compute this estimate for an arbitary list of $n$ samples. This means it’s not estimator in the usual sense—the sense that Paninski is using! So it doesn’t really contradict Paninski’s result.

To wrap up, let me state Paninski’s result in a mathematically precise way. Suppose $p$ is a probability distribution on a finite set $X$. Suppose $S$ is any number we can compute from $p$: that is, any real-valued function on the set of probability distributions. We’ll be interested in the case where $S$ is the Shannon entropy:

$\displaystyle{ S = -\sum_{x \in X} p(x) \, \log p(x) }$

Here we can use whatever base for the logarithm we like: earlier I was using base 2, but that’s not sacred. Define an estimator to be any function

$\hat{S}: X^n \to \mathbb{R}$

The idea is that given $n$ samples from the set $X,$ meaning points $x_1, \dots, x_n \in X,$ the estimator gives a number $\hat{S}(x_1, \dots, x_n)$. This number is supposed to estimate some feature of the probability distribution $p$: for example, its entropy.

If the samples are independent and distributed according to the distribution $p,$ the sample mean of the estimator will be

$\displaystyle{ \langle \hat{S} \rangle = \sum_{x_1, \dots, x_n \in X} \hat{S}(x_1, \dots, x_n) \, p(x_1) \cdots p(x_n) }$

The bias of the estimator is the difference between the sample mean of the estimator and actual value of $S$:

$\langle \hat{S} \rangle - S$

The estimator $\hat{S}$ is unbiased if this bias is zero for all $p.$

Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy! The proof is very short…

Okay, that’s all for today.

I’m back in Singapore now; I learned so much at the Mathematics of Biodiversity conference that there’s no way I’ll be able to tell you all that information. I’ll try to write a few more blog posts, but please be aware that my posts so far give a hopelessly biased and idiosyncratic view of the conference, which would be almost unrecognizable to most of the participants. There are a lot of important themes I haven’t touched on at all… while this business of entropy estimation barely came up: I just find it interesting!

If more of you blogged more, we wouldn’t have this problem.

## The Mathematics of Biodiversity (Part 6)

6 July, 2012

Here are two fun botany stories I learned today from Lou Jost.

### The decline and fall of the Roman Empire

I thought Latin was a long-dead language… except in Finland, where 75,000 people regularly listen to the news in Latin. That’s cool, but surely the last time someone seriously needed to write in Latin was at least a century ago… right?

No! Until the beginning of 2012, botanists reporting new species were required to do so in Latin.

Like this:

Arbor ad 8 alta, raminculis sparse pilosis, trichomatis 2-2.5 mm longis. Folia persistentia; laminae anisophyllae, foliis majoribus ellipticus, 12-23.5 cm longis, 6-13 cm latis, minoribus orbicularis, ca 8.5 cm longis, 7.5 cm latis, apice acuminato et caudato, acuminibus 1.5-2 cm longis, basi rotundata ad obtusam, margine integra, supra sericea, trichomatis 2.5-4 mm longis, appressis, pagina inferiore sericea ad pilosam, trichomatis 2-3 mm longis; petioli 4-7 mm longi. Inflorescentia terminalis vel axillaris, cymosa, 8-10 cm latis. Flores bisexuales; calyx tubularis, ca. 6 mm longus, 10-costatus; corolla alba, tubularis, 5-lobata; stamina 5, filis 8-10 mm longis, pubescentia ad insertionem.

The International Botanical Congress finally voted last year to drop this requirement. So, the busy people who are discovering about 2000 species of plants, algae and fungi each year no longer need to file their reports in the language of the Roman Empire.

### Orchid Fever

The first person who publishes a paper on a new species of plant gets to name it. Sometimes the competition is fierce, as for the magnificent orchid shown above, Phragmipedium kovachii.

Apparently one guy beat another, his archenemy, by publishing an article just a few days earlier. But the other guy took his revenge by getting the first guy arrested for illegally taking an endangered orchid out of Peru. The first guy wound up getting two years’ probation and a \$1,000 fine.

But, he got his name on the orchid!

I believe the full story appears here:

• Eric Hansen, Orchid Fever: A Horticultural Tale of Love, Lust, and Lunacy, Vintage Books, New York, 2001.

You can read a summary here.

### Ecominga

By the way, Lou Jost is not only a great discoverer of new orchid species and a biologist deeply devoted to understanding the mathematics of biodiversity. He also runs a foundation called Ecominga, which runs a number of nature reserves in Ecuador, devoted to preserving the amazing biodiversity of the Upper Pastaza Watershed. This area contains over 190 species of plants not found anywhere else in the world, as well as spectacled bears, mountain tapirs, and an enormous variety of birds.

The forests here are being cut down… but Ecominga has bought thousands of hectares in key locations, and is protecting them. They need money to pay the locals who patrol and run the reserves. It’s not a lot of money in the grand scheme of things—a few thousand dollars a month. So if you’re interested, go to the Ecominga website, check out the information and reports and pictures, and think about giving them some help! Or for that matter, contract me and I’ll put you in touch with him.

## The Mathematics of Biodiversity (Part 5)

3 July, 2012

I’d be happy to get your feedback on these slides of the talk I’m giving the day after tomorrow:

• John Baez, Diversity, entropy and thermodynamics, 6 July 2012, Exploratory Conference on the Mathematics of Biodiversity, Centre de Recerca Matemàtica, Barcelona.

Abstract: As is well known, some popular measures of biodiversity are formally identical to measures of entropy developed by Shannon, Rényi and others. This fact is part of a larger analogy between thermodynamics and the mathematics of biodiversity, which we explore here. Any probability distribution can be extended to a 1-parameter family of probability distributions where the parameter has the physical meaning of ‘temperature’. This allows us to introduce thermodynamic concepts such as energy, entropy, free energy and the partition function in any situation where a probability distribution is present—for example, the probability distribution describing the relative abundances of different species in an ecosystem. The Rényi entropy of this probability distribution is closely related to the change in free energy with temperature. We give one application of thermodynamic ideas to population dynamics, coming from the work of Marc Harper: as a population approaches an ‘evolutionary optimum’, the amount of Shannon information it has ‘left to learn’ is nonincreasing. This fact is closely related to the Second Law of Thermodynamics.

This talk is rather different than the one I’d envisaged giving! There was a lot of interest in my work on Rényi entropy and thermodynamics, because Rényi entropies—and their exponentials, called the Hill numbers—are an important measure of biodiversity. So, I decided to spend a lot of time talking about that.

## The Mathematics of Biodiversity (Part 4)

2 July, 2012

Today the conference part of this program is starting:

Research Program on the Mathematics of Biodiversity, June-July 2012, Centre de Recerca Matemàtica, Barcelona, Spain. Organized by Ben Allen, Silvia Cuadrado, Tom Leinster, Richard Reeve and John Woolliams.

Lou Jost kicked off the proceedings with an impassioned call to think harder about fundamental concepts:

Then Tom Leinster gave an introduction to some of these concepts, and Lou explained how they show up in ecology, genetics, economics and physics.

Suppose we have $n$ different species on an island. Suppose a fraction $p_i$ of the organisms belong to the $i$th species. So,

$\displaystyle{ \sum_{i=1}^n p_i = 1}$

and mathematically we can treat these numbers as probabilities.

People have many ways to compute the ‘biodiversity’ from these numbers. Some of these can be wildly misleading when applied incorrectly, and this has led to shocking errors. For example, in genetics, a commonly used formula for determining when plants or animals on a bunch of islands will split into separate species is completely wrong.

In fact, if we’re not careful, some measures of biodiversity can fool us into thinking we’re saving most of the biodiversity when we’re actually losing almost all of it!

One good example involves measures of similarity between tropical butterflies in the canopy (the top of the forest) and the understory (the bottom). According to Lou Just, some published studies say the similarity is about 95%. That sounds like the two communities are almost the same. However, almost no butterflies living in the canopy live in the understory, and vice versa! The problem is that mathematics is being used inappropriately.

Here are four famous measures of biodiversity:

Species richness. This is just the number of species:

$n$

Shannon entropy. This is the expected amount of information you gain when someone tells you which species an organism belongs to:

$\displaystyle{ - \sum_{i=1}^n p_i \ln(p_i) }$

• The inverse Simpson index. This is the reciprocal of the probability that two randomly chosen organisms belong to the same species:

$\displaystyle{ 1 \big/ \sum_{i=1}^n p_i^2 }$

The probability that two organisms belong to the same species is called the Simpson index:

$\displaystyle{ \sum_{i=1}^n p_i^2 }$

This is used in economics as a measure of the concentration of wealth, where $p_i$ is the fraction of wealth owned by the $i$th individual. Be careful: there’s a lot of different jargon in different fields, so it’s easy to get confused at first! For example, the probability that two organisms belong to different species is often called the Gini–Simpson index:

$\displaystyle{ 1 - \sum_{i=1}^n p_i^2 }$

It was introduced by the statistician Corrado Gini a century ago, in 1912 and the ecologist Edward H. Simpson in 1949. It’s also called the heterozygosity in genetics.

• The Berger–Parker index. This is the fraction of organisms that belong to the most common species:

$\mathrm{max} \, p_i$

So, unlike the other main ones I’ve listed, this quantity tends to go down when biodiversity goes up. To fix this we could take its reciprocal, as we did with the Simpson index.

What a mess, eh? But here’s some good news: all these quantities are functions of a single quantity, the Rényi entropy:

$\displaystyle{ H_q(p) = \frac{1}{1 -q} \ln \sum_{i=1}^n p_i^q }$

for various values of the parameter $q.$

I’ve written about the Rényi entropies and their role in thermodynamics before on this blog. I’ll also talk about it later in this conference, and I’ll show you my slides. So, I won’t repeat that story here. Suffice it to say that Rényi entropies are fascinating but still a bit mysterious to me.

But one of Lou Jost’s main points is that we can make bad mistakes if we work with Rényi entropies when we should be working with their exponentials, which are called Hill numbers and denoted by a $D$, for ‘diversity’:

$\displaystyle{ {}^qD(p) = e^{H_q(p)} = \left(\sum_{i=1}^n p_i^q \right)^{\frac{1}{1-q}} }$

These were introduced by M. O. Hill in 1973. One reason they’re good is that they are effective numbers. This means that if all the species are equally common, the Hill number equals the number of species, regardless of $q$:

$p_i = \frac{1}{n} \; \Longrightarrow \; {}^qD(p) = n$

So, they’re a way of measuring an ‘effective’ number of species in situations where species are not all equally common.

A closely related fact is that the Hill numbers obey the replication principle. This means that if we have probability distributions on two finite sets, each with Hill number $X$ for some choice of $q,$ and we combine them with equal weights to get a probability distribution on the disjoint union of those sets, the resulting distribution has Hill number $2X.$

Another good fact is that the Hill numbers are as large as possible when all the probabilities $p_i$ are equal. They’re as small as possible, namely 1, when one of the $p_i$ equals 1 and the rest are zero.

Let’s see how all the measures of biodiversity I listed are either Hill numbers or can easily be converted to Hill numbers. We’ll also see that at $q = 0,$ the Hill number treats all species that are present in an equal way, regardless of their abundance. As $q$ increases, it counts more abundant species more heavily, since we’re raising the probabilities $p_i$ to a bigger power. And when $q = \infty$, we only care about the most abundant species: none of the others matter at all!

Here goes:

• The species richness is the limit of the Hill numbers as $q \to 0$ from above:

$\displaystyle{ \lim_{q \to 0^+} {}^qD (p) = n }$

So, we can just call this ${}^0D(p).$

• The exponential of the Shannon entropy is the limit of the Hill numbers as $q \to 1$:

$\displaystyle{ \lim_{q \to 1} {}^qD(p) = \exp\left(- \sum_{i=1}^n p_i \ln(p_i)\right) }$

So, we can just call this ${}^1D(p).$

• The inverse Simpson index is the Hill number at $q = 2$:

$\displaystyle{ {}^2D(p) = 1 \big/ \sum_{i=1}^n p_i^2 }$

• The reciprocal of the Berger–Parker index is the limit of Hill numbers as $q \to +\infty$:

$\displaystyle{ \lim_{q \to +\infty} {}^qD(p) = 1 \big/ \mathrm{max} \, p_i }$

so we can call this quantity ${}^\infty D(p).$

These facts mean that understanding Hill numbers will help us understand lots of measures of biodiversity! And the good properties of Hill numbers will help us avoid dangerous mistakes.

For mathematicians, a good challenge is to find theorems uniquely characterizing the Hill numbers…. preferably with assumptions that biologists will accept as plausible facts about ‘diversity’. Some theorems like this already exist for specific choices of $q,$ but it will be better to characterize the function ${}^q D$ for all values of $q$ in one blow. Tom Leinster is working on such a theorem now.

Another important task is to generalize Hill numbers to take into account things like:

• ‘distances’ between species, measured either genetically, phylogenetically or functionally,

• ‘values’ for species, measured either economically or
any other way.

There’s a lot of work on this, and many of the talks here conference will discuss these generalizations.

## The Mathematics of Biodiversity (Part 3)

27 June, 2012

We tend to think of biodiversity as a good thing, but sometimes it’s deadly. Yesterday Andrei Korobeinikov gave a talk on ‘Viral evolution within a host’, which was mainly about AIDS.

The virus that causes this disease, HIV, can reproduce very fast. In an untreated patient near death there are between 1010 and 1012 new virions per day! Remember, a virion is an individual virus particle. The virus also has a high mutation rate: about 3 × 10-5 mutations per generation for each base—that is, each molecule of A,T,C, or G in the RNA of the virus. That may not seem like a lot, but if you multiply it by 1012 you’ll see that a huge number of new variations of each base arise within the body of a single patient.

So, evolution is at work within you as you die.

And in fact, many scientists believe that the diversity of the virus eventually overwhelms your immune system! Although it’s apparently not quite certain, it seems that while the body generates B cells and T cells to attack different variants of HIV as they arise, they eventually can’t keep up with the sheer number of variants.

Of course, the fact that the HIV virus attacks the immune system makes the disearse even worse. Here in blue you see the number of T cells per cubic millimeter of blood, and in red you see the number of virions per cubic centimeter of blood for a typical untreated patient:

Mathematicians and physicists have looked at some very simple models to get a qualitative understanding of these issues. One famous paper that started this off is:

• Lev S. Tsimring, Herbert Levine and David A. Kessler, RNA virus evolution via a fitness-space model, Phys. Rev. Lett. 76 (1996), 4440–4443.

The idea here is to say that at any time $t$ the viruses have a probability density $p(r,t)$ of having fitness $r$. In fact the different genotypes of the virus form a cloud in a higher-dimensional space, but these authors are treating that space is 1-dimensional, with fitness as its one coordinate, just to keep things simple. They then write down an equation for how the population density changes with time:

$\displaystyle{\frac{\partial }{\partial t}p(r,t) = (r - \langle r \rangle)\, p(r,t) + D \frac{\partial^2 }{\partial r}p(r,t) - \frac{\partial}{\partial r}(v_{\mathrm{drift}}\, p(r,t)) }$

This is a replication-mutation-drift equation. If we just had

$\displaystyle{\frac{\partial }{\partial t}p(r,t) = (r - \langle r \rangle)\, p(r,t) }$

this would be a version of the replicator equation, which I explained recently in Information Geometry (Part 9). Here

$\displaystyle{ \langle r \rangle = \int_0^\infty r p(r,t) dr }$

is the mean fitness, and the replicator equations says that the fraction of organisms of a given type grows at a rate proportional to how much their fitness exceeds the mean fitness: that’s where the $(r - \langle r \rangle)$ comes from.

$\displaystyle{\frac{\partial }{\partial t}p(r,t) = D \frac{\partial^2 }{\partial r^2}p(r,t) }$

this would be the heat equation, which describes diffusion occurring at a rate $D$. This models the mutation of the virus, though not in a very realistic way.

$\displaystyle{\frac{\partial}{\partial t} p(r,t) = - \frac{\partial}{\partial r}(v_{\mathrm{drift}} \, p(r,t)) }$

the fitness of the virus would increase at rate equal to the drift velocity $v_{\mathrm{drift}}$.

If we include both the diffusion and drift terms:

$\displaystyle{\frac{\partial }{\partial t} p(r,t) = D \frac{\partial^2 }{\partial r^2}p(r,t) - \frac{\partial}{\partial r}(v_{\mathrm{drift}} \, p(r,t)) }$

we get the Fokker–Planck equation. This is a famous model of something that’s spreading while also drifting along at a constant velocity: for example, a drop of ink in moving water. Its solutions look like this:

Here we start with stuff concentrated at one point, and it spreads out into a Gaussian while drifting along.

By the way, watch out: what biologists call ‘genetic drift’ is actually a form of diffusion, not what physicists call ‘drift’.

More recently, people have looked at another very simple model. You can read about it here:

• Martin A. Nowak, and R. M. May, Virus Dynamics, Oxford University Press, Oxford, 2000.

In this model the variables are:

• the number of healthy human cells of some type, $\mathrm{H}(t)$

• the number of infected human cells of that type, $\mathrm{I}(t)$

• the number of virions, $\mathrm{V}(t)$

These are my names for variables, not theirs. It’s just a sick joke that these letters spell out ‘HIV’.

Chemists like to describe how molecules react and turn into other molecules using ‘chemical reaction networks’. You’ve seen these if you’ve taken chemistry, but I’ve been explaining more about the math of these starting in Network Theory (Part 17). We can also use them here! Though May and Nowak probably didn’t put it this way, we can consider a chemical reaction network with the following 6 reactions:

• the production of a healthy cell:

$\longrightarrow \mathrm{H}$

• the infection of a healthy cell by a virion:

$\mathrm{H} + \mathrm{V} \longrightarrow \mathrm{I}$

• the production of a virion by an infected cell:

$\mathrm{I} \longrightarrow \mathrm{I} + \mathrm{V}$

• the death of a healthy cell:

$\mathrm{H} \longrightarrow$

• the death of a infected cell:

$\mathrm{I} \longrightarrow$

• the death of a virion:

$\mathrm{V} \longrightarrow$

Using a standard recipe which I explained, we can get from this chemical reaction network to some ‘rate equations’ saying how the number of healthy cells, infected cells and virions changes with time:

$\displaystyle{ \frac{d\mathrm{H}}{dt} = \alpha - \beta \mathrm{H}\mathrm{V} - \gamma \mathrm{H} }$

$\displaystyle{ \frac{d\mathrm{I}}{dt} = \beta \mathrm{H}\mathrm{V} - \delta \mathrm{I} }$

$\displaystyle{ \frac{d\mathrm{V}}{dt} = - \beta \mathrm{H}\mathrm{V} + \epsilon \mathrm{I} - \zeta \mathrm{V} }$

The Greek letters are constants called ‘rate constants’, and there’s one for each of the 6 reactions. The equations we get this way are exactly those described by Nowak and May!

What Andrei Korobeinikov is to unify the ideas behind the two models I’ve described here. Alas, I don’t have the energy to explain how. Indeed, I don’t even have the energy to explain what the models I’ve described actually predict. Sad, but true.

I don’t see anything online about Korobeinikov’s new work, but you can read some of his earlier work here:

• Andrei Korobeinikov, Global properties of basic virus dynamics models.

• Suzanne M. O’Regan, Thomas C. Kelly, Andrei Korobeinikov, Michael J. A. O’Callaghan and Alexei V. Pokrovskii, Lyapunov functions for SIR and SIRS epidemic models, Appl. Math. Lett. 23 (2010), 446-448.

The SIR and SIRS models are models of disease that also arise from chemical reaction networks. I explained them back in Network Theory (Part 3). That was before I introduced the terminology of chemical reaction networks… back then I was talking about ‘stochastic Petri nets’, which are an entirely equivalent formalism. Here’s the stochastic Petri net for the SIRS model:

Puzzle: Draw the stochastic Petri net for the HIV model discussed above. It should have 3 yellow circles and 6 aqua squares.

## Information Geometry (Part 13)

26 June, 2012

Last time I gave a sketchy overview of evolutionary game theory. Now let’s get serious.

I’ll start by explaining ‘Nash equilibria’ for 2-person games. These are situations where neither player can profit by changing what they’re doing. Then I’ll introduce ‘mixed strategies’, where the players can choose among several strategies with different probabilities. Then I’ll introduce evolutionary game theory, where we think of each strategy as a species, and its probability as the fraction of organisms that belong to that species.

Back in Part 9, I told you about the ‘replicator equation’, which says how these fractions change with time thanks to natural selection. Now we’ll see how this leads to the idea of an ‘evolutionarily stable strategy’. And finally, we’ll see that when evolution takes us toward such a stable strategy, the amount of information the organisms have ‘left to learn’ keeps decreasing!

### Nash equilibria

We can describe a certain kind of two-person game using a payoff matrix, which is an $n \times n$ matrix $A_{ij}$ of real numbers. We think of $A_{ij}$ as the payoff that either player gets if they choose strategy $i$ and their opponent chooses strategy $j.$

Note that in this kind of game, there’s no significant difference between the ‘first player’ and the ‘second player’: either player wins an amount $A_{ij}$ if they choose strategy $i$ and their opponent chooses strategy $j.$ So, this kind of game is called symmetric even though the matrix $A_{ij}$ may not be symmetric. Indeed, it’s common for this matrix to be antisymmetric, meaning $A_{ij} = - A_{ji},$ since in this case what one player wins, the other loses. Games with this extra property are called zero-sum games. But we won’t limit ourselves to those!

We say a strategy $i$ is a symmetric Nash equilibrium if

$A_{ii} \ge A_{ji}$

for all $j.$ This means that if both players use strategy $i,$ neither gains anything by switching to another strategy.

For example, suppose our matrix is

$\left( \begin{array}{rr} -1 & -12 \\ 0 & -3 \end{array} \right)$

Then we’ve got the Prisoner’s Dilemma exactly as described last time! Here strategy 1 is cooperate and strategy 2 is defect. If a player cooperates and so does his opponent, he wins

$A_{11} = -1$

meaning he gets one month in jail. We include a minus sign because ‘winning a month in jail’ is not a good thing. If the player cooperates but his opponent defects, he gets a whole year in jail:

$A_{12} = -12$

If he defects but his opponent cooperates, he doesn’t go to jail at all:

$A_{21} = 0$

And if they both defect, they both get three months in jail:

$A_{22} = -3$

You can see that defecting is a Nash equilibrium, since

$A_{22} \ge A_{12}$

So, oddly, if our prisoners know game theory and believe Nash equilibria are best, they’ll both be worse off than if they cooperate and don’t betray each other.

### Nash equilibria for mixed strategies

So far we’ve been assuming that with 100% certainty, each player chooses one strategy $i = 1,2,3,\dots, n.$ Since we’ll be considering more general strategies in a minute, let’s call these pure strategies.

Now let’s throw some probability theory into the stew! Let’s allow the players to pick different pure strategies with different probabilities. So, we define a mixed strategy to be a probability distribution on the set of pure strategies. In other words, it’s a list of $n$ nonnegative numbers

$p_i \ge 0$

that sum to one:

$\displaystyle{ \sum_{i=1}^n p_i = 1 }$

Say I choose the mixed strategy $p$ while you, my opponent, choose the mixed strategy $q.$ Say our choices are made independently. Then the probability that I choose the pure strategy $i$ while you chose $j$ is

$p_i q_j$

so the expected value of my winnings is

$\displaystyle{ \sum_{i,j = 1}^n p_i A_{ij} q_j }$

or using vector notation

$p \cdot A q$

where the dot is the usual dot product on $\mathbb{R}^n.$

We can easily adapt the concept of Nash equilibrium to mixed strategies. A mixed strategy $q$ is a symmetric Nash equilibrium if for any other mixed strategy $p,$

$q \cdot A q \ge p \cdot A q$

This means that if both you and I are playing the mixed strategy $q,$ I can’t improve my expected winnings by unilaterally switching to the mixed strategy $p.$ And neither can you, because the game is symmetric!

If this were a course on game theory, I would now do some examples. But it’s not, so I’ll just send you to page 6 of Sandholm’s paper: he looks at some famous games like ‘hawks and doves’ and ‘rock paper scissors’.

### Evolutionarily stable strategies

We’re finally ready to discuss evolutionarily stable strategies. To do this, let’s reinterpret the ‘pure strategies’ $i = 1,2,3, \dots n$ as species. Here I don’t necessarily mean species in the classic biological sense: I just mean different kinds of self-replicating entities, or replicators. For example, they could be different alleles of the same gene.

Similarly, we’ll reinterpret the ‘mixed strategy’ $p$ as describing a mixed population of replicators, where the fraction of replicators belonging to the $i$th species is $p_i.$ These numbers are still probabilities: $p_i$ is the probability that a randomly chosen replicator will belong to the $i$th species.

We’ll reinterpret the payoff matrix $A_{ij}$ as a fitness matrix. In our earlier discussion of the replicator equation, we assumed that the population $P_i$ of the $i$th species grew according to the replicator equation

$\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) P_i }$

where the fitness function $f_i$ is any smooth function of the populations of each kind of replicator.

But in evolutionary game theory it’s common to start by looking at a simple special case where

$\displaystyle{f_i(P_1, \dots, P_n) = \sum_{j=1}^n A_{ij} p_j }$

where

$\displaystyle{ p_j = \frac{P_j}{\sum_k P_k} }$

is the fraction of replicators who belong to the $j$th species.

What does this mean? The idea is that we have a well-mixed population of game players—or replicators. Each one has its own pure strategy—or species. Each one randomly roams around and ‘plays games’ with each other replicator it meets. It gets to reproduce at a rate proportional to its expected winnings.

This is unrealistic in all sorts of ways, but it’s mathematically cute, and it’s been studied a lot, so it’s good to know about. Today I’ll explain evolutionarily stable strategies only in this special case. Later I’ll go back to the general case.

Suppose that we select a sample of replicators from the overall population. What is the mean fitness of the replicators in this sample? For this, we need to know the probability that a replicator from this sample belongs to the $i$th species. Say it’s $q_j.$ Then the mean fitness of our sample is

$\displaystyle{ \sum_{i,j=1}^n q_i A_{ij} p_j }$

This is just a weighted average of the fitnesses in our earlier formula. But using the magic of vectors, we can write this sum as

$q \cdot A p$

We already saw this type of expression in the last section! It’s my expected winnings if I play the mixed strategy $q$ and you play the mixed strategy $p.$

John Maynard Smith defined $q$ to be evolutionarily stable strategy if when we add a small population of ‘invaders’ distributed according to any other probability distribution $p,$ the original population is more fit than the invaders.

In simple terms: a small ‘invading’ population can’t do better than the population as a whole.

Mathematically, this means:

$q \cdot A ((1-\epsilon)q + \epsilon p) > p \cdot A ((1-\epsilon)q + \epsilon p)$

for all mixed strategies $p$ and all sufficiently small $\epsilon \ge 0 .$ Here

$(1-\epsilon)q + \epsilon p$

is the population we get by replacing an $\epsilon$-sized portion of our original population by invaders.

Puzzle: Show that $q$ is a weakly evolutionarily stable strategy if and only these two conditions hold for all mixed stategies $p:$

$q \cdot A q \ge p \cdot A q$

and also

$q \cdot A q = p \cdot A q \; \implies \; q \cdot A p > p \cdot A p$

The first condition says that $q$ is a symmetric Nash equilibrium. In other words, the invaders can’t on average be better playing against the original population than members of the original population are. The second says that if the invaders are just as good at playing against the original population, they must be worse at playing against each other! The combination of these conditions means the invaders won’t take over.

Again, I should do some examples… but instead I’ll refer you to page 9 of Sandholm’s paper, and also these course notes:

• Samuel Alizon and Daniel Cownden, Evolutionary games and evolutionarily stable strategies.

• Samuel Alizon and Daniel Cownden, Replicator dynamics.

### The decrease of relative information

Now comes the punchline… but with a slight surprise twist at the end. Last time we let

$P = (P_1, \dots , P_n)$

be a population that evolves with time according to the replicator equation, and we let $p$ be the corresponding probability distribution. We supposed $q$ was some fixed probability distribution. We saw that the relative information

$I(q,p) = \displaystyle{ \sum_i \ln \left(\frac{q_i}{ p_i }\right) q_i }$

obeys

$\displaystyle{ \frac{d}{dt} I(q,p) = (p - q) } \cdot f(P)$

where $f(P)$ is the vector of fitness functions. So, this relative information can never increase if

$(p - q) \cdot f(P) \le 0$

for all $P$.

We can adapt this to the special case we’re looking at now. Remember, right now we’re assuming

$\displaystyle{f_i(P_1, \dots, P_n) = \sum_{j=1}^n A_{ij} p_j }$

so

$f(P) = A p$

Thus, the relative information will never increase if

$(p - q) \cdot A p \le 0$

or in other words,

$q \cdot A p \ge p \cdot A p \qquad \qquad \qquad \qquad \qquad \qquad (1)$

Now, this looks very similar to the conditions for an evolutionary stable strategy as stated in the Puzzle above. But it’s not the same! That’s the surprise twist.

Remember, the Puzzle says that $q$ is an evolutionarily stable state if for all mixed strategies $p$ we have

$q \cdot A q \ge p \cdot A q \qquad \qquad \qquad \qquad \qquad \qquad (2)$

and also

$q \cdot A q = p \cdot A q \; \implies \; q \cdot A p > p \cdot A p \qquad \; (3)$

Note that condition (1), the one we want, is neither condition (2) nor condition (3)! This drove me crazy for almost a day.

I kept thinking I’d made a mistake, like mixing up $p$ and $q$ somewhere. You’ve got to mind your p’s and q’s in this game!

But the solution turned out to be this. After Maynard Smith came up with his definition of ‘evolutionarily stable state’, another guy came up with a different definition:

• Bernhard Thomas, On evolutionarily stable sets, J. Math. Biology 22 (1985), 105–115.

For him, an evolutionarily stable strategy obeys

$q \cdot A q \ge p \cdot A q \qquad \qquad \qquad \qquad \qquad \qquad (2)$

and also

$q \cdot A p \ge p \cdot A p \qquad \qquad \qquad \qquad \qquad \qquad (1)$

Condition (1) is stronger than condition (3), so he renamed Maynard Smith’s evolutionarily stable strategies weakly evolutionarily stable strategies. And condition (1) guarantees that the relative information $I(q,p)$ can never increase. So, now we’re happy.

Except for one thing: why should we switch from Maynard Smith’s perfectly sensible concept of evolutionarily stable state to this new stronger one? I don’t really know, except that

• it’s not much stronger

and

• it lets us prove the theorem we want!

So, it’s a small mystery for me to mull over. If you have any good ideas, let me know.

## Information Geometry (Part 12)

24 June, 2012

Last time we saw that if a population evolves toward an ‘evolutionarily stable state’, then the amount of information our population has ‘left to learn’ can never increase! It must always decrease or stay the same.

This result sounds wonderful: it’s a lot like the second law of thermodynamics, which says entropy must always increase. Of course there are some conditions for this wonderful result to hold. The main condition is that the population evolves according to the replicator equation. But the other is the existence of an evolutionarily stable state. Last time I wrote down the rather odd-looking definition of ‘evolutionary stable state’ without justifying it. I need to do that soon. But if you’ve never thought about evolutionary game theory, I think giving you a little background will help. So today let me try that.

### Evolutionary game theory

We’ve been thinking of evolution as similar to inference or learning. In this analogy, organisms are like ‘hypotheses’, and the population ‘does experiments’ to see if these hypotheses make ‘correct predictions’ (i.e., can reproduce) or not. The successful ones are reinforced while the unsuccessful ones are weeded out. As a result, the population ‘learns’. And under the conditions of the theorem we discussed last time, the relative information—the amount ‘left to learn’—goes down!

While you might object to various points of this analogy, it’s useful—and that’s really all you can ask of an analogy. It’s useful because it lets us steal chunks of math from the subjects of Bayesian inference and machine learning and apply them to the study of biodiversity and evolution! This is what Marc Harper has been doing:

• Marc Harper, Information geometry and evolutionary game theory.

• Marc Harper, The replicator equation as an inference dynamic.

But now let’s bring in another analogy, also contained in Harper’s work. We can also think of evolution as similar to a game. In this analogy, organisms are like ‘strategies’—or if you prefer, they have strategies. The winners get to reproduce, while the losers don’t. John Maynard Smith started developing this analogy in 1973, and eventually wrote a whole book on it:

• John Maynard Smith, Evolution and the Theory of Games, Cambridge University Press, 1982.

As far as I can tell, evolutionary game theory has brought almost as many chunks of math to game theory as it has taken from it. Maybe it’s just my ignorance showing, but it seems that game theory becomes considerably deeper when we think about games that many players play again and again, with the winners getting to reproduce, while the losers are eliminated.

According to William Sandholm:

The birth of evolutionary game theory is marked by the publication of a series of papers by mathematical biologist John Maynard Smith. Maynard Smith adapted the methods of traditional game theory, which were created to model the behavior of rational economic agents, to the context of biological natural selection. He proposed his notion of an evolutionarily stable strategy (ESS) as a way of explaining the existence of ritualized animal conflict.

Maynard Smith’s equilibrium concept was provided with an explicit dynamic foundation through a diff erential equation model introduced by Taylor and Jonker. Schuster and Sigmund, following Dawkins, dubbed this model the replicator dynamic, and recognized the close links between this game-theoretic dynamic and dynamics studied much earlier in population ecology and population genetics. By the 1980s, evolutionary game theory was a well-developed and firmly established modeling framework in biology.

Towards the end of this period, economists realized the value of the evolutionary approach to game theory in social science contexts, both as a method of providing foundations for the equilibrium concepts of traditional game theory, and as a tool for selecting among equilibria in games that admit more than one. Especially in its early stages, work by economists in evolutionary game theory hewed closely to the interpretation set out by biologists, with the notion of ESS and the replicator dynamic understood as modeling natural selection in populations of agents genetically programmed to behave in specific ways. But it soon became clear that models of essentially the same form could be used to study the behavior of populations of active decision makers. Indeed, the two approaches sometimes lead to identical models: the replicator dynamic itself can be understood not only as a model of natural selection, but also as one of imitation of successful opponents.

While the majority of work in evolutionary game theory has been undertaken by biologists and economists, closely related models have been applied to questions in a variety of fields, including transportation science, computer science, and sociology. Some paradigms from evolutionary game theory are close relatives of certain models from physics, and so have attracted the attention of workers in this field. All told, evolutionary game theory provides a common ground for workers from a wide range of disciplines.

### The Prisoner’s Dilemma

In game theory, the most famous example is the Prisoner’s Dilemma. In its original form, this ‘game’ is played just once:

Two men are arrested, but the police don’t have enough information to convict them. So they separate the two men, and offer both the same deal: if one testifies against his partner (or defects), and the other remains silent (and thus cooperates with his partner), the defector goes free and the cooperator goes to jail for 12 months. If both remain silent, both are sentenced to only 1 month in jail for a minor charge. If they both defect, they both receive a 3-month sentence. Each prisoner must choose either to defect or cooperate with his partner in crime; neither gets to hear what the other decides. What will they do?

Traditional game theory emphasizes the so-called ‘Nash equilibrium’ for this game, in which both prisoners defect. Why don’t they both cooperate? They’d both be better off if they both cooperated. However, for them to both cooperate is ‘unstable’: either one could shorten their sentence by defecting! By definition, a Nash equilibrium has the property that neither player can improve his situation by unilaterally changing his strategy.

In the Prisoner’s Dilemma, the Nash equilibrium is not very nice: both parties would be happier if they’d only cooperate. That’s why it’s called a ‘dilemma’. Perhaps the most tragic example today is global warming. Even if all players would be better off if all cooperate to reduce carbon emissions, any one will be better off if everybody except themselves cooperates while they emit more carbon.

For this and many other reasons, people have been interested in ‘solving’ the Prisoner’s Dilemma: that is, finding reasons why cooperation might be favored over defection.

This book got people really excited in seeing what evolutionary game theory has to say about the Prisoner’s Dilemma:

• Robert Axelrod, The Evolution of Cooperation, Basic Books, New York, 1984. (A related article with the same title is available online.)

The idea is that under certain circumstances, strategies that are ‘nicer’ than defection will gradually take over. The most famous of these strategies is ‘tit for tat’, meaning that you cooperate the first time and after that do whatever your opponent just did. I won’t go into this further, because it’s a big digression and I’m already digressing too far. I’ll just mention that from the outlook of evolutionary game theory, the Prisoner’s Dilemma is still full of surprises. Just this week, some fascinating new work has been causing a stir:

• William Press and Freeman Dyson, Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent, Edge, 18 June 2012.

I hope I’ve succeeded in giving you a vague superficial sense of the history of evolutionary game theory and why it’s interesting. Next time I’ll get serious about the task at hand, which is to understand ‘evolutionarily stable strategies’. If you want to peek ahead, try this nice paper:

• William H. Sandholm, Evolutionary game theory, 12 November 2007.

This is where I got the long quote by Sandholm on the history of evolutionary game theory. The original quote contained lots of references; if you’re interested in those, go to page 3 of this paper.

## The Mathematics of Biodiversity (Part 2)

24 June, 2012

How likely is it that the next thing we see is one of a brand new kind? That sounds like a hard question. Last time I told you about the Good–Turing rule for answering this question.

The discussion that blog entry triggered has been very helpful! Among other things, it got Lou Jost more interested in this subject. Two days ago, he showed me the following simple argument for the Good–Turing estimate.

Suppose there are finitely many species of orchid. Suppose the fraction of orchids belonging to the $i$th species is $p_i.$

Suppose we start collecting orchids. Suppose each time we find one, the chance that it’s an orchid of the $i$th species is $p_i.$ Of course this is not true in reality! For example, it’s harder to find a tiny orchid, like this:

than a big one. But never mind.

Say we collect a total of $N$ orchids. What is the probability that we find no orchids of the $i$th species? It is

$(1 - p_i)^N$

Similarly, the probability that we find exactly one orchid of the $i$th species is

$N p_i (1 - p_i)^{N-1}$

And so on: these are the first two terms in a binomial series.

Let $n_1$ be the expected number of singletons: species for which we find exactly one orchid of that species. Then

$\displaystyle{ n_1 = \sum_i N p_i (1 - p_i)^{N-1} }$

Let $D$ be the coverage deficit: the expected fraction of the total population consisting of species that remain undiscovered. Given our assumptions, this is the same as the chance that the next orchid we find will be of a brand new species.

Then

$\displaystyle{ D = \sum_i p_i (1-p_i)^N }$

since $p_i$ is the fraction of orchids belonging to the $i$th species and $(1-p_i)^N$ is the chance that this species remains undiscovered.

Lou Jost pointed out that the formulas for $n_1$ and $D$ are very similar! In particular,

$\displaystyle{ \frac{n_1}{N} = \sum_i p_i (1 - p_i)^{N-1} }$

should be very close to

$\displaystyle{ D = \sum_i p_i (1 - p_i)^N }$

when $N$ is large. So, we should have

$\displaystyle{ D \approx \frac{n_1}{N} }$

In other words: the chance that the next orchid we find is of a brand new species should be close to the fraction of orchids that are singletons now.

Of course it would be nice to turn these ‘shoulds’ into precise theorems! Theorem 1 in this paper does that:

• David McAllester and Robert E. Schapire, On the convergence rate of Good–Turing estimators, February 17, 2000.

By the way: the only difference between the formulas for $n_1/N$ and $D$ is that the first contains the exponent $N-1,$ while the second contains the exponent $N.$ So, Lou Jost’s argument is a version of Boris Borcic’s ‘time-reversal’ idea:

Good’s estimate is what you immediately obtain if you time-reverse your sampling procedure, e.g., if you ask for the probability that there is a change in the number of species in your sample when you randomly remove a specimen from it.