## The Mathematics of Biodiversity (Part 4)

Today the conference part of this program is starting:

Research Program on the Mathematics of Biodiversity, June-July 2012, Centre de Recerca Matemàtica, Barcelona, Spain. Organized by Ben Allen, Silvia Cuadrado, Tom Leinster, Richard Reeve and John Woolliams.

Lou Jost kicked off the proceedings with an impassioned call to think harder about fundamental concepts:

Then Tom Leinster gave an introduction to some of these concepts, and Lou explained how they show up in ecology, genetics, economics and physics.

Suppose we have $n$ different species on an island. Suppose a fraction $p_i$ of the organisms belong to the $i$th species. So,

$\displaystyle{ \sum_{i=1}^n p_i = 1}$

and mathematically we can treat these numbers as probabilities.

People have many ways to compute the ‘biodiversity’ from these numbers. Some of these can be wildly misleading when applied incorrectly, and this has led to shocking errors. For example, in genetics, a commonly used formula for determining when plants or animals on a bunch of islands will split into separate species is completely wrong.

In fact, if we’re not careful, some measures of biodiversity can fool us into thinking we’re saving most of the biodiversity when we’re actually losing almost all of it!

One good example involves measures of similarity between tropical butterflies in the canopy (the top of the forest) and the understory (the bottom). According to Lou Just, some published studies say the similarity is about 95%. That sounds like the two communities are almost the same. However, almost no butterflies living in the canopy live in the understory, and vice versa! The problem is that mathematics is being used inappropriately.

Here are four famous measures of biodiversity:

Species richness. This is just the number of species:

$n$

Shannon entropy. This is the expected amount of information you gain when someone tells you which species an organism belongs to:

$\displaystyle{ - \sum_{i=1}^n p_i \ln(p_i) }$

• The inverse Simpson index. This is the reciprocal of the probability that two randomly chosen organisms belong to the same species:

$\displaystyle{ 1 \big/ \sum_{i=1}^n p_i^2 }$

The probability that two organisms belong to the same species is called the Simpson index:

$\displaystyle{ \sum_{i=1}^n p_i^2 }$

This is used in economics as a measure of the concentration of wealth, where $p_i$ is the fraction of wealth owned by the $i$th individual. Be careful: there’s a lot of different jargon in different fields, so it’s easy to get confused at first! For example, the probability that two organisms belong to different species is often called the Gini–Simpson index:

$\displaystyle{ 1 - \sum_{i=1}^n p_i^2 }$

It was introduced by the statistician Corrado Gini a century ago, in 1912 and the ecologist Edward H. Simpson in 1949. It’s also called the heterozygosity in genetics.

• The Berger–Parker index. This is the fraction of organisms that belong to the most common species:

$\mathrm{max} \, p_i$

So, unlike the other main ones I’ve listed, this quantity tends to go down when biodiversity goes up. To fix this we could take its reciprocal, as we did with the Simpson index.

What a mess, eh? But here’s some good news: all these quantities are functions of a single quantity, the Rényi entropy:

$\displaystyle{ H_q(p) = \frac{1}{1 -q} \ln \sum_{i=1}^n p_i^q }$

for various values of the parameter $q.$

I’ve written about the Rényi entropies and their role in thermodynamics before on this blog. I’ll also talk about it later in this conference, and I’ll show you my slides. So, I won’t repeat that story here. Suffice it to say that Rényi entropies are fascinating but still a bit mysterious to me.

But one of Lou Jost’s main points is that we can make bad mistakes if we work with Rényi entropies when we should be working with their exponentials, which are called Hill numbers and denoted by a $D$, for ‘diversity':

$\displaystyle{ {}^qD(p) = e^{H_q(p)} = \left(\sum_{i=1}^n p_i^q \right)^{\frac{1}{1-q}} }$

These were introduced by M. O. Hill in 1973. One reason they’re good is that they are effective numbers. This means that if all the species are equally common, the Hill number equals the number of species, regardless of $q$:

$p_i = \frac{1}{n} \; \Longrightarrow \; {}^qD(p) = n$

So, they’re a way of measuring an ‘effective’ number of species in situations where species are not all equally common.

A closely related fact is that the Hill numbers obey the replication principle. This means that if we have probability distributions on two finite sets, each with Hill number $X$ for some choice of $q,$ and we combine them with equal weights to get a probability distribution on the disjoint union of those sets, the resulting distribution has Hill number $2X.$

Another good fact is that the Hill numbers are as large as possible when all the probabilities $p_i$ are equal. They’re as small as possible, namely 1, when one of the $p_i$ equals 1 and the rest are zero.

Let’s see how all the measures of biodiversity I listed are either Hill numbers or can easily be converted to Hill numbers. We’ll also see that at $q = 0,$ the Hill number treats all species that are present in an equal way, regardless of their abundance. As $q$ increases, it counts more abundant species more heavily, since we’re raising the probabilities $p_i$ to a bigger power. And when $q = \infty$, we only care about the most abundant species: none of the others matter at all!

Here goes:

• The species richness is the limit of the Hill numbers as $q \to 0$ from above:

$\displaystyle{ \lim_{q \to 0^+} {}^qD (p) = n }$

So, we can just call this ${}^0D(p).$

• The exponential of the Shannon entropy is the limit of the Hill numbers as $q \to 1$:

$\displaystyle{ \lim_{q \to 1} {}^qD(p) = \exp\left(- \sum_{i=1}^n p_i \ln(p_i)\right) }$

So, we can just call this ${}^1D(p).$

• The inverse Simpson index is the Hill number at $q = 2$:

$\displaystyle{ {}^2D(p) = 1 \big/ \sum_{i=1}^n p_i^2 }$

• The reciprocal of the Berger–Parker index is the limit of Hill numbers as $q \to +\infty$:

$\displaystyle{ \lim_{q \to +\infty} {}^qD(p) = 1 \big/ \mathrm{max} \, p_i }$

so we can call this quantity ${}^\infty D(p).$

These facts mean that understanding Hill numbers will help us understand lots of measures of biodiversity! And the good properties of Hill numbers will help us avoid dangerous mistakes.

For mathematicians, a good challenge is to find theorems uniquely characterizing the Hill numbers…. preferably with assumptions that biologists will accept as plausible facts about ‘diversity’. Some theorems like this already exist for specific choices of $q,$ but it will be better to characterize the function ${}^q D$ for all values of $q$ in one blow. Tom Leinster is working on such a theorem now.

Another important task is to generalize Hill numbers to take into account things like:

• ‘distances’ between species, measured either genetically, phylogenetically or functionally,

• ‘values’ for species, measured either economically or
any other way.

There’s a lot of work on this, and many of the talks here conference will discuss these generalizations.

### 13 Responses to The Mathematics of Biodiversity (Part 4)

1. John Baez says:

In case anyone read the post when it first came out, there’s been an improvement since then. Here are the slides of Lou Jost’s talk:

It has some great examples of how biologists have made mistakes about biodiversity by using math wrong. Besides the examples listed in my blog post, there are measures of how similar two ecosystems are that can go up when you add different species to each ecosystem.

• romain says:

There’s something that puzzles me in these measures of biodiversity (Maybe Lou Jost adressed this problem to some extent but it’s rather hard to read slides without the speech).

It seems that species are always taken as predefined and that all individuals inside a species are considered identical. So that the variability “within” species is not taken into account. Yet, it seems that this is also an important part of biodiversity. In an abstract fitness space, it would amount to have species each covering a subspace instead of a single point (I hope I’m not making a straw man here). And with that picture in mind, it may be better to have a few large subspaces than a lot of dust scattered on the fitness space.

A bit of an exageration to make my question clear: is it better to have a given number of individuals of a single species (homo sapiens) with variability from Himbas to Inuits living under all climates or to have the same number distributed in 5 species of (somewhat similar) orchids that all live in the Ecuadorian forest?

If I’m not wrong (which is possible), all these measures would argue for the orchids (the “dust” case).

• John Baez says:

Romain wrote:

It seems that species are always taken as predefined and that all individuals inside a species are considered identical. So that the variability “within” species is not taken into account.

There’s nothing in the underlying mathematics that forces us to take this attitude: I’m using the word ‘species’ for an element of the set $\{1, \dots, n\}$ on which our probability distribution is given, but it doesn’t need to be a set of biological species in the traditional sense.

Even better, we could consider a set equipped with a distance function, so that individuals contribute to the biodiversity more when they are more dissimilar from other individuals that are present. This would answer your objection, at least in theory. Incorporating a distance function requires new mathematical ideas—but luckily, these have been developed very nicely here:

• Christina Cobbold and Tom Leinster, Measuring diversity: the importance of species similarity, Ecology 93 (2012), 477–489.

The title says ‘species similarity’ but mathematically we could just as well talk about the similarity of one individual to another, and how this affects biodiversity.

In practice, however, you are right. Researchers have found it difficult to get enough data about variability within species to include this effect when measuring biodiversity. I’ve heard people at this conference complaining about this. There are a lot of interesting issues—mathematical, statistical and experimental—involved in trying to remedy this problem!

• romain says:

Ok, I hadn’t reckoned the elements of the set could differ from usual species.

Just thinking out loud here, but wouldn’t mutual information rather than entropy be helpful in such cases?
If we consider a set $S$ of species and a set $R$ of individuals, then we can define entropies $H(S)$, $H(R)$ and also on the set of individuals of a single species $H(R|s_i)$. These different measures could account for the biodiversity of species, across species and within species. Though I’m not sure exactly how to interpret it in terms of biodiversity, we could even compute an information between these two sets.

Funnily enough, in 2009, in a different context (computational neuroscience), I defined an entropy on sets equipped with a distance that is identical to that of Tom Leinster (or Ricotta and Szeidl 2006 which I did not know at that time). But instead of using Renyi entropy, I sticked to Shannon and studied the mutual information (MI) on sets equipped with distances.

In biodiversity, the MI would tell you about how species are different compared to their inner variability. Or conversely, their inner variability would appear as a reference “scale” to measure the between-species variability.

But again, just thinking out loud here (and incidentally indecently advertised my work…)

• John Baez says:

It would be nice to get a reference to your paper where you defined entropy for probability distributions on sets with distance functions! Tom and Christina would be interested.

• John Baez says:

Sandrine Pavoine has a cool paper that studies variability within species of butterflies:

• V. Stevens, S. Pavoine and M. Baguette, Variation within and between closely related species uncovers high intra-specific variability in dispersal, PloS ONE (2010) 5:e11123.

• romain says:

Here’s a paper in Neural Computation about my work:

(the interesting part is between p.7 and p.14.)

It only deals with mutual information between 2 sets of which only one is equipped with a distance. That’s why there is a caveat in the additivity property. The version with 2 sets equipped with a distance is more elegant but yet to be published.

The gap between communication theory and biodiversity is rather wide and I don’t claim that it can be useful here. But it’s interesting to see that similar ideas appeared in so different fields.

2. gwman3 says:

I’m sure John knows this but for those of us whose math background is less extensive, a slight tweak to the Hill numbers (namely, changing 1-q to q in the outer exponent’s denominator) yields the “generalized mean” to which some of us may have been briefly exposed in late high school or early college.

I remember being impressed at how the generalized mean had as special cases every other mean I’d seen(arithmetic, geometric, harmonic, RMS), with min and max thrown in for good measure.

• John Baez says:

Good point! Tom Leinster, who is working on the mathematical foundations of biodiversity, has also worked on the math of generalized means—read his blog article on this subject! He’s hoping to use theorems that uniquely characterize the generalized means to get theorems that uniquely characterize the Hill numbers.

3. […] Rényi entropies—and their exponentials, called the Hill numbers—are an important measure of biodiversity. So, I decided to spend a lot of time talking about those […]

4. John Baez says:

Here are two papers that attempt to uniquely characterize the Hill numbers:

• S. Chakravarty and W. Eichhorn, An axiomatic characterization of a generalized entropy index of concentration, Journal of Productivity Analysis 2 (1991) 103–112.

• Heikki Pursiainen, Consistent Aggregation Methods and Index Number Theory, Ph.D Thesis, Faculty of Social Sciences, University of Helskini, 2005.

5. […] Equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: for starters, there are many measures of biodiversity. […]

6. As I explained a while ago, Rényi entropies are important ways of measuring biodiversity. But here’s what I learned just now, […]