Fisher’s Fundamental Theorem (Part 3)

Last time we stated and proved a simple version of Fisher’s fundamental theorem of natural selection, which says that under some conditions, the rate of increase of the mean fitness equals the variance of the fitness. But the conditions we gave were very restrictive: namely, that the fitness of each species of replicator is constant, not depending on how many of these replicators there are, or any other replicators.

To broaden the scope of Fisher’s fundamental theorem we need to do one of two things:

1) change the left side of the equation: talk about some other quantity other than rate of change of mean fitness.

2) change the right side of the question: talk about some other quantity than the variance in fitness.

Or we could do both! People have spent a lot of time generalizing Fisher’s fundamental theorem. I don’t think there are, or should be, any hard rules on what counts as a generalization.

But today we’ll take alternative 1). We’ll show the square of something called the ‘Fisher speed’ always equals the variance in fitness. One nice thing about this result is that we can drop the restrictive condition I mentioned. Another nice thing is that the Fisher speed is a concept from information theory! It’s defined using the Fisher metric on the space of probability distributions.

And yes—that metric is named after the same guy who proved Fisher’s fundamental theorem! So, arguably, Fisher should have proved this generalization of Fisher’s fundamental theorem. But in fact it seems that I was the first to prove it, around February 1st, 2017. Some similar results were already known, and I will discuss those someday. But they’re a bit different.

A good way to think about the Fisher speed is that it’s ‘the rate at which information is being updated’. A population of replicators of different species gives a probability distribution. Like any probability distribution, this has information in it. As the populations of our replicators change, the Fisher speed says the rate at which this information is being updated. So, in simple terms, we’ll show

The square of the rate at which information is updated is equal to the variance in fitness.

This is quite a change from Fisher’s original idea, namely:

The rate of increase of mean fitness is equal to the variance in fitness.

But it has the advantage of always being true… as long the population dynamics are described by the general framework we introduced last time. So let me remind you of the general setup, and then prove the result!

The setup

We start out with population functions P_i \colon \mathbb{R} \to (0,\infty), one for each species of replicator i = 1,\dots,n, obeying the Lotka–Volterra equation

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) P_i }

for some differentiable functions f_i \colon (0,\infty) \to \mathbb{R} called fitness functions. The probability of a replicator being in the ith species is

\displaystyle{  p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)} }

Using the Lotka–Volterra equation we showed last time that these probabilities obey the replicator equation

\displaystyle{ \dot{p}_i = \left( f_i(P) - \overline f(P) \right)  p_i }

Here P is short for the whole list of populations (P_1(t), \dots, P_n(t)), and

\displaystyle{ \overline f(P) = \sum_j f_j(P) p_j  }

is the mean fitness.

The Fisher metric

The space of probability distributions on the set \{1, \dots, n\} is called the (n-1)-simplex

\Delta^{n-1} = \{ (x_1, \dots, x_n) : \; x_i \ge 0, \; \displaystyle{ \sum_{i=1}^n x_i = 1 } \}

It’s called \Delta^{n-1} because it’s (n-1)-dimensional. When n = 3 it looks like the letter \Delta:

The Fisher metric is a Riemannian metric on the interior of the (n-1)-simplex. That is, given a point p in the interior of \Delta^{n-1} and two tangent vectors v,w at this point the Fisher metric gives a number

g(v,w) = \displaystyle{ \sum_{i=1}^n \frac{v_i w_i}{p_i}  }

Here we are describing the tangent vectors v,w as vectors in \mathbb{R}^n with the property that the sum of their components is zero: that’s what makes them tangent to the (n-1)-simplex. And we’re demanding that x be in the interior of the simplex to avoid dividing by zero, since on the boundary of the simplex we have p_i = 0 for at least one choice of $i.$

If we have a probability distribution p(t) moving around in the interior of the (n-1)-simplex as a function of time, its Fisher speed is

\displaystyle{ \sqrt{g(\dot{p}(t), \dot{p}(t))} = \sum_{i=1}^n \frac{\dot{p}_i(t)^2}{p_i(t)} }

if the derivative \dot{p}(t) exists. This is the usual formula for the speed of a curve moving in a Riemannian manifold, specialized to the case at hand.

Now we’ve got all the formulas we’ll need to prove the result we want. But for those who don’t already know and love it, it’s worthwhile saying a bit more about the Fisher metric.

The factor of 1/x_i in the Fisher metric changes the geometry of the simplex so that it becomes round, like a portion of a sphere:

But the reason the Fisher metric is important, I think, is its connection to relative information. Given two probability distributions p, q \in \Delta^{n-1}, the information of q relative to p is

\displaystyle{ I(q,p) = \sum_{i = 1}^n q_i \ln\left(\frac{q_i}{p_i}\right)   }

You can show this is the expected amount of information gained if p was your prior distribution and you receive information that causes you to update your prior to q. So, sometimes it’s called the information gain. It’s also called relative entropy or—my least favorite, since it sounds so mysterious—the Kullback–Leibler divergence.

Suppose p(t) is a smooth curve in the interior of the (n-1)-simplex. We can ask the rate at which information is gained as time passes. Perhaps surprisingly, a calculation gives

\displaystyle{ \frac{d}{dt} I(p(t), p(t_0))\Big|_{t = t_0} = 0 }

That is, in some sense ‘to first order’ no information is being gained at any moment t_0 \in \mathbb{R}. However, we have

\displaystyle{  \frac{d^2}{dt^2} I(p(t), p(t_0))\Big|_{t = t_0} =  g(\dot{p}(t_0), \dot{p}(t_0))}

So, the square of the Fisher speed has a nice interpretation in terms of relative entropy!

For a derivation of these last two equations, see Part 7 of my posts on information geometry. For more on the meaning of relative entropy, see Part 6.

The result

It’s now extremely easy to show what we want, but let me state it formally so all the assumptions are crystal clear.

Theorem. Suppose the functions P_i \colon \mathbb{R} \to (0,\infty) obey the Lotka–Volterra equations:

\displaystyle{ \dot P_i = f_i(P) P_i}

for some differentiable functions f_i \colon (0,\infty)^n \to \mathbb{R} called fitness functions. Define probabilities and the mean fitness as above, and define the variance of the fitness by

\displaystyle{ \mathrm{Var}(f(P)) =  \sum_j ( f_j(P) - \overline f(P))^2 \, p_j }

Then if none of the populations P_i are zero, the square of the Fisher speed of the probability distribution p(t) = (p_1(t), \dots , p_n(t)) is the variance of the fitness:

g(\dot{p}, \dot{p})  = \mathrm{Var}(f(P))

Proof. The proof is near-instantaneous. We take the square of the Fisher speed:

\displaystyle{ g(\dot{p}, \dot{p}) = \sum_{i=1}^n \frac{\dot{p}_i(t)^2}{p_i(t)} }

and plug in the replicator equation:

\displaystyle{ \dot{p}_i = (f_i(P) - \overline f(P)) p_i }

We obtain:

\begin{array}{ccl} \displaystyle{ g(\dot{p}, \dot{p})} &=& \displaystyle{ \sum_{i=1}^n \left( f_i(P) - \overline f(P) \right)^2 p_i } \\ \\ &=& \mathrm{Var}(f(P)) \end{array}

as desired.   █

It’s hard to imagine anything simpler than this. We see that given the Lotka–Volterra equation, what causes information to be updated is nothing more and nothing less than variance in fitness! But there are other variants of Fisher’s fundamental theorem worth discussing, so I’ll talk about those in future posts.

10 Responses to Fisher’s Fundamental Theorem (Part 3)

  1. Toby Bartels says:

    Typo: You have a stray comma in the Lotka–Volterra equation (the first time).

    This reminds me to remark that these generalized Lotka–Volterra equations hardly say anything more than that the quantities are governed by a system of time-independent differential equations at all. What they do say beyond that is that \dot{P}_i = 0 when P_i = 0 (and that any implicit continuity assumptions you make about the differential equations hold even more strongly at zero). This is quite reasonable, because if the population is zero at any time, then it can hardly change thereafter. But you only look in the interior where none of the P_i are zero anyway, so now it says nothing extra again.

    • John Baez says:

      Yes, the Lotka–Volterra equations are scarcely more than the general system of first-order time-independent ODE! I should have said this. It’s interesting you can get anything out of them at all.

      The main reason for writing

      \dot{P}_i = f_i(P_1, \dots, P_n) P_i

      instead of

      \dot{P}_i = g_i(P_1, \dots, P_n)

      is that the functions f_i do useful things for us when we bring in probabilities: their means, variances and such are interesting.

      But it’s also important that the Lotka–Volterra equations imply that if P_i(t) vanishes at some time it vanishes at all later times (since I’m assuming the f_i are differentiable—Lipschitz would be enough for this result). This means there’s “no true novelty”: no species can come into existence if it’s not there already. This means we’re not doing a full-fledged study of mutation. So we’re studying “natural selection” but not this other important aspect of evolution. (There are also lots of other aspects of evolution that we’re not getting into, of course.)

    • Marc Harper says:

      Similarly, any dynamic confined to the interior of the simplex is actually a replicator equation. This is called “forward-invariance”.

      If a dynamic \dot{p}_i = F_i(p) is forward-invariant on interior of the simplex, then we must have that \sum_j{F_j(p)} = 0 since $\sum_i{\dot{p}_i} = 0. So we can rewrite the dynamic as follows:

      \dot{p}_i = F_i(p) = p_i \frac{F_i(p)}{p_i} =  p_i\frac{F_i(p)}{p_i} - p_i \sum_j{p_j \frac{F_j(p)}{p_j}}

      which is a replicator equation with fitness functions $N_i(p) = F_i(p) / p_i$.

      AFAIK this was first noticed by Dashiell Fryer and myself [1] and [2]. It was also reported essentially as written above recently in [3], which notes a similar observation due to Smale in 1976 (before the common form of replicator equation was defined, typically attributed to Talyor and Jonker in 1978).

      [1] Fryer (2012) On the existence of general equilibrium in finite games and general game dynamics. arXiv:1201.2384
      [2] Harper, Fryer (2014). Lyapunov Functions for Time-Scale Dynamics on Riemannian Geometries of the Simplex, DGAA
      [3] Raju, Krishnaprasad “Lie algebra structure of fitness and replicator control” (2020) arXiv:2005.09792

  2. Phillip Helbig says:

    Off topic, but what is your take on the fact that a mathematical physicist has won the Nobel Prize for physics? Is that the first time that has happened? Actually, Penrose is a mathematician by training. (Fair enough I suppose since physicist Edward Witten, who also has a degree with a major in history and minor in linguistics and worked as a journalist and studied economics and maths for a while before coming to physics—not that it took long since he was a full professor at the IAS at Princeton at just barely 26—has won the Fields Medal.)

  3. Mike Stay says:

    It seems strange that the acceleration of the information is equal to a squared speed. Since information is measured in bits, d²/dt² I is bits per second squared, which means the Fisher speed has units of “square root of bits” per second.

    • Phillip Helbig says:

      But is that meaningful? Think of something like flux density which is power per area per hertz. But since power is energy per time this is energy per area per time per hertz, but hertz has units of 1 over time so they cancel and we are left with energy per area which is formally correct in a sense but intuitively removed from the concept of flux density.

      • Phillip Helbig says:

        On a related note, power is energy per time and of course time is money and knowledge is power. Solving for money, we see that it diverges as knowledge goes to zero, regardless of the energy (e.g. tweeting) expended, which explains why Donald Trump earns more than I do. :-)

    • John Baez says:

      It seems strange that the acceleration of the information is equal to a squared speed. Since information is measured in bits, d²/dt² I is bits per second squared, which means the Fisher speed has units of “square root of bits” per second.

      Since bits are normally regarded as dimensionless this is tolerable, but I agree it’s strange. Fundamentally what’s strange is that as you start with p = q and start moving q away from p, the relative information I(q,p) doesn’t change to first order—only to second order! “To first order, you’re never learning anything”. So relative information is not like distance. It’s more like the square of distance.

      But of course this follows from the fact that I(q,p) depends smoothly on q, I(q,p) \ge 0 and I(q,p) = 0 when q = p. A smooth nonnegative function that vanishes at some point must have vanishing derivative at that point. It can have nonvanishing second derivative, though!

      And this is actually connected in a nice way to how distance |x-y| is the square root of a fundamentally simpler quantity (x-y)^2.

  4. John (if I may),

    ‘Fitness’ and ‘natural selection’ lead to the question, “What exactly is it, which has the fitness to be selected by nature?”

    In the SEP article on fitness ( there is some math and a section on “How the Problems of Defining Biological Individuality Affect the Notion of Fitness.” Here, the question of “What is it that’s fit?” raises a key issue in immunology. And then, a book is cited: “The Limits of the Self: Immunology and Biological Identity” by Thomas Pradeu.

    The issue is how the immune systems knows to attack an ‘other,’ and when it overreacts (as sometimes in COVID) to attack its ‘self.’ Here is a review of The Limits of Self–

    This question is outside those addressed by Shannon-theoretic information. A different mathematics of information seems to be required. For example, say there is uncertainty among three possible entities within the detection process of the immune system: other1, other2, other3.

    In Shannon’s theory of information, if any one of these three is ultimately detected, the amount of information (or ‘surprisal’) is the same. In this situation, Shannon’s theory makes no difference between other1, other2, or other3. The best text on this characteristic of Shannon information that I’ve read is by Fred Dretske– Chapter 1 of his book ‘Knowledge and the Flow of Information.’

    Put another way, in this case, the problem for a mathematical theory of information is to detect the immunological ‘self.’ As well as others.

    Now consider situation theory, channel theory, or ‘informationalism’ as introduced by Jon Barwise (the last, shortly before he passed away). My apologies for not formatting the math using latex. But here is some text on how that mathematical theory of information works to identify self and others:

    First the self is a constituent of a situation. For example, say that along the lines of The Limits of Self, the self is a continuous process, symbolized in ordinary Petri nets by a transition (the continuous process), with one of its arrows pointed to a place labelled ‘self’ (a possibility), and then an arrow from that place back into the transition (the continuous process). This Petri net could symbolize the self inside its situation.

    Now add an element of information (an ‘infon’) to this situation– that the self knows that it is in this situation in which it exists, or in which it occurs.

    And then– if there are any others in this situation, by knowing the situation in which they, as well as its self, exist, it therefore knows them as well.

    This machinery is detailed using mathematical symbols in Barwise’s Situation in Logic– in his chapter on common knowledge.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.