Statistical Laws of Darwinian Evolution

guest post by Matteo Smerlak

Biologists like Steven J. Gould like to emphasize that evolution is unpredictable. They have a point: there is absolutely no way an alien visiting the Earth 400 million years ago could have said:

Hey, I know what’s gonna happen here. Some descendants of those ugly fish will grow wings and start flying in the air. Others will walk the surface of the Earth for a few million years, but they’ll get bored and they’ll eventually go back to the oceans; when they do, they’ll be able to chat across thousands of kilometers using ultrasound. Yet others will grow arms, legs, fur, they’ll climb trees and invent BBQ, and, sooner or later, they’ll start wondering “why all this?”.

Nor can we tell if, a week from now, the flu virus will mutate, become highly pathogenic and forever remove the furry creatures from the surface of the Earth.

Evolution isn’t gravity—we can’t tell in which directions things will fall down.

One reason we can’t predict the outcomes of evolution is that genomes evolve in a super-high dimensional combinatorial space, which a ginormous number of possible turns at every step. Another is that living organisms interact with one another in a massively non-linear way, with, feedback loops, tipping points and all that jazz.

Life’s a mess, if you want my physicist’s opinion.

But that doesn’t mean that nothing can be predicted. Think of statistics. Nobody can predict who I’ll vote for in the next election, but it’s easy to tell what the distribution of votes in the country will be like. Thus, for continuous variables which arise as sums of large numbers of independent components, the central limit theorem tells us that the distribution will always be approximately normal. Or take extreme events: the max of N independent random variables is distributed according to a member of a one-parameter family of so-called “extreme value distributions”: this is the content of the famous Fisher–Tippett–Gnedenko theorem.

So this is the problem I want to think about in this blog post: is evolution ruled by statistical laws? Or, in physics terms: does it exhibit some form of universality?

Fitness distributions are the thing

One lesson from statistical physics is that, to uncover universality, you need to focus on relevant variables. In the case of evolution, it was Darwin’s main contribution to figure out the main relevant variable: the average number of viable offspring, aka fitness, of an organism. Other features—physical strength, metabolic efficiency, you name it—matter only insofar as they are correlated with fitness. If we further assume that fitness is (approximately) heritable, meaning that descendants have the same fitness as their ancestors, we get a simple yet powerful dynamical principle called natural selection: in a given population, the lineage with the highest fitness eventually dominates, i.e. its fraction goes to one over time. This principle is very general: it applies to genes and species, but also to non-living entities such as algorithms, firms or language. The general relevance of natural selection as a evolutionary force is sometimes referred to as “Universal Darwinism”.

The general idea of natural selection is pictured below (reproduced from this paper):

It’s not hard to write down an equation which expresses natural selection in general terms. Consider an infinite population in which each lineage grows with some rate x. (This rate is called the log-fitness or Malthusian fitness to contrast it with the number of viable offspring w=e^{x\Delta t} with \Delta t the lifetime of a generation. It’s more convenient to use x than w in what follows, so we’ll just call x “fitness”). Then the distribution of fitness at time t satisfies the equation

\displaystyle{ \frac{\partial p_t(x)}{\partial t} =\left(x-\int d y\, y\, p_t(y)\right)p_t(x) }

whose explicit solution in terms of the initial fitness distribution p_0(x):

\displaystyle{ p_t(x)=\frac{e^{x t}p_0(x)}{\int d y\, e^{y t}p_0(y)} }

is called the Cramér transform of p_0(x) in large deviations theory. That is, viewed as a flow in the space of probability distributions, natural selection is nothing but a time-dependent exponential tilt. (These equations and the results below can be generalized to include the effect of mutations, which are critical to maintain variation in the population, but we’ll skip this here to focus on pure natural selection. See my paper referenced below for more information.)

An immediate consequence of these equations is that the mean fitness \mu_t=\int dx\, x\, p_t(x) grows monotonically in time, with a rate of growth given by the variance \sigma_t^2=\int dx\, (x-\mu_t)^2\, p_t(x):

\displaystyle{ \frac{d\mu_t}{dt}=\sigma_t^2\geq 0 }

The great geneticist Ronald Fisher (yes, the one in the extreme value theorem!) was very impressed with this relationship. He thought it amounted to an biological version of the second law of thermodynamics, writing in his 1930 monograph

Professor Eddington has recently remarked that “The law that entropy always increases—the second law of thermodynamics—holds, I think, the supreme position among the laws of nature”. It is not a little instructive that so similar a law should hold the supreme position among the biological sciences.

Unfortunately, this excitement hasn’t been shared by the biological community, notably because this Fisher “fundamental theorem of natural selection” isn’t predictive: the mean fitness \mu_t grows according to the fitness variance \sigma_t^2, but what determines the evolution of \sigma_t^2? I can’t use the identity above to predict the speed of evolution in any sense. Geneticists say it’s “dynamically insufficient”.

Two limit theorems

But the situation isn’t as bad as it looks. The evolution of p_t(x) may be decomposed into the evolution of its mean \mu_t, of its variance \sigma_t^2, and of its shape or type

\overline{p}_t(x)=\sigma_t p_t(\sigma_t x+\mu_t).

(We also call \overline{p}_t(x) the “standardized fitness distribution”.) With Ahmed Youssef we showed that:

• If p_0(x) is supported on the whole real line and decays at infinity as

-\ln\int_x^{\infty}p_0(y)d y\underset{x\to\infty}{\sim} x^{\alpha}

for some \alpha > 1, then \mu_t\sim t^{\overline{\alpha}-1}, \sigma_t^2\sim t^{\overline{\alpha}-2} and \overline{p}_t(x) converges to the standard normal distribution as t\to\infty. Here \overline{\alpha} is the conjugate exponent to \alpha, i.e. 1/\overline{\alpha}+1/\alpha=1.

• If p_0(x) has a finite right-end point x_+ with

p(x)\underset{x\to x_+}{\sim} (x_+-x)^\beta

for some \beta\geq0, then x_+-\mu_t\sim t^{-1}, \sigma_t^2\sim t^{-2} and \overline{p}_t(x) converges to the flipped gamma distribution

\displaystyle{ p^*_\beta(x)= \frac{(1+\beta)^{(1+\beta)/2}}{\Gamma(1+\beta)} \Theta[x-(1+\beta)^{1/2}] }

\displaystyle { e^{-(1+\beta)^{1/2}[(1+\beta)^{1/2}-x]}\Big[(1+\beta)^{1/2}-x\Big]^\beta }

Here and below the symbol \sim means “asymptotically equivalent up to a positive multiplicative constant”; \Theta(x) is the Heaviside step function. Note that p^*_\beta(x) becomes Gaussian in the limit \beta\to\infty, i.e. the attractors of cases 1 and 2 form a continuous line in the space of probability distributions; the other extreme case, \beta\to0, corresponds to a flipped exponential distribution.

The one-parameter family of attractors p_\beta^*(x) is plotted below:

These results achieve two things. First, they resolve the dynamical insufficiency of Fisher’s fundamental theorem by giving estimates of the speed of evolution in terms of the tail behavior of the initial fitness distribution. Second, they show that natural selection is indeed subject to a form of universality, whereby the relevant statistical structure turns out to be finite dimensional, with only a handful of “conserved quantities” (the \alpha and \beta exponents) controlling the late-time behavior of natural selection. This amounts to a large reduction in complexity and, concomitantly, an enhancement of predictive power.

(For the mathematically-oriented reader, the proof of the theorems above involves two steps: first, translate the selection equation into a equation for (cumulant) generating functions; second, use a suitable Tauberian theorem—the Kasahara theorem—to relate the behavior of generating functions at large values of their arguments to the tail behavior of p_0(x). Details in our paper.)

It’s useful to consider the convergence of fitness distributions to the attractors p_\beta^*(x) for 0\leq\beta\leq \infty in the skewness-kurtosis plane, i.e. in terms of the third and fourth cumulants of p_t(x).

The red curve is the family of attractors, with the normal at the bottom right and the flipped exponential at the top left, and the dots correspond to numerical simulations performed with the classical Wright–Fisher model and with a simple genetic algorithm solving a linear programming problem. The attractors attract!

Conclusion and a question

Statistics is useful because limit theorems (the central limit theorem, the extreme value theorem) exist. Without them, we wouldn’t be able to make any population-level prediction. Same with statistical physics: it only because matter consists of large numbers of atoms, and limit theorems hold (the H-theorem, the second law), that macroscopic physics is possible in the first place. I believe the same perspective is useful in evolutionary dynamics: it’s true that we can’t predict how many wings birds will have in ten million years, but we can tell what shape fitness distributions should have if natural selection is true.

I’ll close with an open question for you, the reader. In the central limit theorem as well as in the second law of thermodynamics, convergence is driven by a Lyapunov function, namely entropy. (In the case of the central limit theorem, it’s a relatively recent result by Arstein et al.: the entropy of the normalized sum of n i.i.d. random variables, when it’s finite, is a monotonically increasing function of n.) In the case of natural selection for unbounded fitness, it’s clear that entropy will also be eventually monotonically increasing—the normal is the distribution with largest entropy at fixed variance and mean.

Yet it turns out that, in our case, entropy isn’t monotonic at all times; in fact, the closer the initial distribution p_0(x) is to the normal distribution, the later the entropy of the standardized fitness distribution starts to increase. Or, equivalently, the closer the initial distribution p_0(x) to the normal, the later its relative entropy with respect to the normal. Why is this? And what’s the actual Lyapunov function for this process (i.e., what functional of the standardized fitness distribution is monotonic at all times under natural selection)?

In the plots above the blue, orange and green lines correspond respectively to

\displaystyle{ p_0(x)\propto e^{-x^2/2-x^4}, \quad p_0(x)\propto e^{-x^2/2-.01x^4}, \quad p_0(x)\propto e^{-x^2/2-.001x^4} }


• S. J. Gould, Wonderful Life: The Burgess Shale and the Nature of History, W. W. Norton & Co., New York, 1989.

• M. Smerlak and A. Youssef, Limiting fitness distributions in evolutionary dynamics, 2015.

• R. A. Fisher, The Genetical Theory of Natural Selection, Oxford University Press, Oxford, 1930.

• S. Artstein, K. Ball, F. Barthe and A. Naor, Solution of Shannon’s problem on the monotonicity of entropy, J. Am. Math. Soc. 17 (2004), 975–982.

34 Responses to Statistical Laws of Darwinian Evolution

  1. I assume that first formula should read w=e^{x \Delta t}? I can’t see any reference to w at all, other than you mentioning it’s more convenient than x, but the self-referential formula x=e^{x \Delta t} seems off. It would have the solution x=-\frac{W_k\left(t\right)}{t} , k \in \mathbb{Z} where W is the product log function.

    • p_\beta^* is unfortunately cut off. Not sure if that’s fixable: Seems like a lengthy equation.

      • what do you mean by “the closer the initial distribution from the normal”? Is something missing in those two sentences? I can’t quite parse that. By “the normal”, do you mean “the normal distrubition”?

        Ok, and now for a more interesting questions: In the Skewness/Kurtosis graph, I assume the grey graph is a \cosh(x) or maybe an x^2+1. Will all evolutionary trajectories stay above that graph or what is its relevance? Clearly, the two shown processes do stay solidly above.

        This looks like a really nice result! So I’m just totally wildly guessing because I know way too little about this to be able to do an actually educated guess but: Could it be that the Lyapunov function in this case is the dissipation you previously talked about?

        • The blue-shaded region is the subset of the skewness-kurtosis plane available to unimodal distributions according to the Klaassen–Mokveld–van Es inequality [C. A. J. Klaassen, P. J. Mokveld, and B. van Es, “Squared skewness minus kurtosis bounded by 186/125 for unimodal distributions,” Stat. Probabil. Lett. 50 no. 2, (Nov., 2000) 131–135]

      • John Baez says:

        I fixed the big equation; thanks! That was my fault.

        what do you mean by “the closer the initial distribution from the normal”?

        I’ll guess Matteo means “the closer the initial distribution to the normal”. I.e., a little problem with idiomatic English.

        By “the normal”, do you mean “the normal distribution”?

        I bet that’s what he means. So, “the closer the initial distribution is to the normal distribution…”

    • Thanks Kram, you’re right: this should read w=e^{x\Delta t}. I mentioned w just because it’s commonly by many biologists.

  2. It would be interesting to look at this from an Information Geometry angle. Having a Riemann metric as well as a the above vector field would suggest looking at the covariant derivative of the the vector field and the lie derivative of the metric wrt the field. I guess that would require choosing a parametric family to define the Fisher metric. Could the flow satisfy some king of extremal principle in that setting?

  3. domenico says:

    I must read everything carefully, but can a quantum equation (with a right potential maybe using the corrispondence principle) describe a Universal Darwinism system?

    • Hi Domenico, thanks for your suggestion. However, if by quantum equation you mean “Schrödinger equation”, then I’d say no: the latter is linear, but the equation for p_t(x) isn’t.

      • domenico says:

        Thank you, I understand, but if a classical system can be turned in a quantum system (for example a thermodynamic or economic system, with a transformation of the variables in quantum variables), then could be possible approximate a real dynamic with a quantum dynamic (a classical Hamiltonian that approximate the real dynamic for little variation of variables), so that could be write the final state like a probability distribution (I am thinking to meteorology)?

  4. nad says:

    It’s not hard to write down an equation which expresses natural selection in general terms.

    Where does this come from? Is this an observation or assumption ?
    In the link you provided it is written:

    The Fundamental Theorem of Natural Selection describes, quantitatively, how a diverse population subject to selective pressure evolves. It states that the rate of change of the average of a quantitative trait is proportional to the hereditable part of the variance of that trait times the strength of selection for it.

    It seems (?) that you tried to put this law into formulas, setting the strength of selection to one. ???

    An immediate consequence of these equations is that the mean fitness \mu_t=\int dx\, x\, p_t(x) grows monotonically in time, with a rate of growth given by the variance \sigma_t^2=\mu_t=\int dx\, (x-\mu_t)^2\, p_t(x): \displaystyle{ \frac{d\mu_t}{dt}=\sigma_t^2\geq 0 }

    There seems to be a typo or \mu_t is used here in an ambigous way. In particular I can’t see how you get \sigma_t^2 by taking the time derivative of your first definition \mu_t=\int dx\, x\, p_t(x) but maybe that can be clarified if the typo has been clarified.

    • You’re right nad, this is a typo: \mu_t to nothing do it in the definition for the variance. (Sorry! I was surprised by John’s posting of what was still a draft.)

      About your first question, the basic equation of natural selection can derived as follows. Suppose you have organisms with a range of growth rates (i.e. fitness) x. Then the number n_t(x) of individuals with fitness x grows as \partial n_t(x)/\partial t=xn_t(x). The corresponding probability density, which must remain normalized, then satisfies the equation in the post.

      • John Baez says:

        I was surprised by John’s posting of what was still a draft.

        Whoops! I thought you wanted me to post it. I sent you an email saying “I’ll post it in a while!”, and 3 days later I did. Sorry!

        I removed the first \mu_t in this equation:

        \sigma_t^2=\mu_t=\int dx\, (x-\mu_t)^2\, p_t(x)

        I hope that does the job!

  5. Bruce Smith says:

    Just to make sure I understand — am I right that this is a simplified model in which there are no mutations, no interaction between organisms, no change in environment over time, and only single-parent reproduction? (That level of simplification is the only way I can see to legitimize leaving out any separation between genotype and phenotype.)

    • Bruce, I notice the same thing. What the article needs is some background concerning ecological diversity. Evolution does not occur without the competition from other species inhabiting similar ecological niches. So the statistics are those described by a relative abundance distribution (RAD).

      Variations in fitness are not simply there to find an optimal setting for an individual species to survive, but to find alternatives to other species competing for the same resources.

      Species do not evolve in isolation, so I think if we can add the entropic dispersion that contribute to the fat-tails of nearly every characterized RAD, then others can see how it fits into a unified model of ecological diversity.

      So I don’t think I can make any really constructive comments unless the bigger picture is included.

    • You’re right about the assumptions, Bruce, though mutations are discussed in my paper as well. What I’m describing here is the mathematics of natural selection, and natural selection alone. Clearly that’s especially relevant for prokaryotic evolution—not so much for ecosystems including social mammals etc.

      We can add other determinants of evolution in models when that’s needed. That doesn’t mean that we shouldn’t try to understand each effect separately, in its own terms, does it? You can’t understand viscous fluids like if you don’t have a decent understand of inviscid fluids. I think even aircraft engineers would agree with that.

      • Bruce Smith says:

        That doesn’t mean that we shouldn’t try to understand each effect separately, in its own terms, does it?

        Absolutely, simplification is very worthwhile. I’m not criticizing it at all, just wanting to make sure I understand correctly exactly what was included and what was left out.

    • John Baez says:

      Evolutionary biologists have some words to help delineate the various aspects of evolution. From Wikipedia:

      Microevolution is the change in allele frequencies that occurs over time within a population. This change is due to four different processes: mutation, selection (natural and artificial), gene flow, and genetic drift. This change happens over a relatively short (in evolutionary terms) amount of time compared to the changes termed ‘macroevolution’ which is where greater differences in the population occur.

      However, as you might suspect, they argue a lot about these things, because there are lots of complexities and lots of ways to try to divide them into more manageable pieces. They also talk about ‘mesoevolution’, e.g.:

      • Ehab Abouheif, Parallelism as the pattern and process of mesoevolution.

      I heard a cool talk by a guy who wrote a book about a unified mathematical theory of micro- and mesoevolution, but now I can’t remember his name! I was very excited at the time, and wanted to learn more. I think he was German.

  6. linasv says:

    Interesting post, until I get near the end, with the toy GA-learning example. The toy may work as described, but it strongly contradicts my personal experience. In the models that I train, I see fitness curves that resemble the so-called “learning curves”, where the fitness plateaus for a while, and then there is a rapid improvement in fitness, followed by another plateau.

    Understanding this was not entirely easy. Of course, one can hand-wave about how the discovery of something new results in a rapid improvement of fitness. The hard part was this realization: the GA algorithm didn’t simply take the fittest exemplar, and improve it … quite the opposite — it took some fairly mediocre, relatively unfit exemplar, and made a change that put it well out in front of the fittest ones from the population at the previous timestep. Then its a matter of fine-tuning this discovery for a while, allowing rapid improvement, and then the change in fitness plateaus.

    In my case, I was doing just machine learning, but a similar behavior is known in biology: I forget certain details, but the “best” cell-membrane calcium ion channels are evolved from mediocre general-purpose ion channels. The point was that the initial mutation essentially damaged the general-purpose ion channel, making it unfit, but this damage opened a path to a much more efficient (but specialized!) calcium channel. (I lost the reference for this, its somewhere on the net).

    Its not clear to me that what you describe here is still applicable to these fits-n-starts, “flight of swallow” type curves. Perhaps — I guess the change in mean-fitness as square of variance does this, — the bigger the variance, the larger the pool of “unfit”, “damaged” individuals that can be drawn upon to discover a fitter solution. Could use some pondering.

    Here are two URL’s that describe this: the first (later) one in general, the second (earlier) one in greater detail:

    • linasv says:

      Write first, think later. What I talk about, and what you present, can be reconciled if one realizes that Fisher accidentally conflated “distribution of fitness” with “diversity of population”. Naively, one might think they are the same, or are linearly related: but this is absolutely false.

      In my machine learning code, I was careful to maintain a reasonably broad “distribution of fitness” as this clearly seemed to improve the rate of learning — as suggested by the Fisher eqn. However, I later observed that during the plateau regions, the genetic diversity of the population had dramatically narrowed: although the population had a wide range of fitness scores, its actual genetic diversity was small, and most individuals were minor variants of each-other.

      My solution was to somehow promote diversity, and this article now makes clear: Fisher was wrong: its not the “distribution of fitness” that drives the improvement in the mean, its the “genetic diversity of the population”!

      You may feel otherwise, but for me, this is a fore-head slap “but of course” moment!

  7. Torbjörn Larsson says:

    Considering that prokaryote growth rates can differ many orders of magnitude, I can see the advantages with such a simplified model, when more generally differential reproduction is the fitness measure.

    Else there are some laws in population genetics that I have seen in my haphazard assimilation of the material. One is the use of Nash equilibrium for some allele frequencies. Another is the law that speciation happens if the crossbreeding rate between sub-populations is less than 1/generation on average, independent of population size. There is also the Hardy-Weinberg Law, the most hailed (and most controversial) of them all.

    Speaking of the underlying population genetics, linasv has identified a complexity: if selection is too large during some time, a (finite) population can run out of variation capital, meaning random drift will increase.But we can always add complexity.

    the closer the initial distribution p_0(x) to the normal, the later its relative entropy with respect to the normal starts to increase. Why is this?”

    Since I haven’t delved into the math (which is why I don’t attempt the corresponding Lyapunov question), I can just guess based on dynamic intuition.

    The finite fitness distribution should be closer in some sense to a single environmental factor behind the selective drive, since a normal distribution implies many factors. Hence it should converge faster on a solution that satisfies the environmental constraints.

    • “Considering that prokaryote growth rates can differ many orders of magnitude, “

      That’s the key. Apply a MaxEnt distribution to growth rates and use that for modeling relative abundance distributions of species.

  8. Torbjörn Larsson says:

    Re Fisher’s Fundamental Theorem in its guise as the breeder’s equation [ ; ], I just found this interesting discussion on its application. It discusses how evolutionary biologists get around, or fail to get around, the differences between quantitative genetic variation (instantiated in the breeder’s equation) and the adaptive potential (instantiated in the neutral variation).

    It also discusses the failure of another “law”, the “extinction vortex” of small N. Turns out Fisher’s FT/the breeder equation isn’t very applicable to (at least some) small eukaryote populations, nor are models of adaptive potential or the adaptive potential itself.

    Eukaryotes have a lot of tricks up theur sleeves, a “capacity to colonize new habitats”, a sometimes “affinity for [localized] environments”, and “a partially duplicated genome which may act to buffer the loss of genetic diversity at small N”. The upshot is that “we very cautiously suggest that response to selection at small N might be more extensive than previously assumed in evolutionary and conservation biology.”

    [ ]

    • John Baez says:

      Here’s a nice paper about the mathematics of “small N”—that is, effects that happen when you’ve got a small population:

      • Marc Harper and Dashiell Fryer, Entropic equilibria selection of stationary extrema in finite populations.

      Abstract. We propose the entropy of random Markov trajectories originating and terminating at a state as a measure of the stability of a state of a Markov process. These entropies can be computed in terms of the entropy rates and stationary distributions of Markov processes. We apply this definition of stability to local maxima and minima of the stationary distribution of the Moran process with mutation and show that variations in population size, mutation rate, and strength of selection all affect the stability of the stationary extrema.

  9. there is no problem with that paper though i think much of it is well known; the problem with it is as was somewhat mentioned in the comments, fitnesses are not independent. this is why why since 1980 or before many people have been studying interdependent fitnesses, including group selection, and limit theorems for correlated variables. highly nonlinear.

  10. […] via Statistical Laws of Darwinian Evolution — Azimuth […]

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.