This week I’m giving a talk on biology and information:

• John Baez, Biology as information dynamics, talk for Biological Complexity: Can it be Quantified?, a workshop at the Beyond Center, 2 February 2017.

While preparing this talk, I discovered a cool fact. I doubt it’s new, but I haven’t exactly seen it elsewhere. I came up with it while trying to give a precise and general statement of ‘Fisher’s fundamental theorem of natural selection’. I *won’t* start by explaining that theorem, since my version looks rather different than Fisher’s, and I came up with mine precisely because I had trouble understanding his. I’ll say a bit more about this at the end.

Here’s my version:

The square of the rate at which a population learns information is the variance of its fitness.

This is a nice advertisement for the virtues of diversity: more variance means faster learning. But it requires some explanation!

### The setup

Let’s start by assuming we have different kinds of self-replicating entities with populations As usual, these could be all sorts of things:

• molecules of different chemicals

• organisms belonging to different species

• genes of different alleles

• restaurants belonging to different chains

• people with different beliefs

• game-players with different strategies

• etc.

I’ll call them **replicators** of different **species**.

Let’s suppose each population is a function of time that grows at a rate equal to this population times its ‘fitness’. I explained the resulting equation back in Part 9, but it’s pretty simple:

Here is a completely arbitrary smooth function of all the populations! We call it the **fitness** of the *i*th species.

This equation is important, so we want a short way to write it. I’ll often write simply as and simply as With these abbreviations, which any red-blooded physicist would take for granted, our equation becomes simply this:

Next, let be the probability that a randomly chosen organism is of the *i*th species:

Starting from our equation describing how the populations evolve, we can figure out how these probabilities evolve. The answer is called the **replicator equation**:

Here is the average fitness of all the replicators, or **mean fitness**:

In what follows I’ll abbreviate the replicator equation as follows:

### The result

Okay, now let’s figure out how fast the probability distribution

changes with time. For this we need to choose a way to measure the length of the vector

And here information geometry comes to the rescue! We can use the Fisher information metric, which is a Riemannian metric on the space of probability distributions.

I’ve talked about the Fisher information metric in many ways in this series. The most important fact is that as a probability distribution changes with time, its speed

as measured using the Fisher information metric can be seen as the *rate at which information is learned*. I’ll explain that later. Right now I just want a simple *formula* for the Fisher information metric. Suppose and are two tangent vectors to the point in the space of probability distributions. Then the **Fisher information metric** is given as follows:

Using this we can calculate the speed at which moves when it obeys the replicator equation. Actually the square of the speed is simpler:

The answer has a nice meaning, too! It’s just the variance of the fitness: that is, the square of its standard deviation.

So, if you’re willing to buy my claim that the speed is the rate at which our population learns new information, then we’ve seen that *the square of the rate at which a population learns information is the variance of its fitness!*

### Fisher’s fundamental theorem

Now, how is this related to Fisher’s fundamental theorem of natural selection? First of all, what *is* Fisher’s fundamental theorem? Here’s what Wikipedia says about it:

It uses some mathematical notation but is not a theorem in the mathematical sense.

It states:

“The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”

Or in more modern terminology:

“The rate of increase in the mean fitness of any organism at any time ascribable to natural selection acting through changes in gene frequencies is exactly equal to its genetic variance in fitness at that time”.

Largely as a result of Fisher’s feud with the American geneticist Sewall Wright about adaptive landscapes, the theorem was widely misunderstood to mean that the average fitness of a population would always increase, even though models showed this not to be the case. In 1972, George R. Price showed that Fisher’s theorem was indeed correct (and that Fisher’s proof was also correct, given a typo or two), but did not find it to be of great significance. The sophistication that Price pointed out, and that had made understanding difficult, is that the theorem gives a formula for part of the change in gene frequency, and not for all of it. This is a part that can be said to be due to natural selection

Price’s paper is here:

• George R. Price, Fisher’s ‘fundamental theorem’ made clear, *Annals of Human Genetics* **36** (1972), 129–140.

I don’t find it very clear, perhaps because I didn’t spend enough time on it. But I think I get the idea.

My result *is* a theorem in the mathematical sense, though quite an easy one. I assume a population distribution evolves according to the replicator equation and derive an equation whose right-hand side matches that of Fisher’s original equation: the variance of the fitness.

But my left-hand side is different: it’s the square of the speed of the corresponding probability distribution, where speed is measured using the ‘Fisher information metric’. This metric was discovered by the same guy, Ronald Fisher, but I don’t think he used it in *his* work on the fundamental theorem!

Something a bit similar to my statement appears as Theorem 2 of this paper:

• Marc Harper, Information geometry and evolutionary game theory.

and for that theorem he cites:

• Josef Hofbauer and Karl Sigmund, *Evolutionary Games and Population Dynamics*, Cambridge University Press, Cambridge, 1998.

However, his Theorem 2 really concerns the rate of increase of fitness, like Fisher’s fundamental theorem. Moreover, he assumes that the probability distribution flows along the gradient of a function, and I’m not assuming that. Indeed, my version applies to situations where the probability distribution moves round and round in periodic orbits!

### Relative information and the Fisher information metric

The key to generalizing Fisher’s fundamental theorem is thus to focus on the speed at which moves, rather than the increase in fitness. Why do I call this speed the ‘rate at which the population learns information’? It’s because we’re measuring this speed using the Fisher information metric, which is closely connected to relative information, also known as relative entropy or the Kullback–Leibler divergence.

I explained this back in Part 7, but that explanation seems hopelessly technical to me now, so here’s a faster one, which I created while preparing my talk.

The information of a probability distribution **relative to** a probability distribution is

It says how much information you learn if you start with a hypothesis saying that the probability of the *i*th situation was and then update this to a new hypothesis

Now suppose you have a hypothesis that’s changing with time in a smooth way, given by a time-dependent probability Then a calculation shows that

for all times . This seems paradoxical at first. I like to jokingly put it this way:

To first order, you’re never learning anything.

However, as long as the velocity is nonzero, we have

so we can say

To second order, you’re always learning something… unless your opinions are fixed.

This lets us define a ‘rate of learning’—that is, a ‘speed’ at which the probability distribution moves. *And this is precisely the speed given by the Fisher information metric!*

In other words:

where the length is given by Fisher information metric. Indeed, this formula can be used to *define* the Fisher information metric. From this definition we can easily work out the concrete formula I gave earlier.

In summary: as a probability distribution moves around, the relative information between the new probability distribution and the original one grows approximately as the *square* of time, not linearly. So, to talk about a ‘rate at which information is learned’, we need to use the above formula, involving a second time derivative. This rate is just the speed at which the probability distribution moves, measured using the Fisher information metric. And when we have a probability distribution describing how many replicators are of different species, and it’s evolving according to the replicator equation, this speed is also just the variance of the fitness!

At which? a population …

(The first statement of the theorem)

Yes, “at which”. Thanks!

This is some deeply cool stuff …

Thanks! I think so too!

That is one of the most trivial theorems of all time (also profound). Like hardy-Weinberg equilibrium ( quadratic equation–T H Hardy has his name on this, and also knew ramanujan) (George Price quit science to serve the poor in name of chistianity).

Last friday i was discussing a bit of ‘information geometry with amos golan at AU (‘american u’—wash dc, a state in america)—he knows d wolpert (SFI) , a caticha (SUNY) , marc harper, etc..

its also very well known that fisher’s fundamental theorem is wrong—totally inapplicable to the real world. its basically the analog or equivalent of the ‘efficient market theory’ of economics’ (K arrow, samuelson). good theory but not applicable to reality. there are a whole lot of nonlinearities in there for real.

the connections between fisher information, crao-raemer bound, heisenberg up, k-l divergence (different name) are interesting.

Hi John

I am not sure if you have heard of the contest at FQXI which discusses very much this subject. So I think your entry would be a winner. I had similar thoughts myself and I hope I have the time to write an essay for the contest.

http://fqxi.org/community/forum/category/31425?sort=date

The essays, due March 3rd, are supposed to address subjects such as this:

What general features — like information processing, computation, learning, complexity thresholds, and/or departures from equilibrium — allow (or proscribe) agency?

How are goals (versus accomplishments) linked to “arrows of time”?

What separates systems that are intelligent from those that are not? Can we measure this separation objectively and without requiring reference to humans?

What is the relationship between causality – the explanation of events in terms of causes – and teleology – the explanation of events in terms of purposes?

Is goal-oriented behavior a physical or cosmic trend, an accident or an imperative?

I don’t think I have anything interesting to say about goal-directed behavior—at least, not that I want to write an essay about! I could write something about information theory in biology, but that’s not the main point of this contest.

Comments—

The Azimuth blog post by Matteo Smerlak from 2012 ‘the mathematical origin of irreversibility’ (which i don’t remember seeing) overlaps with this one. I would have tried to go to grad school if studying this field seemed available—the people i talked to in theoretical biology nasically weren’t aware of or interested in it.

There is a very large literature in biology and also philosophy of science discussing whether Fisher’s FT is an analog of the 2nd law of thermodynamics—a statistical theory— or instead a newtonian, deterministic theory of motion under the action of ‘forces’—physical or biological (fitness maximization).This goes back to Wright and Fisher.

Most of the discussions in the philosophical and biological literature are not at the mathematical level of Azimuth .

https://www.edge.org/conversation/steven_pinker-the-false-allure-of-group-selection this also mentions price’s equation. (my position is basically the same as many of the commentators—s. pinker is basically wrong. I think this is because biologists don’t study statistical mechanics. As is pointed out this debate is really just about definitions or vocabulary. The math is the same—however depending on your interpretation, you either follow the math into trying to compute every newtonian/deterministic trajectory, or else you take the Boltzmann/Gibbs/Poincare/statistical mechanics route and decide you have to ‘coarse grain’ some of the information out).

To me its hard to put together into one coherent argument or framework the several posts on this topic on Azimuth and elsewhere , which use different mathematical notation for Fisher’s information and so on.

You may be interested in the article ‘information dynamics’ by Amos Golan in Minds and Machines. I’ve only read the abstract since its behind a paywall and libraries are difficult for me to get to.

3, Most (or all) of the FQXI questions have been discussed off and on for last 100 years in the literature, I think. (FQXI essays I’ve read are of very mixed quality—it is interesting that so many people are thinking about these issues, from professional scientists to amateurs.)

I think one could make an argument that information, entropy, and teleology are all connected. ‘Information wants to be free’. There is also ‘free anergy’ E-TS. See Carnot (of carnot cycle)–‘the motive power of fire. That equation appears to occur in Azimuth blog posts but often under a different name and notation.There are even suggestive theorems in mathematical logic on this.

Another question is why anyone should care about this unless you get paid for it.

Nice result! Do higher orders say something interesting too? For instance, would the skewness of the fitness say something about how information accelerates?

Good question! While

when we take higher derivatives things get more complicated, because the fitness is an arbitrary smooth function of the population distribution whose corresponding probability distribution I’m calling That is, we really have

and taking another time derivative requires that we use the product rule, the chain rule, and also the basic equation

You can try it yourself. It gets a little hairy, at first glance, but it could still be interesting.

There are also simplified models where the fitness functions take some special simple form, and then things may be prettier.

http://www.pnas.org/content/81/9/2840 ‘frequency dependent fitness’ (and there is also density dependent fitness, etc. )

there are so many variants/discussions of these topics , each with its own notation, its impossible for me to folow them. there was a huge discussion of whether’s fisher’s FTNS was an analog of 2nd law of thermodynamics, or rather more like a newtonian potential problem. i think the discussion basically say it can be seen either way (just like ‘entropic gravity’).

“The information of a probability distribution q relative to a probability distribution q is”

There seems to be an error

Definitely an error. The formula is for the information of relative to I’ll fix this sentence. Thanks!

John re: “…my version applies to situations where the probability distribution moves round and round in periodic orbits!”-

Does this mean that a population could “learn” at a high rate for a long time, and end up in exactly the state in which it started?

If so, this seems at odds with nontechnical usage, in which “learn” usually indicates a progressive activity with something to show for it at the end (you know – a more faithful representation of reality, increased system uptime, a lower golf handicap or something).

Am I just confused, or should “learn” turn to “churn”?-)

Yes! I agree that this is different than the colloquial use of “learning”. Furthermore, I’m not using “learning” in a way that suggests anything learned is

true. In my talk, I pointed out that you could be learning “facts” or “alternative facts”. We can imagine someone converting to a religion and “learning” a lot of information, and then quitting that religion and “learning” a bunch more.A more accurate expression would be “updating ones hypothesis”, or even better “changing ones probability distribution”. But I needed a word that was short.

It’s all explained in the equations: there’s relative information, or “information left to learn about a given hypothesis”, which decreases as your hypothesis gets closer and closer to that one, and there’s the “Fisher information metric”, which measures your “speed of learning” as you move through the space of hypotheses (that is, probability distributions). The latter has no directional sense.

When you measure the speed of learning using the Fisher information metric, it’s sort of like the speedometer in your car. Your can also measure the total arclength of a path in the space of hypotheses, which is like the odometer in your car.

You can rack up quite a big mileage as you take a trip around the country and return to where you started. From a certain point of view the trip was pointless since you didn’t wind up going anywhere: you could have just stayed home! But from another point of view you were “going places” even though you wound up where you started.

Thank you John.

This is sortuh irrelevant but normally one thinks in simplest terms that darwinian evolution has a trajectory that maximizes fitness. So its like a ball rolling down a hill. At equilibrium it stops. In simplest form fisher’s theorem states this—excepts its a statistical process (almost intuitive — no different from why hot coffee cools or why an ideal gas observed over time reaches a statistical equilibrium of maximum entropy. )

In reality the theorem is innaplicable because it assumes ‘bean bag genetics’—all genes are independent—-when in fact they are linked via epistasis, frequency dependent and other effects. (There are very similar models in genetics and economics which show why these systems in general do not reach some global optimum. Instead they tend to stay in ‘basins of attraction’, attractors or potentials well for varying times. Not everything is an ideal gas.)

Hence its not like a perfectly round ball rolling down a perfectly smooth hill. Or an ideal gas undergoing brownian motion leading to a gaussian. Instead the system can cycle, end up stuck in a nook or cranny, etc. One needs concepts from chaos and ergodic theory to understand this. Many discussions in biology, economics, and physics are either confused on this or not well presented.

I am certainly not an expert for this, but it seems to me that mathematicians should not call “Fisher’s fundamental theorem” a theorem. It is rather a hypothesis. Right?

You, on the other hand, could show that for your system Fisher’s hypothesis is not true and even more. You could give a mathematically sound description of a replacement. I would really like to see that on the arxiv to better understand the details.

Some comments criticize the lack of applicability. To address this (valid) point one could try to take the (new) hypothesis away from the above system and formulate it for say Markov chains. All that with the hope to establish a connection to ergodic and/or dynamical systems theory.

I think Fisher’s FT theorem is a theorem, based on assumptions or axioms. The problem with theorems used for real systems like biology and economics, is that the assumptions are unrealistic (simplified) in most cases. (I think there are a few simple cases where FFT does hold—simple organisms like bacteria which live in a dilute or uncrowded environment under constant conditions.)

If you look at the George Price paper linked to in the blog post you will see on page 2 exactly what assumptions Fisher used to derive his theorem. It only applies to additive genetic variance—that organism and gene fitnesses do not depend on environmental conditions.

Hence it doesn’t applies to systems with epistasis and dominance (in which the fitness of a gene depends on interactions with other genes), population density (meaning that organisms or genes are indifferent to how crowded they are—a billion rats in a small cage are as fit as if there are 10), frequency dependence (like saying a cook who has a ‘perfect recipe’ involving specific combinations of ingredients is no better than one who just randomly throws in ingredients—e.g. a cup of coffee which is 90% coffee and 10% sugar is as good or ‘fit’ as one which is 10% coffee and 90% sugar), and exogenous l changes in environment don’t matter (e.g. climate change).

This is why FFT in its most commonly used limited form is viewed as bean bag genetics. (If this were true, things like ‘plant and animal breeding for better varieties’ would be impossible.

Price’s form of FFT (which is the full theorem including terms commonly dropped out) avoids this problem because it includes the ‘nonlinear’ terms usually left out. Its an ‘ideal gas model’—assumes genes are like Boltzmann’s perfectly elastic colliding point particles. Real particles have interactions, and are ‘frustrated’—cannot collide with every other particle due to tie and space constraints.

This is why FFT is similar to general equilibrium eocnomics. That also is a theorem—that economies like simple genetic systems always move to some optima and eventually an equilibrium. But this assumes humans have perfect information, interact (trade) with everyone else in the economy, act for self-interest, by maximizing an unchanging utility function or set of preferences. Assuming that, economies due move to unique equilibria which are ‘Pareto optimal’.

Once you through in some non-linearities (e.g. interdependent utility functions), or environmental effects (transaction costs, human effect on the environment, or learning—changing preferences), in general there are multiple equilibria or none at all (at least unique ones)—the system just cycles or is chaotic.

Interesting to me is that even assuming bean bag or ideal gas type models, sometimes you can results in good agreement with observations—just as many physical phenomena are approximately Gaussian, though they don’t arise by some random diffusion process (but rather often by the way data are selected and aggregated).

Thanks for the explanation!

My comment was inspired by the ergodic hypothesis. For some systems it is a theorem whereas for others it is not true. I have interpreted John’s “The square of the speed is the variance of the fitness.” as such a hypothesis. True for some systems, like the replicator equation and false for others.

Actually, I have a joint paper with C. Schwarz on non-equilibrium economics. I am quite familiar with this topic.

‘ergodic hypothesis’ or theory. is wayyy up there in basic issues. if you look at original proofs of this, by von neumann, weiner, birkhoff you can see immediately they prove their assumptions–its a variant of the CLT–central limit theorem. it took KAM (kolmogorov/arnold/m to fix this up). (see ‘harmonic analyses as the exploitation of symmetry’ by G Mackey 1980 –free online–Bulletin of Am Math Soc–my prof told me to read this–old classic which i did,its a bible, and he also tried to flunk me, which he coud’nt do since i passed the tests–just didnt go to class).

i looked at your web page. non-commutative relations in economics are quite good. i know of only 2 good papers on ergodic theory in economics, and they do not go far.

time-dependent cobbs-douglass utility functions are better, but only go so far. (and actually if you look, cobbs-douglass utility functins are a version of entropy).

Uwe wrote:

I believe Fisher stated it as a mathematical result with assumptions, a conclusion and a mathematical argument leading from the assumptions to the conclusion. I haven’t read his original paper, and I should, but researchers considered his proof flawed until George Price fixed it up. As Wikipedia says:

Earlier this article says Fisher’s theorem is not a theorem, which seems to contradict this claim! That’s partially because Wikipedia is written by a multitude of authors, but also it seems to reflect the confusion and controversy among biologists about the status of Fisher’s theorem.

I should read Fisher’s original paper and try to make the Wikipedia article a bit clearer. In particular, it never clearly states the theorem, including the assumptions that imply the conclusion!

I can show that for certain systems Fisher’s assumptions don’t hold and his conclusion doesn’t hold either, but

mymore general assumptionsdohold and so my conclusions do.Okay, thanks for your vote. I’ve been thinking this might be worthwhile. I’m trying to think of a good title—one that would simultaneously say I’m trying to generalize Fisher’s fundamental theorem and that this new theorem has quite different conclusions.

P.S. I think a lot if not all of this can be done via Markov chains/processes. However these are higher order Markov processes (usually termed non-Markov). The next state depends on current one, and many previous ones—its state dependent—and all states are not created equal—e.g. US electoral college.)

My “improved Fisher’s theorem”, like the original, deals with the time evolution of probability distributions via a

nonlinearequation. I’m pretty sure this can’t be described by Markovian dynamics, even higher-order, because those equations are linear in the probability distribution.On the other hand, my improved Fisher’s theorem, like the original, does not apply to Markov processes, because my theorem applies to processes for which if at some particular time then at all later times. In biological terms, it disallows ‘mutation’ or other source of ‘novelty’. By ‘novelty’ I mean that the probability of a randomly chosen organism being of the

ith species is zero at some time but nonzero later.what’s if e.g. ?

John wrote: My “improved Fisher’s theorem”, like the original, deals with the time evolution of probability distributions via a nonlinear equation. I’m pretty sure this can’t be described by Markovian dynamics, even higher-order, because those equations are linear in the probability distribution.

Nonlinearity alone is not sufficient to rule out a Markovian approach as long as you allow infinite dimensional state spaces. E.g. in dynamical systems transfer (or Perron-Frobenius) operators are used to capture properties of non-linear systems (Ruelle, Lasota-Yorke). These operators are linear and, if considered on a measure space, also Markov.

What I have in mind is a result like:

The following are equivalent.

1) The square of the speed is the variance of the fitness.

2) The generator of the whatever (e.g. special Markov chain) satisfies some technical condition.

Now that might not make much sense, but as I said, I am still a beginner and trying to understand what you are doing.

i think u are correct.

the replicator equation in general is already nonlinear. (the way its written makes it look linear).

Perron-Frobenius theorem is one way to hide all the nonlinearities. (Ruelle goes through all this, though i dont understand the details.)

Langevin equations are the natural way to turn a replicator dynamic into a markov process and then in the limit into a standard stochastic DE (fokker-planck). This is only true in the infinite limit.

so, its equivalent or its not. if you look at discrete vs continuous dynamics of the logisitic equation in one case it has an equilbrium solution (continuous case), in the other case its chaotic.

You may find interesting N Behera’s ‘Optimality principles in evolutionary genetics’ in Current Science (1999)–not very technical. He tries to find a lagrangian for this dynamics and finds one L= 1/8 ((your speed of learning) + (variance of fitness))

and points out its limited applications, but seems to come to conclusions simialr to yours.

(i can’t do math notation on this computer).

Here’s a link

http://www.iisc.ernet.in/currsci/nov10/GENERALARTICLES.PDF

its the last article of that magazine –last few pages.

James Crow also discussed Berara’s approach in vol 56 (2002) Evolution in ‘here’s to fisher, additive genetic variance FTNS’.

(free on JSTOR)

I’m trying to get an intuitive picture, but basically the speed of learning or fisher information metric or relative entropy equals the rate of change of mean fitness. They are both kinds of distances and evolution (or rate of change) is proportional to these distances. This seems to make sense–in the general case, the mean fitness can always be changing —go up and down. (I took that also to be the point of the Curtsinger paper in PNAS i linked to.) One’s rate of learning is the square of of that—its a kind of like a kinetic energy v**2, while mean fitness is more like velocity v. (This may not make much sense.)

(You likely are right about non/markov processes— i can’t think now how to turn replicator dynamic into one.

Some possible relations may be discussed in https://arxiv.org/abs/cond-mat/0409655 since markov processes are often associated with stochastic DE’s.)