The Mathematical Origin of Irreversibility

guest post by Matteo Smerlak

Introduction

Thermodynamical dissipation and adaptive evolution are two faces of the same Markovian coin!

Consider this. The Second Law of Thermodynamics states that the entropy of an isolated thermodynamic system can never decrease; Landauer’s principle maintains that the erasure of information inevitably causes dissipation; Fisher’s fundamental theorem of natural selection asserts that any fitness difference within a population leads to adaptation in an evolution process governed by natural selection. Diverse as they are, these statements have two common characteristics:

1. they express the irreversibility of certain natural phenomena, and

2. the dynamical processes underlying these phenomena involve an element of randomness.

Doesn’t this suggest to you the following question: Could it be that thermal phenomena, forgetful information processing and adaptive evolution are governed by the same stochastic mechanism?

The answer is—yes! The key to this rather profound connection resides in a universal property of Markov processes discovered recently in the context of non-equilibrium statistical mechanics, and known as the ‘fluctuation theorem’. Typically stated in terms of ‘dissipated work’ or ‘entropy production’, this result can be seen as an extension of the Second Law of Thermodynamics to small systems, where thermal fluctuations cannot be neglected. But it is actually much more than this: it is the mathematical underpinning of irreversibility itself, be it thermodynamical, evolutionary, or else. To make this point clear, let me start by giving a general formulation of the fluctuation theorem that makes no reference to physics concepts such as ‘heat’ or ‘work’.

The mathematical fact

Consider a system randomly jumping between states a, b,\dots with (possibly time-dependent) transition rates \gamma_{a b}(t) where a is the state prior to the jump, while b is the state after the jump. I’ll assume that this dynamics defines a (continuous-time) Markov process, namely that the numbers \gamma_{a b} are the matrix entries of an infinitesimal stochastic matrix, which means that its off-diagonal entries are non-negative and that its columns sum up to zero.

Now, each possible history \omega=(\omega_t)_{0\leq t\leq T} of this process can be characterized by the sequence of occupied states a_{j} and by the times \tau_{j} at which the transitions a_{j-1}\longrightarrow a_{j} occur (0\leq j\leq N):

\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T}).

Define the skewness \sigma_{j}(\tau_{j}) of each of these transitions to be the logarithmic ratio of transition rates:

\displaystyle{\sigma_{j}(\tau_{j}):=\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}

Also define the self-information of the system in state a at time t by:

i_a(t):= -\ln\pi_{a}(t)

where \pi_{a}(t) is the probability that the system is in state a at time t, given some prescribed initial distribution \pi_{a}(0). This quantity is also sometimes called the surprisal, as it measures the ‘surprise’ of finding out that the system is in state a at time t.

Then the following identity—the detailed fluctuation theorem—holds:

\mathrm{Prob}[\Delta i-\Sigma=-A] = e^{-A}\;\mathrm{Prob}[\Delta i-\Sigma=A]

where

\displaystyle{\Sigma:=\sum_{j}\sigma_{j}(\tau_{j})}

is the cumulative skewness along a trajectory of the system, and

\Delta i= i_{a_N}(T)-i_{a_0}(0)

is the variation of self-information between the end points of this trajectory.

This identity has an immediate consequence: if \langle\,\cdot\,\rangle denotes the average over all realizations of the process, then we have the integral fluctuation theorem:

\langle e^{-\Delta i+\Sigma}\rangle=1,

which, by the convexity of the exponential and Jensen’s inequality, implies:

\langle \Delta i\rangle=\Delta S\geq\langle\Sigma\rangle.

In short: the mean variation of self-information, aka the variation of Shannon entropy

\displaystyle{ S(t):= \sum_{a}\pi_{a}(t)i_a(t) }

is bounded from below by the mean cumulative skewness of the underlying stochastic trajectory.

This is the fundamental mathematical fact underlying irreversibility. To unravel its physical and biological consequences, it suffices to consider the origin and interpretation of the ‘skewness’ term in different contexts. (By the way, people usually call \Sigma the ‘entropy production’ or ‘dissipation function’—but how tautological is that?)

The physical and biological consequences

Consider first the standard stochastic-thermodynamic scenario where a physical system is kept in contact with a thermal reservoir at inverse temperature \beta and undergoes thermally induced transitions between states a, b,\dots. By virtue of the detailed balance condition:

\displaystyle{ e^{-\beta E_{a}(t)}\gamma_{a b}(t)=e^{-\beta E_{b}(t)}\gamma_{b a}(t),}

the skewness \sigma_{j}(\tau_{j}) of each such transition is \beta times the energy difference between the states a_{j} and a_{j-1}, namely the heat received from the reservoir during the transition. Hence, the mean cumulative skewness \langle \Sigma\rangle is nothing but \beta\langle Q\rangle, with Q the total heat received by the system along the process. It follows from the detailed fluctuation theorem that

\langle e^{-\Delta i+\beta Q}\rangle=1

and therefore

\Delta S\geq\beta\langle Q\rangle

which is of course Clausius’ inequality. In a computational context where the control parameter is the entropy variation itself (such as in a bit-erasure protocol, where \Delta S=-\ln 2), this inequality in turn expresses Landauer’s principle: it impossible to decrease the self-information of the system’s state without dissipating a minimal amount of heat into the environment (in this case -Q \geq k T\ln2, the ‘Landauer bound’). More general situations (several types of reservoirs, Maxwell-demon-like feedback controls) can be treated along the same lines, and the various forms of the Second Law derived from the detailed fluctuation theorem.

Now, many would agree that evolutionary dynamics is a wholly different business from thermodynamics; in particular, notions such as ‘heat’ or ‘temperature’ are clearly irrelevant to Darwinian evolution. However, the stochastic framework of Markov processes is relevant to describe the genetic evolution of a population, and this fact alone has important consequences. As a simple example, consider the time evolution of mutant fixations x_{a} in a population, with a ranging over the possible genotypes. In a ‘symmetric mutation scheme’, which I understand is biological parlance for ‘reversible Markov process’, meaning one that obeys detailed balance, the ratio between the a\mapsto b and b\mapsto a transition rates is completely determined by the fitnesses f_{a} and f_b of a and b, according to

\displaystyle{\frac{\gamma_{a b}}{\gamma_{b a}} =\left(\frac{f_{b}}{f_{a}}\right)^{\nu} }

where \nu is a model-dependent function of the effective population size [Sella2005]. Along a given history of mutant fixations, the cumulated skewness \Sigma is therefore given by minus the fitness flux:

\displaystyle{\Phi=\nu\sum_{j}(\ln f_{a_j}-\ln f_{a_{j-1}}).}

The integral fluctuation theorem then becomes the fitness flux theorem:

\displaystyle{ \langle e^{-\Delta i -\Phi}\rangle=1}

discussed recently by Mustonen and Lässig [Mustonen2010] and implying Fisher’s fundamental theorem of natural selection as a special case. (Incidentally, the ‘fitness flux theorem’ derived in this reference is more general than this; for instance, it does not rely on the ‘symmetric mutation scheme’ assumption above.) The ensuing inequality

\langle \Phi\rangle\geq-\Delta S

shows that a positive fitness flux is “an almost universal evolutionary principle of biological systems” [Mustonen2010], with negative contributions limited to time intervals with a systematic loss of adaptation (\Delta S > 0). This statement may well be the closest thing to a version of the Second Law of Thermodynamics applying to evolutionary dynamics.

It is really quite remarkable that thermodynamical dissipation and Darwinian evolution can be reduced to the same stochastic mechanism, and that notions such as ‘fitness flux’ and ‘heat’ can arise as two faces of the same mathematical coin, namely the ‘skewness’ of Markovian transitions. After all, the phenomenon of life is in itself a direct challenge to thermodynamics, isn’t it? When thermal phenomena tend to increase the world’s disorder, life strives to bring about and maintain exquisitely fine spatial and chemical structures—which is why Schrödinger famously proposed to define life as negative entropy. Could there be a more striking confirmation of his intuition—and a reconciliation of evolution and thermodynamics in the same go—than the fundamental inequality of adaptive evolution \langle\Phi\rangle\geq-\Delta S?

Surely the detailed fluctuation theorem for Markov processes has other applications, pertaining neither to thermodynamics nor adaptive evolution. Can you think of any?

Proof of the fluctuation theorem

I am a physicist, but knowing that many readers of John’s blog are mathematicians, I’ll do my best to frame—and prove—the FT as an actual theorem.

Let (\Omega,\mathcal{T},p) be a probability space and (\,\cdot\,)^{\dagger}=\Omega\to \Omega a measurable involution of \Omega. Denote p^{\dagger} the pushforward probability measure through this involution, and

\displaystyle{ R=\ln \frac{d p}{d p^\dagger} }

the logarithm of the corresponding Radon-Nikodym derivative (we assume p^\dagger and p are mutually absolutely continuous). Then the following lemmas are true, with (1)\Rightarrow(2)\Rightarrow(3):

Lemma 1. The detailed fluctuation relation:

\forall A\in\mathbb{R} \quad  p\big(R^{-1}(-A) \big)=e^{-A}p \big(R^{-1}(A) \big)

Lemma 2. The integral fluctuation relation:

\displaystyle{\int_{\Omega} d p(\omega)\,e^{-R(\omega)}=1 }

Lemma 3. The positivity of the Kullback-Leibler divergence:

D(p\,\Vert\, p^{\dagger}):=\int_{\Omega} d p(\omega)\,R(\omega)\geq 0.

These are basic facts which anyone can show: (2)\Rightarrow(3) by Jensen’s inequality, (1)\Rightarrow(2) trivially, and (1) follows from R(\omega^{\dagger})=-R(\omega) and the change of variables theorem, as follows,

\begin{array}{ccl} \displaystyle{ \int_{R^{-1}(-A)} d p(\omega)} &=& \displaystyle{ \int_{R^{-1}(A)}d p^{\dagger}(\omega) } \\ \\ &=& \displaystyle{ \int_{R^{-1}(A)} d p(\omega)\, e^{-R(\omega)} } \\ \\ &=& \displaystyle{ e^{-A} \int_{R^{-1}(A)} d p(\omega)} .\end{array}

But here is the beauty: if

(\Omega,\mathcal{T},p) is actually a Markov process defined over some time interval [0,T] and valued in some (say discrete) state space \Sigma, with the instantaneous probability \pi_{a}(t)=p\big(\{\omega_{t}=a\} \big) of each state a\in\Sigma satisfying the master equation (aka Kolmogorov equation)

\displaystyle{ \frac{d\pi_{a}(t)}{dt}=\sum_{b\neq a}\Big(\gamma_{b a}(t)\pi_{a}(t)-\gamma_{a b}(t)\pi_{b}(t)\Big),}

and

• the dagger involution is time-reversal, that is \omega^{\dagger}_{t}:=\omega_{T-t},

then for a given path

\displaystyle{\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T})\in\Omega}

the logarithmic ratio R(\omega) decomposes into ‘variation of self-information’ and ‘cumulative skewness’ along \omega:

\displaystyle{ R(\omega)=\underbrace{\Big(\ln\pi_{a_0}(0)-\ln\pi_{a_N}(T) \Big)}_{\Delta i(\omega)}-\underbrace{\sum_{j=1}^{N}\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}_{\Sigma(\omega)}.}

This is easy to see if one writes the probability of a path explicitly as

\displaystyle{p(\omega)=\pi_{a_{0}}(0)\left[\prod_{j=1}^{N}\phi_{a_{j-1}}(\tau_{j-1},\tau_{j})\gamma_{a_{j-1}a_{j}}(\tau_{j})\right]\phi_{a_{N}}(\tau_{N},T)}

where

\displaystyle{ \phi_{a}(\tau,\tau')=\phi_{a}(\tau',\tau)=\exp\Big(-\sum_{b\neq a}\int_{\tau}^{\tau'}dt\, \gamma_{a b}(t)\Big)}

is the probability that the process remains in the state a between the times \tau and \tau'. It follows from the above lemma that

Theorem. Let (\Omega,\mathcal{T},p) be a Markov process and let i,\Sigma:\Omega\rightarrow \mathbb{R} be defined as above. Then we have

1. The detailed fluctuation theorem:

\forall A\in\mathbb{R}, p\big((\Delta i-\Sigma)^{-1}(-A) \big)=e^{-A}p \big((\Delta i-\Sigma)^{-1}(A) \big)

2. The integral fluctuation theorem:

\int_{\Omega} d p(\omega)\,e^{-\Delta i(\omega)+\Sigma(\omega)}=1

3. The ‘Second Law’ inequality:

\displaystyle{ \Delta S:=\int_{\Omega} d p(\omega)\,\Delta i(\omega)\geq \int_{\Omega} d p(\omega)\,\Sigma(\omega)}

The same theorem can be formulated for other kinds of Markov processes as well, including diffusion processes (in which case it follows from the Girsanov theorem).

References

Landauer’s principle was introduced here:

• [Landauer1961] R. Landauer, Irreversibility and heat generation in the computing process}, IBM Journal of Research and Development 5, (1961) 183–191.

and is now being verified experimentally by various groups worldwide.

The ‘fundamental theorem of natural selection’ was derived by Fisher in his book:

• [Fisher1930] R. Fisher, The Genetical Theory of Natural Selection, Clarendon Press, Oxford, 1930.

His derivation has long been considered obscure, even perhaps wrong, but apparently the theorem is now well accepted. I believe the first Markovian models of genetic evolution appeared here:

• [Fisher1922] R. A. Fisher, On the dominance ratio, Proc. Roy. Soc. Edinb. 42 (1922), 321–341.

• [Wright1931] S. Wright, Evolution in Mendelian populations, Genetics 16 (1931), 97–159.

Fluctuation theorems are reviewed here:

• [Sevick2008] E. Sevick, R. Prabhakar, S. R. Williams, and D. J. Searles, Fluctuation theorems, Ann. Rev. Phys. Chem. 59 (2008), 603–633.

Two of the key ideas for the ‘detailed fluctuation theorem’ discussed here are due to Crooks:

• [Crooks1999] Gavin Crooks, The entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences, Phys. Rev. E 60 (1999), 2721–2726.

who identified (E_{a}(\tau_{j})-E_{a}(\tau_{j-1})) as heat, and Seifert:

• [Seifert2005] Udo Seifert, Entropy production along a stochastic trajectory and an integral fluctuation theorem, Phys. Rev. Lett. 95 (2005), 4.

who understood the relevance of the self-information in this context.

The connection between statistical physics and evolutionary biology is discussed here:

• [Sella2005] G. Sella and A.E. Hirsh, The application of statistical physics to evolutionary biology, Proc. Nat. Acad. Sci. USA 102 (2005), 9541–9546.

and the ‘fitness flux theorem’ is derived in

• [Mustonen2010] V. Mustonen and M. Lässig, Fitness flux and ubiquity of adaptive evolution, Proc. Nat. Acad. Sci. USA 107 (2010), 4248–4253.

Schrödinger’s famous discussion of the physical nature of life was published here:

• [Schrödinger1944] E. Schrödinger, What is Life?, Cambridge University Press, Cambridge, 1944.

46 Responses to The Mathematical Origin of Irreversibility

  1. Okay, this is all good and impressive, but it still doesn’t resolve Loschmidt’s paradox. How can one explain irreversible thermodynamics from time-symmetric fundamental laws?

    • John Baez says:

      This post is not trying to explain that paradox, and I would not personally have chosen the title ‘The Mathematical Origin of Irreversibility’, because that suggests otherwise.

      • That’s right, it’s not about this paradox, especially because in the thermodynamical context this kind of Markov dynamics does not describe closed systems: the transitions are induced by a “bath” which you choose not to describe explicitly. So there’s lots of information being discarded from the very start (it’s a “mesoscopic” description) and therefore this theorem can’t teach anything about “fundamental laws”. (Incidentally, “discarded information” was also Boltzmann’s answer to Loschmidt: the Stosszahl Ansatz discards pair correlations between particles).

        But suppose you’re happy with that (Loschmidt not in the room?), and ask “whence irreversibility” in this context. In other words, can you tell which trajectories contribute to the mean growth of entropy, and why? The answer is in the theorem: the entropy-producing trajectories are the “skew” ones, those which undergo transitions a\rightarrow b even though \gamma_{ab}<\gamma_{ba}. When this happens, entropy grows.

        That's what I mean with "origin", not something related to "fundamental laws". Any suggestion for a better name?

        • It is interesting to connect this to Loschmidt’s paradox and see where we can go with the connection and what the fluctuation theorem can say in these regards, definitely not trying to criticize your post. It would also be interesting if we can derive an analogous theorem from a quantum probability perspective and see what this has anything more “fundamental” to say about the “origin” of “irreversibility” — seeing as how the Fluctuation theorem is derived from a classical probability perspective.

        • One more point which people often make, in connection with the Loschmidt paradox. For a random variable R(\omega) (such as \Delta i-\beta Q, as above) to satisfy the integral fluctuation theorem

          \langle e^{-R}\rangle=1

          and therefore obey

          \langle R\rangle \ge 0,

          it is necessary that some paths \omega have

          R(\omega)<0,

          i.e. are "second-law violating"! They are exponentially suppressed, but still, they must be there!

          So in a way (and with considerable hindsight) Loschmidt was right: "second-law violating" paths (R(\omega)<0) are essential to the second law (\langle R \rangle \ge 0)!

        • andreo says:

          I work a lot with fluctuation theorems in stochastic processes and I found Matteo’s post a neat (and precise) synthesis, thanks a lot!

          I’d like to mention an interesting fact, also related to Loschmidt’s paradox. Fluctuation theorems were originally conceived in the framework of *deterministic* dynamical systems (at the beginning of ’90s) and – a few years later – they were imported in the realm of (markovian and not only) stochastic processes. The quantity which satisfies the theorem in deterministic systems is the phase space contraction rate: this means that a system which conserves phase space (e.g. a Hamiltonian system) is not even a candidate for the theorem. (Very) roughly speaking, the idea is that one does not study irreversibility of a closed determinstic system, but only of an open one, which (being a portion of a closed – Hamitlonian – one) has some continuous “leak” of information which implies some production of entropy. Still one may encounter the paradox: models of deterministic systems exist which have time-reversible microscopic equations and a phase space contraction rate whose average is positive and its fluctuations satisfy the fluctuation theorem. An example is a classical molecular fluid under stationary shear and “thermostatted” in order to avoid heating. There one may “see the paradox at work” in its full glory [see for instance Phys. Rev. E 50, 1645–1648 (1994) ].

          Of course when a paradox is formalized in mathematical terms, it evaporates: one can see the “error” and solve it. Unfortunately the solution is not simple. It basically involves dynamical instability (chaos – ie positivity of some lyapunov exponents – is certainly sufficient but there is still a debate on its necessity), and singularity of phase space density. Dynamical instability explains that two trajectories which are one the time-reversed of the other are both solutions of the equation of motion, but if one small piece of the first is dynamically stable, the time-reverse corresponding small piece of the other is unstable and this eventually lead to very different “path densities” in phase space: trajectories producing entropy are much more probable than trajectories violating the 2nd principle. The time-reversal symmetry of microscopic equations of motion is somehow not satisfied by evolution of density in phase space, thanks to the fact that such a density in a phase-space-contracting system tends to become singular (fractal), because of contraction. I am sure that an expert of dynamical systems could explain it much better, my expertise is the stochastic realm which is not really interested to the issue for the reason explained by Matteo and John in their replies. Hopefully the reference above [and a nice review by Evans and Searles, Advances in Physics 51, 1529 (2002) ] can shed some light on the issue.

      • John Baez says:

        Okay, having fixed some typos (and deleted discussion of those typos), I can now try to understand what Matteo was actually saying. By the way, I’m using the obvious trick to keep the discussion from getting very skinny, which I encourage others to do too, at least for substantial comments.

        Matteo wrote:

        One more point which people often make, in connection with the Loschmidt paradox. For a random variable R(\omega) [...] to satisfy the integral fluctuation theorem

        \langle e^{-R}\rangle=1

        and therefore obey

        \langle R\rangle \ge 0,

        it is necessary that some paths \omega have

        R(\omega)<0

        Let me expand on this to make sure I understand it, and maybe help some other people understand it.

        Suppose a random variable R has the property that the mean of \exp(-R) is 1:

        \langle e^{-R}\rangle=1

        Since the logarithm function is concave, the logarithm of the mean is greater than or equal to the mean of the logarithm.

        So, logarithm of 1 is greater than or equal to the mean of the logarithm of \exp(-R).

        In other words, 0 is greater than or equal to the mean of -R.

        In other words, the mean of R is nonnegative:

        \langle R\rangle \ge 0

        On the other hand, R can’t be positive everywhere, or \exp(-R) would be less than 1 everywhere, which would imply

        \langle e^{-R}\rangle < 1

        contradicting our assumption.

        In short: if

        \langle e^{-R}\rangle =1

        then the mean of R is ≥ 0 but R can’t be > 0 everywhere.

        (This is almost what you said above, but it turns out that it’s really only necessary that some paths have R \le 0, not R < 0. Aren’t mathematicians annoying?)

        So:

        The integral fluctuation theorem simultaneously implies that entropy can’t decrease on average, and that it has a chance of not increasing.

        Neat!

        • Correct, but let’s not forget that R(\omega^{\dagger})=-R(\omega)! (As in the appendix, dagger denotes time-reversal.) So whenever R takes a given positive value, it actually also take the opposite value for some other path! The point is that the probability of the latter is exponentially smaller that that of the former, according to the detailed fluctuation theorem.

  2. Arrow says:

    I am not a mathematician so I cannot appreciate the mathematical significance of all of this but from a general perspective it doesn’t seem very enlightening to me.

    It’s a bit like if we model a cow by a sphere and a moon by a sphere we can discover that the same equation for volume applies to them both.

    What I mean by this is if we only focus on the irreversibility of simple models of irreversible computation and irreversible evolution then we can find connections but those seem to be more due to those simplifications then the real similarity between the processes.

    For example in the real world fitness of almost every population in the history of Earth dropped to zero after some time. So in this context their evolution was not only reversible it was reversed completely.

    In fact I can state a theorem similar to the one about the positive fitness flux though with an opposite conclusion. If we start with some initial population with a certain initial fitness it’s fitness will decrease to zero in a finite time and that in general fitness decreases except for the rare periods when it doesn’t. This theorem has a very strong empirical support and unlike the “positive flux” one it applies to the real world instead of an idealized model of evolution.

    • Now, that’s not a comment I expected! Are you saying that species don’t generally evolve to improve their fitness? That the opposite has “very strong empirical support”? Are you making some subtle point about evolution here (that it’s not monotonous, that all species eventually get extinct, something along this line), or are you plainly denying natural selection and evolution?

      • Arrow says:

        The point I’m making is that while evolution itself does improve fitness that certainly doesn’t mean that fitness in the real world keeps improving. On the contrary in most cases improvements due to evolution prove insufficient and species go extinct.

        If you look at the history of life on Earth almost all species that ever existed are now extinct. Their evolution failed to keep up with changes in natural environment and fitness of their populations reached zero. Only a very small fraction managed to survive in some form to this day.

        Homo sapiens is a good example, we know of many early branches of homo yet all but one of them are now extinct. Evolution of neanderthals failed, the fitness of their population reached zero and they are gone. Ours may very well do the same eventually.

        So while evolution as an abstract process may be irreversible, the gains in fitness due to evolution in the real world are very much reversible and in most (and possibly all) cases only temporary.

        • Oh well, I’m happy enough if I can understand that. The “real world” is not something we scientists can say much about, can we? So we do models, and understand aspects of it. And that’s good, right?

    • John Baez says:

      I think it’s worth pondering why Arrow’s observation doesn’t contradict the results Matteo presented… since that will help us go further.

      Matteo’s results are about time-translation-invariant Markov processes. So, they’re a good model of games where organisms randomly choose other organisms to compete against and randomly succeed (get to reproduce) or fail (don’t get to reproduce), with probabilities that depend on who is playing the game but don’t change with time.

      In such situations, the fittest organisms will, on average, take over.

      This is the idea behind the ‘fitness flux theorem’ presented here, as far as I can tell.

      However, in the real world the assumption of time translation invariance is only a good approximation in limited regimes. There are occasional ‘crises’, like ice ages and meteor impacts and colliding continents. When these occur, what used to be the fittest organisms are no longer the fittest! So, we have extinction events.

      We could model this by dropping the assumption of time translation invariance. That would be fun to try.

      Or, over very long time scales, we might model these occasional crises as occurring randomly, and fold them into the overall Markov process!

      In reality this Markov process would still not be time translation invariant. For example, asteroid impacts have gradually become less and less common over the last 3.5 billion years.

      But we could still learn something by looking at models where this Markov process is time translation invariant. What we’d see is that very rare random extinction events would make it very likely that a given sample path (‘history of the world’) saw fitness rise, crash, rise, crash, etc. But for the average over all these sample paths, fitness would rise!

      There’s a lot more to think about here…

      • I did present the fitness-flux theorem in the time-translation invariant case, but that was just for simplicity; you don’t need to assume that (the authors of the paper on the fitness-flux theorem do not, I think, and in the proof of the fluctuation theorem I don’t either). I think the point is more that the inequality \langle\Phi\rangle\geq-\Delta S does not in itself tell you that fitness must improve over time: that depends on the actual \Delta S. But when entropy does decrease, meaning that the population actually adapts to its environment, then it must be in the direction of positive fitness flux.

        • Graham says:

          I don’t follow that. It seems that if the environment changes, what counts as fit changes, so the definition of Phi changes.

          What happens in an oscillating environment, where the population keeps adapting, then adapting back?

        • Graham says:

          I just remembered this paper.

          http://www.sekj.org/PDF/anzf40/anzf40-185.pdf

          It has simulations of populations constantly adapting to a changing environment, and sometimes going extinct.

        • The fitness flux \Phi is defined as a sum over all transitions along the process. For each transition j, it compares the forward and backward transition rates *at time \tau_j*. You’re right that if the environmental conditions change along the process, what counts as fit also change; what matters is the fitness variation at each transition. In other words, in a changing environment \Phi depends on the actual times at which the transitions took place: a transition that increases fitness at some time may not increase fitness at some other time. But did I get your question right, Graham?

  3. John Baez says:

    Matteo wrote:

    Surely the detailed fluctuation theorem for Markov processes has other applications, pertaining neither to thermodynamics nor adaptive evolution. Can you think of any?

    You may not consider these separate, but besides thermodynamics and biological evolution I like to think about two other examples: game theory and machine learning.

    I talked about evolutionary game theory and statistical mechanics in part 12 and part 13 of the information geometry series. I want to dramatically improve what I said there based on your post here.

    You can think of evolutionary game theory as being about biological evolution, where a mixed strategy is a mixture of genotypes and mixed strategies change as genotypes undergo natural selection. But you can also think of it as being about economic evolution, where players gradually change their mixed strategies in a deliberate attempt to optimize something. A lot of ideas from statistical mechanics still apply, but this also leads to new models that don’t make sense for biological evolution. A good very easy introduction is here:

    • William H. Sandholm, Evolutionary game theory, 12 November 2007.

    Sandholm mentions some other places where the same ideas show up:

    The birth of evolutionary game theory is marked by the publication of a series of papers by mathematical biologist John Maynard Smith. Maynard Smith adapted the methods of traditional game theory, which were created to model the behavior of rational economic agents, to the context of biological natural selection. He proposed his notion of an evolutionarily stable strategy (ESS) as a way of explaining the existence of ritualized animal conflict.

    Maynard Smith’s equilibrium concept was provided with an explicit dynamic foundation through a diff erential equation model introduced by Taylor and Jonker. Schuster and Sigmund, following Dawkins, dubbed this model the replicator dynamic, and recognized the close links between this game-theoretic dynamic and dynamics studied much earlier in population ecology and population genetics. By the 1980s, evolutionary game theory was a well-developed and firmly established modeling framework in biology.

    Towards the end of this period, economists realized the value of the evolutionary approach to game theory in social science contexts, both as a method of providing foundations for the equilibrium concepts of traditional game theory, and as a tool for selecting among equilibria in games that admit more than one. Especially in its early stages, work by economists in evolutionary game theory hewed closely to the interpretation set out by biologists, with the notion of ESS and the replicator dynamic understood as modeling natural selection in populations of agents genetically programmed to behave in specific ways. But it soon became clear that models of essentially the same form could be used to study the behavior of populations of active decision makers. Indeed, the two approaches sometimes lead to identical models: the replicator dynamic itself can be understood not only as a model of natural selection, but also as one of imitation of successful opponents.

    While the majority of work in evolutionary game theory has been undertaken by biologists and economists, closely related models have been applied to questions in a variety of fields, including transportation science, computer science, and sociology. Some paradigms from evolutionary game theory are close relatives of certain models from physics, and so have attracted the attention of workers in this field. All told, evolutionary game theory provides a common ground for workers from a wide range of disciplines.

    (I think he includes references, which I deleted here because it’s annoying to see things like [13,14] when you can’t see what they refer to.)

    He mentions computer science, but it’s really more specifically machine learning that tries to combine ideas from evolutionary biology and statistical mechanics to develop systems that optimize their predictive power. So, it would be nice to see what the detailed fluctuation theorem has to say in economics and machine learning.

  4. John Baez says:

    I was looking for a semi-popular book on life and nonequlibrium thermodynamics, and I discovered this book:

    • Eric D. Schneider and Dorion Sagan, Into the Cool: Energy Flow,Thermodynamics and Life, University of Chicago Press, Chicago, London, 2005.

    I read a review by Craig Callendar of the philosophy department at U.C. San Diego. It mentions Crooks’ fluctuation theorem, so I thought I’d quote part of it. It’s slightly relevant to what we’re talking about:

    Supported by many similar examples, Schneider and Sagan elevate the idea that ‘nature abhors a gradient’ to the status of a law of nature. Parts I and II of the book deal with the physics. Time and again we see that physical systems take surprising turns in trying to compensate for being out of equilibrium. Parts III and IV deal with the life sciences. Some of the more surprising turns, the authors argue, are the origin of life, evolution, regularities in ecology, human health and even economics. The central mechanisms of each of these fields, the authors claim, follow from their general principle that nature seeks to reduce gradients. Chemistry, cells, life, and so on, are all attempts by matter to efficiently dissipate energy due to various gradients, e.g., the temperature gradient due to the sun. In short, the authors see Bènard cells everywhere they look.

    The problem with their hypothesis, like the problem with the Gaia hypothesis (the reader may recall), is that when left vague one sees it confirmed everywhere, but when provided with rigorous content, it seems false. Nowhere in the book is the main claim developed in any technical or even conceptual detail. Understood in full generality, however, it’s hard to imagine something happening that couldn’t be put in the form of a gradient reduction. Bènard convection is due to a temperature gradient, whirlpools to gradients in gravitational potential energy, the rise of new species to “underutilized gradients and habitats” (241), Taylor vortices and hurricanes to pressure gradients, and arbitrage in finance due to price gradients. What are the constraints on the theory? It seems the gradients don’t even have to be measured by a thermodynamic parameter.

    By contrast, if we sharpen the claim it’s probably false. By “non-equilibrium thermodynamics” [or "NET"] let’s agree to mean roughly the theory described in a book like Beyond Classical Thermodynamics by Hans Christian Öttinger. If understood as the assertion that the central features of these fields follow from NET, I don’t believe this has been established or even rendered plausible. No attempt has been made to (say) apply Onsager’s, Crooks or Jarzynski’s fluctuation theorems to the various fields or to apply the complicated physics of Prigogine to capital. The hypothesis seems particularly overblown when extended to systems whose variables aren’t even thermodynamic.

    What provide the hypothesis its air of plausibility are two related claims that are true. First, there are deep analogies between the various subjects, often expressed in a common mathematical structure. Treating traffic flow, gene flow, and monetary flow with the master’s equations originally designed for the statistical mechanics of gas molecules often has been successful. But the authors want to go further. The similarities are not merely analogies for them (see p.282 on economics). Second, in some cases biological and even economic systems are nonequilibrium thermodynamic systems. The systems are macroscopic and as such admit a thermodynamic description. From this fact one may infer many useful generalizations in the biological and economic sciences. Indeed, NET has enjoyed demonstrable success in understanding biological motors in the cell.

    • Thanks for the quote! I should have said somewhere that the “detailed fluctuation theorem” and “Crooks theorem” are essentially the same, just like the “integral fluctuation theorem” and the “Jarzynski identity”.

      Callendar remarks that these have not been applied to life or economics. Not so anymore! The “fitness flux theorem” is an example of a non-physics application, and I claim there will be more. As I said in the post, the fluctuation theorem is really a universal mathematical property of Markov processes; it’s like the central limit theorem, if you wish: it’s so universal that it’s bound to appear everywhere.

      • By the way, it’s only almost always true that “Nature abhors a gradient”, as stated on the backcover of this book. In some cases, (which of course do not spoil the relevance of the book’s main point!) dissipative dynamics can actually produce a gradient from no gradient! If you’re interested, I’ve discussed this surprising effect (which you can think as a version of the famous “ratchet effect”) in [arXiv:1206.3441].

    • This summer I’ve read a nice book by J. Scales Avery, the title is “Information Theory and Evolution”, by World Scientific. It is semipopular and perhaps gets close to what John is looking for. The book describes in a very pleasant (and never trivial) way many phenomena (from biology at all size/time scales, to culture, technology, economy etc.) where the same picture emerges: order in a subsystem grows, while in the total system it cannot grow. However it is not very satisfying in the few paragraphs where (physics/mathematics) formalism is used: the observations are repeated in information language, but there are no hints on how to get a predictive theory.

      In summary, there is a powerful principle saying that in a closed system order cannot grow, nevertheless we are surrounded by open systems where order grows. And the only thing we are able to say is “well, there is no contradiction”…. This is frustrating.

    • Don Foster says:

      With regard to the notions of fittingness and gradients of one sort and another, I am curious about the origins of concerted action. If this comment is way off track, would you kindly ignore it or simply delete it.

      Consider that when you take a stroll in the woods, metric tons of raw matter around you are acting in concert, there is a high degree of fittingness on many meta-levels, everything is mutually finding its proper angle of repose and surprisingly, for great deal of matter, that angle is nearly vertical. There is consensus, a congruous entwining of energy paths, and a multiplicity of ‘formal’ agreements that are both dynamic and enduring. You belong in this scene to the very level of complex DNA chemistry that is resonant within the world around you.

      Now, in an ideal gas at maximal entropy, I would expect each molecule to be traveling on its own unique trajectory, each with its own information theoretic distinction. There would be no concerted action, no consensus as to path. Does the trend toward increasing entropy and the final state thereof allow us to roughly characterize the nature of energy? That is, if there is utility in the notion ‘nature abhors a gradient’ is there also some utility in the notion that energy eschews pathways?

      If we tentatively accept that idea, then how is it possible to view the world about us as resulting from the evolving ‘chemistry’ and entwinement of energy pathways?

      ‘Into the Cool: Energy Flow, Thermodynamics and Life’ sounds very similar in scope to Howard T Odum’s book, ‘Environment, Power and Society’, published in 1971, both in search of general organizing principles.

      General organizing principles are useful and in that regard I am wondering if it is useful to identify some grand counterpoise to energy as being catalytic in the emergence of path. Pathways emerge on gradients, not at equilibrium. On a deep level, could path be viewed as emergent between countervailing gradients of distinction?

  5. arch1 says:

    Matteo, thanks for the great posting which I am struggling to understand. I was encouraged when the Kolmogorov equation looked intuitive based on my interpretation of the transition rates, which is roughly (my first use of latex, bear with me:-):

    \gamma_{ab}(t)dt = \textrm{pr\{system in state } a \textrm{ transitions to state } b \textrm{ in } [t,t+dt]\}

    However, when I read your description of them more carefully I became puzzled. If the transition rates constitute a matrix in which the off-diagonal entries are non-negative and the column sums are zero, that implies the diagonal entries are negative, which implies that the diagonal entries (at least) are not everyday probabilities.

    I followed up on the “infinitesimal stochastic matrix” hotlink you provided, but the discussion there assumes physics background (Hamiltonians etc.) which I don’t have.

    Is there a (hopefully simple:-) math-oriented description of these transition rates somewhere? (I’m encouraged by your section title “The mathematical fact” to believe that such a description is at least possible:-)

    Thanks much!

    • Those gamma’s are not probabilities, but “transition rates”: they tell you how much the probability that the system is in a given state changes over time. The diagonal elements of a “infinitesimal stochastic matrix” \gamma_{aa}=-\sum_{b\neq a}\gamma_{ab}, in particular, are negative because the probability that the state remains a can only decrease over time—at some point or another, the system will jump to another state b\neq a.

  6. arch1 says:

    Trying the latex part again w/ space before trailing dollar sign..

    \gamma_a_b(t)dt = pr{system in state a transitions to state b in [t..t+dt]}

    And once more with just the LHS..

    \gamma_a_b(t)dt

    • John Baez says:

      Your problem never had anything to with the use of

      $latex $

      That was always fine. You don’t need a space before the trailing dollar sign, though you do need one after “latex”. The problem is that you’re writing:

      \gamma_a_b

      which LaTeX can’t parse. It can’t tell whether you want b to be a subscript of a, like this;

      \gamma_{a_b}

      which gives

      \gamma_{a_b}

      or what you really want, which is to have b be a subscript of \gamma, like this:

      \gamma_{ab}

      which gives

      \gamma_{ab}

      • arch1 says:

        Thanks John! I should have realized that my syntax was unlikely to be correct.

        It didn’t help that http://www.texify.com/links.php managed to interpret it as I had intended (almost – I now see that it places the b very slightly higher(!) than the a).

        Does anyone have a pointer to an online LaTeX interpreter which is decently compatible with the one used here?

  7. Bruce Smith says:

    Just reporting some minor typos; feel free to delete this when they’re addressed:

    1. you say, near the top of the main post

    … by the times \tau_{j} at which the transitions a_{j-1}\longrightarrow a_{j} occur …

    but the following formula shows such transitions labelled with j-1 rather than j .

    2. Some of the most recent comments above this one contain an error message “formula does not parse”.

  8. An interesting/related paper was recently published in PRL, “Thermodynamics of Prediction” by S. Still, with Crooks as a coauthor, here is the arXiv link: http://arxiv.org/abs/1203.3271

    I’ll try to explain it a bit to encourage you to read it.

    Related to Jarzynski’s work and the fluctuation theorem is the idea of measuring equilibrium quantities by forcing a system out of equilibrium and observing the response, see: http://www.physics.berkeley.edu/research/liphardt/pdfs/JarzynskiTest.pdf

    In Still’s paper they observe that much of the literature assumes that the driving signal is given/known explicitly, while in nature and biological systems this is most often not the case, hence they study stochastic driving signals.

    The idea is that implicit in the dynamics of such a forced system is a model of its (stochastic) environment. They then ask how the quality of this model is related to thermodynamic efficiency.
    How do you measure the quality of a model?
    For them a good model should have predictive power and not be overly complicated. This comes down to a balance of a systems memory, namely you break it into two parts, one useful, predictive part, and the other “useless nostalgia”. It is this latter part that is related to dissipation, hence having less nostalgia is having less dissipation is having better thermodynamic efficiency.

    Here is a whirlwind tour of how the paper makes this idea precise, taking many quotes straight from the paper:
    “the dynamics of system are modeled by discrete time Markovian conditional state-to-state transition probabilities”
    For the driving signal they assume only that its changes are governed by some probability density.
    Their system is in contact with a heat bath and starts in equilibrium. It then goes through a bunch of steps where the environment forces it out of equilibrium and then it relaxes. The environment does work on the system in each driving step and heat flows at each relaxation step.
    If you let it relax all the way to equilibrium any additional free energy gets dissipated as heat back to the environment. The additional free energy is given by the Kullback-Leibler divergence, or the relative entropy between the current state/distribution and the equilibrium one. The change in non-equilibrium free energy is the sum of the change in equilibrium free energy and this additional piece: \Delta F_{neq} = \Delta F_{eq} + F_{add}
    Dissipation work is given by the difference between work done on the system and the non-equilibrium change in free energy: W_{diss} = W - \Delta F_{neq}
    Excess work is given by the difference in work done on the system and the equilibrium free-energy (the work done for the quasistatic case): W_{ex} = W - \Delta F_{eq}
    The excess work minus the dissipation work gives you, W_{ex} - W_{diss} = \Delta F_{neq} - \Delta F_{eq} = F_{add} , which is precisely the additional free energy (the KL divergence of the current distribution relative to the equilibrium one).

    Now for the information/prediction half, which I understand even less!
    They look at Shannon’s (symmetric) mutual information of the system state and the external driving signal, both at a certain time t, this is called the system’s ‘instantaneous memory’, I[s_{t},x_{t}] where s is for system and x is for external signal. The ‘instantaneous predictive power’ is, I[s_{t}, x_{t+1}] , or the mutual information of a state at time t and the driving signal at t+1. The difference of the two is the ‘instantaneous nonpredictive information;’ “it represents useless nostalgia and provides a measure for the ineffectiveness of the implicit model.” (So memory-power=information, kidding)
    The paper then shows that this instantaneous nostalgia is proportional to the average work dissipated as t goes to t+1.

    They derive a lower bound on the total dissipation and use it to refine Landauer’s principle. They then discuss this in relation to biological systems, where the systems have adapted to their environments forcing, asking if minimizing nostalgia is a factor driving things towards energetic efficiency.

    This paper is more about thermodynamics and prediction (as the title suggests) than about reversibility. I don’t understand it yet, but there are hints of connections here, not only among biology, chemistry and nature, but also information and complexity (algorithmic information theory). I guess that’s why there is all the talk of Kolmogorov above. I’m new at this, and I’m sure many of you reading this have a better big picture! It seems though that in Still’s paper, a good model is about balancing complexity with predictability, and that this is done by not having any nostalgia, which doesn’t sound practical for us nostalgic humans! I would like to learn more though about mutual information in biological systems, and how some systems we understand a little bit have some ‘memory’ built in, either about their environment, or even their own dynamics.

    So read the paper, it does a better job explaining itself!

    • John Baez says:

      I’ve really got to read this… thanks for the tantalizing summary!

    • Don Foster says:

      I wonder if the dynamic of evolution could be modeled via an audio analogue. That is, the environmental forcing could be seen as being produced by a complex audio waveform and the genetic population modeled as a resonant surface, a complex Chladni plate.
      Would the resonance of that surface produce a complimentary chance in the driving signal? This seems to be the case in biological systems.

  9. [...] The Mathematical Origin of Irreversibility (johncarlosbaez.wordpress.com) [...]

  10. Jon Rowlands says:

    The properties of the Radon-Nikodym derivative are invoked all through the proof, and understanding the end result really seems to come down to understanding this object. Can you give any intuition about it?

    • John Baez says:

      If you have a measure d \mu you can get another measure by multiplying it by a function f:

      d \nu = f  d\mu

      Conversely, under some conditions you can figure out f knowing the measures d \mu and d \nu. This trick is called the Radon-Nikodym derivative because if you just follow your nose it looks like a derivative

      \displaystyle{ \frac{d \nu}{d \mu} = f }

      It’s really more like division: the measure d \nu divided by the measure d \mu is the function f.

      If this is too abstract for you, imagine d x is the usual thing that shows up in integrals and define

      d \mu(x) = \alpha(x)  \, d x

      d \nu(x) = \beta(x)  \, d x

      for functions \alpha and \beta. Then the Radon-Nikodym derivative at the point x is

      \displaystyle{ \frac{d \nu(x)}{d \mu(x)} = \frac{\alpha(x)}{\beta(x)} }

      whenever this exists, that is, whenever you’re not dividing by zero.

      I’m stating everything in a way that leaves out the technical fine print that’s needed for an actual theorem. For that, try Wikipedia. But you said you wanted intuition, and the intuition is a lot simpler than the Wikipedia article makes it seem.

  11. Hi there

    This has been on my “to read” list for ages and I finally got to it. But I’m having trouble seeing how the “immediate consequence”

    \displaystyle{ \langle e^{-\Delta i + \Sigma} \rangle = 1 }

    follows from

    \displaystyle{\frac{p(\Delta i - \Sigma = -A)}{p(\Delta i - \Sigma = A)} = e^{-A}}

    Is there any chance you could sketch out the derivation a little, or just point me towards a reference?

    I tried searching for “integral fluctuation theorem”, and found this, which seems to derive something similar, but it’s written in very different notation and it’s hard to connect it to what you’ve written here.

    • John Baez says:

      I don’t know if Matteo is listening, but I’ll think about this while I’m proctoring a 3-hour real analysis final today.

      • Thanks, I really appreciate that!

      • John Baez says:

        So, we have a random variable X and we’re trying to show

        \displaystyle{ \frac{p(X = -A)}{p(X = A)} = e^{-A} \;  \mathrm{ if } \; A \ge 0 }

        implies

        \langle e^X \rangle = 1

        For starters, let’s try an example. Suppose X can only equal 1 or -1. Say the probability it equals 1 is p. Then this probability is e times the probability that X equals -1, so

        e(1 - p) = p

        so

        \displaystyle{  p = \frac{e}{1+e}, \qquad (1-p) = \frac{1}{1+e} }

        So, the mean value of e^X is

        p e^1 + (1-p) e^{-1} = \frac{e^2}{1+e} + \frac{e^{-1}}{1+e} =  \frac{e^2 + e^{-1}}{e+1} \simeq 2.08

        So it’s not 1. So either I’m making a dumb mistake or the claim needs to be fixed somehow. I think I can show that in general

        \displaystyle{ \frac{p(X = -A)}{p(X = A)} = e^{-A} \; \mathrm{ if } \; A \ge 0 }

        implies

        \langle e^X \rangle \ge 1

        or more generally, if we have a probability density function p(x):

        \displaystyle{ p(x) \ge 0 ,\qquad \int_{\infty}^\infty p(x) \, dx = 1 }

        and

        \displaystyle{ \frac{p(-x)}{p(x)}  } = e^{-x} \; \textrm {if } \; x \ge 0

        then

        \displaystyle{  \int_{\infty}^\infty e^x p(x) \, dx \ge 1 }

        But I don’t know if this is the right way to fix the claim: replace the equation with an inequality. I have a feeling that changing things a bit could give us an equation.

  12. Matt Kuenzel says:

    Would this be a reasonable picture of the idea:

    Start with a row of cups each containing some amount of liquid.

    Let an exchange be the following: randomly select a source cup and a destination cup and some quantity less than the amount in the source cup. Move the quantity between the two cups.

    Let a series of exchanges be a mixing.

    The rate of mixing intuitively is expressed by the skewness. (If the skewness were zero all the amounts would be constant.)

    The result can be stated this way: the information difference between the starting and ending states is always non-negative and a function of the mixing rate.

    Also, the so-called “Data Processing Inequality” seems relevant:
    Consider a Markov chain X -> Y -> Z with Z = f(Y). Then I(X, Z) is less than or equal to I(X, Y). [I is mutual information.]

    Lastly, there are questions in the comments regarding the time-invariance of the transition matrix and how that limits the result. Would it be possible to define a “larger” transition matrix where each state S would be replaced by a series of states (S, t) for each time t? The purpose being to construct a constant transition matrix and thereby applying the theorem to seemingly time-variant processes?

You can use HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,711 other followers