Mathematics of the Environment (Part 7)

13 November, 2012

Last time we saw how the ice albedo effect could, in theory, make the Earth’s climate have two stable states. In a very simple model, we saw that a hot Earth might stay hot since it’s dark and absorbs lots of sunlight, while a cold Earth might stay cold—since it’s icy and white and reflects most of the sunlight.

If you haven’t tried it yet, make sure to play around with this program pioneered by Lesley de Cruz and then implemented in Java by Allan Erskine:

Temperature dynamics.

The explanations were all given in Part 6 so I won’t repeat them here!

This week, we’ll see how noise affects this simple climate model. We’ll borrow lots of material from here:

Glyn Adgie and Tim van Beek, Increasing the signal-to-noise ratio with more noise.

And we’ll use software written by these guys together with Allan Erskine. The power of the Azimuth Project knows no bounds!

Stochastic differential equations

The Milankovich cycles are periodic changes in how the Earth orbits the Sun. One question is: can these changes can be responsible for the ice ages? On the first sight it seems impossible, because the changes are simply too small. But it turns out that we can find a dynamical system where a small periodic external force is actually strengthened by random ‘noise’ in the system. This phenomenon has been dubbed ‘stochastic resonance’ and has been proposed as an explanation for the ice ages. It also shows up in many other phenomena:

• Roberto Benzi, Stochastic resonance: from climate to biology.

But to understand it, we need to think a little about stochastic differential equations.

A lot of systems can be described by ordinary differential equations:

\displaystyle{ \frac{d x}{d t} = f(x, t)}

If f is nice, the time evolution of the system will be a nice smooth function x(t), like the trajectory of a thrown stone. But there are situations where we have some kind of noise, a chaotic, fluctuating influence, that we would like to take into account. This could be, for example, turbulence in the air flow around a rocket. Or, in our case, short term fluctuations of the weather of the earth. If we take this into account, we get an equation of the form

\displaystyle{ \frac{d x}{d t} = f(x, t) + w(t) }

where the w is a ‘random function’ which models the noise. Typically this noise is just a simplified way to take into account rapidly changing fine-grained aspects of the system at hand. This way we do not have to explicitly model these aspects, which is often impossible.

White noise

We’ll look at a model of this sort:

\displaystyle{ \frac{d x}{d t} = f(x, t) + w(t) }

where w(t) is ‘white noise’. But what’s that?

Very naively speaking, white noise is a random function that typically looks very wild, like this:

White noise actually has a sound, too: it sounds like this! The idea is that you can take a random function like the one graphed above, and use it to drive the speakers of your computer, to produce sound waves of an equally wild sort. And it sounds like static.

However, all this is naive. Why? The concept of a ‘random function’ is not terribly hard to define, at least if you’ve taken the usual year-long course on real analysis that we force on our grad students at U.C. Riverside: it’s a probability measure on some space of functions. But white noise is a bit too spiky and singular too be a random function: it’s a random distribution.

Distributions were envisioned by Dirac but formalized later by Laurent Schwarz and others. A distribution D(t) is a bit like a function, but often distributions are too nasty to have well-defined values at points! Instead, all that makes sense are expressions like this:

\displaystyle{ \int_\infty^\infty D(t) f(t) \, d t}

where we multiply the distribution by any compactly supported smooth function f(t) and then integrate it. Indeed, we specify a distribution just by saying what all these integrals equal. For example, the Dirac delta \delta(t) is a distribution defined by

\displaystyle{ \int_\infty^\infty \delta(t) f(t) \, d t = f(0) }

If you try to imagine the Dirac delta as a function, you run into a paradox: it should be zero everywhere except at t = 0, but its integral should equal 1. So, if you try to graph it, the region under the graph should be an infinitely skinny infinitely tall spike whose area is 1:

But that’s impossible, at least in the standard framework of mathematics—so such a function does not really exist!

Similarly, white noise w(t) is too spiky to be an honest random function, but if we multiply it by any compactly supported smooth function f(t) and integrate, we get a random variable

\displaystyle{ \int_\infty^\infty w(t) f(t) \, d t }

whose probability distribution is a Gaussian with mean zero and standard deviation equal to

\displaystyle{ \sqrt{ \int_\infty^\infty f(t)^2 \, d t} }

(Note: the word ‘distribution’ has a completely different meaning when it shows up in the phrase probability distribution! I’m assuming you’re comfortable with that meaning already.)

Indeed, the above formulas make sense and are true, not just when f is compactly supported and smooth, but whenever

\displaystyle{ \sqrt{ \int_\infty^\infty f(t)^2 \, d t} } < \infty

If you know about Gaussians and you know about this sort of integral, which shows up all over math and physics, you’ll realize that white noise is an extremely natural concept!

Brownian motion

While white noise is not a random function, its integral

\displaystyle{ W(s) = \int_0^s w(s) \, ds }

turns out to be a well-defined random function. It has a lot of names, including: the Wiener process, Brownian motion, red noise and—as a kind of off-color joke—brown noise.

The capital W here stands for Wiener: that is, Norbert Wiener, the famously absent-minded MIT professor who studied this random function… and also invented cybernetics:

Brownian noise sounds like this.

Puzzle: How does it sound different from white noise, and why?

It looks like this:

Here we are zooming in on closer and closer, while rescaling the vertical axis as well. We see that Brownian noise is self-similar: if W(t) is Brownian noise, so is W(c t)/\sqrt{c} for all c > 0.

More importantly for us, Brownian noise is the solution of this differential equation:

\displaystyle{ \frac{d W}{d t} = w(t) }

where w(t) is white noise. This is essentially true by definition, but making it rigorous takes some work. More fancy stochastic differential equations

\displaystyle{ \frac{d x}{d t} = f(x, t) + w(t) }

take even more work to rigorously formulate and solve. You can read about them here:

Stochastic differential equation, Azimuth Library.

It’s actually much easier to explain the difference equations we use to approximately solve these stochastic differential equation on the computer. Suppose we discretize time into steps like this:

t_i = i \Delta t

where \Delta t > 0 is some fixed number, our ‘time step’. Then we can define

x(t_{i+1}) = x(t_i) + f(x(t_i), t_i) \; \Delta t + w_i

where w_i are independent Gaussian random variables with mean zero and standard deviation

\sqrt{ \Delta t}

The square root in this formula comes from the definition I gave of white noise.

If we use a random number generator to crank out random numbers w_i distributed in this way, we can write a program to work out the numbers x(t_i) if we are given some initial value x(0). And if the time step \Delta t is small enough, we can hope to get a ‘good approximation to the true solution’. Of course defining what we mean by a ‘good approximation’ is tricky here… but I think it’s more important to just plunge in and see what happens, to get a feel for what’s going on here.

Stochastic resonance

Let’s do an example of this equation:

\displaystyle{ \frac{d x}{d t} = f(x, t) + w(t) }

which exhibits ‘stochastic resonance’. Namely:

\displaystyle{ \frac{d x}{d t} = x(t) - x(t)^3 + A \;  \sin(t)  + \sqrt{2D} w(t) }

Here A and D are constants we get to choose. If we set them both to zero we get:

\displaystyle{ \frac{d x}{d t} = x(t) - x(t)^3 }

This has stable equilibrium solutions at x = \pm 1 and an unstable equilibrium in between at x = 0. So, this is a bistable model similar to the one we’ve been studying, but mathematically simpler!

Then we can add an oscillating time-dependent term:

\displaystyle{ \frac{d x}{d t} = x(t) - x(t)^3 + A \;  \sin(t) }

which wiggles the system back and forth. This can make it jump from one equilibrium to another.

And then we can add on noise:

\displaystyle{ \frac{d x}{d t} = x(t) - x(t)^3 + A \;  \sin(t)  + \sqrt{2D} w(t) }

Let’s see what the solutions look like!

In the following graphs, the green curve is A \sin(t),, while the red curve is x(t). Here is a simulation with a low level of noise:

low noise level

As you can see, within the time of the simulation there is no transition from the stable state at 1 to the one at -1. If we were doing a climate model, this would be like the Earth staying in the warm state.

Here is a simulation with a high noise level:

high noise level

The solution x(t) jumps around wildly. By inspecting the graph with your eyes only, you don’t see any pattern in it, do you?

But finally, here is a simulation where the noise level is not too small, and not too big:

high noise level

Here we see the noise helping the solution x(t) hops from around x = -1 to x = 1, or vice versa. The solution x(t) is not at all ‘periodic’—it’s quite random. But still, it tends to hop back and forth thanks to the combination of the sinusoidal term A \sin(t) and the noise.

Glyn Adgie, Allan Erskine, Jim Stuttard and Tim van Beek have created an online model that lets you solve this stochastic differential equation for different values of A and D. You can try it here:

Stochastic resonance example.

You can change the values using the sliders under the graphic and see what happens. You can also choose different ‘random seeds’, which means that the random numbers used in the simulation will be different.

To read more about stochastic resonance, go here:

Stochastic resonance, Azimuth Library.

In future weeks I hope to say more about the actual evidence that stochastic resonance plays a role in our glacial cycles! It would also be great to go back to our climate model from last time and add noise. We’re working on that.


The Mathematical Origin of Irreversibility

8 October, 2012

guest post by Matteo Smerlak

Introduction

Thermodynamical dissipation and adaptive evolution are two faces of the same Markovian coin!

Consider this. The Second Law of Thermodynamics states that the entropy of an isolated thermodynamic system can never decrease; Landauer’s principle maintains that the erasure of information inevitably causes dissipation; Fisher’s fundamental theorem of natural selection asserts that any fitness difference within a population leads to adaptation in an evolution process governed by natural selection. Diverse as they are, these statements have two common characteristics:

1. they express the irreversibility of certain natural phenomena, and

2. the dynamical processes underlying these phenomena involve an element of randomness.

Doesn’t this suggest to you the following question: Could it be that thermal phenomena, forgetful information processing and adaptive evolution are governed by the same stochastic mechanism?

The answer is—yes! The key to this rather profound connection resides in a universal property of Markov processes discovered recently in the context of non-equilibrium statistical mechanics, and known as the ‘fluctuation theorem’. Typically stated in terms of ‘dissipated work’ or ‘entropy production’, this result can be seen as an extension of the Second Law of Thermodynamics to small systems, where thermal fluctuations cannot be neglected. But it is actually much more than this: it is the mathematical underpinning of irreversibility itself, be it thermodynamical, evolutionary, or else. To make this point clear, let me start by giving a general formulation of the fluctuation theorem that makes no reference to physics concepts such as ‘heat’ or ‘work’.

The mathematical fact

Consider a system randomly jumping between states a, b,\dots with (possibly time-dependent) transition rates \gamma_{a b}(t) where a is the state prior to the jump, while b is the state after the jump. I’ll assume that this dynamics defines a (continuous-time) Markov process, namely that the numbers \gamma_{a b} are the matrix entries of an infinitesimal stochastic matrix, which means that its off-diagonal entries are non-negative and that its columns sum up to zero.

Now, each possible history \omega=(\omega_t)_{0\leq t\leq T} of this process can be characterized by the sequence of occupied states a_{j} and by the times \tau_{j} at which the transitions a_{j-1}\longrightarrow a_{j} occur (0\leq j\leq N):

\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T}).

Define the skewness \sigma_{j}(\tau_{j}) of each of these transitions to be the logarithmic ratio of transition rates:

\displaystyle{\sigma_{j}(\tau_{j}):=\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}

Also define the self-information of the system in state a at time t by:

i_a(t):= -\ln\pi_{a}(t)

where \pi_{a}(t) is the probability that the system is in state a at time t, given some prescribed initial distribution \pi_{a}(0). This quantity is also sometimes called the surprisal, as it measures the ‘surprise’ of finding out that the system is in state a at time t.

Then the following identity—the detailed fluctuation theorem—holds:

\mathrm{Prob}[\Delta i-\Sigma=-A] = e^{-A}\;\mathrm{Prob}[\Delta i-\Sigma=A] \;

where

\displaystyle{\Sigma:=\sum_{j}\sigma_{j}(\tau_{j})}

is the cumulative skewness along a trajectory of the system, and

\Delta i= i_{a_N}(T)-i_{a_0}(0)

is the variation of self-information between the end points of this trajectory.

This identity has an immediate consequence: if \langle\,\cdot\,\rangle denotes the average over all realizations of the process, then we have the integral fluctuation theorem:

\langle e^{-\Delta i+\Sigma}\rangle=1,

which, by the convexity of the exponential and Jensen’s inequality, implies:

\langle \Delta i\rangle=\Delta S\geq\langle\Sigma\rangle.

In short: the mean variation of self-information, aka the variation of Shannon entropy

\displaystyle{ S(t):= \sum_{a}\pi_{a}(t)i_a(t) }

is bounded from below by the mean cumulative skewness of the underlying stochastic trajectory.

This is the fundamental mathematical fact underlying irreversibility. To unravel its physical and biological consequences, it suffices to consider the origin and interpretation of the ‘skewness’ term in different contexts. (By the way, people usually call \Sigma the ‘entropy production’ or ‘dissipation function’—but how tautological is that?)

The physical and biological consequences

Consider first the standard stochastic-thermodynamic scenario where a physical system is kept in contact with a thermal reservoir at inverse temperature \beta and undergoes thermally induced transitions between states a, b,\dots. By virtue of the detailed balance condition:

\displaystyle{ e^{-\beta E_{a}(t)}\gamma_{a b}(t)=e^{-\beta E_{b}(t)}\gamma_{b a}(t),}

the skewness \sigma_{j}(\tau_{j}) of each such transition is \beta times the energy difference between the states a_{j} and a_{j-1}, namely the heat received from the reservoir during the transition. Hence, the mean cumulative skewness \langle \Sigma\rangle is nothing but \beta\langle Q\rangle, with Q the total heat received by the system along the process. It follows from the detailed fluctuation theorem that

\langle e^{-\Delta i+\beta Q}\rangle=1

and therefore

\Delta S\geq\beta\langle Q\rangle

which is of course Clausius’ inequality. In a computational context where the control parameter is the entropy variation itself (such as in a bit-erasure protocol, where \Delta S=-\ln 2), this inequality in turn expresses Landauer’s principle: it impossible to decrease the self-information of the system’s state without dissipating a minimal amount of heat into the environment (in this case -Q \geq k T\ln2, the ‘Landauer bound’). More general situations (several types of reservoirs, Maxwell-demon-like feedback controls) can be treated along the same lines, and the various forms of the Second Law derived from the detailed fluctuation theorem.

Now, many would agree that evolutionary dynamics is a wholly different business from thermodynamics; in particular, notions such as ‘heat’ or ‘temperature’ are clearly irrelevant to Darwinian evolution. However, the stochastic framework of Markov processes is relevant to describe the genetic evolution of a population, and this fact alone has important consequences. As a simple example, consider the time evolution of mutant fixations x_{a} in a population, with a ranging over the possible genotypes. In a ‘symmetric mutation scheme’, which I understand is biological parlance for ‘reversible Markov process’, meaning one that obeys detailed balance, the ratio between the a\mapsto b and b\mapsto a transition rates is completely determined by the fitnesses f_{a} and f_b of a and b, according to

\displaystyle{\frac{\gamma_{a b}}{\gamma_{b a}} =\left(\frac{f_{b}}{f_{a}}\right)^{\nu} }

where \nu is a model-dependent function of the effective population size [Sella2005]. Along a given history of mutant fixations, the cumulated skewness \Sigma is therefore given by minus the fitness flux:

\displaystyle{\Phi=\nu\sum_{j}(\ln f_{a_j}-\ln f_{a_{j-1}}).}

The integral fluctuation theorem then becomes the fitness flux theorem:

\displaystyle{ \langle e^{-\Delta i -\Phi}\rangle=1}

discussed recently by Mustonen and Lässig [Mustonen2010] and implying Fisher’s fundamental theorem of natural selection as a special case. (Incidentally, the ‘fitness flux theorem’ derived in this reference is more general than this; for instance, it does not rely on the ‘symmetric mutation scheme’ assumption above.) The ensuing inequality

\langle \Phi\rangle\geq-\Delta S

shows that a positive fitness flux is “an almost universal evolutionary principle of biological systems” [Mustonen2010], with negative contributions limited to time intervals with a systematic loss of adaptation (\Delta S > 0). This statement may well be the closest thing to a version of the Second Law of Thermodynamics applying to evolutionary dynamics.

It is really quite remarkable that thermodynamical dissipation and Darwinian evolution can be reduced to the same stochastic mechanism, and that notions such as ‘fitness flux’ and ‘heat’ can arise as two faces of the same mathematical coin, namely the ‘skewness’ of Markovian transitions. After all, the phenomenon of life is in itself a direct challenge to thermodynamics, isn’t it? When thermal phenomena tend to increase the world’s disorder, life strives to bring about and maintain exquisitely fine spatial and chemical structures—which is why Schrödinger famously proposed to define life as negative entropy. Could there be a more striking confirmation of his intuition—and a reconciliation of evolution and thermodynamics in the same go—than the fundamental inequality of adaptive evolution \langle\Phi\rangle\geq-\Delta S?

Surely the detailed fluctuation theorem for Markov processes has other applications, pertaining neither to thermodynamics nor adaptive evolution. Can you think of any?

Proof of the fluctuation theorem

I am a physicist, but knowing that many readers of John’s blog are mathematicians, I’ll do my best to frame—and prove—the FT as an actual theorem.

Let (\Omega,\mathcal{T},p) be a probability space and (\,\cdot\,)^{\dagger}=\Omega\to \Omega a measurable involution of \Omega. Denote p^{\dagger} the pushforward probability measure through this involution, and

\displaystyle{ R=\ln \frac{d p}{d p^\dagger} }

the logarithm of the corresponding Radon-Nikodym derivative (we assume p^\dagger and p are mutually absolutely continuous). Then the following lemmas are true, with (1)\Rightarrow(2)\Rightarrow(3):

Lemma 1. The detailed fluctuation relation:

\forall A\in\mathbb{R} \quad  p\big(R^{-1}(-A) \big)=e^{-A}p \big(R^{-1}(A) \big)

Lemma 2. The integral fluctuation relation:

\displaystyle{\int_{\Omega} d p(\omega)\,e^{-R(\omega)}=1 }

Lemma 3. The positivity of the Kullback-Leibler divergence:

D(p\,\Vert\, p^{\dagger}):=\int_{\Omega} d p(\omega)\,R(\omega)\geq 0.

These are basic facts which anyone can show: (2)\Rightarrow(3) by Jensen’s inequality, (1)\Rightarrow(2) trivially, and (1) follows from R(\omega^{\dagger})=-R(\omega) and the change of variables theorem, as follows,

\begin{array}{ccl} \displaystyle{ \int_{R^{-1}(-A)} d p(\omega)} &=& \displaystyle{ \int_{R^{-1}(A)}d p^{\dagger}(\omega) } \\ \\ &=& \displaystyle{ \int_{R^{-1}(A)} d p(\omega)\, e^{-R(\omega)} } \\ \\ &=& \displaystyle{ e^{-A} \int_{R^{-1}(A)} d p(\omega)} .\end{array}

But here is the beauty: if

(\Omega,\mathcal{T},p) is actually a Markov process defined over some time interval [0,T] and valued in some (say discrete) state space \Sigma, with the instantaneous probability \pi_{a}(t)=p\big(\{\omega_{t}=a\} \big) of each state a\in\Sigma satisfying the master equation (aka Kolmogorov equation)

\displaystyle{ \frac{d\pi_{a}(t)}{dt}=\sum_{b\neq a}\Big(\gamma_{b a}(t)\pi_{a}(t)-\gamma_{a b}(t)\pi_{b}(t)\Big),}

and

• the dagger involution is time-reversal, that is \omega^{\dagger}_{t}:=\omega_{T-t},

then for a given path

\displaystyle{\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T})\in\Omega}

the logarithmic ratio R(\omega) decomposes into ‘variation of self-information’ and ‘cumulative skewness’ along \omega:

\displaystyle{ R(\omega)=\underbrace{\Big(\ln\pi_{a_0}(0)-\ln\pi_{a_N}(T) \Big)}_{\Delta i(\omega)}-\underbrace{\sum_{j=1}^{N}\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}_{\Sigma(\omega)}.}

This is easy to see if one writes the probability of a path explicitly as

\displaystyle{p(\omega)=\pi_{a_{0}}(0)\left[\prod_{j=1}^{N}\phi_{a_{j-1}}(\tau_{j-1},\tau_{j})\gamma_{a_{j-1}a_{j}}(\tau_{j})\right]\phi_{a_{N}}(\tau_{N},T)}

where

\displaystyle{ \phi_{a}(\tau,\tau')=\phi_{a}(\tau',\tau)=\exp\Big(-\sum_{b\neq a}\int_{\tau}^{\tau'}dt\, \gamma_{a b}(t)\Big)}

is the probability that the process remains in the state a between the times \tau and \tau'. It follows from the above lemma that

Theorem. Let (\Omega,\mathcal{T},p) be a Markov process and let i,\Sigma:\Omega\rightarrow \mathbb{R} be defined as above. Then we have

1. The detailed fluctuation theorem:

\forall A\in\mathbb{R}, p\big((\Delta i-\Sigma)^{-1}(-A) \big)=e^{-A}p \big((\Delta i-\Sigma)^{-1}(A) \big)

2. The integral fluctuation theorem:

\int_{\Omega} d p(\omega)\,e^{-\Delta i(\omega)+\Sigma(\omega)}=1

3. The ‘Second Law’ inequality:

\displaystyle{ \Delta S:=\int_{\Omega} d p(\omega)\,\Delta i(\omega)\geq \int_{\Omega} d p(\omega)\,\Sigma(\omega)}

The same theorem can be formulated for other kinds of Markov processes as well, including diffusion processes (in which case it follows from the Girsanov theorem).

References

Landauer’s principle was introduced here:

• [Landauer1961] R. Landauer, Irreversibility and heat generation in the computing process}, IBM Journal of Research and Development 5, (1961) 183–191.

and is now being verified experimentally by various groups worldwide.

The ‘fundamental theorem of natural selection’ was derived by Fisher in his book:

• [Fisher1930] R. Fisher, The Genetical Theory of Natural Selection, Clarendon Press, Oxford, 1930.

His derivation has long been considered obscure, even perhaps wrong, but apparently the theorem is now well accepted. I believe the first Markovian models of genetic evolution appeared here:

• [Fisher1922] R. A. Fisher, On the dominance ratio, Proc. Roy. Soc. Edinb. 42 (1922), 321–341.

• [Wright1931] S. Wright, Evolution in Mendelian populations, Genetics 16 (1931), 97–159.

Fluctuation theorems are reviewed here:

• [Sevick2008] E. Sevick, R. Prabhakar, S. R. Williams, and D. J. Searles, Fluctuation theorems, Ann. Rev. Phys. Chem. 59 (2008), 603–633.

Two of the key ideas for the ‘detailed fluctuation theorem’ discussed here are due to Crooks:

• [Crooks1999] Gavin Crooks, The entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences, Phys. Rev. E 60 (1999), 2721–2726.

who identified (E_{a}(\tau_{j})-E_{a}(\tau_{j-1})) as heat, and Seifert:

• [Seifert2005] Udo Seifert, Entropy production along a stochastic trajectory and an integral fluctuation theorem, Phys. Rev. Lett. 95 (2005), 4.

who understood the relevance of the self-information in this context.

The connection between statistical physics and evolutionary biology is discussed here:

• [Sella2005] G. Sella and A.E. Hirsh, The application of statistical physics to evolutionary biology, Proc. Nat. Acad. Sci. USA 102 (2005), 9541–9546.

and the ‘fitness flux theorem’ is derived in

• [Mustonen2010] V. Mustonen and M. Lässig, Fitness flux and ubiquity of adaptive evolution, Proc. Nat. Acad. Sci. USA 107 (2010), 4248–4253.

Schrödinger’s famous discussion of the physical nature of life was published here:

• [Schrödinger1944] E. Schrödinger, What is Life?, Cambridge University Press, Cambridge, 1944.


An Entropy Challenge

29 August, 2012

If you like computer calculations, here’s a little challenge for you. Oscar Dahlsten may have solved it, but we’d love for you to check his work. It’s pretty important for the foundations of thermodynamics, but you don’t need to know any physics or even anything beyond a little algebra to tackle it! First I’ll explain it in really simple terms, then I’ll remind you a bit of why it matters.

We’re looking for two lists of nonnegative numbers, of the same length, listed in decreasing order:

p_1 \ge p_2 \ge \cdots \ge p_n \ge 0

q_1 \ge q_2 \ge \cdots \ge q_n \ge 0

that sum to 1:

p_1 + \cdots + p_n = 1

q_1 + \cdots + q_n = 1

and that obey this inequality:

\displaystyle{ \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta  \le  \frac{1}{1 - \beta} \ln \sum_{i=1}^n q_i^\beta }

for all 0 < \beta < \infty (ignoring \beta = 1), yet do not obey these inequalities:

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n.

Oscar’s proposed solution is this:

p = (0.4, 0.29, 0.29, 0.02)

q = (0.39, 0.31, 0.2, 0.1)

Can you see if this works? Is there a simpler example, like one with lists of just 3 numbers?

This question came up near the end of my post More Second Laws of Thermodynamics. I phrased the question with a bit more jargon, and said a lot more about its significance. Suppose we have two probability distributions on a finite set, say p and q. We say p majorizes q if

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n, when we write both lists of numbers in decreasing order. This means p is ‘less flat’ than q, so it should have less entropy. And indeed it does: not just for ordinary entropy, but also for Rényi entropy! The Rényi entropy of p is defined by

\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta  }

where 0 < \beta < 1 or 1 < \beta < \infty. We can also define Rényi entropy for \beta = 0, 1, \infty by taking a limit, and at \beta = 1 we get the ordinary entropy

\displaystyle{ H_1(p) = - \sum_{i = 1}^n p_i \ln (p_i) }

The question is whether majorization is more powerful than Rényi entropy as a tool to to tell when one probability distribution is less flat than another. I know that if p majorizes q, its Rényi entropy is less than than that of q for all 0 \le \beta \le \infty. Your mission, should you choose to accept it, is to show the converse is not true.


More Second Laws of Thermodynamics

24 August, 2012

Oscar Dahlsten is visiting the Centre for Quantum Technologies, so we’re continuing some conversations about entropy that we started last year, back when the Entropy Club was active. But now Jamie Vicary and Brendan Fong are involved in the conversations.

I was surprised when Oscar told me that for a large class of random processes, the usual second law of thermodynamics is just one of infinitely many laws saying that various kinds of disorder increase. I’m annoyed that nobody ever told me about this before! It’s as if they told me about conservation of energy but not conservation of schmenergy, and phlenergy, and zenergy

So I need to tell you about this. You may not understand it, but at least I can say I tried. I don’t want you blaming me for concealing all these extra second laws of thermodynamics!

Here’s the basic idea. Not all random processes are guaranteed to make entropy increase. But a bunch of them always make probability distributions flatter in a certain precise sense. This makes the entropy of the probability distribution increase. But when you make a probability distribution flatter in this sense, a bunch of other quantities increase too! For example, besides the usual entropy, there are infinitely many other kinds of entropy, called ‘Rényi entropies’, one for each number between 0 and ∞. And a doubly stochastic operator makes all the Rényi entropies increase! This fact is a special case of Theorem 10 here:

• Tim van Erven and Peter Harremoës, Rényi divergence and majorization.

Let me state this fact precisely, and then say a word about how this is related to quantum theory and ‘the collapse of the wavefunction’.

To keep things simple let’s talk about probability distributions on a finite set, though Erven and Harremoës generalize it all to a measure space.

How do we make precise the concept that one probability distribution is flatter than another? You know it when you see it, at least some of the time. For example, suppose I have some system in thermal equilibrium at some temperature, and the probabilities of it being in various states look like this:

Then say I triple the temperature. The probabilities flatten out:

But how can we make this concept precise in a completely general way? We can do it using the concept of ‘majorization’. If one probability distribution is less flat than another, people say it ‘majorizes’ that other one.

Here’s the definition. Say we have two probability distributions p and q on the same set. For each one, list the probabilities in decreasing order:

p_1 \ge p_2 \ge \cdots \ge p_n

q_1 \ge q_2 \ge \cdots \ge q_n

Then we say p majorizes q if

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n. So, the idea is that the biggest probabilities in the distribution p add up to more than the corresponding biggest ones in q.

In 1960, Alfred Rényi defined a generalization of the usual Shannon entropy that depends on a parameter \beta. If p is a probability distribution on a finite set, its Rényi entropy of order \beta is defined to be

\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_i p_i^\beta }

where 0 \le \beta < \infty. Well, to be honest: if \beta is 0, 1, or \infty we have to define this by taking a limit where we let \beta creep up to that value. But the limit exists, and when \beta = 1 we get the usual Shannon entropy

\displaystyle{ H_1(p) = - \sum_i p_i \ln(p_i) }

As I explained a while ago, Rényi entropies are important ways of measuring biodiversity. But here’s what I learned just now, from the paper by Erven and Harremoës:

Theorem 1. If a probability distribution p majorizes a probability distribution q, its Rényi entropies are smaller:

\displaystyle{ H_\beta(p) \le H_\beta(q) }

for all 0 \le \beta < \infty.

And here’s what makes this fact so nice. If you do something to a classical system in a way that might involve some randomness, we can describe your action using a stochastic matrix. An n \times n matrix T is called stochastic if whenever p \in \mathbb{R}^n is a probability distribution, so is T p. This is equivalent to saying:

• the matrix entries of T are all \ge 0, and

• each column of T sums to 1.

If T is stochastic, it’s not necessarily true that the entropy of T p is greater than or equal to that of p, not even for the Shannon entropy.

Puzzle 1. Find a counterexample.

However, entropy does increase if we use specially nice stochastic matrices called ‘doubly stochastic’ matrices. People say a matrix T doubly stochastic if it’s stochastic and it maps the probability distribution

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

to itself. This is the most spread-out probability distribution of all: every other probability distribution majorizes this one.

Why do they call such matrices ‘doubly’ stochastic? Well, if you’ve got a stochastic matrix, each column sums to one. But a stochastic operator is doubly stochastic if and only if each row sums to 1 as well.

Here’s a really cool fact:

Theorem 2. If T is doubly stochastic, p majorizes T p for any probability distribution p \in \mathbb{R}^n. Conversely, if a probability distribution p majorizes a probability distribution q, then q = T p for some doubly stochastic matrix T.

Taken together, Theorems 1 and 2 say that doubly stochastic transformations increase entropy… but not just Shannon entropy! They increase all the different Rényi entropies, as well. So if time evolution is described by a doubly stochastic matrix, we get lots of ‘second laws of thermodynamics’, saying that all these different kinds of entropy increase!

Finally, what does all this have to do with quantum mechanics, and collapsing the wavefunction? There are different things to say, but this is the simplest:

Theorem 3. Given two probability distributions p, q \in \mathbb{R}^n, then p majorizes q if and only there exists a self-adjoint matrix D with eigenvalues p_i and diagonal entries q_i.

The matrix D will be a density matrix: a self-adjoint matrix with positive eigenvalues and trace equal to 1. We use such matrices to describe mixed states in quantum mechanics.

Theorem 3 gives a precise sense in which preparing a quantum system in some state, letting time evolve, and then measuring it ‘increases randomness’.

How? Well, suppose we have a quantum system whose Hilbert space is \mathbb{C}^n. If we prepare the system in a mixture of the standard basis states with probabilities p_i, we can describe it with a diagonal density matrix D_0. Then suppose we wait a while and some unitary time evolution occurs. The system is now described by a new density matrix

D = U D_0 \, U^{-1}

where U is some unitary operator. If we then do a measurement to see which of the standard basis states our system now lies in, we’ll get the different possible results with probabilities q_i, the diagonal entries of D. But the eigenvalues of D will still be the numbers p_i. So, by the theorem, p majorizes q!

So, not only Shannon entropy but also all the Rényi entropies will increase!

Of course, there are some big physics questions lurking here. Like: what about the real world? In the real world, do lots of different kinds of entropy tend to increase, or just some?

Of course, there’s a huge famous old problem about how reversible time evolution can be compatible with any sort of law saying that entropy must always increase! Still, there are some arguments, going back to Boltzmann’s H-theorem, which show entropy increases under some extra conditions. So then we can ask if other kinds of entropy, like Rényi entropy, increase as well. This will be true whenever we can argue that time evolution is described by doubly stochastic matrices. Theorem 3 gives a partial answer, but there’s probably much more to say.

I don’t have much more to say right now, though. I’ll just point out that while doubly stochastic matrices map the ‘maximally smeared-out’ probability distribution

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

to itself, a lot of this theory generalizes to stochastic matrices that map exactly one other probability distribution to itself. We need to work with relative Rényi entropy instead of Rényi entropy, and so on, but I don’t think these adjustments are really a big deal. And there are nice theorems that let you know when a stochastic matrix maps exactly one probability distribution to itself, based on the Perron–Frobenius theorem.

References

I already gave you a reference for Theorem 1, namely the paper by Erven and Harremoës, though I don’t think they were the first to prove this particular result: they generalize it quite a lot.

What about Theorem 2? It goes back at least to here:

• Barry C. Arnold, Majorization and the Lorenz Order: A Brief Introduction, Springer Lecture Notes in Statistics 43, Springer, Berlin, 1987.

The partial order on probability distributions given by majorization is also called the ‘Lorenz order’, but mainly when we consider probability distributions on infinite sets. This name presumably comes from the Lorenz curve, a measure of income inequality. This curve shows for the bottom x% of households, what percentage y% of the total income they have:

Puzzle 2. If you’ve got two different probability distributions of incomes, and one majorizes the other, how are their Lorenz curves related?

When we generalize majorization by letting some other probability distribution take the place of

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

it seems people call it the ‘Markov order’. Here’s a really fascinating paper on that, which I’m just barely beginning to understand:

• A. N. Gorban, P. A. Gorban and G. Judge, Entropy: the Markov ordering approach, Entropy 12 (2010), 1145–1193.

What about Theorem 3? Apparently it goes back to here:

• A. Uhlmann, Wiss. Z. Karl-Marx-Univ. Leipzig 20 (1971), 633.

though I only know this thanks to a more recent paper:

• Michael A. Nielsen, Conditions for a class of entanglement transformations, Phys. Rev. Lett. 83 (1999), 436–439.

By the way, Nielsen’s paper contains another very nice result about majorization! Suppose you have states \psi and \phi of a 2-part quantum system. You can trace out one part and get density matrices describing mixed states of the other part, say D_\psi and D_\phi. Then Nielsen shows you can get from \psi to \phi using ‘local operations and classical communication’ if and only if D_\phi majorizes D_\psi. Note that things are going backwards here compared to how they’ve been going in the rest of this post: if we can get from \psi to \phi, then all forms of entropy go down when we go from D_\psi to D_\phi! This ‘anti-second-law’ behavior is confusing at first, but familiar to me by now.

When I first learned all this stuff, I naturally thought of the following question—maybe you did too, just now. If p, q \in \mathbb{R}^n are probability distributions and

\displaystyle{ H_\beta(p) \le H_\beta(q) }

for all 0 \le \beta < \infty, is it true that p majorizes q?

Apparently the answer must be no, because Klimesh has gone to quite a bit of work to obtain a weaker conclusion: not that p majorizes q, but that p \otimes r majorizes q \otimes r for some probability distribution r \in \mathbb{R}^m. He calls this catalytic majorization, with r serving as a ‘catalyst’:

• Matthew Klimesh, Inequalities that collectively completely characterizes the catalytic majorization relation.

I thank Vlatko Vedral here at the CQT for pointing this out!

Finally, here is a good general introduction to majorization, pointed out by Vasileios Anagnostopoulos:

• T. Ando, Majorization, doubly stochastic matrices, and comparison of eigenvalues, Linear Algebra and its Applications 118 (1989), 163-–248.


Network Theory (Part 24)

23 August, 2012

Now we’ve reached the climax of our story so far: we’re ready to prove the deficiency zero theorem. First let’s talk about it informally a bit. Then we’ll review the notation, and then—hang on to your seat!—we’ll give the proof.

The crucial trick is to relate a bunch of chemical reactions, described by a ‘reaction network’ like this:

to a simpler problem where a system randomly hops between states arranged in the same pattern:

This is sort of amazing, because we’ve thrown out lots of detail. It’s also amazing because this simpler problem is linear. In the original problem, the chance that a reaction turns a B + E into a D is proportional to the number of B’s times the number of E’s. That’s nonlinear! But in the simplified problem, the chance that your system hops from state 4 to state 3 is just proportional to the probability that it’s in state 4 to begin with. That’s linear.

The wonderful thing is that, at least under some conditions, we can find equilibrium solutions of our original problem starting from equilibrium solutions of the simpler problem.

Let’s roughly sketch how it works, and where we are so far. Our simplified problem is described by an equation like this:

\displaystyle{ \frac{d}{d t} \psi = H \psi }

where \psi is a function that the probability of being in each state, and H describes the probability per time of hopping from one state to another. We can easily understand quite a lot about the equilibrium solutions, where \psi doesn’t change at all:

H \psi = 0

because this is a linear equation. We did this in Part 23. Of course, when I say ‘easily’, that’s a relative thing: we needed to use the Perron–Frobenius theorem, which Jacob introduced in Part 20. But that’s a well-known theorem in linear algebra, and it’s easy to apply here.

In Part 22, we saw that the original problem was described by an equation like this, called the ‘rate equation’:

\displaystyle{ \frac{d x}{d t} = Y H x^Y  }

Here x is a vector whose entries describe the amount of each kind of chemical: the amount of A’s, the amount of B’s, and so on. The matrix H is the same as in the simplified problem, but Y is a matrix that says how many times each chemical shows up in each spot in our reaction network:

The key thing to notice is x^Y, where we take a vector and raise it to the power of a matrix. We explained this operation back in Part 22. It’s this operation that says how many B + E pairs we have, for example, given the number of B’s and the number of E’s. It’s this that makes the rate equation nonlinear.

Now, we’re looking for equilibrium solutions of the rate equation, where the rate of change is zero:

Y H x^Y = 0

But in fact we’ll do even better! We’ll find solutions of this:

H x^Y = 0

And we’ll get these by taking our solutions of this:

H \psi = 0

and adjusting them so that

\psi = x^Y

while \psi remains a solution of H \psi = 0.

But: how do we do this ‘adjusting’? That’s the crux of the whole business! That’s what we’ll do today.

Remember, \psi is a function that gives a probability for each ‘state’, or numbered box here:

The picture here consists of two pieces, called ‘connected components’: the piece containing boxes 0 and 1, and the piece containing boxes 2, 3 and 4. It turns out that we can multiply \psi by a function that’s constant on each connected component, and if we had H \psi = 0 to begin with, that will still be true afterward. The reason is that there’s no way for \psi to ‘leak across’ from one component to another. It’s like having water in separate buckets. You can increase the amount of water in one bucket, and decrease it another, and as long as the water’s surface remains flat in each bucket, the whole situation remains in equilibrium.

That’s sort of obvious. What’s not obvious is that we can adjust \psi this way so as to ensure

\psi = x^Y

for some x.

And indeed, it’s not always true! It’s only true if our reaction network obeys a special condition. It needs to have ‘deficiency zero’. We defined this concept back in Part 21, but now we’ll finally use it for something. It turns out to be precisely the right condition to guarantee we can tweak any function on our set of states, multiplying it by a function that’s constant on each connected component, and get a new function \psi with

\psi = x^Y

When all is said and done, that is the key to the deficiency zero theorem.

Review

The battle is almost upon us—we’ve got one last chance to review our notation. We start with a stochastic reaction network:

This consists of:

• finite sets of transitions T, complexes K and species S,

• a map r: T \to (0,\infty) giving a rate constant for each transition,

source and target maps s,t : T \to K saying where each transition starts and ends,

• a one-to-one map Y : K \to \mathbb{N}^S saying how each complex is made of species.

Then we extend s, t and Y to linear maps:

Then we put inner products on these vector spaces as described last time, which lets us ‘turn around’ the maps s and t by taking their adjoints:

s^\dagger, t^\dagger : \mathbb{R}^K \to \mathbb{R}^T

More surprisingly, we can ‘turn around’ Y and get a nonlinear map using ‘matrix exponentiation’:

\begin{array}{ccc} \mathbb{R}^S &\to& \mathbb{R}^K \\                               x     &\mapsto&   x^Y \end{array}

This is most easily understood by thinking of x as a row vector and Y as a matrix:

Remember, complexes are made out of species. The matrix entry Y_{i j} says how many things of the jth species there are in a complex of the ith kind. If \psi \in \mathbb{R}^K says how many complexes there are of each kind, Y \psi \in \mathbb{R}^S says how many things there are of each species. Conversely, if x \in \mathbb{R}^S says how many things there are of each species, x^Y \in \mathbb{R}^K says how many ways we can build each kind of complex from them.

So, we get these maps:

Next, the boundary operator

\partial : \mathbb{R}^T \to \mathbb{R}^K

describes how each transition causes a change in complexes:

\partial = t - s

As we saw last time, there is a Hamiltonian

H : \mathbb{R}^K \to \mathbb{R}^K

describing a Markov processes on the set of complexes, given by

H = \partial s^\dagger

But the star of the show is the rate equation. This describes how the number of things of each species changes with time. We write these numbers in a list and get a vector x \in \mathbb{R}^S with nonnegative components. The rate equation says:

\displaystyle{ \frac{d x}{d t} = Y H x^Y }

We can read this as follows:

x says how many things of each species we have now.

x^Y says how many complexes of each kind we can build from these species.

s^\dagger x^Y says how many transitions of each kind can originate starting from these complexes, with each transition weighted by its rate.

H x^Y = \partial s^\dagger x^Y is the rate of change of the number of complexes of each kind, due to these transitions.

Y H x^Y is the rate of change of the number of things of each species.

The zero deficiency theorem

We are looking for equilibrium solutions of the rate equation, where the number of things of each species doesn’t change at all:

Y H x^Y = 0

In fact we will find complex balanced equilibrium solutions, where even the number of complexes of each kind doesn’t change:

H x^Y = 0

More precisely, we have:

Deficiency Zero Theorem (Child’s Version). Suppose we have a reaction network obeying these two conditions:

1. It is weakly reversible, meaning that whenever there’s a transition from one complex \kappa to another \kappa', there’s a directed path of transitions going back from \kappa' to \kappa.

2. It has deficiency zero, meaning \mathrm{im} \partial  \cap \mathrm{ker} Y = \{ 0 \} .

Then for any choice of rate constants there exists a complex balanced equilibrium solution of the rate equation where all species are present in nonzero amounts. In other words, there exists x \in \mathbb{R}^S with all components positive and such that:

H x^Y = 0

Proof. Because our reaction network is weakly reversible, the theorems in Part 23 show there exists \psi \in (0,\infty)^K with

H \psi = 0

This \psi may not be of the form x^Y, but we shall adjust \psi so that it becomes of this form, while still remaining a solution of H \psi = 0 latex . To do this, we need a couple of lemmas:

Lemma 1. \mathrm{ker} \partial^\dagger + \mathrm{im} Y^\dagger = \mathbb{R}^K.

Proof. We need to use a few facts from linear algebra. If V is a finite-dimensional vector space with inner product, the orthogonal complement L^\perp of a subspace L \subseteq V consists of vectors that are orthogonal to everything in L:

L^\perp = \{ v \in V : \quad \forall w \in L \quad \langle v, w \rangle = 0 \}

We have

(L \cap M)^\perp = L^\perp + M^\perp

where L and M are subspaces of V and + denotes the sum of two subspaces: that is, the smallest subspace containing both. Also, if T: V \to W is a linear map between finite-dimensional vector spaces with inner product, we have

(\mathrm{ker} T)^\perp = \mathrm{im} T^\dagger

and

(\mathrm{im} T)^\perp = \mathrm{ker} T^\dagger

Now, because our reaction network has deficiency zero, we know that

\mathrm{im} \partial \cap \mathrm{ker} Y = \{ 0 \}

Taking the orthogonal complement of both sides, we get

(\mathrm{im} \partial \cap \mathrm{ker} Y)^\perp = \mathbb{R}^K

and using the rules we mentioned, we obtain

\mathrm{ker} \partial^\dagger + \mathrm{im} Y^\dagger = \mathbb{R}^K

as desired.   █

Now, given a vector \phi in \mathbb{R}^K or \mathbb{R}^S with all positive components, we can define the logarithm of such a vector, component-wise:

(\ln \phi)_i = \ln (\phi_i)

Similarly, for any vector \phi in either of these spaces, we can define its exponential in a component-wise way:

(\exp \phi)_i = \exp(\phi_i)

These operations are inverse to each other. Moreover:

Lemma 2. The nonlinear operator

\begin{array}{ccc} \mathbb{R}^S &\to& \mathbb{R}^K \\                               x     &\mapsto&   x^Y \end{array}

is related to the linear operator

\begin{array}{ccc} \mathbb{R}^S &\to& \mathbb{R}^K \\                               x     &\mapsto&   Y^\dagger x \end{array}

by the formula

x^Y = \exp(Y^\dagger \ln x )

which holds for all x \in (0,\infty)^S.

Proof. A straightforward calculation. By the way, this formula would look a bit nicer if we treated \ln x as a row vector and multiplied it on the right by Y: then we would have

x^Y = \exp((\ln x) Y)

The problem is that we are following the usual convention of multiplying vectors by matrices on the left, yet writing the matrix on the right in x^Y. Taking the transpose Y^\dagger of the matrix Y serves to compensate for this.   █

Now, given our vector \psi \in (0,\infty)^K with H \psi = 0, we can take its logarithm and get \ln \psi \in \mathbb{R}^K. Lemma 1 says that

\mathbb{R}^K = \mathrm{ker} \partial^\dagger + \mathrm{im} Y^\dagger

so we can write

\ln \psi =  \alpha + Y^\dagger \beta

where \alpha \in \mathrm{ker} \partial^\dagger and \beta \in \mathbb{R}^S. Moreover, we can write

\beta = \ln x

for some x \in (0,\infty)^S, so that

\ln \psi = \alpha + Y^\dagger (\ln x)

Exponentiating both sides component-wise, we get

\psi  =   \exp(\alpha) \; \exp(Y^\dagger (\ln x))

where at right we are taking the component-wise product of vectors. Thanks to Lemma 2, we conclude that

\psi = \exp(\alpha) x^Y

So, we have taken \psi and almost written it in the form x^Y—but not quite! We can adjust \psi to make it be of this form:

\exp(-\alpha) \psi = x^Y

Clearly all the components of \exp(-\alpha) \psi are positive, since the same is true for both \psi and \exp(-\alpha). So, the only remaining task is to check that

H(\exp(-\alpha) \psi) = 0

We do this using two lemmas:

Lemma 3. If H \psi = 0 and \alpha \in \mathrm{ker} \partial^\dagger, then H(\exp(-\alpha) \psi) = 0.

Proof. It is enough to check that multiplication by \exp(-\alpha) commutes with the Hamiltonian H, since then

H(\exp(-\alpha) \psi) = \exp(-\alpha) H \psi = 0

Recall from Part 23 that H is the Hamiltonian of a Markov process associated to this ‘graph with rates’:

As noted here:

• John Baez and Brendan Fong, A Noether theorem for Markov processes.

multiplication by some function on K commutes with H if and only if that function is constant on each connected component of this graph. Such functions are called conserved quantities.

So, it suffices to show that \exp(-\alpha) is constant on each connected component. For this, it is enough to show that \alpha itself is constant on each connected component. But this will follow from the next lemma, since \alpha \in \mathrm{ker} \partial^\dagger.   █

Lemma 4. A function \alpha \in \mathbb{R}^K is a conserved quantity iff \partial^\dagger \alpha = 0 . In other words, \alpha is constant on each connected component of the graph s, t: T \to K iff \partial^\dagger \alpha = 0 .

Proof. Suppose \partial^\dagger \alpha = 0, or in other words, \alpha \in \mathrm{ker} \partial^\dagger, or in still other words, \alpha \in (\mathrm{im} \partial)^\perp. To show that \alpha is constant on each connected component, it suffices to show that whenever we have two complexes connected by a transition, like this:

\tau: \kappa \to \kappa'

then \alpha takes the same value at both these complexes:

\alpha_\kappa = \alpha_{\kappa'}

To see this, note

\partial \tau = t(\tau) - s(\tau) = \kappa' - \kappa

and since \alpha \in (\mathrm{im} \partial)^\perp, we conclude

\langle \alpha, \kappa' - \kappa \rangle = 0

But calculating this inner product, we see

\alpha_{\kappa'} - \alpha_{\kappa} = 0

as desired.

For the converse, we simply turn the argument around: if \alpha is constant on each connected component, we see \langle \alpha, \kappa' - \kappa \rangle = 0 whenever there is a transition \tau : \kappa \to \kappa'. It follows that \langle \alpha, \partial \tau \rangle = 0 for every transition \tau, so \alpha \in (\mathrm{im} \partial)^\perp .

And thus concludes the proof of the lemma!   █

And thus concludes the proof of the theorem!   █

And thus concludes this post!


Network Theory (Part 23)

21 August, 2012

We’ve been looking at reaction networks, and we’re getting ready to find equilibrium solutions of the equations they give. To do this, we’ll need to connect them to another kind of network we’ve studied. A reaction network is something like this:

It’s a bunch of complexes, which are sums of basic building-blocks called species, together with arrows called transitions going between the complexes. If we know a number for each transition describing the rate at which it occurs, we get an equation called the ‘rate equation’. This describes how the amount of each species changes with time. We’ve been talking about this equation ever since the start of this series! Last time, we wrote it down in a new very compact form:

\displaystyle{ \frac{d x}{d t} = Y H x^Y  }

Here x is a vector whose components are the amounts of each species, while H and Y are certain matrices.

But now suppose we forget how each complex is made of species! Suppose we just think of them as abstract things in their own right, like numbered boxes:

We can use these boxes to describe states of some system. The arrows still describe transitions, but now we think of these as ways for the system to hop from one state to another. Say we know a number for each transition describing the probability per time at which it occurs:

Then we get a ‘Markov process’—or in other words, a random walk where our system hops from one state to another. If \psi is the probability distribution saying how likely the system is to be in each state, this Markov process is described by this equation:

\displaystyle{ \frac{d \psi}{d t} = H \psi  }

This is simpler than the rate equation, because it’s linear. But the matrix H is the same—we’ll see that explicitly later on today.

What’s the point? Well, our ultimate goal is to prove the deficiency zero theorem, which gives equilibrium solutions of the rate equation. That means finding x with

Y H x^Y = 0

Today we’ll find all equilibria for the Markov process, meaning all \psi with

H \psi = 0

Then next time we’ll show some of these have the form

\psi = x^Y

So, we’ll get

H x^Y = 0

and thus

Y H x^Y = 0

as desired!

So, let’s get to to work.

The Markov process of a graph with rates

We’ve been looking at stochastic reaction networks, which are things like this:

However, we can build a Markov process starting from just part of this information:

Let’s call this thing a ‘graph with rates’, for lack of a better name. We’ve been calling the things in K ‘complexes’, but now we’ll think of them as ‘states’. So:

Definition. A graph with rates consists of:

• a finite set of states K,

• a finite set of transitions T,

• a map r: T \to (0,\infty) giving a rate constant for each transition,

source and target maps s,t : T \to K saying where each transition starts and ends.

Starting from this, we can get a Markov process describing how a probability distribution \psi on our set of states will change with time. As usual, this Markov process is described by a master equation:

\displaystyle{ \frac{d \psi}{d t} = H \psi }

for some Hamiltonian:

H : \mathbb{R}^K \to \mathbb{R}^K

What is this Hamiltonian, exactly? Let’s think of it as a matrix where H_{i j} is the probability per time for our system to hop from the state j to the state i. This looks backwards, but don’t blame me—blame the guys who invented the usual conventions for matrix algebra. Clearly if i \ne j this probability per time should be the sum of the rate constants of all transitions going from j to i:

\displaystyle{ i \ne j \quad \Rightarrow \quad H_{i j} =  \sum_{\tau: j \to i} r(\tau) }

where we write \tau: j \to i when \tau is a transition with source j and target i.

Now, we saw in Part 11 that for a probability distribution to remain a probability distribution as it evolves in time according to the master equation, we need H to be infinitesimal stochastic: its off-diagonal entries must be nonnegative, and the sum of the entries in each column must be zero.

The first condition holds already, and the second one tells us what the diagonal entries must be. So, we’re basically done describing H. But we can summarize it this way:

Puzzle 1. Think of \mathbb{R}^K as the vector space consisting of finite linear combinations of elements \kappa \in K. Then show

\displaystyle{  H \kappa = \sum_{s(\tau) = \kappa} r(\tau) (t(\tau) - s(\tau)) }

Equilibrium solutions of the master equation

Now we’ll classify equilibrium solutions of the master equation, meaning \psi \in \mathbb{R}^K with

H \psi = 0

We’ll do only do this when our graph with rates is ‘weakly reversible’. This concept doesn’t actually depend on the rates, so let’s be general and say:

Definition. A graph is weakly reversible if for every edge \tau : i \to j, there is directed path going back from j to i, meaning that we have edges

\tau_1 : j \to j_1 , \quad \tau_2 : j_1 \to j_2 , \quad \dots, \quad \tau_n: j_{n-1} \to i

This graph with rates is not weakly reversible:

but this one is:

The good thing about the weakly reversible case is that we get one equilibrium solution of the master equation for each component of our graph, and all equilibrium solutions are linear combinations of these. This is not true in general! For example, this guy is not weakly reversible:

It has only one component, but the master equation has two linearly independent equilibrium solutions: one that vanishes except at the state 0, and one that vanishes except at the state 2.

The idea of a ‘component’ is supposed to be fairly intuitive—our graph falls apart into pieces called components—but we should make it precise. As explained in Part 21, the graphs we’re using here are directed multigraphs, meaning things like

s, t : E \to V

where E is the set of edges (our transitions) and V is the set of vertices (our states). There are actually two famous concepts of ‘component’ for graphs of this sort: ‘strongly connected’ components and ‘connected’ components. We only need connected components, but let me explain both concepts, in a futile attempt to slake your insatiable thirst for knowledge.

Two vertices i and j of a graph lie in the same strongly connected component iff you can find a directed path of edges from i to j and also one from j back to i.

Remember, a directed path from i to j looks like this:

i \to a \to b \to c \to j

Here’s a path from x to y that is not directed:

i \to a \leftarrow b \to c \to j

and I hope you can write down the obvious but tedious definition of an ‘undirected path’, meaning a path made of edges that don’t necessarily point in the correct direction. Given that, we say two vertices i and j lie in the same connected component iff you can find an undirected path going from i to j. In this case, there will automatically also be an undirected path going from j to i.

For example, i and j lie in the same connected component here, but not the same strongly connected component:

i \to a \leftarrow b \to c \to j

Here’s a graph with one connected component and 3 strongly connected components, which are marked in blue:

For the theory we’re looking at now, we only care about connected components, not strongly connected components! However:

Puzzle 2. Show that for weakly reversible graph, the connected components are the same as the strongly connected components.

With these definitions out of way, we can state today’s big theorem:

Theorem. Suppose H is the Hamiltonian of a weakly reversible graph with rates:

Then for each connected component C \subseteq K, there exists a unique probability distribution \psi_C \in \mathbb{R}^K that is positive on that component, zero elsewhere, and is an equilibrium solution of the master equation:

H \psi_C = 0

Moreover, these probability distributions \psi_C form a basis for the space of equilibrium solutions of the master equation. So, the dimension of this space is the number of components of K.

Proof. We start by assuming our graph has one connected component. We use the Perron–Frobenius theorem, as explained in Part 20. This applies to ‘nonnegative’ matrices, meaning those whose entries are all nonnegative. That is not true of H itself, but only its diagonal entries can be negative, so if we choose a large enough number c > 0, H + c I will be nonnegative.

Since our graph is weakly reversible and has one connected component, it follows straight from the definitions that the operator H + c I will also be ‘irreducible’ in the sense of Part 20. The Perron–Frobenius theorem then swings into action, and we instantly conclude several things.

First, H + c I has a positive real eigenvalue r such that any other eigenvalue, possibly complex, has absolute value \le r. Second, there is an eigenvector \psi with eigenvalue r and all positive components. Third, any other eigenvector with eigenvalue r is a scalar multiple of \psi.

Subtracting c, it follows that \lambda = r - c is the eigenvalue of H with the largest real part. We have H \psi = \lambda \psi, and any other vector with this property is a scalar multiple of \psi.

We can show that in fact \lambda = 0. To do this we copy an argument from Part 20. First, since \psi is positive we can normalize it to be a probability distribution:

\displaystyle{ \sum_{i \in K} \psi_i = 1 }

Since H is infinitesimal stochastic, \exp(t H) sends probability distributions to probability distributions:

\displaystyle{ \sum_{i \in K} (\exp(t H) \psi)_i = 1 }

for all t \ge 0. On the other hand,

\displaystyle{ \sum_{i \in K} (\exp(t H)\psi)_i = \sum_{i \in K} e^{t \lambda} \psi_i = e^{t \lambda} }

so we must have \lambda = 0.

We conclude that when our graph has one connected component, there is a probability distribution \psi \in \mathbb{R}^K that is positive everywhere and has H \psi = 0. Moreover, any \phi \in \mathbb{R}^K with H \phi = 0 is a scalar multiple of \psi.

When K has several components, the matrix H is block diagonal, with one block for each component. So, we can run the above argument on each component C \subseteq K and get a probability distribution \psi_C \in \mathbb{R}^K that is positive on C. We can then check that H \psi_C = 0 and that every \phi \in \mathbb{R}^K with H \phi = 0 can be expressed as a linear combination of these probability distributions \psi_C in a unique way.   █

This result must be absurdly familiar to people who study Markov processes, but I haven’t bothered to look up a reference yet. Do you happen to know a good one? I’d like to see one that generalizes this theorem to graphs that aren’t weakly reversible. I think I see how it goes. We don’t need that generalization right now, but it would be good to have around.

The Hamiltonian, revisited

One last small piece of business: last time I showed you a very slick formula for the Hamiltonian H. I’d like to prove it agrees with the formula I gave this time.

We start with any graph with rates:

We extend s and t to linear maps between vector spaces:

We define the boundary operator just as we did last time:

\partial = t - s

Then we put an inner product on the vector spaces \mathbb{R}^T and \mathbb{R}^K. So, for \mathbb{R}^K we let the elements of K be an orthonormal basis, but for \mathbb{R}^T we define the inner product in a more clever way involving the rate constants:

\displaystyle{ \langle \tau, \tau' \rangle = \frac{1}{r(\tau)} \delta_{\tau, \tau'} }

where \tau, \tau' \in T. This lets us define adjoints of the maps s, t and \partial, via formulas like this:

\langle s^\dagger \phi, \psi \rangle = \langle \phi, s \psi \rangle

Then:

Theorem. The Hamiltonian for a graph with rates is given by

H = \partial s^\dagger

Proof. It suffices to check that this formula agrees with the formula for H given in Puzzle 1:

\displaystyle{   H \kappa = \sum_{s(\tau) = \kappa} r(\tau) (t(\tau) - s(\tau)) }

Here we are using the complex \kappa \in K as a name for one of the standard basis vectors of \mathbb{R}^K. Similarly shall we write things like \tau or \tau' for basis vectors of \mathbb{R}^T.

First, we claim that

\displaystyle{ s^\dagger \kappa = \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \tau }

To prove this it’s enough to check that taking the inner products of either sides with any basis vector \tau', we get results that agree. On the one hand:

\begin{array}{ccl}  \langle \tau' , s^\dagger \kappa \rangle &=&   \langle s \tau', \kappa \rangle \\  \\  &=& \delta_{s(\tau'), \kappa}    \end{array}

On the other hand:

\begin{array}{ccl} \displaystyle{ \langle \tau', \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \tau \rangle } &=&  \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \langle \tau', \tau \rangle   \\  \\  &=& \displaystyle{ \sum_{\tau: \; s(\tau) = \kappa} \delta_{\tau', \tau} }   \\  \\  &=&   \delta_{s(\tau'), \kappa}   \end{array}

where the factor of 1/r(\tau) in the inner product on \mathbb{R}^T cancels the visible factor of r(\tau). So indeed the results match.

Using this formula for s^\dagger \kappa we now see that

\begin{array}{ccl}  H \kappa &=& \partial s^\dagger \kappa   \\  \\  &=& \partial \displaystyle{ \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, \tau }    \\  \\  &=& \displaystyle{ \sum_{\tau: \; s(\tau) = \kappa} r(\tau) \, (t(\tau) - s(\tau)) }  \end{array}

which is precisely what we want.   █

I hope you see through the formulas to their intuitive meaning. As usual, the formulas are just a way of precisely saying something that makes plenty of sense. If \kappa is some state of our Markov process, s^\dagger \kappa is the sum of all transitions starting at this state, weighted by their rates. Applying \partial to a transition tells us what change in state it causes. So \partial s^\dagger \kappa tells us the rate at which things change when we start in the state \kappa. That’s why \partial s^\dagger is the Hamiltonian for our Markov process. After all, the Hamiltonian tells us how things change:

\displaystyle{ \frac{d \psi}{d t} = H \psi }

Okay, we’ve got all the machinery in place. Next time we’ll prove the deficiency zero theorem!


Network Theory (Part 20)

6 August, 2012

guest post by Jacob Biamonte

We’re in the middle of a battle: in addition to our typical man versus equation scenario, it’s a battle between two theories. For those good patrons following the network theory series, you know the two opposing forces well. It’s our old friends, at it again:

Stochastic Mechanics vs Quantum Mechanics!

Today we’re reporting live from a crossroads, and we’re facing a skirmish that gives rise to what some might consider a paradox. Let me sketch the main thesis before we get our hands dirty with the gory details.

First I need to tell you that the battle takes place at the intersection of stochastic and quantum mechanics. We recall from Part 16 that there is a class of operators called ‘Dirichlet operators’ that are valid Hamiltonians for both stochastic and quantum mechanics. In other words, you can use them to generate time evolution both for old-fashioned random processes and for quantum processes!

Staying inside this class allows the theories to fight it out on the same turf. We will be considering a special subclass of Dirichlet operators, which we call ‘irreducible Dirichlet operators’. These are the ones where starting in any state in our favorite basis of states, we have a nonzero chance of winding up in any other. When considering this subclass, we found something interesting:

Thesis. Let H be an irreducible Dirichlet operator with n eigenstates. In stochastic mechanics, there is only one valid state that is an eigenvector of H: the unique so-called ‘Perron–Frobenius state’. The other n-1 eigenvectors are forbidden states of a stochastic system: the stochastic system is either in the Perron–Frobenius state, or in a superposition of at least two eigensvectors. In quantum mechanics, all n eigenstates of H are valid states.

This might sound like a riddle, but today as we’ll prove, riddle or not, it’s a fact. If it makes sense, well that’s another issue. As John might have said, it’s like a bone kicked down from the gods up above: we can either choose to chew on it, or let it be. Today we are going to do a bit of chewing.

One of the many problems with this post is that John had a nut loose on his keyboard. It was not broken! I’m saying he wrote enough blog posts on this stuff to turn them into a book. I’m supposed to be compiling the blog articles into a massive LaTeX file, but I wrote this instead.

Another problem is that this post somehow seems to use just about everything said before, so I’m going to have to do my best to make things self-contained. Please bear with me as I try to recap what’s been done. For those of you familiar with the series, a good portion of the background for what we’ll cover today can be found in Part 12 and Part 16.

At the intersection of two theories

As John has mentioned in his recent talks, the typical view of how quantum mechanics and probability theory come into contact looks like this:

The idea is that quantum theory generalizes classical probability theory by considering observables that don’t commute.

That’s perfectly valid, but we’ve been exploring an alternative view in this series. Here quantum theory doesn’t subsume probability theory, but they intersect:

What goes in the middle you might ask? As odd as it might sound at first, John showed in Part 16 that electrical circuits made of resistors constitute the intersection!

For example, a circuit like this:

gives rise to a Hamiltonian H that’s good both for stochastic mechanics and quantum mechanics. Indeed, he found that the power dissipated by a circuit made of resistors is related to the familiar quantum theory concept known as the expectation value of the Hamiltonian!

\textrm{power} = -2 \langle \psi, H \psi \rangle

Oh—and you might think we made a mistake and wrote our Ω (ohm) symbols upside down. We didn’t. It happens that ℧ is the symbol for a ‘mho’—a unit of conductance that’s the reciprocal of an ohm. Check out Part 16 for the details.

Stochastic mechanics versus quantum mechanics

Let’s recall how states, time evolution, symmetries and observables work in the two theories. Today we’ll fix a basis for our vector space of states, and we’ll assume it’s finite-dimensional so that all vectors have n components over either the complex numbers \mathbb{C} or the reals \mathbb{R}. In other words, we’ll treat our space as either \mathbb{C}^n or \mathbb{R}^n. In this fashion, linear operators that map such spaces to themselves will be represented as square matrices.

Vectors will be written as \psi_i where the index i runs from 1 to n, and we think of each choice of the index as a state of our system—but since we’ll be using that word in other ways too, let’s call it a configuration. It’s just a basic way our system can be.

States

Besides the configurations i = 1,\dots, n, we have more general states that tell us the probability or amplitude of finding our system in one of these configurations:

Stochastic states are n-tuples of nonnegative real numbers:

\psi_i \in \mathbb{R}^+

The probability of finding the system in the ith configuration is defined to be \psi_i. For these probabilities to sum to one, \psi_i needs to be normalized like this:

\displaystyle{ \sum_i \psi_i = 1 }

or in the notation we’re using in these articles:

\displaystyle{ \langle \psi \rangle = 1 }

where we define

\displaystyle{ \langle \psi \rangle = \sum_i \psi_i }

Quantum states are n-tuples of complex numbers:

\psi_i \in \mathbb{C}

The probability of finding a state in the ith configuration is defined to be |\psi(x)|^2. For these probabilities to sum to one, \psi needs to be normalized like this:

\displaystyle{ \sum_i |\psi_i|^2 = 1 }

or in other words

\langle \psi,  \psi \rangle = 1

where the inner product of two vectors \psi and \phi is defined by

\displaystyle{ \langle \psi, \phi \rangle = \sum_i \overline{\psi}_i \phi_i }

Now, the usual way to turn a quantum state \psi into a stochastic state is to take the absolute value of each number \psi_i and then square it. However, if the numbers \psi_i happen to be nonnegative, we can also turn \psi into a stochastic state simply by multiplying it by a number to ensure \langle \psi \rangle = 1.

This is very unorthodox, but it lets us evolve the same vector \psi either stochastically or quantum-mechanically, using the recipes I’ll describe next. In physics jargon these correspond to evolution in ‘real time’ and ‘imaginary time’. But don’t ask me which is which: from a quantum viewpoint stochastic mechanics uses imaginary time, but from a stochastic viewpoint it’s the other way around!

Time evolution

Time evolution works similarly in stochastic and quantum mechanics, but with a few big differences:

• In stochastic mechanics the state changes in time according to the master equation:

\displaystyle{ \frac{d}{d t} \psi(t) = H \psi(t) }

which has the solution

\psi(t) = \exp(t H) \psi(0)

• In quantum mechanics the state changes in time according to Schrödinger’s equation:

\displaystyle{ \frac{d}{d t} \psi(t) = -i H \psi(t) }

which has the solution

\psi(t) = \exp(-i t H) \psi(0)

The operator H is called the Hamiltonian. The properties it must have depend on whether we’re doing stochastic mechanics or quantum mechanics:

• We need H to be infinitesimal stochastic for time evolution given by \exp(t H) to send stochastic states to stochastic states. In other words, we need that (i) its columns sum to zero and (ii) its off-diagonal entries are real and nonnegative:

\displaystyle{ \sum_i H_{i j}=0 }

i\neq j \Rightarrow H_{i j}\geq 0

• We need H to be self-adjoint for time evolution given by \exp(-i t H) to send quantum states to quantum states. So, we need

H = H^\dagger

where we recall that the adjoint of a matrix is the conjugate of its transpose:

\displaystyle{ (H^\dagger)_{i j} := \overline{H}_{j i} }

We are concerned with the case where the operator H generates both a valid quantum evolution and also a valid stochastic one:

H is a Dirichlet operator if it’s both self-adjoint and infinitesimal stochastic.

We will soon go further and zoom in on this intersection! But first let’s finish our review.

Symmetries

As John explained in Part 12, besides states and observables we need symmetries, which are transformations that map states to states. These include the evolution operators which we only briefly discussed in the preceding subsection.

• A linear map U that sends quantum states to quantum states is called an isometry, and isometries are characterized by this property:

U^\dagger U = 1

• A linear map U that sends stochastic states to stochastic states is called a stochastic operator, and stochastic operators are characterized by these properties:

\displaystyle{ \sum_i U_{i j} = 1 }

and

U_{i j} \geq 0

A notable difference here is that in our finite-dimensional situation, isometries are always invertible, but stochastic operators may not be! If U is an n \times n matrix that’s an isometry, U^\dagger is its inverse. So, we also have

U U^\dagger = 1

and we say U is unitary. But if U is stochastic, it may not have an inverse—and even if it does, its inverse is rarely stochastic. This explains why in stochastic mechanics time evolution is often not reversible, while in quantum mechanics it always is.

Puzzle 1. Suppose U is a stochastic n \times n matrix whose inverse is stochastic. What are the possibilities for U?

It is quite hard for an operator to be a symmetry in both stochastic and quantum mechanics, especially in our finite-dimensional situation:

Puzzle 2. Suppose U is an n \times n matrix that is both stochastic and unitary. What are the possibilities for U?

Observables

‘Observables’ are real-valued quantities that can be measured, or predicted, given a specific theory.

• In quantum mechanics, an observable is given by a self-adjoint matrix O, and the expected value of the observable O in the quantum state \psi is

\displaystyle{ \langle \psi , O \psi \rangle = \sum_{i,j} \overline{\psi}_i O_{i j} \psi_j }

• In stochastic mechanics, an observable O has a value O_i in each configuration i, and the expected value of the observable O in the stochastic state \psi is

\displaystyle{ \langle O \psi \rangle = \sum_i O_i \psi_i }

We can turn an observable in stochastic mechanics into an observable in quantum mechanics by making a diagonal matrix whose diagonal entries are the numbers O_i.

From graphs to matrices

Back in Part 16, John explained how a graph with positive numbers on its edges gives rise to a Hamiltonian in both quantum and stochastic mechanics—in other words, a Dirichlet operator.

Here’s how this works. We’ll consider simple graphs: graphs without arrows on their edges, with at most one edge from one vertex to another, with no edges from a vertex to itself. And we’ll only look at graphs with finitely many vertices and edges. We’ll assume each edge is labelled by a positive number, like this:

If our graph has n vertices, we can create an n \times n matrix A where A_{i j} is the number labelling the edge from i to j, if there is such an edge, and 0 if there’s not. This matrix is symmetric, with real entries, so it’s self-adjoint. So A is a valid Hamiltonian in quantum mechanics.

How about stochastic mechanics? Remember that a Hamiltonian H in stochastic mechanics needs to be ‘infinitesimal stochastic’. So, its off-diagonal entries must be nonnegative, which is indeed true for our A, but also the sums of its columns must be zero, which is not true when our A is nonzero.

But now comes the best news you’ve heard all day: we can improve A to a stochastic operator in a way that is completely determined by A itself! This is done by subtracting a diagonal matrix L whose entries are the sums of the columns of A:

L_{i i} = \sum_i A_{i j}

i \ne j \Rightarrow L_{i j} = 0

It’s easy to check that

H = A - L

is still self-adjoint, but now also infinitesimal stochastic. So, it’s a Dirichlet operator: a good Hamiltonian for both stochastic and quantum mechanics!

In Part 16, we saw a bit more: every Dirichlet operator arises this way. It’s easy to see. You just take your Dirichlet operator and make a graph with one edge for each nonzero off-diagonal entry. Then you label the edge with this entry.

So, Dirichlet operators are essentially the same as finite simple graphs with edges labelled by positive numbers.

Now, a simple graph can consist of many separate ‘pieces’, called components. Then there’s no way for a particle hopping along the edges to get from one component to another, either in stochastic or quantum mechanics. So we might as well focus our attention on graphs with just one component. These graphs are called ‘connected’. In other words:

Definition. A simple graph is connected if it is nonempty and there is a path of edges connecting any vertex to any other.

Our goal today is to understand more about Dirichlet operators coming from connected graphs. For this we need to learn the Perron–Frobenius theorem. But let’s start with something easier.

Perron’s theorem

In quantum mechanics it’s good to think about observables that have positive expected values:

\langle \psi, O \psi \rangle > 0

for every quantum state \psi \in \mathbb{C}^n. These are called positive definite. But in stochastic mechanics it’s good to think about matrices that are positive in a more naive sense:

Definition. An n \times n real matrix T is positive if all its entries are positive:

T_{i j} > 0

for all 1 \le i, j \le n.

Similarly:

Definition. A vector \psi \in \mathbb{R}^n is positive if all its components are positive:

\psi_i > 0

for all 1 \le i \le n.

We’ll also define nonnegative matrices and vectors in the same way, replacing > 0 by \ge 0. A good example of a nonnegative vector is a stochastic state.

In 1907, Perron proved the following fundamental result about positive matrices:

Perron’s Theorem. Given a positive square matrix T, there is a positive real number r, called the Perron–Frobenius eigenvalue of T, such that r is an eigenvalue of T and any other eigenvalue \lambda of T has |\lambda| < r. Moreover, there is a positive vector \psi \in \mathbb{R}^n with T \psi = r \psi. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of T must be a scalar multiple of \psi.

In other words, if T is positive, it has a unique eigenvalue with the largest absolute value. This eigenvalue is positive. Up to a constant factor, it has an unique eigenvector. We can choose this eigenvector to be positive. And then, up to a constant factor, it’s the only nonnegative eigenvector of T.

From matrices to graphs

The conclusions of Perron’s theorem don’t hold for matrices that are merely nonnegative. For example, these matrices

\left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array} \right) , \qquad \left( \begin{array}{cc} 0 & 1 \\ 0 & 0 \end{array} \right)

are nonnegative, but they violate lots of the conclusions of Perron’s theorem.

Nonetheless, in 1912 Frobenius published an impressive generalization of Perron’s result. In its strongest form, it doesn’t apply to all nonnegative matrices; only to those that are ‘irreducible’. So, let us define those.

We’ve seen how to build a matrix from a graph. Now we need to build a graph from a matrix! Suppose we have an n \times n matrix T. Then we can build a graph G_T with n vertices where there is an edge from the ith vertex to the jth vertex if and only if T_{i j} \ne 0.

But watch out: this is a different kind of graph! It’s a directed graph, meaning the edges have directions, there’s at most one edge going from any vertex to any vertex, and we do allow an edge going from a vertex to itself. There’s a stronger concept of ‘connectivity’ for these graphs:

Definition. A directed graph is strongly connected if there is a directed path of edges going from any vertex to any other vertex.

So, you have to be able to walk along edges from any vertex to any other vertex, but always following the direction of the edges! Using this idea we define irreducible matrices:

Definition. A nonnegative square matrix T is irreducible if its graph G_T is strongly connected.

The Perron–Frobenius theorem

Now we are ready to state:

The Perron-Frobenius Theorem. Given an irreducible nonnegative square matrix T, there is a positive real number r, called the Perron–Frobenius eigenvalue of T, such that r is an eigenvalue of T and any other eigenvalue \lambda of T has |\lambda| \le r. Moreover, there is a positive vector \psi \in \mathbb{R}^n with T\psi = r \psi. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of T must be a scalar multiple of \psi.

The only conclusion of this theorem that’s weaker than those of Perron’s theorem is that there may be other eigenvalues with |\lambda| = r. For example, this matrix is irreducible and nonnegative:

\left( \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array} \right)

Its Perron–Frobenius eigenvalue is 1, but it also has -1 as an eigenvalue. In general, Perron-Frobenius theory says quite a lot about the other eigenvalues on the circle |\lambda| = r, but we won’t need that fancy stuff here.

Perron–Frobenius theory is useful in many ways, from highbrow math to ranking football teams. We’ll need it not just today but also later in this series. There are many books and other sources of information for those that want to take a closer look at this subject. If you’re interested, you can search online or take a look at these:

• Dimitrious Noutsos, Perron Frobenius theory and some extensions, 2008. (Includes proofs of the basic theorems.)

• V. S. Sunder, Perron Frobenius theory, 18 December 2009. (Includes applications to graph theory, Markov chains and von Neumann algebras.)

• Stephen Boyd, Lecture 17: Perron Frobenius theory, Winter 2008-2009. (Includes a max-min characterization of the Perron–Frobenius eigenvalue and applications to Markov chains, economics, population growth and power control.)

I have not taken a look myself, but if anyone is interested and can read German, the original work appears here:

• Oskar Perron, Zur Theorie der Matrizen, Math. Ann. 64 (1907), 248–263.

• Georg Frobenius, Über Matrizen aus nicht negativen Elementen, S.-B. Preuss Acad. Wiss. Berlin (1912), 456–477.

And, of course, there’s this:

• Wikipedia, Perron–Frobenius theorem.

It’s got a lot of information.

Irreducible Dirichlet operators

Now comes the payoff. We saw how to get a Dirichlet operator H from any finite simple graph with edges labelled by positive numbers. Now let’s apply Perron–Frobenius theory to prove our thesis.

Unfortunately, the matrix H is rarely nonnegative. If you remember how we built it, you’ll see its off-diagonal entries will always be nonnegative… but its diagonal entries can be negative.

Luckily, we can fix this just by adding a big enough multiple of the identity matrix to H! The result is a nonnegative matrix

T = H + c I

where c > 0 is some large number. This matrix T has the same eigenvectors as H. The off-diagonal matrix entries of T are the same as those of H, so T_{i j} is nonzero for i \ne j exactly when the graph we started with has an edge from i to j. So, for i \ne j, the graph G_T will have an directed edge going from i to j precisely when our original graph had an edge from i to j. And that means that if our original graph was connected, G_T will be strongly connected. Thus, by definition, the matrix T is irreducible!

Since T is nonnegative and irreducible, the Perron–Frobenius theorem swings into action and we conclude:

Lemma. Suppose H is the Dirichlet operator coming from a connected finite simple graph with edges labelled by positive numbers. Then the eigenvalues of H are real. Let \lambda be the largest eigenvalue. Then there is a positive vector \psi \in \mathbb{R}^n with H\psi = \lambda \psi. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of H must be a scalar multiple of \psi.

Proof. The eigenvalues of H are real since H is self-adjoint. Notice that if r is the Perron–Frobenius eigenvalue of T = H + c I and

T \psi = r \psi

then

H \psi = (r - c)\psi

By the Perron–Frobenius theorem the number r is positive, and it has the largest absolute value of any eigenvalue of T. Thanks to the subtraction, the eigenvalue r - c may not have the largest absolute value of any eigenvalue of H. It is, however, the largest eigenvalue of H, so we take this as our \lambda. The rest follows from the Perron–Frobenius theorem.   █

But in fact we can improve this result, since the largest eigenvalue \lambda is just zero. Let’s also make up a definition, to make our result sound more slick:

Definition. A Dirichlet operator is irreducible if it comes from a connected finite simple graph with edges labelled by positive numbers.

This meshes nicely with our earlier definition of irreducibility for nonnegative matrices. Now:

Theorem. Suppose H is an irreducible Dirichlet operator. Then H has zero as its largest real eigenvalue. There is a positive vector \psi \in \mathbb{R}^n with H\psi = 0. Any other vector with this property is a scalar multiple of \psi. Furthermore, any nonnegative vector that is an eigenvector of H must be a scalar multiple of \psi.

Proof. Choose \lambda as in the Lemma, so that H\psi = \lambda \psi. Since \psi is positive we can normalize it to be a stochastic state:

\displaystyle{ \sum_i \psi_i = 1 }

Since H is a Dirichlet operator, \exp(t H) sends stochastic states to stochastic states, so

\displaystyle{ \sum_i (\exp(t H) \psi)_i = 1 }

for all t \ge 0. On the other hand,

\displaystyle{ \sum_i (\exp(t H)\psi)_i = \sum_i e^{t \lambda} \psi_i = e^{t \lambda} }

so we must have \lambda = 0.   █

What’s the point of all this? One point is that there’s a unique stochastic state \psi that’s an equilibrium state: since H \psi = 0, it doesn’t change with time. It’s also globally stable: since all the other eigenvalues of H are negative, all other stochastic states converge to this one as time goes forward.

An example

There are many examples of irreducible Dirichlet operators. For instance, in Part 15 we talked about graph Laplacians. The Laplacian of a connected simple graph is always irreducible. But let us try a different sort of example, coming from the picture of the resistors we saw earlier:

Let’s create a matrix A whose entry A_{i j} is the number labelling the edge from i to j if there is such an edge, and zero otherwise:

A = \left(  \begin{array}{ccccc}   0 & 2 & 1 & 0 & 1 \\   2 & 0 & 0 & 1 & 1 \\   1 & 0 & 0 & 2 & 1 \\   0 & 1 & 2 & 0 & 1 \\   1 & 1 & 1 & 1 & 0  \end{array}  \right)

Remember how the game works. The matrix A is already a valid Hamiltonian for quantum mechanics, since it’s self adjoint. However, to get a valid Hamiltonian for both stochastic and quantum mechanics—in other words, a Dirichlet operator—we subtract the diagonal matrix L whose entries are the sums of the columns of A. In this example it just so happens that the column sums are all 4, so L = 4 I, and our Dirichlet operator is

H = A - 4 I = \left(  \begin{array}{rrrrr}   -4 & 2 & 1 & 0 & 1 \\   2 & -4 & 0 & 1 & 1 \\   1 & 0 & -4 & 2 & 1 \\   0 & 1 & 2 & -4 & 1 \\   1 & 1 & 1 & 1 & -4  \end{array}  \right)

We’ve set up this example so it’s easy to see that the vector \psi = (1,1,1,1,1) has

H \psi = 0

So, this is the unique eigenvector for the eigenvalue 0. We can use Mathematica to calculate the remaining eigenvalues of H. The set of eigenvalues is

\{0, -7, -8, -8, -3 \}

As we expect from our theorem, the largest real eigenvalue is 0. By design, the eigenstate associated to this eigenvalue is

| v_0 \rangle = (1, 1, 1, 1, 1)

(This funny notation for vectors is common in quantum mechanics, so don’t worry about it.) All the other eigenvectors fail to be nonnegative, as predicted by the theorem. They are:

| v_1 \rangle = (1, -1, -1, 1, 0)
| v_2 \rangle = (-1, 0, -1, 0, 2)
| v_3 \rangle = (-1, 1, -1, 1, 0)
| v_4 \rangle = (-1, -1, 1, 1, 0)

To compare the quantum and stochastic states, consider first |v_0\rangle. This is the only eigenvector that can be normalized to a stochastic state. Remember, a stochastic state must have nonnegative components. This rules out |v_1\rangle through to |v_4\rangle as valid stochastic states, no matter how we normalize them! However, these are allowed as states in quantum mechanics, once we normalize them correctly. For a stochastic system to be in a state other than the Perron–Frobenius state, it must be a linear combination of least two eigenstates. For instance,

\psi_a = (1-a)|v_0\rangle + a |v_1\rangle

can be normalized to give stochastic state only if 0 \leq a \leq \frac{1}{2}.

And, it’s easy to see that it works this way for any irreducible Dirichlet operator, thanks to our theorem. So, our thesis has been proved true!

Puzzles on irreducibility

Let us conclude with a couple more puzzles. There are lots of ways to characterize irreducible nonnegative matrices; we don’t need to mention graphs. Here’s one:

Puzzle 3. Let T be a nonnegative n \times n matrix. Show that T is irreducible if and only if for all i,j \ge 0, (T^m)_{i j} > 0 for some natural number m.

You may be confused because today we explained the usual concept of irreducibility for nonnegative matrices, but also defined a concept of irreducibility for Dirichlet operators. Luckily there’s no conflict: Dirichlet operators aren’t nonnegative matrices, but if we add a big multiple of the identity to a Dirichlet operator it becomes a nonnegative matrix, and then:

Puzzle 4. Show that a Dirichlet operator H is irreducible if and only if the nonnegative operator H + c I (where c is any sufficiently large constant) is irreducible.

Irreducibility is also related to the nonexistence of interesting conserved quantities. In Part 11 we saw a version of Noether’s Theorem for stochastic mechanics. Remember that an observable O in stochastic mechanics assigns a number O_i to each configuration i = 1, \dots, n. We can make a diagonal matrix with O_i as its diagonal entries, and by abuse of language we call this O as well. Then we say O is a conserved quantity for the Hamiltonian H if the commutator [O,H] = O H - H O vanishes.

Puzzle 5. Let H be a Dirichlet operator. Show that H is irreducible if and only if every conserved quantity O for H is a constant, meaning that for some c \in \mathbb{R} we have O_i = c for all i. (Hint: examine the proof of Noether’s theorem.)

In fact this works more generally:

Puzzle 6. Let H be an infinitesimal stochastic matrix. Show that H + c I is an irreducible nonnegative matrix for all sufficiently large c if and only if every conserved quantity O for H is a constant.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers