Quantropy (Part 4)

11 November, 2013

There’s a new paper on the arXiv:

• John Baez and Blake Pollard, Quantropy.

Blake is a physics grad student at U. C. Riverside who plans to do his thesis with me.

If you have carefully read all my previous posts on quantropy (Part 1, Part 2 and Part 3), there’s only a little new stuff here. But still, it’s better organized, and less chatty.

And in fact, Blake came up with a lot of new stuff for this paper! He studied the quantropy of the harmonic oscillator, and tweaked the analogy between statistical mechanics and quantum mechanics in an interesting way. Unfortunately, we needed to put a version of this paper on the arXiv by a deadline, and our writeup of this new work wasn’t quite ready (my fault). So, we’ll put that other stuff in a new version—or, I’m thinking now, a separate paper.

But here are two new things.

First, putting this paper on the arXiv had the usual good effect of revealing some existing work on the same topic. Joakim Munkhammar emailed me and pointed out this paper, which is free online:

• Joakim Munkhammar, Canonical relational quantum mechanics from information theory, Electronic Journal of Theoretical Physics 8 (2011), 93–108.

You’ll see it cites Garrett Lisi’s paper and pushes forward in various directions. There seems to be a typo where he writes the path integral

$Z = \displaystyle{ \int e^{-\alpha S(q) } D q}$

and says

In order to fit the purpose Lisi concluded that the Lagrange multiplier value $\alpha \equiv 1/i \hbar.$ In similarity with Lisi’s approach we shall also assume that the arbitrary scaling-part of the constant $\alpha$ is in fact $1/\hbar.$

I’m pretty sure he means $1/i\hbar,$ given what he writes later. However, he speaks of ‘maximizing entropy’, which is not quite right for a complex-valued quantity; Blake and I prefer to give this new quantity a new name, and speak of ‘finding a stationary point of quantropy’.

But in a way these are small issues; being a mathematician, I’m more quick to spot tiny technical defects than to absorb significant new ideas, and it will take a while to really understand Munkhammar’s paper.

Second, while writing our paper, Blake and I noticed another similarity between the partition function of a classical ideal gas and the partition function of a quantum free particle. Both are given by an integral like this:

$Z = \displaystyle{\int e^{-\alpha S(q) } D q }$

where $S$ is a quadratic function of $q \in \mathbb{R}^n.$ Here $n$ is the number of degrees of freedom for the particles in the ideal gas, or the number of time steps for a free particle on a line (where we are discretizing time). The only big difference is that

$\alpha = 1/kT$

for the ideal gas, but

$\alpha = 1/i \hbar$

for the free particle.

In both cases there’s an ambiguity in the answer! The reason is that to do this integral, we need to pick a measure $D q.$ The obvious guess is Lebesgue measure

$dq = dq_1 \cdots dq_n$

on $\mathbb{R}^n.$ But this can’t be right, on physical grounds!

The reason is that the partition function $Z$ needs to be dimensionless, but $d q$ has units. To correct this, we need to divide $dq$ by some dimensionful quantity to get $D q.$

In the case of the ideal gas, this dimensionful quantity involves the ‘thermal de Broglie wavelength’ of the particles in the gas. And this brings Planck’s constant into the game, even though we’re not doing quantum mechanics: we’re studying the statistical mechanics of a classical ideal gas!

That’s weird and interesting. It’s not the only place where we see that classical statistical mechanics is incomplete or inconsistent, and we need to introduce some ideas from quantum physics to get sensible answers. The most famous one is the ultraviolet catastrophe. What are all rest?

In the case of the free particle, we need to divide by a quantity with dimensions of lengthn to make

$dq = dq_1 \cdots dq_n$

dimensionless, since each $dq_i$ has dimensions of length. The easiest way is to introduce a length scale $\Delta x$ and divide each $dq_i$ by that. This is commonly done when people study the free particle. This length scale drops out of the final answer for the questions people usually care about… but not for the quantropy.

Similarly, Planck’s constant drops out of the final answer for some questions about the classical ideal gas, but not for its entropy!

So there’s an interesting question here, about what this new length scale $\Delta x$ means, if anything. One might argue that quantropy is a bad idea, and the need for this new length scale to make it unambiguous is just proof of that. However, the mathematical analogy to quantum mechanics is so precise that I think it’s worth going a bit further out on this limb, and thinking a bit more about what’s going on.

Some weird sort of déjà vu phenomenon seems to be going on. Once upon a time, people tried to calculate the partition functions of classical systems. They discovered they were infinite or ambiguous until they introduced Planck’s constant, and eventually quantum mechanics. Then Feynman introduced the path integral approach to quantum mechanics. In this approach one is again computing partition functions, but now with a new meaning, and with complex rather than real exponentials. But these partition functions are again infinite or ambiguous… for very similar mathematical reasons! And at least in some cases, we can remove the ambiguity using the same trick as before: introducing a new constant. But then… what?

Are we stuck in an infinite loop here? What, if anything, is the meaning of this ‘second Planck’s constant’? Does this have anything to do with second quantization? (I don’t see how, but I can’t resist asking.)

Entropy and Information in Biological Systems

2 November, 2013

John Harte is an ecologist who uses maximum entropy methods to predict the distribution, abundance and energy usage of species. Marc Harper uses information theory in bioinformatics and evolutionary game theory. Harper, Harte and I are organizing a workshop on entropy and information in biological systems, and I’m really excited about it!

It’ll take place at the National Institute for Mathematical and Biological Synthesis in Knoxville Tennesee. It’s tentatively scheduled for October 22-24, 2014. When the date gets confirmed, I’ll post an advertisement so you can apply to attend.

Writing the proposal was fun, because we got to pull together lots of interesting people who are applying information theory and entropy to biology in quite different ways. So, here it is!

Proposal

Ever since Shannon initiated research on information theory in 1948, there have been hopes that the concept of information could serve as a tool to help systematize and unify work in biology. The link between information and entropy was noted very early on, and it suggested that a full thermodynamic understanding of biology would naturally involve the information processing and storage that are characteristic of living organisms. However, the subject is full of conceptual pitfalls for the unwary, and progress has been slower than initially expected. Premature attempts at ‘grand syntheses’ have often misfired. But applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful, such as:

• the maximum entropy principle in ecology,
• Shannon and Rényi entropies as measures of biodiversity,
• information theory in evolutionary game theory,
• information and the thermodynamics of individual cells.

Because they work in diverse fields, researchers in these specific topics have had little opportunity to trade insights and take stock of the progress so far. The aim of the workshop is to do just this.

In what follows, participants’ names are in boldface, while the main goals of the workshop are in italics.

Roderick Dewar is a key advocate of the principle of Maximum Entropy Production, which says that biological systems—and indeed all open, non-equilibrium systems—act to produce entropy at the maximum rate. Along with others, he has applied this principle to make testable predictions in a wide range of biological systems, from ATP synthesis [DJZ2006] to respiration and photosynthesis of individual plants [D2010] and plant communities. He has also sought to derive this principle from ideas in statistical mechanics [D2004, D2009], but it remains controversial.

The first goal of this workshop is to study the validity of this principle.

While they may be related, the principle of Maximum Entropy Production should not be confused with the MaxEnt inference procedure, which says that we should choose the probabilistic hypothesis with the highest entropy subject to the constraints provided by our data. MaxEnt was first explicitly advocated by Jaynes. He noted that it is already implicit in the procedures of statistical mechanics, but convincingly argued that it can also be applied to situations where entropy is more ‘informational’ than ‘thermodynamic’ in character.

Recently John Harte has applied MaxEnt in this way to ecology, using it to make specific testable predictions for the distribution, abundance and energy usage of species across spatial scales and across habitats and taxonomic groups [Harte2008, Harte2009, Harte2011]. Annette Ostling is an expert on other theories that attempt to explain the same data, such as the ‘neutral model’ [AOE2008, ODLSG2009, O2005, O2012]. Dewar has also used MaxEnt in ecology [D2008], and he has argued that it underlies the principle of Maximum Entropy Production.

Thus, a second goal of this workshop is to familiarize all the participants with applications of the MaxEnt method to ecology, compare it with competing approaches, and study whether MaxEnt provides a sufficient justification for the principle of Maximum Entropy Production.

Entropy is not merely a predictive tool in ecology: it is also widely used as a measure of biodiversity. Here Shannon’s original concept of entropy naturally generalizes to ‘Rényi entropy’, which depends on a parameter $\alpha \ge 0$. This equals

$\displaystyle{ H_\alpha(p) = \frac{1}{1-\alpha} \log \sum_i p_i^\alpha }$

where $p_i$ is the fraction of organisms of the $i$th type (which could mean species, some other taxon, etc.). In the limit $\alpha \to 1$ this reduces to the Shannon entropy:

$\displaystyle{ H(p) = - \sum_i p_i \log p_i }$

As $\alpha$ increases, we give less weight to rare types of organisms. Christina Cobbold and Tom Leinster have described a systematic and highly flexible system of biodiversity measurement, with Rényi entropy at its heart [CL2012]. They consider both the case where all we have are the numbers $p_i$, and the more subtle case where we take the distance between different types of organisms into account.

John Baez has explained the role of Rényi entropy in thermodynamics [B2011], and together with Tom Leinster and Tobias Fritz he has proved other theorems characterizing entropy which explain its importance for information processing [BFL2011]. However, these ideas have not yet been connected to the widespread use of entropy in biodiversity studies. More importantly, the use of entropy as a measure of biodiversity has not been clearly connected to MaxEnt methods in ecology. Does the success of MaxEnt methods imply a tendency for ecosystems to maximize biodiversity subject to the constraints of resource availability? This seems surprising, but a more nuanced statement along these general lines might be correct.

So, a third goal of this workshop is to clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

As the amount of data to analyze in genomics continues to surpass the ability of humans to analyze it, we can expect automated experiment design to become ever more important. In Chris Lee and Marc Harper’s RoboMendel program [LH2013], a mathematically precise concept of ‘potential information’—how much information is left to learn—plays a crucial role in deciding what experiment to do next, given the data obtained so far. It will be useful for them to interact with William Bialek, who has expertise in estimating entropy from empirical data and using it to constrain properties of models [BBS, BNS2001, BNS2002], and Susanne Still, who applies information theory to automated theory building and biology [CES2010, PS2012].

However, there is another link between biology and potential information. Harper has noted that in an ecosystem where the population of each type of organism grows at a rate proportional to its fitness (which may depend on the fraction of organisms of each type), the quantity

$\displaystyle{ I(q||p) = \sum_i q_i \ln(q_i/p_i) }$

always decreases if there is an evolutionarily stable state [Harper2009]. Here $p_i$ is the fraction of organisms of the $i$th genotype at a given time, while $q_i$ is this fraction in the evolutionarily stable state. This quantity is often called the Shannon information of $q$ ‘relative to’ $p$. But in fact, it is precisely the same as Lee and Harper’s potential information! Indeed, there is a precise mathematical analogy between evolutionary games and processes where a probabilistic hypothesis is refined by repeated experiments.

Thus, a fourth goal of this workshop is to develop the concept of evolutionary games as ‘learning’ processes in which information is gained over time.

We shall try to synthesize this with Carl Bergstrom and Matina Donaldson-Matasci’s work on the ‘fitness value of information’: a measure of how much increase in fitness a population can obtain per bit of extra information [BL2004, DBL2010, DM2013]. Following Harper, we shall consider not only relative Shannon entropy, but also relative Rényi entropy, as a measure of information gain [Harper2011].

A fifth and final goal of this workshop is to study the interplay between information theory and the thermodynamics of individual cells and organelles.

Suzanne Still has studied the thermodynamics of prediction in biological systems [BCSS2012]. And in a celebrated related piece of work, Jeremy England used thermodynamic arguments to a derive a lower bound for the amount of entropy generated during a process of self-replication of a bacterial cell [England2013]. Interestingly, he showed that E. coli comes within a factor of 3 of this lower bound.

In short, information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The time is ripe for an investigative workshop that brings together experts from different fields and lets them share insights and methods and begin to tackle some of the big remaining questions.

Bibliography

[AOE2008] D. Alonso, A. Ostling and R. Etienne, The assumption of symmetry and species abundance distributions, Ecology Letters 11 (2008), 93–105.

[TMMABB2012} D. Amodei, W. Bialek, M. J. Berry II, O. Marre, T. Mora, and G. Tkacik, The simplest maximum entropy model for collective behavior in a neural network, arXiv:1207.6319 (2012).

[B2011] J. Baez, Rényi entropy and free energy, arXiv:1102.2098 (2011).

[BFL2011] J. Baez, T. Fritz and T. Leinster, A characterization of entropy in terms of information loss, Entropy 13 (2011), 1945–1957.

[B2011] J. Baez and M. Stay, Algorithmic thermodynamics, Math. Struct. Comp. Sci. 22 (2012), 771–787.

[BCSS2012] A. J. Bell, G. E. Crooks, S. Still and D. A Sivak, The thermodynamics of prediction, Phys. Rev. Lett. 109 (2012), 120604.

[BL2004] C. T. Bergstrom and M. Lachmann, Shannon information and biological fitness, in IEEE Information Theory Workshop 2004, IEEE, 2004, pp. 50-54.

[BBS] M. J. Berry II, W. Bialek and E. Schneidman, An information theoretic approach to the functional classification of neurons, in Advances in Neural Information Processing Systems 15, MIT Press, 2005.

[BNS2001] W. Bialek, I. Nemenman and N. Tishby, Predictability, complexity and learning, Neural Computation 13 (2001), 2409–2463.

[BNS2002] W. Bialek, I. Nemenman and F. Shafee, Entropy and inference, revisited, in Advances in Neural Information Processing Systems 14, MIT Press, 2002.

[CL2012] C. Cobbold and T. Leinster, Measuring diversity: the importance of species similarity, Ecology 93 (2012), 477–489.

[CES2010] J. P. Crutchfield, S. Still and C. Ellison, Optimal causal inference: estimating stored information and approximating causal architecture, Chaos 20 (2010), 037111.

[D2004] R. C. Dewar, Maximum entropy production and non-equilibrium statistical mechanics, in Non-Equilibrium Thermodynamics and Entropy Production: Life, Earth and Beyond, eds. A. Kleidon and R. Lorenz, Springer, New York, 2004, 41–55.

[DJZ2006] R. C. Dewar, D. Juretíc, P. Zupanovíc, The functional design of the rotary enzyme ATP synthase is consistent with maximum entropy production, Chem. Phys. Lett. 430 (2006), 177–182.

[D2008] R. C. Dewar, A. Porté, Statistical mechanics unifies different ecological patterns, J. Theor. Bio. 251 (2008), 389–403.

[D2009] R. C. Dewar, Maximum entropy production as an inference algorithm that translates physical assumptions into macroscopic predictions: don’t shoot the messenger, Entropy 11 (2009), 931–944.

[D2010] R. C. Dewar, Maximum entropy production and plant optimization theories, Phil. Trans. Roy. Soc. B 365 (2010) 1429–1435.

[DBL2010} M. C. Donaldson-Matasci, C. T. Bergstrom, and
M. Lachmann, The fitness value of information, Oikos 119 (2010), 219-230.

[DM2013] M. C. Donaldson-Matasci, G. DeGrandi-Hoffman, and A. Dornhaus, Bigger is better: honey bee colonies as distributed information-gathering systems, Animal Behaviour 85 (2013), 585–592.

[England2013] J. L. England, Statistical physics of self-replication, J. Chem. Phys. 139 (2013), 121923.

[ODLSG2009} J. L. Green, J. K. Lake, J. P. O’Dwyer, A. Ostling and V. M. Savage, An integrative framework for stochastic, size-structured community assembly, PNAS 106 (2009), 6170--6175.

[Harper2009] M. Harper, Information geometry and evolutionary game theory, arXiv:0911.1383 (2009).

[Harper2011] M. Harper, Escort evolutionary game theory, Physica D 240 (2011), 1411–1415.

[Harte2008] J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700-–2711.

[Harte2009] J. Harte, A. Smith and D. Storch, Biodiversity scales from plots to biomes with a universal species-area curve, Ecology Letters 12 (2009), 789–797.

[Harte2011] J. Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics, Oxford U. Press, Oxford, 2011.

[LH2013] M. Harper and C. Lee, Basic experiment planning via information metrics: the RoboMendel problem, arXiv:1210.4808 (2012).

[O2005] A. Ostling, Neutral theory tested by birds, Nature 436 (2005), 635.

[O2012] A. Ostling, Do fitness-equalizing tradeoffs lead to neutral communities?, Theoretical Ecology 5 (2012), 181–194.

[PS2012] D. Precup and S. Still, An information-theoretic approach to curiosity-driven reinforcement learning, Theory in Biosciences 131 (2012), 139–148.

Relative Entropy (Part 2)

2 July, 2013

In the first part of this mini-series, I describe how various ideas important in probability theory arise naturally when you start doing linear algebra using only the nonnegative real numbers.

But after writing it, I got an email from a rather famous physicist saying he got “lost at line two”. So, you’ll be happy to hear that the first part is not a prerequisite for the remaining parts! I wrote it just to intimidate that guy.

Tobias Fritz and I have proved a theorem characterizing the concept of relative entropy, which is also known as ‘relative information’, ‘information gain’ or—most terrifying and least helpful of all—’Kullback-Leibler divergence’. In this second part I’ll introduce two key players in this theorem. The first, $\mathrm{FinStat},$ is a category where:

• an object consists of a system with finitely many states, and a probability distribution on those states

and

• a morphism consists of a deterministic ‘measurement process’ mapping states of one system to states of another, together with a ‘hypothesis’ that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

The second, $\mathrm{FP},$ is a subcategory of $\mathrm{FinStat}.$ It has all the same objects, but only morphisms where the hypothesis is ‘optimal’. This means that if the observer measures the system many times, and uses the probability distribution of their observations together with their hypothesis to guess the probability distribution of states of the system, they get the correct answer (in the limit of many measurements).

In this part all I will really do is explain precisely what $\mathrm{FinStat}$ and $\mathrm{FP}$ are. But to whet your appetite, let me explain how we can use them to give a new characterization of relative entropy!

Suppose we have any morphism in $\mathrm{FinStat}.$ In other words: suppose we have a deterministic measurement process, together with a hypothesis that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

Then we have two probability distributions on the states of the system being measured! First, the ‘true’ probability distribution. Second, the probability that the observer will guess based on their observations.

Whenever we have two probability distributions on the same set, we can compute the entropy of the first relative to to the second. This describes how surprised you’ll be if you discover the probability distribution is really the first, when you thought it was the second.

So: any morphism in $\mathrm{FinStat}$ will have a relative entropy. It will describe how surprised the observer will be when they discover the true probability distribution, given what they had guessed.

But this amount of surprise will be zero if their hypothesis was ‘optimal’ in the sense I described. So, the relative entropy will vanish on morphisms in $\mathrm{FP}.$

Our theorem says this fact almost characterizes the concept of relative entropy! More precisely, it says that any convex-linear continuous functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on the subcategory $\mathrm{FP}$ must equal some constant times the relative entropy.

Don’t be scared! This should not make sense to you yet, since I haven’t said how I’m thinking of $[0,+\infty]$ as a category, nor what a ‘convex-linear continuous functor’ is, nor how relative entropy gives one. I will explain all that later. I just want you to get a vague idea of where I’m going.

Now let me explain the categories $\mathrm{FinStat}$ and $\mathrm{FP}.$ We need to warm up a bit first.

FinStoch

A stochastic map $f : X \leadsto Y$ is different from an ordinary function, because instead of assigning a unique element of $Y$ to each element of $X,$ it assigns a probability distribution on $Y$ to each element of $X.$ So you should imagine it as being like a function ‘with random noise added’, so that $f(x)$ is not a specific element of $Y,$ but instead has a probability of taking on different values. This is why I’m using a weird wiggly arrow to denote a stochastic map.

More formally:

Definition. Given finite sets $X$ and $Y,$ a stochastic map $f : X \leadsto Y$ assigns a real number $f_{yx}$ to each pair $x \in X, y \in Y,$ such that fixing any element $x,$ the numbers $f_{yx}$ form a probability distribution on $Y.$ We call $f_{yx}$ the probability of $y$ given $x.$

In more detail:

$f_{yx} \ge 0$ for all $x \in X,$ $y \in Y.$

and

$\displaystyle{ \sum_{y \in Y} f_{yx} = 1}$ for all $x \in X.$

Note that we can think of $f : X \leadsto Y$ as a $Y \times X$-shaped matrix of numbers. A matrix obeying the two properties above is called stochastic. This viewpoint is nice because it reduces the problem of composing stochastic maps to matrix multiplication. It’s each to check that multiplying two stochastic matrices gives a stochastic matrix. So, composing stochastic maps gives a stochastic map.

We thus get a category:

Definition. Let $\mathrm{FinStoch}$ be the category of finite sets and stochastic maps between them.

In case you’re wondering why I’m restricting attention to finite sets, it’s merely because I want to keep things simple. I don’t want to worry about whether sums or integrals converge.

FinProb

Now take your favorite 1-element set and call it $1.$ A function $p : 1 \to X$ is just a point of $X.$ But a stochastic map $p : 1 \leadsto X$ is something more interesting: it’s a probability distribution on $X.$

Why? Because it gives a probability distribution on $X$ for each element of $1,$ but that set has just one element.

Last time I introduced the rather long-winded phrase finite probability measure space to mean a finite set with a probability distribution on it. But now we’ve seen a very quick way to describe such a thing within $\mathrm{FinStoch}:$

And this gives a quick way to think about a measure-preserving function between finite probability measure spaces! It’s just a commutative triangle like this:

Note that the horizontal arrow $f: X \to Y$ is not wiggly. The straight arrow means it’s an honest function, not a stochastic map. But a function is a special case of a stochastic map! So it makes sense to compose a straight arrow with a wiggly arrow—and the result is, in general, a wiggly arrow. So, it makes sense to demand that this triangle commutes, and this says that the function $f: X \to Y$ is measure-preserving.

Let me work through the details, in case they’re not clear.

First: how is a function a special case of a stochastic map? Here’s how. If we start with a function $f: X \to Y,$ we get a matrix of numbers

$f_{yx} = \delta_{y,f(x)}$

where $\delta$ is the Kronecker delta. So, each element $x \in X$ gives a probability distribution that’s zero except at $f(x).$

Given this, we can work out what this commuting triangle really says:

If use $p_x$ to stand for the probability distribution that $p: 1 \leadsto X$ puts on $X,$ and similarly for $q_y,$ the commuting triangle says

$\displaystyle{ q_y = \sum_{x \in X} \delta_{y,f(x)} p_x}$

or in other words:

$\displaystyle{ q_y = \sum_{x \in X : f(x) = y} p_x }$

or if you like:

$\displaystyle{ q_y = \sum_{x \in f^{-1}(y)} p_x }$

In this situation people say $q$ is $p$ pushed forward along $f$, and they say $f$ is a measure-preserving function.

So, we’ve used $\mathrm{FinStoch}$ to describe another important category:

Definition. Let $\mathrm{FinProb}$ be the category of finite probability measure spaces and measure-preserving functions between them.

I can’t resist mentioning another variation:

A commuting triangle like this is a measure-preserving stochastic map. In other words, $p$ gives a probability measure on $X,$ $q$ gives a probability measure on $Y,$ and $f: X \leadsto Y$ is a stochastic map with

$\displaystyle{ q_y = \sum_{x \in X} f_{yx} p_x }$

FinStat

The category we really need for relative entropy is a bit more subtle. An object is a finite probability measure space:

but a morphism looks like this:

The whole diagram doesn’t commute, but the two equations I wrote down hold. The first equation says that $f: X \to Y$ is a measure-preserving function. In other words, this triangle, which we’ve seen before, commutes:

The second equation says that $f \circ s$ is the identity, or in math jargon, $s$ is a section for $f.$

But what does that really mean?

The idea is that $X$ is the set of ‘states’ of some system, while $Y$ is a set of possible ‘observations’ you might make. The function $f$ is a ‘measurement process’. You ‘measure’ the system using $f,$ and if the system is in the the state $x$ you get the observation $f(x).$ The probability distribution $p$ says the probability that the system is any given state, while $q$ says the probability that you get any given observation when you do your measurement.

Note: are assuming for now that that there’s no random noise in the observation process! That’s why $f$ is a function instead of a stochastic map.

But what about $s?$ That’s the fun part: $s$ describes your ‘hypothesis’ about the system’s state given a particular measurement! If you measure the system and get a result $y \in Y,$ you guess it’s in the state $x$ with probability $s_{xy}.$

And we don’t want this hypothesis to be really dumb: that’s what

$f \circ s = 1_Y$

says. You see, this equation says that

$\sum_{x \in X} \delta_{y', f(x)} s_{xy} = \delta_{y' y}$

or in other words:

$\sum_{x \in f^{-1}(y')} s_{xy} = \delta_{y' y}$

If you think about it, this implies $s_{xy} = 0$ unless $f(x) = y.$

So, if you make an observation $y,$ you will guess the system is in state $x$ with probability zero unless $f(x) = y.$ In short, you won’t make a really dumb guess about the system’s state.

Here’s how we compose morphisms:

We get a measure-preserving function $g \circ f : X \to Z$ and a stochastic map going back, $s \circ t : Z \to Z.$ You can check that these obey the required equations:

$g \circ f \circ p = r$

$g \circ f \circ s \circ t = 1_Z$

So, we get a category:

Definition. Let $\mathrm{FinStat}$ be the category where an object is a finite probability measure space:

a morphism is a diagram obeying these equations:

and composition is defined as above.

FP

As we’ve just seen, a morphism in $\mathrm{FinStat}$ consists of a ‘measurement process’ $f$ and a ‘hypothesis’ $s:$

But sometimes we’re lucky and our hypothesis is optimal, in the sense that

$s \circ q = p$

Conceptually, this says that if you take the probability distribution $q$ on our observations and use it to guess a probability distribution for the system’s state using our hypothesis $s,$ you get the correct answer: $p.$

Mathematically, it says that this diagram commutes:

In other words, $s$ is a measure-preserving stochastic map.

There’s a subcategory of $\mathrm{FinStat}$ with all the same objects, but only these ‘optimal’ morphisms. It’s important, but the name we have for it is not very exciting:

Definition. Let $\mathrm{FP}$ be the subcategory of $\mathrm{FinStat}$ where an object is a finite probability measure space

and a morphism is a diagram obeying these equations:

Why do we call this category $\mathrm{FP}$? Because it’s a close relative of $\mathrm{FinProb},$ where a morphism, you’ll remember, looks like this:

The point is that for a morphism in $\mathrm{FP},$ the conditions on $s$ are so strong that they completely determine it unless there are observations that happen with probability zero—that is, unless there are $y \in Y$ with $q_y = 0.$ To see this, note that

$s \circ q = p$

actually says

$\sum_{y \in Y} s_{xy} q_y = p_x$

for any choice of $x \in X.$ But we’ve already seen $s_{xy} = 0$ unless $f(x) = y,$ so the sum has just one term, and the equation says

$s_{x,f(x)} q_{f(x)} = p_x$

We can solve this for $s_{x,f(x)},$ so $s$ is completely determined… unless $q_{f(x)} = 0.$

This covers the case when $y = f(x).$ We also can’t figure out $s_{x,y}$ if $y$ isn’t in the image of $f.$

So, to be utterly precise, $s$ is determined by $p, q$ and $f$ unless there’s an element $y \in Y$ that has $q_y = 0.$ Except for this special case, a morphism in $\mathrm{FP}$ is just a morphism in $\mathrm{FinProb}.$ But in this special case, a morphism in $\mathrm{FP}$ has a little extra information: an arbitrary probability distribution on the inverse image of each point $y$ with this property.

In short, $\mathrm{FP}$ is the same as $\mathrm{FinProb}$ except that our observer’s ‘optimal hypothesis’ must provide a guess about the state of the system given an observation, even in cases of observations that occur with probability zero.

I’m going into these nitpicky details for two reasons. First, we’ll need $\mathrm{FP}$ for our characterization of relative entropy. But second, Tom Leinster already ran into this category in his work on entropy and category theory! He discussed it here:

• Tom Leinster, An operadic introduction to entropy.

Despite the common theme of entropy, he arrived at it from a very different starting-point.

Conclusion

So, I hope that next time I can show you something like this:

and you’ll say “Oh, that’s a probability distribution on the states of some system!” Intuitively, you should think of the wiggly arrow $p$ as picking out a ‘random element’ of the set $X.$

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, sending a probability distribution on the states of the measured system to a probability distribution on observations!”

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with a hypothesis about the system’s state, given what is observed!”

And I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with an optimal hypothesis about the system’s state, given what is observed!”

I don’t count on it… but I can hope.

Postscript

And speaking of unrealistic hopes, if I were really optimistic I would hope you noticed that $\mathrm{FinStoch}$ and $\mathrm{FinProb},$ which underlie the more fancy categories I’ve discussed today, were themselves constructed starting from linear algebra over the nonnegative numbers $[0,\infty)$ in Part 1. That ‘foundational’ work is not really needed for what we’re doing now. However, I like the fact that we’re ultimately getting the concept of relative entropy starting from very little: just linear algebra, using only nonnegative numbers!

Relative Entropy (Part 1)

20 June, 2013

I’m trying to finish off a paper that Tobias Fritz and I have been working on, which gives a category-theoretic (and Bayesian!) characterization of relative entropy. It’s a kind of sequel to our paper with Tom Leinster, in which we characterized entropy.

That earlier paper was developed in conversations on the n-Category Café. It was a lot of fun; I sort of miss that style of working. Also, to get warmed up, I need to think through some things I’ve thought about before. So, I might as well write them down here.

The idea

There are many categories related to probability theory, and they’re related in many ways. Last summer—on the 24th of August 2012, according to my notes here—Jamie Vicary, Brendan Fong and I worked through a bunch of these relationships. I need to write them down now, even if they’re not all vitally important to my paper with Tobias. They’re sort of buzzing around my brain like flies.

(Tobias knows this stuff too, and this is how we think about probability theory, but we weren’t planning to stick it in our paper. Maybe we should.)

Let’s restrict attention to probability measures on finite sets, and related structures. We could study these questions more generally, and we should, but not today. What we’ll do is give a unified purely algebraic description of:

• finite sets

• measures on finite sets

• probability measures on finite sets

and various kinds of maps between these:

• functions

• bijections

• measure-preserving functions

• stochastic maps

Finitely generated free [0,∞)-modules

People often do linear algebra over a field, which is—roughly speaking—a number system where you can add, subtract, multiply and divide. But algebraists have long realized that a lot of linear algebra still works with a commutative ring, where you can’t necessarily divide. It gets more complicated, but also a lot more interesting.

But in fact, a lot still works with a commutative rig, where we can’t necessarily subtract either! Something I keep telling everyone is that linear algebra over rigs is a good idea for studying things like probability theory, thermodynamics, and the principle of least action.

Today we’ll start with the rig of nonnegative real numbers with their usual addition and multiplication; let’s call this $[0,\infty) .$ The idea is that measure theory, and probability theory, are closely related to linear algebra over this rig.

Let $C$ be the category with of finitely generated free $[0,\infty)$-modules as objects, and module homomorphisms as morphisms. I’ll call these morphisms maps.

Puzzle. Do we need to say ‘free’ here? Are there finitely generated modules over $[0,\infty)$ that aren’t free?

Every finitely generated free $[0,\infty)$-module is isomorphic to $[0,\infty)^S$ for some finite set $S .$ In other words, it’s isomorphic to $[0,\infty)^n$ for some $n = 0, 1, 2, \dots .$ So, $C$ is equivalent to the category where objects are natural numbers, a morphism from $m$ to $n$ is an $m \times n$ matrix of numbers in $[0,\infty) ,$ and composition is done by matrix multiplication. I’ll also call this equivalent category $C.$

We can take tensor products of finitely generated free modules, and this makes $C$ into a symmetric monoidal †-category. This means we can draw maps using string diagrams in the usual way. However, I’m feeling lazy so I’ll often write equations when I could be drawing diagrams.

One of the rules of the game is that all these equations will make sense in any symmetric monoidal †-category. So we could, if we wanted, generalize ideas from probability theory this way. If you want to do this, you’ll need to know that $[0,\infty)$ is the unit for the tensor product in $C.$ We’ll be seeing this guy $[0,\infty)$ a lot. So if you want to generalize, replace $C$ by any symmetric monoidal †-category, and replace $[0,\infty)$ by the unit for the tensor product.

Finite sets

There’s a way to see the category of finite sets lurking in $C,$ which we can borrow from this paper:

• Bob Coecke, Dusko Pavlovic and Jamie Vicary, A new description of orthogonal bases.

For any finite set $S ,$ we get a free finitely generated $[0,\infty)$-module, namely $[0,\infty)^S .$ This comes with some structure:

• a multiplication $m: [0,\infty)^S \otimes [0,\infty)^S \to [0,\infty)^S ,$ coming from pointwise multiplication of $[0,\infty)$-valued functions on $S$

• the unit for this multiplication, an element of $[0,\infty)^S,$ which we can write as a morphism $i: [0,\infty) \to [0,\infty)^S$

• a comultiplication, obtained by taking the diagonal map $\Delta : S \to S \times S$ and promoting it to a linear map $\Delta : [0,\infty)^S \to [0, \infty)^S \otimes [0,\infty)^S$

• a counit for this comultiplication, obtained by taking the unique map to the terminal set $! : S \to 1$ and promoting it to a linear map $e: [0,\infty)^S \to [0,\infty)$

These morphisms $m, i, \Delta, e$ make

$x = [0,\infty)^S$

into a commutative Frobenius algebra in $C .$ That’s a thing where the unit, counit, multiplication and comultiplication obey these laws:

(I drew these back when I was feeling less lazy.) This Frobenius algebra is also ‘special’, meaning it obeys this:

And it’s also a †-Frobenius algebra, meaning that the counit and comultiplication are obtained from the unit and multiplication by ‘flipping’ them using the †category structure. (If we think of a morphism in $C$ as a matrix, its dagger is its transpose.)

Conversely, suppose we have any special commutative †-Frobenius algebra $x .$ Then using the ideas in the paper by Coecke, Pavlovich and Vicary we can recover a basis for $x ,$ consisting of the vectors $e_i \in x$ with

$\Delta(e_i) = e_i \otimes e_i$

This basis forms a set $S$ such that

$x \cong [0,\infty)^S$

for some specified isomorphism in $C.$ Furthermore, this is an isomorphism of special commutative †-Frobenius algebras!

In case you’re wondering, these vectors $e_i$ correspond to the functions on $S$ that are zero everywhere except at one point $i \in S,$ where they equal 1.

In short, a special commutative †-Frobenius algebra in $C$ is just a fancy way of talking about a finite set. This may seem silly, but it’s a way to start describing probability theory using linear algebra very much as we do with quantum theory. This analogy between quantum theory and probability theory is so interesting that it deserves a book.

Functions and bijections

Now suppose we have two special commutative †-Frobenius algebra in $C$, say $x$ and $y .$

Suppose $f : x \to y$ is a Frobenius algebra homomorphism: that is, a map preserving all the structure—the unit, counit, multiplication and comultiplication. Then it comes from an isomorphism of finite sets. This lets us find $FinSet_0 ,$ the groupoid of finite sets and bijections, inside $C.$

Alternatively, suppose $f : x \to y$ is just a coalgebra homomorphism: that is a map preserving just the counit and comultiplication. Then it comes from an arbitrary function between finite sets. This lets us find $FinSet ,$ the category of finite sets and functions, inside $C .$

But what if $f$ preserves just the counit? This sounds like a dry, formal question. But it’s not: the answer is something useful, a ‘stochastic map’.

Stochastic maps

A stochastic map from a finite set $S$ to a finite set $T$ is a map sending each point of $S$ to a probability measure on $T .$

We can think of this as a $T \times S$-shaped matrix of numbers in $[0,\infty) ,$ where a given column gives the probability that a given point in $S$ goes to any point in $T .$ The sum of the numbers in each column will be 1. And conversely, any $T \times S$-shaped matrix of numbers in $[0,\infty) ,$ where each column sums to 1, gives a stochastic map from $S$ to $T .$

But now let’s describe this idea using the category $C.$ We’ve seen a finite set is the same as a special commutative †-Frobenius algebra. So, say we have two of these, $x$ and $y .$ Our matrix of numbers in $[0,\infty)$ is just a map

$f: x \to y$

So, we just need a way to state the condition that each column in the matrix sums to 1. And this condition simply says that $f$ preserves the counit:

$\epsilon_y \circ f = \epsilon_x$

where $\epsilon_x : x \to [0,\infty)$ is the counit for $x ,$ and similarly for $\epsilon_y .$

To understand this, note that if we use the canonical isomorphism

$x \cong [0,\infty)^S$

the counit $\epsilon_x$ can be seen as the map

$[0,\infty)^S \to [0,\infty)$

that takes any $S$-tuple of numbers and sums them up. In other words, it’s integration with respect to counting measure. So, the equation

$\epsilon_y \circ f = \epsilon_x$

says that if we take any $S$-tuple of numbers, multiply it by the matrix $f ,$ and then sum up the entries of the resulting $T$-tuple, it’s the same as if we summed up the original $S$-tuple. But this says precisely that each column of the matrix $f$ sums to 1.

So, we can use our formalism to describe $\mathrm{FinStoch},$ the category with finite sets as objects and stochastic maps as morphisms. We’ve seen this category is equivalent to the category with special commutative †-Frobenius algebras in $C$ as objects and counit-preserving maps as morphisms.

Finite measure spaces

Now let’s use our formalism to describe finite measure spaces—by which, beware, I mean a finite sets equipped with measures! To do this, we’ll use a special commutative †-Frobenius algebra $x$ in $C$ together with any map

$\mu: [0,\infty) \to x$

Starting from these, we get a specified isomorphism

$x \cong [0,\infty)^S$

and $\mu$ sends the number 1 to a vector in $[0,\infty)^S$: that is, a function on $S$ taking values in $[0,\infty) .$ Multiplying this function by counting measure, we get a measure on $S .$

Puzzle. How can we describe this measure without the annoying use of counting measure?

Conversely, any measure on a finite set gives a special commutative †-Frobenius algebra $x$ in $C$ equipped with a map from $[0,\infty) .$

So, we can say a finite measure space is a special commutative †-Frobenius algebra in $C$ equipped with a map

$\mu: [0,\infty) \to x$

And given two of these,

$\mu: [0,\infty) \to x , \qquad \nu: [0,\infty) \to y$

and a coalgebra morphism

$f : x \to y$

obeying this equation

$f \circ \mu = \nu$

then we get a measure-preserving function between finite measure spaces! If you’re a category theorist, you’ll draw this equation as a commutative triangle:

Conversely, any measure-preserving function between finite measure spaces obeys this equation. So, we get an algebraic way of describing the category $\mathrm{FinMeas} ,$ with finite measure spaces as objects and measure-preserving maps as morphisms.

Finite probability measure spaces

I’m mainly interested in probability measures. So suppose $x$ is a special commutative †-Frobenius algebra in $C$ equipped with a map

$\mu: [0,\infty) \to x$

We’ve seen this gives a finite measure space. But this is a probability measure space if and only if

$e \circ \mu = 1$

where

$e : x \to [0,\infty)$

is the counit for $x .$ The equation simply says the total integral of our measure $\mu$ is 1.

So, we get a way of describing the category $\mathrm{FinProb} ,$ which has finite probability measure spaces as objects and measure-preserving maps as objects. Given finite probability measure spaces described this way:

$\mu: [0,\infty) \to x , \qquad \nu: [0,\infty) \to y$

a measure-preserving function is a coalgebra morphism

$f : x \to y$

such that the obvious triangle commutes:

$f \circ \mu = \nu$

Measure-preserving stochastic maps

Say we have two finite measure spaces. Then we can ask whether a stochastic map from one to the other is measure-preserving. And we can answer this question in the language of $C .$

Remember, a finite measure space is a special commutative †-Frobenius algebra $x$ in $C$ together with a map

$\mu: [0,\infty) \to x$

Say we have another one:

$\nu: [0,\infty) \to y$

A stochastic map is just a map

$f: x \to y$

that preserves the counit:

$\epsilon_y \circ f = \epsilon_x$

But it’s a measure-preserving stochastic map if also

$f \circ \mu = \nu$

Next…

There’s a lot more to say; I haven’t gotten anywhere near what Tobias and I are doing! But it’s pleasant to have this basic stuff written down.

Maximum Entropy and Ecology

21 February, 2013

I already talked about John Harte’s book on how to stop global warming. Since I’m trying to apply information theory and thermodynamics to ecology, I was also interested in this book of his:

John Harte, Maximum Entropy and Ecology, Oxford U. Press, Oxford, 2011.

There’s a lot in this book, and I haven’t absorbed it all, but let me try to briefly summarize his maximum entropy theory of ecology. This aims to be “a comprehensive, parsimonious, and testable theory of the distribution, abundance, and energetics of species across spatial scales”. One great thing is that he makes quantitative predictions using this theory and compares them to a lot of real-world data. But let me just tell you about the theory.

It’s heavily based on the principle of maximum entropy (MaxEnt for short), and there are two parts:

Two MaxEnt calculations are at the core of the theory: the first yields all the metrics that describe abundance and energy distributions, and the second describes the spatial scaling properties of species’ distributions.

Abundance and energy distributions

The first part of Harte’s theory is all about a conditional probability distribution

$R(n,\epsilon | S_0, N_0, E_0)$

which he calls the ecosystem structure function. Here:

$S_0$: the total number of species under consideration in some area.

$N_0$: the total number of individuals under consideration in that area.

$E_0$: the total rate of metabolic energy consumption of all these individuals.

Given this,

$R(n,\epsilon | S_0, N_0, E_0) \, d \epsilon$

is the probability that given $S_0, N_0, E_0,$ if a species is picked from the collection of species, then it has $n$ individuals, and if an individual is picked at random from that species, then its rate of metabolic energy consumption is in the interval $(\epsilon, \epsilon + d \epsilon).$

Here of course $d \epsilon$ is ‘infinitesimal’, meaning that we take a limit where it goes to zero to make this idea precise (if we’re doing analytical work) or take it to be very small (if we’re estimating $R$ from data).

I believe that when we ‘pick a species’ we’re treating them all as equally probable, not weighting them according to their number of individuals.

Clearly $R$ obeys some constraints. First, since it’s a probability distribution, it obeys the normalization condition:

$\displaystyle{ \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) = 1 }$

Second, since the average number of individuals per species is $N_0/S_0,$ we have:

$\displaystyle{ \sum_n \int d \epsilon \; n R(n,\epsilon | S_0, N_0, E_0) = N_0 / S_0 }$

Third, since the average over species of the total rate of metabolic energy consumption of individuals within the species is $E_0/ S_0,$ we have:

$\displaystyle{ \sum_n \int d \epsilon \; n \epsilon R(n,\epsilon | S_0, N_0, E_0) = E_0 / S_0 }$

Harte’s theory is that $R$ maximizes entropy subject to these three constraints. Here entropy is defined by

$\displaystyle{ - \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) \ln(R(n,\epsilon | S_0, N_0, E_0)) }$

Harte uses this theory to calculate $R,$ and tests the results against data from about 20 ecosystems. For example, he predicts the abundance of species as a function of their rank, with rank 1 being the most abundant, rank 2 being the second most abundant, and so on. And he gets results like this:

The data here are from:

• Green, Harte, and Ostling’s work on a serpentine grassland,

• Luquillo’s work on a 10.24-hectare tropical forest, and

• Cocoli’s work on a 2-hectare wet tropical forest.

The fit looks good to me… but I should emphasize that I haven’t had time to study these matters in detail. For more, you can read this paper, at least if your institution subscribes to this journal:

• J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

Spatial abundance distribution

The second part of Harte’s theory is all about a conditional probability distribution

$\Pi(n | A, n_0, A_0)$

This is the probability that $n$ individuals of a species are found in a region of area $A$ given that it has $n_0$ individuals in a larger region of area $A_0.$

$\Pi$ obeys two constraints. First, since it’s a probability distribution, it obeys the normalization condition:

$\displaystyle{ \sum_n \Pi(n | A, n_0, A_0) = 1 }$

Second, since the mean value of $n$ across regions of area $A$ equals $n_0 A/A_0,$ we have

$\displaystyle{ \sum_n n \Pi(n | A, n_0, A_0) = n_0 A/A_0 }$

Harte’s theory is that $\Pi$ maximizes entropy subject to these two constraints. Here entropy is defined by

$\displaystyle{- \sum_n \Pi(n | A, n_0, A_0)\ln(\Pi(n | A, n_0, A_0)) }$

Harte explains two approaches to use this idea to derive ‘scaling laws’ for how $n$ varies with $n$. And again, he compares his predictions to real-world data, and get results that look good to my (amateur, hasty) eye!

I hope sometime I can dig deeper into this subject. Do you have any ideas, or knowledge about this stuff?

Network Theory for Economists

15 January, 2013

Tomorrow I’m giving a talk in the econometrics seminar at U.C. Riverside. I was invited to speak on my work on network theory, so I don’t feel too bad about the fact that I’ll be saying only a little about economics and practically nothing about econometrics. Still, I’ve tried to slant the talk in a way that emphasizes possible applications to economics and game theory. Here are the slides:

For long-time readers here the fun comes near the end. I explain how reaction networks can be used to describe evolutionary games. I point out that in certain classes of evolutionary games, evolution tends to increase ‘fitness’, and/or lead the players to a ‘Nash equilibrium’. For precise theorems you’ll have to click the links in my talk and read the references!

I conclude with an example: a game with three strategies and 7 Nash equilibria. Here evolution makes the proportion of these three strategies follow these flow lines, at least in the limit of large numbers of players:

This picture is from William Sandholm’s nice expository paper:

• William H. Sandholm, Evolutionary game theory, 2007.

I mentioned it before in Information Geometry (Part 12), en route to showing a proof that some quantity always decreases in a class of evolutionary games. Sometime I want to tell the whole story linking:

and

But not today! Think of these talk slides as a little appetizer.

John Harte

27 October, 2012

Earlier this week I gave a talk on the Mathematics of Planet Earth at the University of Southern California, and someone there recommended that I look into John Harte’s work on maximum entropy methods in ecology. He works at U.C. Berkeley.

I checked out his website and found that his goals resemble mine: save the planet and understand its ecosystems. He’s a lot further along than I am, since he comes from a long background in ecology while I’ve just recently blundered in from mathematical physics. I can’t really say what I think of his work since I’m just learning about it. But I thought I should point out its existence.

This free book is something a lot of people would find interesting:

• John and Mary Ellen Harte, Cool the Earth, Save the Economy: Solving the Climate Crisis Is EASY, 2008.

EASY? Well, it’s an acronym. Here’s the basic idea of the US-based plan described in this book:

Any proposed energy policy should include these two components:

Technical/Behavioral: What resources and technologies are to be used to supply energy? On the demand side, what technologies and lifestyle changes are being proposed to consumers?

Incentives/Economic Policy: How are the desired supply and demand options to be encouraged or forced? Here the options include taxes, subsidies, regulations, permits, research and development, and education.

And a successful energy policy should satisfy the AAA criteria:

Availability. The climate crisis will rapidly become costly to society if we do not take action expeditiously. We need to adopt now those technologies that are currently available, provided they meet the following two additional criteria:

Affordability. Because of the central role of energy in our society, its cost to consumers should not increase significantly. In fact, a successful energy policy could ultimately save consumers money.

Acceptability. All energy strategies have environmental, land use, and health and safety implications; these must be acceptable to the public. Moreover, while some interest groups will undoubtedly oppose any particular energy policy, political acceptability at a broad scale is necessary.

Our strategy for preventing climate catastrophe and achieving energy independence includes:

Energy Efficient Technology at home and at the workplace. Huge reductions in home energy use can be achieved with available technologies, including more efficient appliances such as refrigerators, water heaters, and light bulbs. Home retrofits and new home design features such as “smart” window coatings, lighter-colored roofs where there are hot summers, better home insulation, and passive solar designs can also reduce energy use. Together, energy efficiency in home and industry can save the U.S. up to approximately half of the energy currently consumed in those sectors, and at no net cost—just by making different choices. Sounds good, doesn’t it?

Automobile Fuel Efficiency. Phase in higher Corporate Average Fuel Economy (CAFE) standards for automobiles, SUVs and light trucks by requiring vehicles to go 35 miles per gallon of gas (mpg) by 2015, 45 mpg by 2020, and 60 mpg by 2030. This would rapidly wipe out our dependence on foreign oil and cut emissions from the vehicle sector by two-thirds. A combination of plug-in hybrid, lighter car body materials, re-design and other innovations could readily achieve these standards. This sounds good, too!

Solar and Wind Energy. Rooftop photovoltaic panels and solar water heating units should be phased in over the next 20 years, with the goal of solar installation on 75% of U.S. homes and commercial buildings by 2030. (Not all roofs receive sufficient sunlight to make solar panels practical for them.) Large wind farms, solar photovoltaic stations, and solar thermal stations should also be phased in so that by 2030, all U.S. electricity demand will be supplied by existing hydroelectric, existing and possibly some new nuclear, and, most importantly, new solar and wind units. This will require investment in expansion of the grid to bring the new supply to the demand, and in research and development to improve overnight storage systems. Achieving this goal would reduce our dependence on coal to practically zero. More good news!

You are part of the answer. Voting wisely for leaders who promote the first three components is one of the most important individual actions one can make. Other actions help, too. Just as molecules make up mountains, individual actions taken collectively have huge impacts. Improved driving skills, automobile maintenance, reusing and recycling, walking and biking, wearing sweaters in winter and light clothing in summer, installing timers on thermostats and insulating houses, carpooling, paying attention to energy efficiency labels on appliances, and many other simple practices and behaviors hugely influence energy consumption. A major education campaign, both in schools for youngsters and by the media for everyone, should be mounted to promote these consumer practices.

No part of EASY can be left out; all parts are closely integrated. Some parts might create much larger changes—for example, more efficient home appliances and automobiles—but all parts are essential. If, for example, we do not achieve the decrease in electricity demand that can be brought about with the E of EASY, then it is extremely doubtful that we could meet our electricity needs with the S of EASY.

It is equally urgent that once we start implementing the plan, we aggressively export it to other major emitting nations. We can reduce our own emissions all we want, but the planet will continue to warm if we can’t convince other major global emitters to reduce their emissions substantially, too.

What EASY will achieve. If no actions are taken to reduce carbon dioxide emissions, in the year 2030 the U.S. will be emitting about 2.2 billion tons of carbon in the form of carbon dioxide. This will be an increase of 25% from today’s emission rate of about 1.75 billion tons per year of carbon. By following the EASY plan, the U.S. share in a global effort to solve the climate crisis (that is, prevent catastrophic warming) will result in U.S emissions of only about 0.4 billion tons of carbon by 2030, which represents a little less than 25% of 2007 carbon dioxide emissions.128 Stated differently, the plan provides a way to eliminate 1.8 billion tons per year of carbon by that date.

We must act urgently: in the 14 months it took us to write this book, atmospheric CO2 levels rose by several billion tons of carbon, and more climatic consequences have been observed. Let’s assume that we conserve our forests and other natural carbon reservoirs at our current levels, as well as maintain our current nuclear and hydroelectric plants (or replace them with more solar and wind generators). Here’s what implementing EASY will achieve, as illustrated by Figure 3.1 on the next page.

Please check out this book and help me figure out if the numbers add up! I could also use help understanding his research, for example:

• John Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics, Oxford University Press, Oxford, 2011.

The book is not free but the first chapter is.

This paper looks really interesting too:

• J. Harte, T. Zillio, E. Conlisk and A. B. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–-2711.

Again, it’s not freely available—tut tut. Ecologists should follow physicists and make their work free online; if you’re serious about saving the planet you should let everyone know what you’re doing! However, the abstract is visible to all, and of course I can use my academic superpowers to get ahold of the paper for myself:

Abstract: The biodiversity scaling metrics widely studied in macroecology include the species-area relationship (SAR), the scale-dependent species-abundance distribution (SAD), the distribution of masses or metabolic energies of individuals within and across species, the abundance-energy or abundance-mass relationship across species, and the species-level occupancy distributions across space. We propose a theoretical framework for predicting the scaling forms of these and other metrics based on the state-variable concept and an analytical method derived from information theory. In statistical physics, a method of inference based on information entropy results in a complete macro-scale description of classical thermodynamic systems in terms of the state variables volume, temperature, and number of molecules. In analogy, we take the state variables of an ecosystem to be its total area, the total number of species within any specified taxonomic group in that area, the total number of individuals across those species, and the summed metabolic energy rate for all those individuals. In terms solely of ratios of those state variables, and without invoking any specific ecological mechanisms, we show that realistic functional forms for the macroecological metrics listed above are inferred based on information entropy. The Fisher log series SAD emerges naturally from the theory. The SAR is predicted to have negative curvature on a log-log plot, but as the ratio of the number of species to the number of individuals decreases, the SAR becomes better and better approximated by a power law, with the predicted slope z in the range of 0.14-0.20. Using the 3/4 power mass-metabolism scaling relation to relate energy requirements and measured body sizes, the Damuth scaling rule relating mass and abundance is also predicted by the theory. We argue that the predicted forms of the macroecological metrics are in reasonable agreement with the patterns observed from plant census data across habitats and spatial scales. While this is encouraging, given the absence of adjustable fitting parameters in the theory, we further argue that even small discrepancies between data and predictions can help identify ecological mechanisms that influence macroecological patterns.

The Mathematical Origin of Irreversibility

8 October, 2012

guest post by Matteo Smerlak

Introduction

Thermodynamical dissipation and adaptive evolution are two faces of the same Markovian coin!

Consider this. The Second Law of Thermodynamics states that the entropy of an isolated thermodynamic system can never decrease; Landauer’s principle maintains that the erasure of information inevitably causes dissipation; Fisher’s fundamental theorem of natural selection asserts that any fitness difference within a population leads to adaptation in an evolution process governed by natural selection. Diverse as they are, these statements have two common characteristics:

1. they express the irreversibility of certain natural phenomena, and

2. the dynamical processes underlying these phenomena involve an element of randomness.

Doesn’t this suggest to you the following question: Could it be that thermal phenomena, forgetful information processing and adaptive evolution are governed by the same stochastic mechanism?

The answer is—yes! The key to this rather profound connection resides in a universal property of Markov processes discovered recently in the context of non-equilibrium statistical mechanics, and known as the ‘fluctuation theorem’. Typically stated in terms of ‘dissipated work’ or ‘entropy production’, this result can be seen as an extension of the Second Law of Thermodynamics to small systems, where thermal fluctuations cannot be neglected. But it is actually much more than this: it is the mathematical underpinning of irreversibility itself, be it thermodynamical, evolutionary, or else. To make this point clear, let me start by giving a general formulation of the fluctuation theorem that makes no reference to physics concepts such as ‘heat’ or ‘work’.

The mathematical fact

Consider a system randomly jumping between states $a, b,\dots$ with (possibly time-dependent) transition rates $\gamma_{a b}(t)$ where $a$ is the state prior to the jump, while $b$ is the state after the jump. I’ll assume that this dynamics defines a (continuous-time) Markov process, namely that the numbers $\gamma_{a b}$ are the matrix entries of an infinitesimal stochastic matrix, which means that its off-diagonal entries are non-negative and that its columns sum up to zero.

Now, each possible history $\omega=(\omega_t)_{0\leq t\leq T}$ of this process can be characterized by the sequence of occupied states $a_{j}$ and by the times $\tau_{j}$ at which the transitions $a_{j-1}\longrightarrow a_{j}$ occur $(0\leq j\leq N)$:

$\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T}).$

Define the skewness $\sigma_{j}(\tau_{j})$ of each of these transitions to be the logarithmic ratio of transition rates:

$\displaystyle{\sigma_{j}(\tau_{j}):=\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}$

Also define the self-information of the system in state $a$ at time $t$ by:

$i_a(t):= -\ln\pi_{a}(t)$

where $\pi_{a}(t)$ is the probability that the system is in state $a$ at time $t$, given some prescribed initial distribution $\pi_{a}(0)$. This quantity is also sometimes called the surprisal, as it measures the ‘surprise’ of finding out that the system is in state $a$ at time $t$.

Then the following identity—the detailed fluctuation theorem—holds:

$\mathrm{Prob}[\Delta i-\Sigma=-A] = e^{-A}\;\mathrm{Prob}[\Delta i-\Sigma=A]$

where

$\displaystyle{\Sigma:=\sum_{j}\sigma_{j}(\tau_{j})}$

is the cumulative skewness along a trajectory of the system, and

$\Delta i= i_{a_N}(T)-i_{a_0}(0)$

is the variation of self-information between the end points of this trajectory.

This identity has an immediate consequence: if $\langle\,\cdot\,\rangle$ denotes the average over all realizations of the process, then we have the integral fluctuation theorem:

$\langle e^{-\Delta i+\Sigma}\rangle=1,$

which, by the convexity of the exponential and Jensen’s inequality, implies:

$\langle \Delta i\rangle=\Delta S\geq\langle\Sigma\rangle.$

In short: the mean variation of self-information, aka the variation of Shannon entropy

$\displaystyle{ S(t):= \sum_{a}\pi_{a}(t)i_a(t) }$

is bounded from below by the mean cumulative skewness of the underlying stochastic trajectory.

This is the fundamental mathematical fact underlying irreversibility. To unravel its physical and biological consequences, it suffices to consider the origin and interpretation of the ‘skewness’ term in different contexts. (By the way, people usually call $\Sigma$ the ‘entropy production’ or ‘dissipation function’—but how tautological is that?)

The physical and biological consequences

Consider first the standard stochastic-thermodynamic scenario where a physical system is kept in contact with a thermal reservoir at inverse temperature $\beta$ and undergoes thermally induced transitions between states $a, b,\dots$. By virtue of the detailed balance condition:

$\displaystyle{ e^{-\beta E_{a}(t)}\gamma_{a b}(t)=e^{-\beta E_{b}(t)}\gamma_{b a}(t),}$

the skewness $\sigma_{j}(\tau_{j})$ of each such transition is $\beta$ times the energy difference between the states $a_{j}$ and $a_{j-1}$, namely the heat received from the reservoir during the transition. Hence, the mean cumulative skewness $\langle \Sigma\rangle$ is nothing but $\beta\langle Q\rangle,$ with $Q$ the total heat received by the system along the process. It follows from the detailed fluctuation theorem that

$\langle e^{-\Delta i+\beta Q}\rangle=1$

and therefore

$\Delta S\geq\beta\langle Q\rangle$

which is of course Clausius’ inequality. In a computational context where the control parameter is the entropy variation itself (such as in a bit-erasure protocol, where $\Delta S=-\ln 2$), this inequality in turn expresses Landauer’s principle: it impossible to decrease the self-information of the system’s state without dissipating a minimal amount of heat into the environment (in this case $-Q \geq k T\ln2$, the ‘Landauer bound’). More general situations (several types of reservoirs, Maxwell-demon-like feedback controls) can be treated along the same lines, and the various forms of the Second Law derived from the detailed fluctuation theorem.

Now, many would agree that evolutionary dynamics is a wholly different business from thermodynamics; in particular, notions such as ‘heat’ or ‘temperature’ are clearly irrelevant to Darwinian evolution. However, the stochastic framework of Markov processes is relevant to describe the genetic evolution of a population, and this fact alone has important consequences. As a simple example, consider the time evolution of mutant fixations $x_{a}$ in a population, with $a$ ranging over the possible genotypes. In a ‘symmetric mutation scheme’, which I understand is biological parlance for ‘reversible Markov process’, meaning one that obeys detailed balance, the ratio between the $a\mapsto b$ and $b\mapsto a$ transition rates is completely determined by the fitnesses $f_{a}$ and $f_b$ of $a$ and $b$, according to

$\displaystyle{\frac{\gamma_{a b}}{\gamma_{b a}} =\left(\frac{f_{b}}{f_{a}}\right)^{\nu} }$

where $\nu$ is a model-dependent function of the effective population size [Sella2005]. Along a given history of mutant fixations, the cumulated skewness $\Sigma$ is therefore given by minus the fitness flux:

$\displaystyle{\Phi=\nu\sum_{j}(\ln f_{a_j}-\ln f_{a_{j-1}}).}$

The integral fluctuation theorem then becomes the fitness flux theorem:

$\displaystyle{ \langle e^{-\Delta i -\Phi}\rangle=1}$

discussed recently by Mustonen and Lässig [Mustonen2010] and implying Fisher’s fundamental theorem of natural selection as a special case. (Incidentally, the ‘fitness flux theorem’ derived in this reference is more general than this; for instance, it does not rely on the ‘symmetric mutation scheme’ assumption above.) The ensuing inequality

$\langle \Phi\rangle\geq-\Delta S$

shows that a positive fitness flux is “an almost universal evolutionary principle of biological systems” [Mustonen2010], with negative contributions limited to time intervals with a systematic loss of adaptation ($\Delta S > 0$). This statement may well be the closest thing to a version of the Second Law of Thermodynamics applying to evolutionary dynamics.

It is really quite remarkable that thermodynamical dissipation and Darwinian evolution can be reduced to the same stochastic mechanism, and that notions such as ‘fitness flux’ and ‘heat’ can arise as two faces of the same mathematical coin, namely the ‘skewness’ of Markovian transitions. After all, the phenomenon of life is in itself a direct challenge to thermodynamics, isn’t it? When thermal phenomena tend to increase the world’s disorder, life strives to bring about and maintain exquisitely fine spatial and chemical structures—which is why Schrödinger famously proposed to define life as negative entropy. Could there be a more striking confirmation of his intuition—and a reconciliation of evolution and thermodynamics in the same go—than the fundamental inequality of adaptive evolution $\langle\Phi\rangle\geq-\Delta S$?

Surely the detailed fluctuation theorem for Markov processes has other applications, pertaining neither to thermodynamics nor adaptive evolution. Can you think of any?

Proof of the fluctuation theorem

I am a physicist, but knowing that many readers of John’s blog are mathematicians, I’ll do my best to frame—and prove—the FT as an actual theorem.

Let $(\Omega,\mathcal{T},p)$ be a probability space and $(\,\cdot\,)^{\dagger}=\Omega\to \Omega$ a measurable involution of $\Omega$. Denote $p^{\dagger}$ the pushforward probability measure through this involution, and

$\displaystyle{ R=\ln \frac{d p}{d p^\dagger} }$

the logarithm of the corresponding Radon-Nikodym derivative (we assume $p^\dagger$ and $p$ are mutually absolutely continuous). Then the following lemmas are true, with $(1)\Rightarrow(2)\Rightarrow(3)$:

Lemma 1. The detailed fluctuation relation:

$\forall A\in\mathbb{R} \quad p\big(R^{-1}(-A) \big)=e^{-A}p \big(R^{-1}(A) \big)$

Lemma 2. The integral fluctuation relation:

$\displaystyle{\int_{\Omega} d p(\omega)\,e^{-R(\omega)}=1 }$

Lemma 3. The positivity of the Kullback-Leibler divergence:

$D(p\,\Vert\, p^{\dagger}):=\int_{\Omega} d p(\omega)\,R(\omega)\geq 0.$

These are basic facts which anyone can show: $(2)\Rightarrow(3)$ by Jensen’s inequality, $(1)\Rightarrow(2)$ trivially, and $(1)$ follows from $R(\omega^{\dagger})=-R(\omega)$ and the change of variables theorem, as follows,

$\begin{array}{ccl} \displaystyle{ \int_{R^{-1}(-A)} d p(\omega)} &=& \displaystyle{ \int_{R^{-1}(A)}d p^{\dagger}(\omega) } \\ \\ &=& \displaystyle{ \int_{R^{-1}(A)} d p(\omega)\, e^{-R(\omega)} } \\ \\ &=& \displaystyle{ e^{-A} \int_{R^{-1}(A)} d p(\omega)} .\end{array}$

But here is the beauty: if

$(\Omega,\mathcal{T},p)$ is actually a Markov process defined over some time interval $[0,T]$ and valued in some (say discrete) state space $\Sigma$, with the instantaneous probability $\pi_{a}(t)=p\big(\{\omega_{t}=a\} \big)$ of each state $a\in\Sigma$ satisfying the master equation (aka Kolmogorov equation)

$\displaystyle{ \frac{d\pi_{a}(t)}{dt}=\sum_{b\neq a}\Big(\gamma_{b a}(t)\pi_{a}(t)-\gamma_{a b}(t)\pi_{b}(t)\Big),}$

and

• the dagger involution is time-reversal, that is $\omega^{\dagger}_{t}:=\omega_{T-t},$

then for a given path

$\displaystyle{\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T})\in\Omega}$

the logarithmic ratio $R(\omega)$ decomposes into ‘variation of self-information’ and ‘cumulative skewness’ along $\omega$:

$\displaystyle{ R(\omega)=\underbrace{\Big(\ln\pi_{a_0}(0)-\ln\pi_{a_N}(T) \Big)}_{\Delta i(\omega)}-\underbrace{\sum_{j=1}^{N}\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}_{\Sigma(\omega)}.}$

This is easy to see if one writes the probability of a path explicitly as

$\displaystyle{p(\omega)=\pi_{a_{0}}(0)\left[\prod_{j=1}^{N}\phi_{a_{j-1}}(\tau_{j-1},\tau_{j})\gamma_{a_{j-1}a_{j}}(\tau_{j})\right]\phi_{a_{N}}(\tau_{N},T)}$

where

$\displaystyle{ \phi_{a}(\tau,\tau')=\phi_{a}(\tau',\tau)=\exp\Big(-\sum_{b\neq a}\int_{\tau}^{\tau'}dt\, \gamma_{a b}(t)\Big)}$

is the probability that the process remains in the state $a$ between the times $\tau$ and $\tau'$. It follows from the above lemma that

Theorem. Let $(\Omega,\mathcal{T},p)$ be a Markov process and let $i,\Sigma:\Omega\rightarrow \mathbb{R}$ be defined as above. Then we have

1. The detailed fluctuation theorem:

$\forall A\in\mathbb{R}, p\big((\Delta i-\Sigma)^{-1}(-A) \big)=e^{-A}p \big((\Delta i-\Sigma)^{-1}(A) \big)$

2. The integral fluctuation theorem:

$\int_{\Omega} d p(\omega)\,e^{-\Delta i(\omega)+\Sigma(\omega)}=1$

3. The ‘Second Law’ inequality:

$\displaystyle{ \Delta S:=\int_{\Omega} d p(\omega)\,\Delta i(\omega)\geq \int_{\Omega} d p(\omega)\,\Sigma(\omega)}$

The same theorem can be formulated for other kinds of Markov processes as well, including diffusion processes (in which case it follows from the Girsanov theorem).

References

Landauer’s principle was introduced here:

• [Landauer1961] R. Landauer, Irreversibility and heat generation in the computing process}, IBM Journal of Research and Development 5, (1961) 183–191.

and is now being verified experimentally by various groups worldwide.

The ‘fundamental theorem of natural selection’ was derived by Fisher in his book:

• [Fisher1930] R. Fisher, The Genetical Theory of Natural Selection, Clarendon Press, Oxford, 1930.

His derivation has long been considered obscure, even perhaps wrong, but apparently the theorem is now well accepted. I believe the first Markovian models of genetic evolution appeared here:

• [Fisher1922] R. A. Fisher, On the dominance ratio, Proc. Roy. Soc. Edinb. 42 (1922), 321–341.

• [Wright1931] S. Wright, Evolution in Mendelian populations, Genetics 16 (1931), 97–159.

Fluctuation theorems are reviewed here:

• [Sevick2008] E. Sevick, R. Prabhakar, S. R. Williams, and D. J. Searles, Fluctuation theorems, Ann. Rev. Phys. Chem. 59 (2008), 603–633.

Two of the key ideas for the ‘detailed fluctuation theorem’ discussed here are due to Crooks:

• [Crooks1999] Gavin Crooks, The entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences, Phys. Rev. E 60 (1999), 2721–2726.

who identified $(E_{a}(\tau_{j})-E_{a}(\tau_{j-1}))$ as heat, and Seifert:

• [Seifert2005] Udo Seifert, Entropy production along a stochastic trajectory and an integral fluctuation theorem, Phys. Rev. Lett. 95 (2005), 4.

who understood the relevance of the self-information in this context.

The connection between statistical physics and evolutionary biology is discussed here:

• [Sella2005] G. Sella and A.E. Hirsh, The application of statistical physics to evolutionary biology, Proc. Nat. Acad. Sci. USA 102 (2005), 9541–9546.

and the ‘fitness flux theorem’ is derived in

• [Mustonen2010] V. Mustonen and M. Lässig, Fitness flux and ubiquity of adaptive evolution, Proc. Nat. Acad. Sci. USA 107 (2010), 4248–4253.

Schrödinger’s famous discussion of the physical nature of life was published here:

• [Schrödinger1944] E. Schrödinger, What is Life?, Cambridge University Press, Cambridge, 1944.

An Entropy Challenge

29 August, 2012

If you like computer calculations, here’s a little challenge for you. Oscar Dahlsten may have solved it, but we’d love for you to check his work. It’s pretty important for the foundations of thermodynamics, but you don’t need to know any physics or even anything beyond a little algebra to tackle it! First I’ll explain it in really simple terms, then I’ll remind you a bit of why it matters.

We’re looking for two lists of nonnegative numbers, of the same length, listed in decreasing order:

$p_1 \ge p_2 \ge \cdots \ge p_n \ge 0$

$q_1 \ge q_2 \ge \cdots \ge q_n \ge 0$

that sum to 1:

$p_1 + \cdots + p_n = 1$

$q_1 + \cdots + q_n = 1$

and that obey this inequality:

$\displaystyle{ \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta \le \frac{1}{1 - \beta} \ln \sum_{i=1}^n q_i^\beta }$

for all $0 < \beta < \infty$ (ignoring $\beta = 1$), yet do not obey these inequalities:

$p_1 + \cdots + p_k \ge q_1 + \cdots + q_k$

for all $1 \le k \le n.$

Oscar’s proposed solution is this:

$p = (0.4, 0.29, 0.29, 0.02)$

$q = (0.39, 0.31, 0.2, 0.1)$

Can you see if this works? Is there a simpler example, like one with lists of just 3 numbers?

This question came up near the end of my post More Second Laws of Thermodynamics. I phrased the question with a bit more jargon, and said a lot more about its significance. Suppose we have two probability distributions on a finite set, say $p$ and $q.$ We say $p$ majorizes $q$ if

$p_1 + \cdots + p_k \ge q_1 + \cdots + q_k$

for all $1 \le k \le n,$ when we write both lists of numbers in decreasing order. This means $p$ is ‘less flat’ than $q$, so it should have less entropy. And indeed it does: not just for ordinary entropy, but also for Rényi entropy! The Rényi entropy of $p$ is defined by

$\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta }$

where $0 < \beta < 1$ or $1 < \beta < \infty$. We can also define Rényi entropy for $\beta = 0, 1, \infty$ by taking a limit, and at $\beta = 1$ we get the ordinary entropy

$\displaystyle{ H_1(p) = - \sum_{i = 1}^n p_i \ln (p_i) }$

The question is whether majorization is more powerful than Rényi entropy as a tool to to tell when one probability distribution is less flat than another. I know that if $p$ majorizes $q,$ its Rényi entropy is less than than that of $q$ for all $0 \le \beta \le \infty.$ Your mission, should you choose to accept it, is to show the converse is not true.

More Second Laws of Thermodynamics

24 August, 2012

Oscar Dahlsten is visiting the Centre for Quantum Technologies, so we’re continuing some conversations about entropy that we started last year, back when the Entropy Club was active. But now Jamie Vicary and Brendan Fong are involved in the conversations.

I was surprised when Oscar told me that for a large class of random processes, the usual second law of thermodynamics is just one of infinitely many laws saying that various kinds of disorder increase. I’m annoyed that nobody ever told me about this before! It’s as if they told me about conservation of energy but not conservation of schmenergy, and phlenergy, and zenergy

So I need to tell you about this. You may not understand it, but at least I can say I tried. I don’t want you blaming me for concealing all these extra second laws of thermodynamics!

Here’s the basic idea. Not all random processes are guaranteed to make entropy increase. But a bunch of them always make probability distributions flatter in a certain precise sense. This makes the entropy of the probability distribution increase. But when you make a probability distribution flatter in this sense, a bunch of other quantities increase too! For example, besides the usual entropy, there are infinitely many other kinds of entropy, called ‘Rényi entropies’, one for each number between 0 and ∞. And a doubly stochastic operator makes all the Rényi entropies increase! This fact is a special case of Theorem 10 here:

• Tim van Erven and Peter Harremoës, Rényi divergence and majorization.

Let me state this fact precisely, and then say a word about how this is related to quantum theory and ‘the collapse of the wavefunction’.

To keep things simple let’s talk about probability distributions on a finite set, though Erven and Harremoës generalize it all to a measure space.

How do we make precise the concept that one probability distribution is flatter than another? You know it when you see it, at least some of the time. For example, suppose I have some system in thermal equilibrium at some temperature, and the probabilities of it being in various states look like this:

Then say I triple the temperature. The probabilities flatten out:

But how can we make this concept precise in a completely general way? We can do it using the concept of ‘majorization’. If one probability distribution is less flat than another, people say it ‘majorizes’ that other one.

Here’s the definition. Say we have two probability distributions $p$ and $q$ on the same set. For each one, list the probabilities in decreasing order:

$p_1 \ge p_2 \ge \cdots \ge p_n$

$q_1 \ge q_2 \ge \cdots \ge q_n$

Then we say $p$ majorizes $q$ if

$p_1 + \cdots + p_k \ge q_1 + \cdots + q_k$

for all $1 \le k \le n.$ So, the idea is that the biggest probabilities in the distribution $p$ add up to more than the corresponding biggest ones in $q.$

In 1960, Alfred Rényi defined a generalization of the usual Shannon entropy that depends on a parameter $\beta.$ If $p$ is a probability distribution on a finite set, its Rényi entropy of order $\beta$ is defined to be

$\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_i p_i^\beta }$

where $0 \le \beta < \infty.$ Well, to be honest: if $\beta$ is 0, 1, or $\infty$ we have to define this by taking a limit where we let $\beta$ creep up to that value. But the limit exists, and when $\beta = 1$ we get the usual Shannon entropy

$\displaystyle{ H_1(p) = - \sum_i p_i \ln(p_i) }$

As I explained a while ago, Rényi entropies are important ways of measuring biodiversity. But here’s what I learned just now, from the paper by Erven and Harremoës:

Theorem 1. If a probability distribution $p$ majorizes a probability distribution $q,$ its Rényi entropies are smaller:

$\displaystyle{ H_\beta(p) \le H_\beta(q) }$

for all $0 \le \beta < \infty.$

And here’s what makes this fact so nice. If you do something to a classical system in a way that might involve some randomness, we can describe your action using a stochastic matrix. An $n \times n$ matrix $T$ is called stochastic if whenever $p \in \mathbb{R}^n$ is a probability distribution, so is $T p.$ This is equivalent to saying:

• the matrix entries of $T$ are all $\ge 0,$ and

• each column of $T$ sums to 1.

If $T$ is stochastic, it’s not necessarily true that the entropy of $T p$ is greater than or equal to that of $p,$ not even for the Shannon entropy.

Puzzle 1. Find a counterexample.

However, entropy does increase if we use specially nice stochastic matrices called ‘doubly stochastic’ matrices. People say a matrix $T$ doubly stochastic if it’s stochastic and it maps the probability distribution

$\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }$

to itself. This is the most spread-out probability distribution of all: every other probability distribution majorizes this one.

Why do they call such matrices ‘doubly’ stochastic? Well, if you’ve got a stochastic matrix, each column sums to one. But a stochastic operator is doubly stochastic if and only if each row sums to 1 as well.

Here’s a really cool fact:

Theorem 2. If $T$ is doubly stochastic, $p$ majorizes $T p$ for any probability distribution $p \in \mathbb{R}^n.$ Conversely, if a probability distribution $p$ majorizes a probability distribution $q,$ then $q = T p$ for some doubly stochastic matrix $T$.

Taken together, Theorems 1 and 2 say that doubly stochastic transformations increase entropy… but not just Shannon entropy! They increase all the different Rényi entropies, as well. So if time evolution is described by a doubly stochastic matrix, we get lots of ‘second laws of thermodynamics’, saying that all these different kinds of entropy increase!

Finally, what does all this have to do with quantum mechanics, and collapsing the wavefunction? There are different things to say, but this is the simplest:

Theorem 3. Given two probability distributions $p, q \in \mathbb{R}^n$, then $p$ majorizes $q$ if and only there exists a self-adjoint matrix $D$ with eigenvalues $p_i$ and diagonal entries $q_i.$

The matrix $D$ will be a density matrix: a self-adjoint matrix with positive eigenvalues and trace equal to 1. We use such matrices to describe mixed states in quantum mechanics.

Theorem 3 gives a precise sense in which preparing a quantum system in some state, letting time evolve, and then measuring it ‘increases randomness’.

How? Well, suppose we have a quantum system whose Hilbert space is $\mathbb{C}^n.$ If we prepare the system in a mixture of the standard basis states with probabilities $p_i,$ we can describe it with a diagonal density matrix $D_0.$ Then suppose we wait a while and some unitary time evolution occurs. The system is now described by a new density matrix

$D = U D_0 \, U^{-1}$

where $U$ is some unitary operator. If we then do a measurement to see which of the standard basis states our system now lies in, we’ll get the different possible results with probabilities $q_i,$ the diagonal entries of $D.$ But the eigenvalues of $D$ will still be the numbers $p_i.$ So, by the theorem, $p$ majorizes $q$!

So, not only Shannon entropy but also all the Rényi entropies will increase!

Of course, there are some big physics questions lurking here. Like: what about the real world? In the real world, do lots of different kinds of entropy tend to increase, or just some?

Of course, there’s a huge famous old problem about how reversible time evolution can be compatible with any sort of law saying that entropy must always increase! Still, there are some arguments, going back to Boltzmann’s H-theorem, which show entropy increases under some extra conditions. So then we can ask if other kinds of entropy, like Rényi entropy, increase as well. This will be true whenever we can argue that time evolution is described by doubly stochastic matrices. Theorem 3 gives a partial answer, but there’s probably much more to say.

I don’t have much more to say right now, though. I’ll just point out that while doubly stochastic matrices map the ‘maximally smeared-out’ probability distribution

$\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }$

to itself, a lot of this theory generalizes to stochastic matrices that map exactly one other probability distribution to itself. We need to work with relative Rényi entropy instead of Rényi entropy, and so on, but I don’t think these adjustments are really a big deal. And there are nice theorems that let you know when a stochastic matrix maps exactly one probability distribution to itself, based on the Perron–Frobenius theorem.

References

I already gave you a reference for Theorem 1, namely the paper by Erven and Harremoës, though I don’t think they were the first to prove this particular result: they generalize it quite a lot.

What about Theorem 2? It goes back at least to here:

• Barry C. Arnold, Majorization and the Lorenz Order: A Brief Introduction, Springer Lecture Notes in Statistics 43, Springer, Berlin, 1987.

The partial order on probability distributions given by majorization is also called the ‘Lorenz order’, but mainly when we consider probability distributions on infinite sets. This name presumably comes from the Lorenz curve, a measure of income inequality. This curve shows for the bottom x% of households, what percentage y% of the total income they have:

Puzzle 2. If you’ve got two different probability distributions of incomes, and one majorizes the other, how are their Lorenz curves related?

When we generalize majorization by letting some other probability distribution take the place of

$\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }$

it seems people call it the ‘Markov order’. Here’s a really fascinating paper on that, which I’m just barely beginning to understand:

• A. N. Gorban, P. A. Gorban and G. Judge, Entropy: the Markov ordering approach, Entropy 12 (2010), 1145–1193.

What about Theorem 3? Apparently it goes back to here:

• A. Uhlmann, Wiss. Z. Karl-Marx-Univ. Leipzig 20 (1971), 633.

though I only know this thanks to a more recent paper:

• Michael A. Nielsen, Conditions for a class of entanglement transformations, Phys. Rev. Lett. 83 (1999), 436–439.

By the way, Nielsen’s paper contains another very nice result about majorization! Suppose you have states $\psi$ and $\phi$ of a 2-part quantum system. You can trace out one part and get density matrices describing mixed states of the other part, say $D_\psi$ and $D_\phi$. Then Nielsen shows you can get from $\psi$ to $\phi$ using ‘local operations and classical communication’ if and only if $D_\phi$ majorizes $D_\psi$. Note that things are going backwards here compared to how they’ve been going in the rest of this post: if we can get from $\psi$ to $\phi$, then all forms of entropy go down when we go from $D_\psi$ to $D_\phi$! This ‘anti-second-law’ behavior is confusing at first, but familiar to me by now.

When I first learned all this stuff, I naturally thought of the following question—maybe you did too, just now. If $p, q \in \mathbb{R}^n$ are probability distributions and

$\displaystyle{ H_\beta(p) \le H_\beta(q) }$

for all $0 \le \beta < \infty$, is it true that $p$ majorizes $q$?

Apparently the answer must be no, because Klimesh has gone to quite a bit of work to obtain a weaker conclusion: not that $p$ majorizes $q$, but that $p \otimes r$ majorizes $q \otimes r$ for some probability distribution $r \in \mathbb{R}^m.$ He calls this catalytic majorization, with $r$ serving as a ‘catalyst’:

I thank Vlatko Vedral here at the CQT for pointing this out!

Finally, here is a good general introduction to majorization, pointed out by Vasileios Anagnostopoulos:

• T. Ando, Majorization, doubly stochastic matrices, and comparison of eigenvalues, Linear Algebra and its Applications 118 (1989), 163-–248.