Relative Entropy (Part 1)

20 June, 2013

I’m trying to finish off a paper that Tobias Fritz and I have been working on, which gives a category-theoretic (and Bayesian!) characterization of relative entropy. It’s a kind of sequel to our paper with Tom Leinster, in which we characterized entropy.

That earlier paper was developed in conversations on the n-Category Café. It was a lot of fun; I sort of miss that style of working. Also, to get warmed up, I need to think through some things I’ve thought about before. So, I might as well write them down here.

The idea

There are many categories related to probability theory, and they’re related in many ways. Last summer—on the 24th of August 2012, according to my notes here—Jamie Vicary, Brendan Fong and I worked through a bunch of these relationships. I need to write them down now, even if they’re not all vitally important to my paper with Tobias. They’re sort of buzzing around my brain like flies.

(Tobias knows this stuff too, and this is how we think about probability theory, but we weren’t planning to stick it in our paper. Maybe we should.)

Let’s restrict attention to probability measures on finite sets, and related structures. We could study these questions more generally, and we should, but not today. What we’ll do is give a unified purely algebraic description of:

• finite sets

• measures on finite sets

• probability measures on finite sets

and various kinds of maps between these:

• functions

• bijections

• measure-preserving functions

• stochastic maps

Finitely generated free [0,∞)-modules

People often do linear algebra over a field, which is—roughly speaking—a number system where you can add, subtract, multiply and divide. But algebraists have long realized that a lot of linear algebra still works with a commutative ring, where you can’t necessarily divide. It gets more complicated, but also a lot more interesting.

But in fact, a lot still works with a commutative rig, where we can’t necessarily subtract either! Something I keep telling everyone is that linear algebra over rigs is a good idea for studying things like probability theory, thermodynamics, and the principle of least action.

Today we’ll start with the rig of nonnegative real numbers with their usual addition and multiplication; let’s call this [0,\infty) . The idea is that measure theory, and probability theory, are closely related to linear algebra over this rig.

Let C be the category with of finitely generated free [0,\infty) -modules as objects, and module homomorphisms as morphisms. I’ll call these morphisms maps.

Puzzle. Do we need to say ‘free’ here? Are there finitely generated modules over [0,\infty) that aren’t free?

Every finitely generated free [0,\infty)-module is isomorphic to [0,\infty)^S for some finite set S . In other words, it’s isomorphic to [0,\infty)^n for some n = 0, 1, 2, \dots . So, C is equivalent to the category where objects are natural numbers, a morphism from m to n is an m \times n matrix of numbers in [0,\infty) , and composition is done by matrix multiplication. I’ll also call this equivalent category C.

We can take tensor products of finitely generated free modules, and this makes C into a symmetric monoidal †-category. This means we can draw maps using string diagrams in the usual way. However, I’m feeling lazy so I’ll often write equations when I could be drawing diagrams.

One of the rules of the game is that all these equations will make sense in any symmetric monoidal †-category. So we could, if we wanted, generalize ideas from probability theory this way. If you want to do this, you’ll need to know that [0,\infty) is the unit for the tensor product in C. We’ll be seeing this guy [0,\infty) a lot. So if you want to generalize, replace C by any symmetric monoidal †-category, and replace [0,\infty) by the unit for the tensor product.

Finite sets

There’s a way to see the category of finite sets lurking in C, which we can borrow from this paper:

• Bob Coecke, Dusko Pavlovic and Jamie Vicary, A new description of orthogonal bases.

For any finite set S , we get a free finitely generated [0,\infty) -module, namely [0,\infty)^S . This comes with some structure:

• a multiplication m: [0,\infty)^S \otimes [0,\infty)^S \to [0,\infty)^S , coming from pointwise multiplication of [0,\infty) -valued functions on S

• the unit for this multiplication, an element of [0,\infty)^S, which we can write as a morphism i: [0,\infty) \to [0,\infty)^S

• a comultiplication, obtained by taking the diagonal map \Delta : S \to S \times S and promoting it to a linear map \Delta : [0,\infty)^S \to [0, \infty)^S \otimes [0,\infty)^S

• a counit for this comultiplication, obtained by taking the unique map to the terminal set ! : S \to 1 and promoting it to a linear map e: [0,\infty)^S \to [0,\infty)

These morphisms m, i, \Delta, e make

x = [0,\infty)^S

into a commutative Frobenius algebra in C . That’s a thing where the unit, counit, multiplication and comultiplication obey these laws:

(I drew these back when I was feeling less lazy.) This Frobenius algebra is also ‘special’, meaning it obeys this:

And it’s also a †-Frobenius algebra, meaning that the counit and comultiplication are obtained from the unit and multiplication by ‘flipping’ them using the †category structure. (If we think of a morphism in C as a matrix, its dagger is its transpose.)

Conversely, suppose we have any special commutative †-Frobenius algebra x . Then using the ideas in the paper by Coecke, Pavlovich and Vicary we can recover a basis for x , consisting of the vectors e_i \in x with

\Delta(e_i) = e_i \otimes e_i

This basis forms a set S such that

x \cong [0,\infty)^S

for some specified isomorphism in C. Furthermore, this is an isomorphism of special commutative †-Frobenius algebras!

In case you’re wondering, these vectors e_i correspond to the functions on S that are zero everywhere except at one point i \in S, where they equal 1.

In short, a special commutative †-Frobenius algebra in C is just a fancy way of talking about a finite set. This may seem silly, but it’s a way to start describing probability theory using linear algebra very much as we do with quantum theory. This analogy between quantum theory and probability theory is so interesting that it deserves a book.

Functions and bijections

Now suppose we have two special commutative †-Frobenius algebra in C, say x and y .

Suppose f : x \to y is a Frobenius algebra homomorphism: that is, a map preserving all the structure—the unit, counit, multiplication and comultiplication. Then it comes from an isomorphism of finite sets. This lets us find \mathrm{FinSet}_0 , the groupoid of finite sets and bijections, inside C.

Alternatively, suppose f : x \to y is just a coalgebra homomorphism: that is a map preserving just the counit and comultiplication. Then it comes from an arbitrary function between finite sets. This lets us find FinSet , the category of finite sets and functions, inside C .

But what if f preserves just the counit? This sounds like a dry, formal question. But it’s not: the answer is something useful, a ‘stochastic map’.

Stochastic maps

A stochastic map from a finite set S to a finite set T is a map sending each point of S to a probability measure on T .

We can think of this as a T \times S -shaped matrix of numbers in [0,\infty) , where a given column gives the probability that a given point in S goes to any point in T . The sum of the numbers in each column will be 1. And conversely, any T \times S -shaped matrix of numbers in [0,\infty) , where each column sums to 1, gives a stochastic map from S to T .

But now let’s describe this idea using the category C. We’ve seen a finite set is the same as a special commutative †-Frobenius algebra. So, say we have two of these, x and y . Our matrix of numbers in [0,\infty) is just a map

f: x \to y

So, we just need a way to state the condition that each column in the matrix sums to 1. And this condition simply says that f preserves the counit:

\epsilon_y \circ f = \epsilon_x

where \epsilon_x : x \to [0,\infty) is the counit for x , and similarly for \epsilon_y .

To understand this, note that if we use the canonical isomorphism

x \cong [0,\infty)^S

the counit \epsilon_x can be seen as the map

[0,\infty)^S \to [0,\infty)

that takes any S -tuple of numbers and sums them up. In other words, it’s integration with respect to counting measure. So, the equation

\epsilon_y \circ f = \epsilon_x

says that if we take any S -tuple of numbers, multiply it by the matrix f , and then sum up the entries of the resulting T -tuple, it’s the same as if we summed up the original S -tuple. But this says precisely that each column of the matrix f sums to 1.

So, we can use our formalism to describe \mathrm{FinStoch}, the category with finite sets as objects and stochastic maps as morphisms. We’ve seen this category is equivalent to the category with special commutative †-Frobenius algebras in C as objects and counit-preserving maps as morphisms.

Finite measure spaces

Now let’s use our formalism to describe finite measure spaces—by which, beware, I mean a finite sets equipped with measures! To do this, we’ll use a special commutative †-Frobenius algebra x in C together with any map

\mu: [0,\infty) \to x

Starting from these, we get a specified isomorphism

x \cong [0,\infty)^S

and \mu sends the number 1 to a vector in [0,\infty)^S : that is, a function on S taking values in [0,\infty) . Multiplying this function by counting measure, we get a measure on S .

Puzzle. How can we describe this measure without the annoying use of counting measure?

Conversely, any measure on a finite set gives a special commutative †-Frobenius algebra x in C equipped with a map from [0,\infty) .

So, we can say a finite measure space is a special commutative †-Frobenius algebra in C equipped with a map

\mu: [0,\infty) \to x

And given two of these,

\mu: [0,\infty) \to x , \qquad  \nu: [0,\infty) \to y

and a coalgebra morphism

f : x \to y

obeying this equation

f \circ \mu  = \nu

then we get a measure-preserving function between finite measure spaces! If you’re a category theorist, you’ll draw this equation as a commutative triangle:

Conversely, any measure-preserving function between finite measure spaces obeys this equation. So, we get an algebraic way of describing the category \mathrm{FinMeas} , with finite measure spaces as objects and measure-preserving maps as morphisms.

Finite probability measure spaces

I’m mainly interested in probability measures. So suppose x is a special commutative †-Frobenius algebra in C equipped with a map

\mu: [0,\infty) \to x

We’ve seen this gives a finite measure space. But this is a probability measure space if and only if

e \circ \mu = 1

where

e : x \to [0,\infty)

is the counit for x . The equation simply says the total integral of our measure \mu is 1.

So, we get a way of describing the category \mathrm{FinProb} , which has finite probability measure spaces as objects and measure-preserving maps as objects. Given finite probability measure spaces described this way:

\mu: [0,\infty) \to x , \qquad  \nu: [0,\infty) \to y

a measure-preserving function is a coalgebra morphism

f : x \to y

such that the obvious triangle commutes:

f \circ \mu  = \nu

Measure-preserving stochastic maps

Say we have two finite measure spaces. Then we can ask whether a stochastic map from one to the other is measure-preserving. And we can answer this question in the language of C .

Remember, a finite measure space is a special commutative †-Frobenius algebra x in C together with a map

\mu: [0,\infty) \to x

Say we have another one:

\nu: [0,\infty) \to y

A stochastic map is just a map

f: x \to y

that preserves the counit:

\epsilon_y \circ f = \epsilon_x

But it’s a measure-preserving stochastic map if also

f \circ \mu  = \nu

Next…

There’s a lot more to say; I haven’t gotten anywhere near what Tobias and I are doing! But it’s pleasant to have this basic stuff written down.


Maximum Entropy and Ecology

21 February, 2013

I already talked about John Harte’s book on how to stop global warming. Since I’m trying to apply information theory and thermodynamics to ecology, I was also interested in this book of his:

John Harte, Maximum Entropy and Ecology, Oxford U. Press, Oxford, 2011.

There’s a lot in this book, and I haven’t absorbed it all, but let me try to briefly summarize his maximum entropy theory of ecology. This aims to be “a comprehensive, parsimonious, and testable theory of the distribution, abundance, and energetics of species across spatial scales”. One great thing is that he makes quantitative predictions using this theory and compares them to a lot of real-world data. But let me just tell you about the theory.

It’s heavily based on the principle of maximum entropy (MaxEnt for short), and there are two parts:

Two MaxEnt calculations are at the core of the theory: the first yields all the metrics that describe abundance and energy distributions, and the second describes the spatial scaling properties of species’ distributions.

Abundance and energy distributions

The first part of Harte’s theory is all about a conditional probability distribution

R(n,\epsilon | S_0, N_0, E_0)

which he calls the ecosystem structure function. Here:

S_0: the total number of species under consideration in some area.

N_0: the total number of individuals under consideration in that area.

E_0: the total rate of metabolic energy consumption of all these individuals.

Given this,

R(n,\epsilon | S_0, N_0, E_0) \, d \epsilon

is the probability that given S_0, N_0, E_0, if a species is picked from the collection of species, then it has n individuals, and if an individual is picked at random from that species, then its rate of metabolic energy consumption is in the interval (\epsilon, \epsilon + d \epsilon).

Here of course d \epsilon is ‘infinitesimal’, meaning that we take a limit where it goes to zero to make this idea precise (if we’re doing analytical work) or take it to be very small (if we’re estimating R from data).

I believe that when we ‘pick a species’ we’re treating them all as equally probable, not weighting them according to their number of individuals.

Clearly R obeys some constraints. First, since it’s a probability distribution, it obeys the normalization condition:

\displaystyle{ \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) = 1 }

Second, since the average number of individuals per species is N_0/S_0, we have:

\displaystyle{ \sum_n \int d \epsilon \; n R(n,\epsilon | S_0, N_0, E_0) = N_0 / S_0 }

Third, since the average over species of the total rate of metabolic energy consumption of individuals within the species is E_0/ S_0, we have:

\displaystyle{ \sum_n \int d \epsilon \; n \epsilon R(n,\epsilon | S_0, N_0, E_0) = E_0 / S_0 }

Harte’s theory is that R maximizes entropy subject to these three constraints. Here entropy is defined by

\displaystyle{ - \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) \ln(R(n,\epsilon | S_0, N_0, E_0)) }

Harte uses this theory to calculate R, and tests the results against data from about 20 ecosystems. For example, he predicts the abundance of species as a function of their rank, with rank 1 being the most abundant, rank 2 being the second most abundant, and so on. And he gets results like this:

The data here are from:

• Green, Harte, and Ostling’s work on a serpentine grassland,

• Luquillo’s work on a 10.24-hectare tropical forest, and

• Cocoli’s work on a 2-hectare wet tropical forest.

The fit looks good to me… but I should emphasize that I haven’t had time to study these matters in detail. For more, you can read this paper, at least if your institution subscribes to this journal:

• J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

Spatial abundance distribution

The second part of Harte’s theory is all about a conditional probability distribution

\Pi(n | A, n_0, A_0)

This is the probability that n individuals of a species are found in a region of area A given that it has n_0 individuals in a larger region of area A_0.

\Pi obeys two constraints. First, since it’s a probability distribution, it obeys the normalization condition:

\displaystyle{ \sum_n  \Pi(n | A, n_0, A_0) = 1 }

Second, since the mean value of n across regions of area A equals n_0 A/A_0, we have

\displaystyle{ \sum_n n \Pi(n | A, n_0, A_0) = n_0 A/A_0 }

Harte’s theory is that \Pi maximizes entropy subject to these two constraints. Here entropy is defined by

\displaystyle{- \sum_n  \Pi(n | A, n_0, A_0)\ln(\Pi(n | A, n_0, A_0)) }

Harte explains two approaches to use this idea to derive ‘scaling laws’ for how n varies with n. And again, he compares his predictions to real-world data, and get results that look good to my (amateur, hasty) eye!

I hope sometime I can dig deeper into this subject. Do you have any ideas, or knowledge about this stuff?


Network Theory for Economists

15 January, 2013

Tomorrow I’m giving a talk in the econometrics seminar at U.C. Riverside. I was invited to speak on my work on network theory, so I don’t feel too bad about the fact that I’ll be saying only a little about economics and practically nothing about econometrics. Still, I’ve tried to slant the talk in a way that emphasizes possible applications to economics and game theory. Here are the slides:

Network Theory.

For long-time readers here the fun comes near the end. I explain how reaction networks can be used to describe evolutionary games. I point out that in certain classes of evolutionary games, evolution tends to increase ‘fitness’, and/or lead the players to a ‘Nash equilibrium’. For precise theorems you’ll have to click the links in my talk and read the references!

I conclude with an example: a game with three strategies and 7 Nash equilibria. Here evolution makes the proportion of these three strategies follow these flow lines, at least in the limit of large numbers of players:

This picture is from William Sandholm’s nice expository paper:

• William H. Sandholm, Evolutionary game theory, 2007.

I mentioned it before in Information Geometry (Part 12), en route to showing a proof that some quantity always decreases in a class of evolutionary games. Sometime I want to tell the whole story linking:

reaction networks
evolutionary games
the 2nd law of thermodynamics

and

Fisher’s fundamental theorem of natural selection.

But not today! Think of these talk slides as a little appetizer.


John Harte

27 October, 2012

Earlier this week I gave a talk on the Mathematics of Planet Earth at the University of Southern California, and someone there recommended that I look into John Harte’s work on maximum entropy methods in ecology. He works at U.C. Berkeley.

I checked out his website and found that his goals resemble mine: save the planet and understand its ecosystems. He’s a lot further along than I am, since he comes from a long background in ecology while I’ve just recently blundered in from mathematical physics. I can’t really say what I think of his work since I’m just learning about it. But I thought I should point out its existence.

This free book is something a lot of people would find interesting:

• John and Mary Ellen Harte, Cool the Earth, Save the Economy: Solving the Climate Crisis Is EASY, 2008.

EASY? Well, it’s an acronym. Here’s the basic idea of the US-based plan described in this book:

Any proposed energy policy should include these two components:

Technical/Behavioral: What resources and technologies are to be used to supply energy? On the demand side, what technologies and lifestyle changes are being proposed to consumers?

Incentives/Economic Policy: How are the desired supply and demand options to be encouraged or forced? Here the options include taxes, subsidies, regulations, permits, research and development, and education.

And a successful energy policy should satisfy the AAA criteria:

Availability. The climate crisis will rapidly become costly to society if we do not take action expeditiously. We need to adopt now those technologies that are currently available, provided they meet the following two additional criteria:

Affordability. Because of the central role of energy in our society, its cost to consumers should not increase significantly. In fact, a successful energy policy could ultimately save consumers money.

Acceptability. All energy strategies have environmental, land use, and health and safety implications; these must be acceptable to the public. Moreover, while some interest groups will undoubtedly oppose any particular energy policy, political acceptability at a broad scale is necessary.

Our strategy for preventing climate catastrophe and achieving energy independence includes:

Energy Efficient Technology at home and at the workplace. Huge reductions in home energy use can be achieved with available technologies, including more efficient appliances such as refrigerators, water heaters, and light bulbs. Home retrofits and new home design features such as “smart” window coatings, lighter-colored roofs where there are hot summers, better home insulation, and passive solar designs can also reduce energy use. Together, energy efficiency in home and industry can save the U.S. up to approximately half of the energy currently consumed in those sectors, and at no net cost—just by making different choices. Sounds good, doesn’t it?

Automobile Fuel Efficiency. Phase in higher Corporate Average Fuel Economy (CAFE) standards for automobiles, SUVs and light trucks by requiring vehicles to go 35 miles per gallon of gas (mpg) by 2015, 45 mpg by 2020, and 60 mpg by 2030. This would rapidly wipe out our dependence on foreign oil and cut emissions from the vehicle sector by two-thirds. A combination of plug-in hybrid, lighter car body materials, re-design and other innovations could readily achieve these standards. This sounds good, too!

Solar and Wind Energy. Rooftop photovoltaic panels and solar water heating units should be phased in over the next 20 years, with the goal of solar installation on 75% of U.S. homes and commercial buildings by 2030. (Not all roofs receive sufficient sunlight to make solar panels practical for them.) Large wind farms, solar photovoltaic stations, and solar thermal stations should also be phased in so that by 2030, all U.S. electricity demand will be supplied by existing hydroelectric, existing and possibly some new nuclear, and, most importantly, new solar and wind units. This will require investment in expansion of the grid to bring the new supply to the demand, and in research and development to improve overnight storage systems. Achieving this goal would reduce our dependence on coal to practically zero. More good news!

You are part of the answer. Voting wisely for leaders who promote the first three components is one of the most important individual actions one can make. Other actions help, too. Just as molecules make up mountains, individual actions taken collectively have huge impacts. Improved driving skills, automobile maintenance, reusing and recycling, walking and biking, wearing sweaters in winter and light clothing in summer, installing timers on thermostats and insulating houses, carpooling, paying attention to energy efficiency labels on appliances, and many other simple practices and behaviors hugely influence energy consumption. A major education campaign, both in schools for youngsters and by the media for everyone, should be mounted to promote these consumer practices.

No part of EASY can be left out; all parts are closely integrated. Some parts might create much larger changes—for example, more efficient home appliances and automobiles—but all parts are essential. If, for example, we do not achieve the decrease in electricity demand that can be brought about with the E of EASY, then it is extremely doubtful that we could meet our electricity needs with the S of EASY.

It is equally urgent that once we start implementing the plan, we aggressively export it to other major emitting nations. We can reduce our own emissions all we want, but the planet will continue to warm if we can’t convince other major global emitters to reduce their emissions substantially, too.

What EASY will achieve. If no actions are taken to reduce carbon dioxide emissions, in the year 2030 the U.S. will be emitting about 2.2 billion tons of carbon in the form of carbon dioxide. This will be an increase of 25% from today’s emission rate of about 1.75 billion tons per year of carbon. By following the EASY plan, the U.S. share in a global effort to solve the climate crisis (that is, prevent catastrophic warming) will result in U.S emissions of only about 0.4 billion tons of carbon by 2030, which represents a little less than 25% of 2007 carbon dioxide emissions.128 Stated differently, the plan provides a way to eliminate 1.8 billion tons per year of carbon by that date.

We must act urgently: in the 14 months it took us to write this book, atmospheric CO2 levels rose by several billion tons of carbon, and more climatic consequences have been observed. Let’s assume that we conserve our forests and other natural carbon reservoirs at our current levels, as well as maintain our current nuclear and hydroelectric plants (or replace them with more solar and wind generators). Here’s what implementing EASY will achieve, as illustrated by Figure 3.1 on the next page.

Please check out this book and help me figure out if the numbers add up! I could also use help understanding his research, for example:

• John Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics, Oxford University Press, Oxford, 2011.

The book is not free but the first chapter is.

This paper looks really interesting too:

• J. Harte, T. Zillio, E. Conlisk and A. B. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–-2711.

Again, it’s not freely available—tut tut. Ecologists should follow physicists and make their work free online; if you’re serious about saving the planet you should let everyone know what you’re doing! However, the abstract is visible to all, and of course I can use my academic superpowers to get ahold of the paper for myself:

Abstract: The biodiversity scaling metrics widely studied in macroecology include the species-area relationship (SAR), the scale-dependent species-abundance distribution (SAD), the distribution of masses or metabolic energies of individuals within and across species, the abundance-energy or abundance-mass relationship across species, and the species-level occupancy distributions across space. We propose a theoretical framework for predicting the scaling forms of these and other metrics based on the state-variable concept and an analytical method derived from information theory. In statistical physics, a method of inference based on information entropy results in a complete macro-scale description of classical thermodynamic systems in terms of the state variables volume, temperature, and number of molecules. In analogy, we take the state variables of an ecosystem to be its total area, the total number of species within any specified taxonomic group in that area, the total number of individuals across those species, and the summed metabolic energy rate for all those individuals. In terms solely of ratios of those state variables, and without invoking any specific ecological mechanisms, we show that realistic functional forms for the macroecological metrics listed above are inferred based on information entropy. The Fisher log series SAD emerges naturally from the theory. The SAR is predicted to have negative curvature on a log-log plot, but as the ratio of the number of species to the number of individuals decreases, the SAR becomes better and better approximated by a power law, with the predicted slope z in the range of 0.14-0.20. Using the 3/4 power mass-metabolism scaling relation to relate energy requirements and measured body sizes, the Damuth scaling rule relating mass and abundance is also predicted by the theory. We argue that the predicted forms of the macroecological metrics are in reasonable agreement with the patterns observed from plant census data across habitats and spatial scales. While this is encouraging, given the absence of adjustable fitting parameters in the theory, we further argue that even small discrepancies between data and predictions can help identify ecological mechanisms that influence macroecological patterns.


The Mathematical Origin of Irreversibility

8 October, 2012

guest post by Matteo Smerlak

Introduction

Thermodynamical dissipation and adaptive evolution are two faces of the same Markovian coin!

Consider this. The Second Law of Thermodynamics states that the entropy of an isolated thermodynamic system can never decrease; Landauer’s principle maintains that the erasure of information inevitably causes dissipation; Fisher’s fundamental theorem of natural selection asserts that any fitness difference within a population leads to adaptation in an evolution process governed by natural selection. Diverse as they are, these statements have two common characteristics:

1. they express the irreversibility of certain natural phenomena, and

2. the dynamical processes underlying these phenomena involve an element of randomness.

Doesn’t this suggest to you the following question: Could it be that thermal phenomena, forgetful information processing and adaptive evolution are governed by the same stochastic mechanism?

The answer is—yes! The key to this rather profound connection resides in a universal property of Markov processes discovered recently in the context of non-equilibrium statistical mechanics, and known as the ‘fluctuation theorem’. Typically stated in terms of ‘dissipated work’ or ‘entropy production’, this result can be seen as an extension of the Second Law of Thermodynamics to small systems, where thermal fluctuations cannot be neglected. But it is actually much more than this: it is the mathematical underpinning of irreversibility itself, be it thermodynamical, evolutionary, or else. To make this point clear, let me start by giving a general formulation of the fluctuation theorem that makes no reference to physics concepts such as ‘heat’ or ‘work’.

The mathematical fact

Consider a system randomly jumping between states a, b,\dots with (possibly time-dependent) transition rates \gamma_{a b}(t) where a is the state prior to the jump, while b is the state after the jump. I’ll assume that this dynamics defines a (continuous-time) Markov process, namely that the numbers \gamma_{a b} are the matrix entries of an infinitesimal stochastic matrix, which means that its off-diagonal entries are non-negative and that its columns sum up to zero.

Now, each possible history \omega=(\omega_t)_{0\leq t\leq T} of this process can be characterized by the sequence of occupied states a_{j} and by the times \tau_{j} at which the transitions a_{j-1}\longrightarrow a_{j} occur (0\leq j\leq N):

\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T}).

Define the skewness \sigma_{j}(\tau_{j}) of each of these transitions to be the logarithmic ratio of transition rates:

\displaystyle{\sigma_{j}(\tau_{j}):=\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}

Also define the self-information of the system in state a at time t by:

i_a(t):= -\ln\pi_{a}(t)

where \pi_{a}(t) is the probability that the system is in state a at time t, given some prescribed initial distribution \pi_{a}(0). This quantity is also sometimes called the surprisal, as it measures the ‘surprise’ of finding out that the system is in state a at time t.

Then the following identity—the detailed fluctuation theorem—holds:

\mathrm{Prob}[\Delta i-\Sigma=-A] = e^{-A}\;\mathrm{Prob}[\Delta i-\Sigma=A] \;

where

\displaystyle{\Sigma:=\sum_{j}\sigma_{j}(\tau_{j})}

is the cumulative skewness along a trajectory of the system, and

\Delta i= i_{a_N}(T)-i_{a_0}(0)

is the variation of self-information between the end points of this trajectory.

This identity has an immediate consequence: if \langle\,\cdot\,\rangle denotes the average over all realizations of the process, then we have the integral fluctuation theorem:

\langle e^{-\Delta i+\Sigma}\rangle=1,

which, by the convexity of the exponential and Jensen’s inequality, implies:

\langle \Delta i\rangle=\Delta S\geq\langle\Sigma\rangle.

In short: the mean variation of self-information, aka the variation of Shannon entropy

\displaystyle{ S(t):= \sum_{a}\pi_{a}(t)i_a(t) }

is bounded from below by the mean cumulative skewness of the underlying stochastic trajectory.

This is the fundamental mathematical fact underlying irreversibility. To unravel its physical and biological consequences, it suffices to consider the origin and interpretation of the ‘skewness’ term in different contexts. (By the way, people usually call \Sigma the ‘entropy production’ or ‘dissipation function’—but how tautological is that?)

The physical and biological consequences

Consider first the standard stochastic-thermodynamic scenario where a physical system is kept in contact with a thermal reservoir at inverse temperature \beta and undergoes thermally induced transitions between states a, b,\dots. By virtue of the detailed balance condition:

\displaystyle{ e^{-\beta E_{a}(t)}\gamma_{a b}(t)=e^{-\beta E_{b}(t)}\gamma_{b a}(t),}

the skewness \sigma_{j}(\tau_{j}) of each such transition is \beta times the energy difference between the states a_{j} and a_{j-1}, namely the heat received from the reservoir during the transition. Hence, the mean cumulative skewness \langle \Sigma\rangle is nothing but \beta\langle Q\rangle, with Q the total heat received by the system along the process. It follows from the detailed fluctuation theorem that

\langle e^{-\Delta i+\beta Q}\rangle=1

and therefore

\Delta S\geq\beta\langle Q\rangle

which is of course Clausius’ inequality. In a computational context where the control parameter is the entropy variation itself (such as in a bit-erasure protocol, where \Delta S=-\ln 2), this inequality in turn expresses Landauer’s principle: it impossible to decrease the self-information of the system’s state without dissipating a minimal amount of heat into the environment (in this case -Q \geq k T\ln2, the ‘Landauer bound’). More general situations (several types of reservoirs, Maxwell-demon-like feedback controls) can be treated along the same lines, and the various forms of the Second Law derived from the detailed fluctuation theorem.

Now, many would agree that evolutionary dynamics is a wholly different business from thermodynamics; in particular, notions such as ‘heat’ or ‘temperature’ are clearly irrelevant to Darwinian evolution. However, the stochastic framework of Markov processes is relevant to describe the genetic evolution of a population, and this fact alone has important consequences. As a simple example, consider the time evolution of mutant fixations x_{a} in a population, with a ranging over the possible genotypes. In a ‘symmetric mutation scheme’, which I understand is biological parlance for ‘reversible Markov process’, meaning one that obeys detailed balance, the ratio between the a\mapsto b and b\mapsto a transition rates is completely determined by the fitnesses f_{a} and f_b of a and b, according to

\displaystyle{\frac{\gamma_{a b}}{\gamma_{b a}} =\left(\frac{f_{b}}{f_{a}}\right)^{\nu} }

where \nu is a model-dependent function of the effective population size [Sella2005]. Along a given history of mutant fixations, the cumulated skewness \Sigma is therefore given by minus the fitness flux:

\displaystyle{\Phi=\nu\sum_{j}(\ln f_{a_j}-\ln f_{a_{j-1}}).}

The integral fluctuation theorem then becomes the fitness flux theorem:

\displaystyle{ \langle e^{-\Delta i -\Phi}\rangle=1}

discussed recently by Mustonen and Lässig [Mustonen2010] and implying Fisher’s fundamental theorem of natural selection as a special case. (Incidentally, the ‘fitness flux theorem’ derived in this reference is more general than this; for instance, it does not rely on the ‘symmetric mutation scheme’ assumption above.) The ensuing inequality

\langle \Phi\rangle\geq-\Delta S

shows that a positive fitness flux is “an almost universal evolutionary principle of biological systems” [Mustonen2010], with negative contributions limited to time intervals with a systematic loss of adaptation (\Delta S > 0). This statement may well be the closest thing to a version of the Second Law of Thermodynamics applying to evolutionary dynamics.

It is really quite remarkable that thermodynamical dissipation and Darwinian evolution can be reduced to the same stochastic mechanism, and that notions such as ‘fitness flux’ and ‘heat’ can arise as two faces of the same mathematical coin, namely the ‘skewness’ of Markovian transitions. After all, the phenomenon of life is in itself a direct challenge to thermodynamics, isn’t it? When thermal phenomena tend to increase the world’s disorder, life strives to bring about and maintain exquisitely fine spatial and chemical structures—which is why Schrödinger famously proposed to define life as negative entropy. Could there be a more striking confirmation of his intuition—and a reconciliation of evolution and thermodynamics in the same go—than the fundamental inequality of adaptive evolution \langle\Phi\rangle\geq-\Delta S?

Surely the detailed fluctuation theorem for Markov processes has other applications, pertaining neither to thermodynamics nor adaptive evolution. Can you think of any?

Proof of the fluctuation theorem

I am a physicist, but knowing that many readers of John’s blog are mathematicians, I’ll do my best to frame—and prove—the FT as an actual theorem.

Let (\Omega,\mathcal{T},p) be a probability space and (\,\cdot\,)^{\dagger}=\Omega\to \Omega a measurable involution of \Omega. Denote p^{\dagger} the pushforward probability measure through this involution, and

\displaystyle{ R=\ln \frac{d p}{d p^\dagger} }

the logarithm of the corresponding Radon-Nikodym derivative (we assume p^\dagger and p are mutually absolutely continuous). Then the following lemmas are true, with (1)\Rightarrow(2)\Rightarrow(3):

Lemma 1. The detailed fluctuation relation:

\forall A\in\mathbb{R} \quad  p\big(R^{-1}(-A) \big)=e^{-A}p \big(R^{-1}(A) \big)

Lemma 2. The integral fluctuation relation:

\displaystyle{\int_{\Omega} d p(\omega)\,e^{-R(\omega)}=1 }

Lemma 3. The positivity of the Kullback-Leibler divergence:

D(p\,\Vert\, p^{\dagger}):=\int_{\Omega} d p(\omega)\,R(\omega)\geq 0.

These are basic facts which anyone can show: (2)\Rightarrow(3) by Jensen’s inequality, (1)\Rightarrow(2) trivially, and (1) follows from R(\omega^{\dagger})=-R(\omega) and the change of variables theorem, as follows,

\begin{array}{ccl} \displaystyle{ \int_{R^{-1}(-A)} d p(\omega)} &=& \displaystyle{ \int_{R^{-1}(A)}d p^{\dagger}(\omega) } \\ \\ &=& \displaystyle{ \int_{R^{-1}(A)} d p(\omega)\, e^{-R(\omega)} } \\ \\ &=& \displaystyle{ e^{-A} \int_{R^{-1}(A)} d p(\omega)} .\end{array}

But here is the beauty: if

(\Omega,\mathcal{T},p) is actually a Markov process defined over some time interval [0,T] and valued in some (say discrete) state space \Sigma, with the instantaneous probability \pi_{a}(t)=p\big(\{\omega_{t}=a\} \big) of each state a\in\Sigma satisfying the master equation (aka Kolmogorov equation)

\displaystyle{ \frac{d\pi_{a}(t)}{dt}=\sum_{b\neq a}\Big(\gamma_{b a}(t)\pi_{a}(t)-\gamma_{a b}(t)\pi_{b}(t)\Big),}

and

• the dagger involution is time-reversal, that is \omega^{\dagger}_{t}:=\omega_{T-t},

then for a given path

\displaystyle{\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T})\in\Omega}

the logarithmic ratio R(\omega) decomposes into ‘variation of self-information’ and ‘cumulative skewness’ along \omega:

\displaystyle{ R(\omega)=\underbrace{\Big(\ln\pi_{a_0}(0)-\ln\pi_{a_N}(T) \Big)}_{\Delta i(\omega)}-\underbrace{\sum_{j=1}^{N}\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}_{\Sigma(\omega)}.}

This is easy to see if one writes the probability of a path explicitly as

\displaystyle{p(\omega)=\pi_{a_{0}}(0)\left[\prod_{j=1}^{N}\phi_{a_{j-1}}(\tau_{j-1},\tau_{j})\gamma_{a_{j-1}a_{j}}(\tau_{j})\right]\phi_{a_{N}}(\tau_{N},T)}

where

\displaystyle{ \phi_{a}(\tau,\tau')=\phi_{a}(\tau',\tau)=\exp\Big(-\sum_{b\neq a}\int_{\tau}^{\tau'}dt\, \gamma_{a b}(t)\Big)}

is the probability that the process remains in the state a between the times \tau and \tau'. It follows from the above lemma that

Theorem. Let (\Omega,\mathcal{T},p) be a Markov process and let i,\Sigma:\Omega\rightarrow \mathbb{R} be defined as above. Then we have

1. The detailed fluctuation theorem:

\forall A\in\mathbb{R}, p\big((\Delta i-\Sigma)^{-1}(-A) \big)=e^{-A}p \big((\Delta i-\Sigma)^{-1}(A) \big)

2. The integral fluctuation theorem:

\int_{\Omega} d p(\omega)\,e^{-\Delta i(\omega)+\Sigma(\omega)}=1

3. The ‘Second Law’ inequality:

\displaystyle{ \Delta S:=\int_{\Omega} d p(\omega)\,\Delta i(\omega)\geq \int_{\Omega} d p(\omega)\,\Sigma(\omega)}

The same theorem can be formulated for other kinds of Markov processes as well, including diffusion processes (in which case it follows from the Girsanov theorem).

References

Landauer’s principle was introduced here:

• [Landauer1961] R. Landauer, Irreversibility and heat generation in the computing process}, IBM Journal of Research and Development 5, (1961) 183–191.

and is now being verified experimentally by various groups worldwide.

The ‘fundamental theorem of natural selection’ was derived by Fisher in his book:

• [Fisher1930] R. Fisher, The Genetical Theory of Natural Selection, Clarendon Press, Oxford, 1930.

His derivation has long been considered obscure, even perhaps wrong, but apparently the theorem is now well accepted. I believe the first Markovian models of genetic evolution appeared here:

• [Fisher1922] R. A. Fisher, On the dominance ratio, Proc. Roy. Soc. Edinb. 42 (1922), 321–341.

• [Wright1931] S. Wright, Evolution in Mendelian populations, Genetics 16 (1931), 97–159.

Fluctuation theorems are reviewed here:

• [Sevick2008] E. Sevick, R. Prabhakar, S. R. Williams, and D. J. Searles, Fluctuation theorems, Ann. Rev. Phys. Chem. 59 (2008), 603–633.

Two of the key ideas for the ‘detailed fluctuation theorem’ discussed here are due to Crooks:

• [Crooks1999] Gavin Crooks, The entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences, Phys. Rev. E 60 (1999), 2721–2726.

who identified (E_{a}(\tau_{j})-E_{a}(\tau_{j-1})) as heat, and Seifert:

• [Seifert2005] Udo Seifert, Entropy production along a stochastic trajectory and an integral fluctuation theorem, Phys. Rev. Lett. 95 (2005), 4.

who understood the relevance of the self-information in this context.

The connection between statistical physics and evolutionary biology is discussed here:

• [Sella2005] G. Sella and A.E. Hirsh, The application of statistical physics to evolutionary biology, Proc. Nat. Acad. Sci. USA 102 (2005), 9541–9546.

and the ‘fitness flux theorem’ is derived in

• [Mustonen2010] V. Mustonen and M. Lässig, Fitness flux and ubiquity of adaptive evolution, Proc. Nat. Acad. Sci. USA 107 (2010), 4248–4253.

Schrödinger’s famous discussion of the physical nature of life was published here:

• [Schrödinger1944] E. Schrödinger, What is Life?, Cambridge University Press, Cambridge, 1944.


An Entropy Challenge

29 August, 2012

If you like computer calculations, here’s a little challenge for you. Oscar Dahlsten may have solved it, but we’d love for you to check his work. It’s pretty important for the foundations of thermodynamics, but you don’t need to know any physics or even anything beyond a little algebra to tackle it! First I’ll explain it in really simple terms, then I’ll remind you a bit of why it matters.

We’re looking for two lists of nonnegative numbers, of the same length, listed in decreasing order:

p_1 \ge p_2 \ge \cdots \ge p_n \ge 0

q_1 \ge q_2 \ge \cdots \ge q_n \ge 0

that sum to 1:

p_1 + \cdots + p_n = 1

q_1 + \cdots + q_n = 1

and that obey this inequality:

\displaystyle{ \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta  \le  \frac{1}{1 - \beta} \ln \sum_{i=1}^n q_i^\beta }

for all 0 < \beta < \infty (ignoring \beta = 1), yet do not obey these inequalities:

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n.

Oscar’s proposed solution is this:

p = (0.4, 0.29, 0.29, 0.02)

q = (0.39, 0.31, 0.2, 0.1)

Can you see if this works? Is there a simpler example, like one with lists of just 3 numbers?

This question came up near the end of my post More Second Laws of Thermodynamics. I phrased the question with a bit more jargon, and said a lot more about its significance. Suppose we have two probability distributions on a finite set, say p and q. We say p majorizes q if

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n, when we write both lists of numbers in decreasing order. This means p is ‘less flat’ than q, so it should have less entropy. And indeed it does: not just for ordinary entropy, but also for Rényi entropy! The Rényi entropy of p is defined by

\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta  }

where 0 < \beta < 1 or 1 < \beta < \infty. We can also define Rényi entropy for \beta = 0, 1, \infty by taking a limit, and at \beta = 1 we get the ordinary entropy

\displaystyle{ H_1(p) = - \sum_{i = 1}^n p_i \ln (p_i) }

The question is whether majorization is more powerful than Rényi entropy as a tool to to tell when one probability distribution is less flat than another. I know that if p majorizes q, its Rényi entropy is less than than that of q for all 0 \le \beta \le \infty. Your mission, should you choose to accept it, is to show the converse is not true.


More Second Laws of Thermodynamics

24 August, 2012

Oscar Dahlsten is visiting the Centre for Quantum Technologies, so we’re continuing some conversations about entropy that we started last year, back when the Entropy Club was active. But now Jamie Vicary and Brendan Fong are involved in the conversations.

I was surprised when Oscar told me that for a large class of random processes, the usual second law of thermodynamics is just one of infinitely many laws saying that various kinds of disorder increase. I’m annoyed that nobody ever told me about this before! It’s as if they told me about conservation of energy but not conservation of schmenergy, and phlenergy, and zenergy

So I need to tell you about this. You may not understand it, but at least I can say I tried. I don’t want you blaming me for concealing all these extra second laws of thermodynamics!

Here’s the basic idea. Not all random processes are guaranteed to make entropy increase. But a bunch of them always make probability distributions flatter in a certain precise sense. This makes the entropy of the probability distribution increase. But when you make a probability distribution flatter in this sense, a bunch of other quantities increase too! For example, besides the usual entropy, there are infinitely many other kinds of entropy, called ‘Rényi entropies’, one for each number between 0 and ∞. And a doubly stochastic operator makes all the Rényi entropies increase! This fact is a special case of Theorem 10 here:

• Tim van Erven and Peter Harremoës, Rényi divergence and majorization.

Let me state this fact precisely, and then say a word about how this is related to quantum theory and ‘the collapse of the wavefunction’.

To keep things simple let’s talk about probability distributions on a finite set, though Erven and Harremoës generalize it all to a measure space.

How do we make precise the concept that one probability distribution is flatter than another? You know it when you see it, at least some of the time. For example, suppose I have some system in thermal equilibrium at some temperature, and the probabilities of it being in various states look like this:

Then say I triple the temperature. The probabilities flatten out:

But how can we make this concept precise in a completely general way? We can do it using the concept of ‘majorization’. If one probability distribution is less flat than another, people say it ‘majorizes’ that other one.

Here’s the definition. Say we have two probability distributions p and q on the same set. For each one, list the probabilities in decreasing order:

p_1 \ge p_2 \ge \cdots \ge p_n

q_1 \ge q_2 \ge \cdots \ge q_n

Then we say p majorizes q if

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n. So, the idea is that the biggest probabilities in the distribution p add up to more than the corresponding biggest ones in q.

In 1960, Alfred Rényi defined a generalization of the usual Shannon entropy that depends on a parameter \beta. If p is a probability distribution on a finite set, its Rényi entropy of order \beta is defined to be

\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_i p_i^\beta }

where 0 \le \beta < \infty. Well, to be honest: if \beta is 0, 1, or \infty we have to define this by taking a limit where we let \beta creep up to that value. But the limit exists, and when \beta = 1 we get the usual Shannon entropy

\displaystyle{ H_1(p) = - \sum_i p_i \ln(p_i) }

As I explained a while ago, Rényi entropies are important ways of measuring biodiversity. But here’s what I learned just now, from the paper by Erven and Harremoës:

Theorem 1. If a probability distribution p majorizes a probability distribution q, its Rényi entropies are smaller:

\displaystyle{ H_\beta(p) \le H_\beta(q) }

for all 0 \le \beta < \infty.

And here’s what makes this fact so nice. If you do something to a classical system in a way that might involve some randomness, we can describe your action using a stochastic matrix. An n \times n matrix T is called stochastic if whenever p \in \mathbb{R}^n is a probability distribution, so is T p. This is equivalent to saying:

• the matrix entries of T are all \ge 0, and

• each column of T sums to 1.

If T is stochastic, it’s not necessarily true that the entropy of T p is greater than or equal to that of p, not even for the Shannon entropy.

Puzzle 1. Find a counterexample.

However, entropy does increase if we use specially nice stochastic matrices called ‘doubly stochastic’ matrices. People say a matrix T doubly stochastic if it’s stochastic and it maps the probability distribution

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

to itself. This is the most spread-out probability distribution of all: every other probability distribution majorizes this one.

Why do they call such matrices ‘doubly’ stochastic? Well, if you’ve got a stochastic matrix, each column sums to one. But a stochastic operator is doubly stochastic if and only if each row sums to 1 as well.

Here’s a really cool fact:

Theorem 2. If T is doubly stochastic, p majorizes T p for any probability distribution p \in \mathbb{R}^n. Conversely, if a probability distribution p majorizes a probability distribution q, then q = T p for some doubly stochastic matrix T.

Taken together, Theorems 1 and 2 say that doubly stochastic transformations increase entropy… but not just Shannon entropy! They increase all the different Rényi entropies, as well. So if time evolution is described by a doubly stochastic matrix, we get lots of ‘second laws of thermodynamics’, saying that all these different kinds of entropy increase!

Finally, what does all this have to do with quantum mechanics, and collapsing the wavefunction? There are different things to say, but this is the simplest:

Theorem 3. Given two probability distributions p, q \in \mathbb{R}^n, then p majorizes q if and only there exists a self-adjoint matrix D with eigenvalues p_i and diagonal entries q_i.

The matrix D will be a density matrix: a self-adjoint matrix with positive eigenvalues and trace equal to 1. We use such matrices to describe mixed states in quantum mechanics.

Theorem 3 gives a precise sense in which preparing a quantum system in some state, letting time evolve, and then measuring it ‘increases randomness’.

How? Well, suppose we have a quantum system whose Hilbert space is \mathbb{C}^n. If we prepare the system in a mixture of the standard basis states with probabilities p_i, we can describe it with a diagonal density matrix D_0. Then suppose we wait a while and some unitary time evolution occurs. The system is now described by a new density matrix

D = U D_0 \, U^{-1}

where U is some unitary operator. If we then do a measurement to see which of the standard basis states our system now lies in, we’ll get the different possible results with probabilities q_i, the diagonal entries of D. But the eigenvalues of D will still be the numbers p_i. So, by the theorem, p majorizes q!

So, not only Shannon entropy but also all the Rényi entropies will increase!

Of course, there are some big physics questions lurking here. Like: what about the real world? In the real world, do lots of different kinds of entropy tend to increase, or just some?

Of course, there’s a huge famous old problem about how reversible time evolution can be compatible with any sort of law saying that entropy must always increase! Still, there are some arguments, going back to Boltzmann’s H-theorem, which show entropy increases under some extra conditions. So then we can ask if other kinds of entropy, like Rényi entropy, increase as well. This will be true whenever we can argue that time evolution is described by doubly stochastic matrices. Theorem 3 gives a partial answer, but there’s probably much more to say.

I don’t have much more to say right now, though. I’ll just point out that while doubly stochastic matrices map the ‘maximally smeared-out’ probability distribution

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

to itself, a lot of this theory generalizes to stochastic matrices that map exactly one other probability distribution to itself. We need to work with relative Rényi entropy instead of Rényi entropy, and so on, but I don’t think these adjustments are really a big deal. And there are nice theorems that let you know when a stochastic matrix maps exactly one probability distribution to itself, based on the Perron–Frobenius theorem.

References

I already gave you a reference for Theorem 1, namely the paper by Erven and Harremoës, though I don’t think they were the first to prove this particular result: they generalize it quite a lot.

What about Theorem 2? It goes back at least to here:

• Barry C. Arnold, Majorization and the Lorenz Order: A Brief Introduction, Springer Lecture Notes in Statistics 43, Springer, Berlin, 1987.

The partial order on probability distributions given by majorization is also called the ‘Lorenz order’, but mainly when we consider probability distributions on infinite sets. This name presumably comes from the Lorenz curve, a measure of income inequality. This curve shows for the bottom x% of households, what percentage y% of the total income they have:

Puzzle 2. If you’ve got two different probability distributions of incomes, and one majorizes the other, how are their Lorenz curves related?

When we generalize majorization by letting some other probability distribution take the place of

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

it seems people call it the ‘Markov order’. Here’s a really fascinating paper on that, which I’m just barely beginning to understand:

• A. N. Gorban, P. A. Gorban and G. Judge, Entropy: the Markov ordering approach, Entropy 12 (2010), 1145–1193.

What about Theorem 3? Apparently it goes back to here:

• A. Uhlmann, Wiss. Z. Karl-Marx-Univ. Leipzig 20 (1971), 633.

though I only know this thanks to a more recent paper:

• Michael A. Nielsen, Conditions for a class of entanglement transformations, Phys. Rev. Lett. 83 (1999), 436–439.

By the way, Nielsen’s paper contains another very nice result about majorization! Suppose you have states \psi and \phi of a 2-part quantum system. You can trace out one part and get density matrices describing mixed states of the other part, say D_\psi and D_\phi. Then Nielsen shows you can get from \psi to \phi using ‘local operations and classical communication’ if and only if D_\phi majorizes D_\psi. Note that things are going backwards here compared to how they’ve been going in the rest of this post: if we can get from \psi to \phi, then all forms of entropy go down when we go from D_\psi to D_\phi! This ‘anti-second-law’ behavior is confusing at first, but familiar to me by now.

When I first learned all this stuff, I naturally thought of the following question—maybe you did too, just now. If p, q \in \mathbb{R}^n are probability distributions and

\displaystyle{ H_\beta(p) \le H_\beta(q) }

for all 0 \le \beta < \infty, is it true that p majorizes q?

Apparently the answer must be no, because Klimesh has gone to quite a bit of work to obtain a weaker conclusion: not that p majorizes q, but that p \otimes r majorizes q \otimes r for some probability distribution r \in \mathbb{R}^m. He calls this catalytic majorization, with r serving as a ‘catalyst’:

• Matthew Klimesh, Inequalities that collectively completely characterizes the catalytic majorization relation.

I thank Vlatko Vedral here at the CQT for pointing this out!

Finally, here is a good general introduction to majorization, pointed out by Vasileios Anagnostopoulos:

• T. Ando, Majorization, doubly stochastic matrices, and comparison of eigenvalues, Linear Algebra and its Applications 118 (1989), 163-–248.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers