Maximum Entropy and Ecology

21 February, 2013

I already talked about John Harte’s book on how to stop global warming. Since I’m trying to apply information theory and thermodynamics to ecology, I was also interested in this book of his:

John Harte, Maximum Entropy and Ecology, Oxford U. Press, Oxford, 2011.

There’s a lot in this book, and I haven’t absorbed it all, but let me try to briefly summarize his maximum entropy theory of ecology. This aims to be “a comprehensive, parsimonious, and testable theory of the distribution, abundance, and energetics of species across spatial scales”. One great thing is that he makes quantitative predictions using this theory and compares them to a lot of real-world data. But let me just tell you about the theory.

It’s heavily based on the principle of maximum entropy (MaxEnt for short), and there are two parts:

Two MaxEnt calculations are at the core of the theory: the first yields all the metrics that describe abundance and energy distributions, and the second describes the spatial scaling properties of species’ distributions.

Abundance and energy distributions

The first part of Harte’s theory is all about a conditional probability distribution

R(n,\epsilon | S_0, N_0, E_0)

which he calls the ecosystem structure function. Here:

S_0: the total number of species under consideration in some area.

N_0: the total number of individuals under consideration in that area.

E_0: the total rate of metabolic energy consumption of all these individuals.

Given this,

R(n,\epsilon | S_0, N_0, E_0) \, d \epsilon

is the probability that given S_0, N_0, E_0, if a species is picked from the collection of species, then it has n individuals, and if an individual is picked at random from that species, then its rate of metabolic energy consumption is in the interval (\epsilon, \epsilon + d \epsilon).

Here of course d \epsilon is ‘infinitesimal’, meaning that we take a limit where it goes to zero to make this idea precise (if we’re doing analytical work) or take it to be very small (if we’re estimating R from data).

I believe that when we ‘pick a species’ we’re treating them all as equally probable, not weighting them according to their number of individuals.

Clearly R obeys some constraints. First, since it’s a probability distribution, it obeys the normalization condition:

\displaystyle{ \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) = 1 }

Second, since the average number of individuals per species is N_0/S_0, we have:

\displaystyle{ \sum_n \int d \epsilon \; n R(n,\epsilon | S_0, N_0, E_0) = N_0 / S_0 }

Third, since the average over species of the total rate of metabolic energy consumption of individuals within the species is E_0/ S_0, we have:

\displaystyle{ \sum_n \int d \epsilon \; n \epsilon R(n,\epsilon | S_0, N_0, E_0) = E_0 / S_0 }

Harte’s theory is that R maximizes entropy subject to these three constraints. Here entropy is defined by

\displaystyle{ - \sum_n \int d \epsilon \; R(n,\epsilon | S_0, N_0, E_0) \ln(R(n,\epsilon | S_0, N_0, E_0)) }

Harte uses this theory to calculate R, and tests the results against data from about 20 ecosystems. For example, he predicts the abundance of species as a function of their rank, with rank 1 being the most abundant, rank 2 being the second most abundant, and so on. And he gets results like this:

The data here are from:

• Green, Harte, and Ostling’s work on a serpentine grassland,

• Luquillo’s work on a 10.24-hectare tropical forest, and

• Cocoli’s work on a 2-hectare wet tropical forest.

The fit looks good to me… but I should emphasize that I haven’t had time to study these matters in detail. For more, you can read this paper, at least if your institution subscribes to this journal:

• J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

Spatial abundance distribution

The second part of Harte’s theory is all about a conditional probability distribution

\Pi(n | A, n_0, A_0)

This is the probability that n individuals of a species are found in a region of area A given that it has n_0 individuals in a larger region of area A_0.

\Pi obeys two constraints. First, since it’s a probability distribution, it obeys the normalization condition:

\displaystyle{ \sum_n  \Pi(n | A, n_0, A_0) = 1 }

Second, since the mean value of n across regions of area A equals n_0 A/A_0, we have

\displaystyle{ \sum_n n \Pi(n | A, n_0, A_0) = n_0 A/A_0 }

Harte’s theory is that \Pi maximizes entropy subject to these two constraints. Here entropy is defined by

\displaystyle{- \sum_n  \Pi(n | A, n_0, A_0)\ln(\Pi(n | A, n_0, A_0)) }

Harte explains two approaches to use this idea to derive ‘scaling laws’ for how n varies with n. And again, he compares his predictions to real-world data, and get results that look good to my (amateur, hasty) eye!

I hope sometime I can dig deeper into this subject. Do you have any ideas, or knowledge about this stuff?


Network Theory for Economists

15 January, 2013

Tomorrow I’m giving a talk in the econometrics seminar at U.C. Riverside. I was invited to speak on my work on network theory, so I don’t feel too bad about the fact that I’ll be saying only a little about economics and practically nothing about econometrics. Still, I’ve tried to slant the talk in a way that emphasizes possible applications to economics and game theory. Here are the slides:

Network Theory.

For long-time readers here the fun comes near the end. I explain how reaction networks can be used to describe evolutionary games. I point out that in certain classes of evolutionary games, evolution tends to increase ‘fitness’, and/or lead the players to a ‘Nash equilibrium’. For precise theorems you’ll have to click the links in my talk and read the references!

I conclude with an example: a game with three strategies and 7 Nash equilibria. Here evolution makes the proportion of these three strategies follow these flow lines, at least in the limit of large numbers of players:

This picture is from William Sandholm’s nice expository paper:

• William H. Sandholm, Evolutionary game theory, 2007.

I mentioned it before in Information Geometry (Part 12), en route to showing a proof that some quantity always decreases in a class of evolutionary games. Sometime I want to tell the whole story linking:

reaction networks
evolutionary games
the 2nd law of thermodynamics

and

Fisher’s fundamental theorem of natural selection.

But not today! Think of these talk slides as a little appetizer.


John Harte

27 October, 2012

Earlier this week I gave a talk on the Mathematics of Planet Earth at the University of Southern California, and someone there recommended that I look into John Harte’s work on maximum entropy methods in ecology. He works at U.C. Berkeley.

I checked out his website and found that his goals resemble mine: save the planet and understand its ecosystems. He’s a lot further along than I am, since he comes from a long background in ecology while I’ve just recently blundered in from mathematical physics. I can’t really say what I think of his work since I’m just learning about it. But I thought I should point out its existence.

This free book is something a lot of people would find interesting:

• John and Mary Ellen Harte, Cool the Earth, Save the Economy: Solving the Climate Crisis Is EASY, 2008.

EASY? Well, it’s an acronym. Here’s the basic idea of the US-based plan described in this book:

Any proposed energy policy should include these two components:

Technical/Behavioral: What resources and technologies are to be used to supply energy? On the demand side, what technologies and lifestyle changes are being proposed to consumers?

Incentives/Economic Policy: How are the desired supply and demand options to be encouraged or forced? Here the options include taxes, subsidies, regulations, permits, research and development, and education.

And a successful energy policy should satisfy the AAA criteria:

Availability. The climate crisis will rapidly become costly to society if we do not take action expeditiously. We need to adopt now those technologies that are currently available, provided they meet the following two additional criteria:

Affordability. Because of the central role of energy in our society, its cost to consumers should not increase significantly. In fact, a successful energy policy could ultimately save consumers money.

Acceptability. All energy strategies have environmental, land use, and health and safety implications; these must be acceptable to the public. Moreover, while some interest groups will undoubtedly oppose any particular energy policy, political acceptability at a broad scale is necessary.

Our strategy for preventing climate catastrophe and achieving energy independence includes:

Energy Efficient Technology at home and at the workplace. Huge reductions in home energy use can be achieved with available technologies, including more efficient appliances such as refrigerators, water heaters, and light bulbs. Home retrofits and new home design features such as “smart” window coatings, lighter-colored roofs where there are hot summers, better home insulation, and passive solar designs can also reduce energy use. Together, energy efficiency in home and industry can save the U.S. up to approximately half of the energy currently consumed in those sectors, and at no net cost—just by making different choices. Sounds good, doesn’t it?

Automobile Fuel Efficiency. Phase in higher Corporate Average Fuel Economy (CAFE) standards for automobiles, SUVs and light trucks by requiring vehicles to go 35 miles per gallon of gas (mpg) by 2015, 45 mpg by 2020, and 60 mpg by 2030. This would rapidly wipe out our dependence on foreign oil and cut emissions from the vehicle sector by two-thirds. A combination of plug-in hybrid, lighter car body materials, re-design and other innovations could readily achieve these standards. This sounds good, too!

Solar and Wind Energy. Rooftop photovoltaic panels and solar water heating units should be phased in over the next 20 years, with the goal of solar installation on 75% of U.S. homes and commercial buildings by 2030. (Not all roofs receive sufficient sunlight to make solar panels practical for them.) Large wind farms, solar photovoltaic stations, and solar thermal stations should also be phased in so that by 2030, all U.S. electricity demand will be supplied by existing hydroelectric, existing and possibly some new nuclear, and, most importantly, new solar and wind units. This will require investment in expansion of the grid to bring the new supply to the demand, and in research and development to improve overnight storage systems. Achieving this goal would reduce our dependence on coal to practically zero. More good news!

You are part of the answer. Voting wisely for leaders who promote the first three components is one of the most important individual actions one can make. Other actions help, too. Just as molecules make up mountains, individual actions taken collectively have huge impacts. Improved driving skills, automobile maintenance, reusing and recycling, walking and biking, wearing sweaters in winter and light clothing in summer, installing timers on thermostats and insulating houses, carpooling, paying attention to energy efficiency labels on appliances, and many other simple practices and behaviors hugely influence energy consumption. A major education campaign, both in schools for youngsters and by the media for everyone, should be mounted to promote these consumer practices.

No part of EASY can be left out; all parts are closely integrated. Some parts might create much larger changes—for example, more efficient home appliances and automobiles—but all parts are essential. If, for example, we do not achieve the decrease in electricity demand that can be brought about with the E of EASY, then it is extremely doubtful that we could meet our electricity needs with the S of EASY.

It is equally urgent that once we start implementing the plan, we aggressively export it to other major emitting nations. We can reduce our own emissions all we want, but the planet will continue to warm if we can’t convince other major global emitters to reduce their emissions substantially, too.

What EASY will achieve. If no actions are taken to reduce carbon dioxide emissions, in the year 2030 the U.S. will be emitting about 2.2 billion tons of carbon in the form of carbon dioxide. This will be an increase of 25% from today’s emission rate of about 1.75 billion tons per year of carbon. By following the EASY plan, the U.S. share in a global effort to solve the climate crisis (that is, prevent catastrophic warming) will result in U.S emissions of only about 0.4 billion tons of carbon by 2030, which represents a little less than 25% of 2007 carbon dioxide emissions.128 Stated differently, the plan provides a way to eliminate 1.8 billion tons per year of carbon by that date.

We must act urgently: in the 14 months it took us to write this book, atmospheric CO2 levels rose by several billion tons of carbon, and more climatic consequences have been observed. Let’s assume that we conserve our forests and other natural carbon reservoirs at our current levels, as well as maintain our current nuclear and hydroelectric plants (or replace them with more solar and wind generators). Here’s what implementing EASY will achieve, as illustrated by Figure 3.1 on the next page.

Please check out this book and help me figure out if the numbers add up! I could also use help understanding his research, for example:

• John Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics, Oxford University Press, Oxford, 2011.

The book is not free but the first chapter is.

This paper looks really interesting too:

• J. Harte, T. Zillio, E. Conlisk and A. B. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–-2711.

Again, it’s not freely available—tut tut. Ecologists should follow physicists and make their work free online; if you’re serious about saving the planet you should let everyone know what you’re doing! However, the abstract is visible to all, and of course I can use my academic superpowers to get ahold of the paper for myself:

Abstract: The biodiversity scaling metrics widely studied in macroecology include the species-area relationship (SAR), the scale-dependent species-abundance distribution (SAD), the distribution of masses or metabolic energies of individuals within and across species, the abundance-energy or abundance-mass relationship across species, and the species-level occupancy distributions across space. We propose a theoretical framework for predicting the scaling forms of these and other metrics based on the state-variable concept and an analytical method derived from information theory. In statistical physics, a method of inference based on information entropy results in a complete macro-scale description of classical thermodynamic systems in terms of the state variables volume, temperature, and number of molecules. In analogy, we take the state variables of an ecosystem to be its total area, the total number of species within any specified taxonomic group in that area, the total number of individuals across those species, and the summed metabolic energy rate for all those individuals. In terms solely of ratios of those state variables, and without invoking any specific ecological mechanisms, we show that realistic functional forms for the macroecological metrics listed above are inferred based on information entropy. The Fisher log series SAD emerges naturally from the theory. The SAR is predicted to have negative curvature on a log-log plot, but as the ratio of the number of species to the number of individuals decreases, the SAR becomes better and better approximated by a power law, with the predicted slope z in the range of 0.14-0.20. Using the 3/4 power mass-metabolism scaling relation to relate energy requirements and measured body sizes, the Damuth scaling rule relating mass and abundance is also predicted by the theory. We argue that the predicted forms of the macroecological metrics are in reasonable agreement with the patterns observed from plant census data across habitats and spatial scales. While this is encouraging, given the absence of adjustable fitting parameters in the theory, we further argue that even small discrepancies between data and predictions can help identify ecological mechanisms that influence macroecological patterns.


The Mathematical Origin of Irreversibility

8 October, 2012

guest post by Matteo Smerlak

Introduction

Thermodynamical dissipation and adaptive evolution are two faces of the same Markovian coin!

Consider this. The Second Law of Thermodynamics states that the entropy of an isolated thermodynamic system can never decrease; Landauer’s principle maintains that the erasure of information inevitably causes dissipation; Fisher’s fundamental theorem of natural selection asserts that any fitness difference within a population leads to adaptation in an evolution process governed by natural selection. Diverse as they are, these statements have two common characteristics:

1. they express the irreversibility of certain natural phenomena, and

2. the dynamical processes underlying these phenomena involve an element of randomness.

Doesn’t this suggest to you the following question: Could it be that thermal phenomena, forgetful information processing and adaptive evolution are governed by the same stochastic mechanism?

The answer is—yes! The key to this rather profound connection resides in a universal property of Markov processes discovered recently in the context of non-equilibrium statistical mechanics, and known as the ‘fluctuation theorem’. Typically stated in terms of ‘dissipated work’ or ‘entropy production’, this result can be seen as an extension of the Second Law of Thermodynamics to small systems, where thermal fluctuations cannot be neglected. But it is actually much more than this: it is the mathematical underpinning of irreversibility itself, be it thermodynamical, evolutionary, or else. To make this point clear, let me start by giving a general formulation of the fluctuation theorem that makes no reference to physics concepts such as ‘heat’ or ‘work’.

The mathematical fact

Consider a system randomly jumping between states a, b,\dots with (possibly time-dependent) transition rates \gamma_{a b}(t) where a is the state prior to the jump, while b is the state after the jump. I’ll assume that this dynamics defines a (continuous-time) Markov process, namely that the numbers \gamma_{a b} are the matrix entries of an infinitesimal stochastic matrix, which means that its off-diagonal entries are non-negative and that its columns sum up to zero.

Now, each possible history \omega=(\omega_t)_{0\leq t\leq T} of this process can be characterized by the sequence of occupied states a_{j} and by the times \tau_{j} at which the transitions a_{j-1}\longrightarrow a_{j} occur (0\leq j\leq N):

\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T}).

Define the skewness \sigma_{j}(\tau_{j}) of each of these transitions to be the logarithmic ratio of transition rates:

\displaystyle{\sigma_{j}(\tau_{j}):=\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}

Also define the self-information of the system in state a at time t by:

i_a(t):= -\ln\pi_{a}(t)

where \pi_{a}(t) is the probability that the system is in state a at time t, given some prescribed initial distribution \pi_{a}(0). This quantity is also sometimes called the surprisal, as it measures the ‘surprise’ of finding out that the system is in state a at time t.

Then the following identity—the detailed fluctuation theorem—holds:

\mathrm{Prob}[\Delta i-\Sigma=-A] = e^{-A}\;\mathrm{Prob}[\Delta i-\Sigma=A]

where

\displaystyle{\Sigma:=\sum_{j}\sigma_{j}(\tau_{j})}

is the cumulative skewness along a trajectory of the system, and

\Delta i= i_{a_N}(T)-i_{a_0}(0)

is the variation of self-information between the end points of this trajectory.

This identity has an immediate consequence: if \langle\,\cdot\,\rangle denotes the average over all realizations of the process, then we have the integral fluctuation theorem:

\langle e^{-\Delta i+\Sigma}\rangle=1,

which, by the convexity of the exponential and Jensen’s inequality, implies:

\langle \Delta i\rangle=\Delta S\geq\langle\Sigma\rangle.

In short: the mean variation of self-information, aka the variation of Shannon entropy

\displaystyle{ S(t):= \sum_{a}\pi_{a}(t)i_a(t) }

is bounded from below by the mean cumulative skewness of the underlying stochastic trajectory.

This is the fundamental mathematical fact underlying irreversibility. To unravel its physical and biological consequences, it suffices to consider the origin and interpretation of the ‘skewness’ term in different contexts. (By the way, people usually call \Sigma the ‘entropy production’ or ‘dissipation function’—but how tautological is that?)

The physical and biological consequences

Consider first the standard stochastic-thermodynamic scenario where a physical system is kept in contact with a thermal reservoir at inverse temperature \beta and undergoes thermally induced transitions between states a, b,\dots. By virtue of the detailed balance condition:

\displaystyle{ e^{-\beta E_{a}(t)}\gamma_{a b}(t)=e^{-\beta E_{b}(t)}\gamma_{b a}(t),}

the skewness \sigma_{j}(\tau_{j}) of each such transition is \beta times the energy difference between the states a_{j} and a_{j-1}, namely the heat received from the reservoir during the transition. Hence, the mean cumulative skewness \langle \Sigma\rangle is nothing but \beta\langle Q\rangle, with Q the total heat received by the system along the process. It follows from the detailed fluctuation theorem that

\langle e^{-\Delta i+\beta Q}\rangle=1

and therefore

\Delta S\geq\beta\langle Q\rangle

which is of course Clausius’ inequality. In a computational context where the control parameter is the entropy variation itself (such as in a bit-erasure protocol, where \Delta S=-\ln 2), this inequality in turn expresses Landauer’s principle: it impossible to decrease the self-information of the system’s state without dissipating a minimal amount of heat into the environment (in this case -Q \geq k T\ln2, the ‘Landauer bound’). More general situations (several types of reservoirs, Maxwell-demon-like feedback controls) can be treated along the same lines, and the various forms of the Second Law derived from the detailed fluctuation theorem.

Now, many would agree that evolutionary dynamics is a wholly different business from thermodynamics; in particular, notions such as ‘heat’ or ‘temperature’ are clearly irrelevant to Darwinian evolution. However, the stochastic framework of Markov processes is relevant to describe the genetic evolution of a population, and this fact alone has important consequences. As a simple example, consider the time evolution of mutant fixations x_{a} in a population, with a ranging over the possible genotypes. In a ‘symmetric mutation scheme’, which I understand is biological parlance for ‘reversible Markov process’, meaning one that obeys detailed balance, the ratio between the a\mapsto b and b\mapsto a transition rates is completely determined by the fitnesses f_{a} and f_b of a and b, according to

\displaystyle{\frac{\gamma_{a b}}{\gamma_{b a}} =\left(\frac{f_{b}}{f_{a}}\right)^{\nu} }

where \nu is a model-dependent function of the effective population size [Sella2005]. Along a given history of mutant fixations, the cumulated skewness \Sigma is therefore given by minus the fitness flux:

\displaystyle{\Phi=\nu\sum_{j}(\ln f_{a_j}-\ln f_{a_{j-1}}).}

The integral fluctuation theorem then becomes the fitness flux theorem:

\displaystyle{ \langle e^{-\Delta i -\Phi}\rangle=1}

discussed recently by Mustonen and Lässig [Mustonen2010] and implying Fisher’s fundamental theorem of natural selection as a special case. (Incidentally, the ‘fitness flux theorem’ derived in this reference is more general than this; for instance, it does not rely on the ‘symmetric mutation scheme’ assumption above.) The ensuing inequality

\langle \Phi\rangle\geq-\Delta S

shows that a positive fitness flux is “an almost universal evolutionary principle of biological systems” [Mustonen2010], with negative contributions limited to time intervals with a systematic loss of adaptation (\Delta S > 0). This statement may well be the closest thing to a version of the Second Law of Thermodynamics applying to evolutionary dynamics.

It is really quite remarkable that thermodynamical dissipation and Darwinian evolution can be reduced to the same stochastic mechanism, and that notions such as ‘fitness flux’ and ‘heat’ can arise as two faces of the same mathematical coin, namely the ‘skewness’ of Markovian transitions. After all, the phenomenon of life is in itself a direct challenge to thermodynamics, isn’t it? When thermal phenomena tend to increase the world’s disorder, life strives to bring about and maintain exquisitely fine spatial and chemical structures—which is why Schrödinger famously proposed to define life as negative entropy. Could there be a more striking confirmation of his intuition—and a reconciliation of evolution and thermodynamics in the same go—than the fundamental inequality of adaptive evolution \langle\Phi\rangle\geq-\Delta S?

Surely the detailed fluctuation theorem for Markov processes has other applications, pertaining neither to thermodynamics nor adaptive evolution. Can you think of any?

Proof of the fluctuation theorem

I am a physicist, but knowing that many readers of John’s blog are mathematicians, I’ll do my best to frame—and prove—the FT as an actual theorem.

Let (\Omega,\mathcal{T},p) be a probability space and (\,\cdot\,)^{\dagger}=\Omega\to \Omega a measurable involution of \Omega. Denote p^{\dagger} the pushforward probability measure through this involution, and

\displaystyle{ R=\ln \frac{d p}{d p^\dagger} }

the logarithm of the corresponding Radon-Nikodym derivative (we assume p^\dagger and p are mutually absolutely continuous). Then the following lemmas are true, with (1)\Rightarrow(2)\Rightarrow(3):

Lemma 1. The detailed fluctuation relation:

\forall A\in\mathbb{R} \quad  p\big(R^{-1}(-A) \big)=e^{-A}p \big(R^{-1}(A) \big)

Lemma 2. The integral fluctuation relation:

\displaystyle{\int_{\Omega} d p(\omega)\,e^{-R(\omega)}=1 }

Lemma 3. The positivity of the Kullback-Leibler divergence:

D(p\,\Vert\, p^{\dagger}):=\int_{\Omega} d p(\omega)\,R(\omega)\geq 0.

These are basic facts which anyone can show: (2)\Rightarrow(3) by Jensen’s inequality, (1)\Rightarrow(2) trivially, and (1) follows from R(\omega^{\dagger})=-R(\omega) and the change of variables theorem, as follows,

\begin{array}{ccl} \displaystyle{ \int_{R^{-1}(-A)} d p(\omega)} &=& \displaystyle{ \int_{R^{-1}(A)}d p^{\dagger}(\omega) } \\ \\ &=& \displaystyle{ \int_{R^{-1}(A)} d p(\omega)\, e^{-R(\omega)} } \\ \\ &=& \displaystyle{ e^{-A} \int_{R^{-1}(A)} d p(\omega)} .\end{array}

But here is the beauty: if

(\Omega,\mathcal{T},p) is actually a Markov process defined over some time interval [0,T] and valued in some (say discrete) state space \Sigma, with the instantaneous probability \pi_{a}(t)=p\big(\{\omega_{t}=a\} \big) of each state a\in\Sigma satisfying the master equation (aka Kolmogorov equation)

\displaystyle{ \frac{d\pi_{a}(t)}{dt}=\sum_{b\neq a}\Big(\gamma_{b a}(t)\pi_{a}(t)-\gamma_{a b}(t)\pi_{b}(t)\Big),}

and

• the dagger involution is time-reversal, that is \omega^{\dagger}_{t}:=\omega_{T-t},

then for a given path

\displaystyle{\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T})\in\Omega}

the logarithmic ratio R(\omega) decomposes into ‘variation of self-information’ and ‘cumulative skewness’ along \omega:

\displaystyle{ R(\omega)=\underbrace{\Big(\ln\pi_{a_0}(0)-\ln\pi_{a_N}(T) \Big)}_{\Delta i(\omega)}-\underbrace{\sum_{j=1}^{N}\ln\frac{\gamma_{a_{j}a_{j-1}}(\tau_{j})}{\gamma_{a_{j-1}a_{j}}(\tau_{j})}}_{\Sigma(\omega)}.}

This is easy to see if one writes the probability of a path explicitly as

\displaystyle{p(\omega)=\pi_{a_{0}}(0)\left[\prod_{j=1}^{N}\phi_{a_{j-1}}(\tau_{j-1},\tau_{j})\gamma_{a_{j-1}a_{j}}(\tau_{j})\right]\phi_{a_{N}}(\tau_{N},T)}

where

\displaystyle{ \phi_{a}(\tau,\tau')=\phi_{a}(\tau',\tau)=\exp\Big(-\sum_{b\neq a}\int_{\tau}^{\tau'}dt\, \gamma_{a b}(t)\Big)}

is the probability that the process remains in the state a between the times \tau and \tau'. It follows from the above lemma that

Theorem. Let (\Omega,\mathcal{T},p) be a Markov process and let i,\Sigma:\Omega\rightarrow \mathbb{R} be defined as above. Then we have

1. The detailed fluctuation theorem:

\forall A\in\mathbb{R}, p\big((\Delta i-\Sigma)^{-1}(-A) \big)=e^{-A}p \big((\Delta i-\Sigma)^{-1}(A) \big)

2. The integral fluctuation theorem:

\int_{\Omega} d p(\omega)\,e^{-\Delta i(\omega)+\Sigma(\omega)}=1

3. The ‘Second Law’ inequality:

\displaystyle{ \Delta S:=\int_{\Omega} d p(\omega)\,\Delta i(\omega)\geq \int_{\Omega} d p(\omega)\,\Sigma(\omega)}

The same theorem can be formulated for other kinds of Markov processes as well, including diffusion processes (in which case it follows from the Girsanov theorem).

References

Landauer’s principle was introduced here:

• [Landauer1961] R. Landauer, Irreversibility and heat generation in the computing process}, IBM Journal of Research and Development 5, (1961) 183–191.

and is now being verified experimentally by various groups worldwide.

The ‘fundamental theorem of natural selection’ was derived by Fisher in his book:

• [Fisher1930] R. Fisher, The Genetical Theory of Natural Selection, Clarendon Press, Oxford, 1930.

His derivation has long been considered obscure, even perhaps wrong, but apparently the theorem is now well accepted. I believe the first Markovian models of genetic evolution appeared here:

• [Fisher1922] R. A. Fisher, On the dominance ratio, Proc. Roy. Soc. Edinb. 42 (1922), 321–341.

• [Wright1931] S. Wright, Evolution in Mendelian populations, Genetics 16 (1931), 97–159.

Fluctuation theorems are reviewed here:

• [Sevick2008] E. Sevick, R. Prabhakar, S. R. Williams, and D. J. Searles, Fluctuation theorems, Ann. Rev. Phys. Chem. 59 (2008), 603–633.

Two of the key ideas for the ‘detailed fluctuation theorem’ discussed here are due to Crooks:

• [Crooks1999] Gavin Crooks, The entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences, Phys. Rev. E 60 (1999), 2721–2726.

who identified (E_{a}(\tau_{j})-E_{a}(\tau_{j-1})) as heat, and Seifert:

• [Seifert2005] Udo Seifert, Entropy production along a stochastic trajectory and an integral fluctuation theorem, Phys. Rev. Lett. 95 (2005), 4.

who understood the relevance of the self-information in this context.

The connection between statistical physics and evolutionary biology is discussed here:

• [Sella2005] G. Sella and A.E. Hirsh, The application of statistical physics to evolutionary biology, Proc. Nat. Acad. Sci. USA 102 (2005), 9541–9546.

and the ‘fitness flux theorem’ is derived in

• [Mustonen2010] V. Mustonen and M. Lässig, Fitness flux and ubiquity of adaptive evolution, Proc. Nat. Acad. Sci. USA 107 (2010), 4248–4253.

Schrödinger’s famous discussion of the physical nature of life was published here:

• [Schrödinger1944] E. Schrödinger, What is Life?, Cambridge University Press, Cambridge, 1944.


An Entropy Challenge

29 August, 2012

If you like computer calculations, here’s a little challenge for you. Oscar Dahlsten may have solved it, but we’d love for you to check his work. It’s pretty important for the foundations of thermodynamics, but you don’t need to know any physics or even anything beyond a little algebra to tackle it! First I’ll explain it in really simple terms, then I’ll remind you a bit of why it matters.

We’re looking for two lists of nonnegative numbers, of the same length, listed in decreasing order:

p_1 \ge p_2 \ge \cdots \ge p_n \ge 0

q_1 \ge q_2 \ge \cdots \ge q_n \ge 0

that sum to 1:

p_1 + \cdots + p_n = 1

q_1 + \cdots + q_n = 1

and that obey this inequality:

\displaystyle{ \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta  \le  \frac{1}{1 - \beta} \ln \sum_{i=1}^n q_i^\beta }

for all 0 < \beta < \infty (ignoring \beta = 1), yet do not obey these inequalities:

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n.

Oscar’s proposed solution is this:

p = (0.4, 0.29, 0.29, 0.02)

q = (0.39, 0.31, 0.2, 0.1)

Can you see if this works? Is there a simpler example, like one with lists of just 3 numbers?

This question came up near the end of my post More Second Laws of Thermodynamics. I phrased the question with a bit more jargon, and said a lot more about its significance. Suppose we have two probability distributions on a finite set, say p and q. We say p majorizes q if

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n, when we write both lists of numbers in decreasing order. This means p is ‘less flat’ than q, so it should have less entropy. And indeed it does: not just for ordinary entropy, but also for Rényi entropy! The Rényi entropy of p is defined by

\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_{i=1}^n p_i^\beta  }

where 0 < \beta < 1 or 1 < \beta < \infty. We can also define Rényi entropy for \beta = 0, 1, \infty by taking a limit, and at \beta = 1 we get the ordinary entropy

\displaystyle{ H_1(p) = - \sum_{i = 1}^n p_i \ln (p_i) }

The question is whether majorization is more powerful than Rényi entropy as a tool to to tell when one probability distribution is less flat than another. I know that if p majorizes q, its Rényi entropy is less than than that of q for all 0 \le \beta \le \infty. Your mission, should you choose to accept it, is to show the converse is not true.


More Second Laws of Thermodynamics

24 August, 2012

Oscar Dahlsten is visiting the Centre for Quantum Technologies, so we’re continuing some conversations about entropy that we started last year, back when the Entropy Club was active. But now Jamie Vicary and Brendan Fong are involved in the conversations.

I was surprised when Oscar told me that for a large class of random processes, the usual second law of thermodynamics is just one of infinitely many laws saying that various kinds of disorder increase. I’m annoyed that nobody ever told me about this before! It’s as if they told me about conservation of energy but not conservation of schmenergy, and phlenergy, and zenergy

So I need to tell you about this. You may not understand it, but at least I can say I tried. I don’t want you blaming me for concealing all these extra second laws of thermodynamics!

Here’s the basic idea. Not all random processes are guaranteed to make entropy increase. But a bunch of them always make probability distributions flatter in a certain precise sense. This makes the entropy of the probability distribution increase. But when you make a probability distribution flatter in this sense, a bunch of other quantities increase too! For example, besides the usual entropy, there are infinitely many other kinds of entropy, called ‘Rényi entropies’, one for each number between 0 and ∞. And a doubly stochastic operator makes all the Rényi entropies increase! This fact is a special case of Theorem 10 here:

• Tim van Erven and Peter Harremoës, Rényi divergence and majorization.

Let me state this fact precisely, and then say a word about how this is related to quantum theory and ‘the collapse of the wavefunction’.

To keep things simple let’s talk about probability distributions on a finite set, though Erven and Harremoës generalize it all to a measure space.

How do we make precise the concept that one probability distribution is flatter than another? You know it when you see it, at least some of the time. For example, suppose I have some system in thermal equilibrium at some temperature, and the probabilities of it being in various states look like this:

Then say I triple the temperature. The probabilities flatten out:

But how can we make this concept precise in a completely general way? We can do it using the concept of ‘majorization’. If one probability distribution is less flat than another, people say it ‘majorizes’ that other one.

Here’s the definition. Say we have two probability distributions p and q on the same set. For each one, list the probabilities in decreasing order:

p_1 \ge p_2 \ge \cdots \ge p_n

q_1 \ge q_2 \ge \cdots \ge q_n

Then we say p majorizes q if

p_1 + \cdots + p_k \ge q_1 + \cdots + q_k

for all 1 \le k \le n. So, the idea is that the biggest probabilities in the distribution p add up to more than the corresponding biggest ones in q.

In 1960, Alfred Rényi defined a generalization of the usual Shannon entropy that depends on a parameter \beta. If p is a probability distribution on a finite set, its Rényi entropy of order \beta is defined to be

\displaystyle{ H_\beta(p) = \frac{1}{1 - \beta} \ln \sum_i p_i^\beta }

where 0 \le \beta < \infty. Well, to be honest: if \beta is 0, 1, or \infty we have to define this by taking a limit where we let \beta creep up to that value. But the limit exists, and when \beta = 1 we get the usual Shannon entropy

\displaystyle{ H_1(p) = - \sum_i p_i \ln(p_i) }

As I explained a while ago, Rényi entropies are important ways of measuring biodiversity. But here’s what I learned just now, from the paper by Erven and Harremoës:

Theorem 1. If a probability distribution p majorizes a probability distribution q, its Rényi entropies are smaller:

\displaystyle{ H_\beta(p) \le H_\beta(q) }

for all 0 \le \beta < \infty.

And here’s what makes this fact so nice. If you do something to a classical system in a way that might involve some randomness, we can describe your action using a stochastic matrix. An n \times n matrix T is called stochastic if whenever p \in \mathbb{R}^n is a probability distribution, so is T p. This is equivalent to saying:

• the matrix entries of T are all \ge 0, and

• each column of T sums to 1.

If T is stochastic, it’s not necessarily true that the entropy of T p is greater than or equal to that of p, not even for the Shannon entropy.

Puzzle 1. Find a counterexample.

However, entropy does increase if we use specially nice stochastic matrices called ‘doubly stochastic’ matrices. People say a matrix T doubly stochastic if it’s stochastic and it maps the probability distribution

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

to itself. This is the most spread-out probability distribution of all: every other probability distribution majorizes this one.

Why do they call such matrices ‘doubly’ stochastic? Well, if you’ve got a stochastic matrix, each column sums to one. But a stochastic operator is doubly stochastic if and only if each row sums to 1 as well.

Here’s a really cool fact:

Theorem 2. If T is doubly stochastic, p majorizes T p for any probability distribution p \in \mathbb{R}^n. Conversely, if a probability distribution p majorizes a probability distribution q, then q = T p for some doubly stochastic matrix T.

Taken together, Theorems 1 and 2 say that doubly stochastic transformations increase entropy… but not just Shannon entropy! They increase all the different Rényi entropies, as well. So if time evolution is described by a doubly stochastic matrix, we get lots of ‘second laws of thermodynamics’, saying that all these different kinds of entropy increase!

Finally, what does all this have to do with quantum mechanics, and collapsing the wavefunction? There are different things to say, but this is the simplest:

Theorem 3. Given two probability distributions p, q \in \mathbb{R}^n, then p majorizes q if and only there exists a self-adjoint matrix D with eigenvalues p_i and diagonal entries q_i.

The matrix D will be a density matrix: a self-adjoint matrix with positive eigenvalues and trace equal to 1. We use such matrices to describe mixed states in quantum mechanics.

Theorem 3 gives a precise sense in which preparing a quantum system in some state, letting time evolve, and then measuring it ‘increases randomness’.

How? Well, suppose we have a quantum system whose Hilbert space is \mathbb{C}^n. If we prepare the system in a mixture of the standard basis states with probabilities p_i, we can describe it with a diagonal density matrix D_0. Then suppose we wait a while and some unitary time evolution occurs. The system is now described by a new density matrix

D = U D_0 \, U^{-1}

where U is some unitary operator. If we then do a measurement to see which of the standard basis states our system now lies in, we’ll get the different possible results with probabilities q_i, the diagonal entries of D. But the eigenvalues of D will still be the numbers p_i. So, by the theorem, p majorizes q!

So, not only Shannon entropy but also all the Rényi entropies will increase!

Of course, there are some big physics questions lurking here. Like: what about the real world? In the real world, do lots of different kinds of entropy tend to increase, or just some?

Of course, there’s a huge famous old problem about how reversible time evolution can be compatible with any sort of law saying that entropy must always increase! Still, there are some arguments, going back to Boltzmann’s H-theorem, which show entropy increases under some extra conditions. So then we can ask if other kinds of entropy, like Rényi entropy, increase as well. This will be true whenever we can argue that time evolution is described by doubly stochastic matrices. Theorem 3 gives a partial answer, but there’s probably much more to say.

I don’t have much more to say right now, though. I’ll just point out that while doubly stochastic matrices map the ‘maximally smeared-out’ probability distribution

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

to itself, a lot of this theory generalizes to stochastic matrices that map exactly one other probability distribution to itself. We need to work with relative Rényi entropy instead of Rényi entropy, and so on, but I don’t think these adjustments are really a big deal. And there are nice theorems that let you know when a stochastic matrix maps exactly one probability distribution to itself, based on the Perron–Frobenius theorem.

References

I already gave you a reference for Theorem 1, namely the paper by Erven and Harremoës, though I don’t think they were the first to prove this particular result: they generalize it quite a lot.

What about Theorem 2? It goes back at least to here:

• Barry C. Arnold, Majorization and the Lorenz Order: A Brief Introduction, Springer Lecture Notes in Statistics 43, Springer, Berlin, 1987.

The partial order on probability distributions given by majorization is also called the ‘Lorenz order’, but mainly when we consider probability distributions on infinite sets. This name presumably comes from the Lorenz curve, a measure of income inequality. This curve shows for the bottom x% of households, what percentage y% of the total income they have:

Puzzle 2. If you’ve got two different probability distributions of incomes, and one majorizes the other, how are their Lorenz curves related?

When we generalize majorization by letting some other probability distribution take the place of

\displaystyle{ p_0 = (\frac{1}{n}, \dots, \frac{1}{n}) }

it seems people call it the ‘Markov order’. Here’s a really fascinating paper on that, which I’m just barely beginning to understand:

• A. N. Gorban, P. A. Gorban and G. Judge, Entropy: the Markov ordering approach, Entropy 12 (2010), 1145–1193.

What about Theorem 3? Apparently it goes back to here:

• A. Uhlmann, Wiss. Z. Karl-Marx-Univ. Leipzig 20 (1971), 633.

though I only know this thanks to a more recent paper:

• Michael A. Nielsen, Conditions for a class of entanglement transformations, Phys. Rev. Lett. 83 (1999), 436–439.

By the way, Nielsen’s paper contains another very nice result about majorization! Suppose you have states \psi and \phi of a 2-part quantum system. You can trace out one part and get density matrices describing mixed states of the other part, say D_\psi and D_\phi. Then Nielsen shows you can get from \psi to \phi using ‘local operations and classical communication’ if and only if D_\phi majorizes D_\psi. Note that things are going backwards here compared to how they’ve been going in the rest of this post: if we can get from \psi to \phi, then all forms of entropy go down when we go from D_\psi to D_\phi! This ‘anti-second-law’ behavior is confusing at first, but familiar to me by now.

When I first learned all this stuff, I naturally thought of the following question—maybe you did too, just now. If p, q \in \mathbb{R}^n are probability distributions and

\displaystyle{ H_\beta(p) \le H_\beta(q) }

for all 0 \le \beta < \infty, is it true that p majorizes q?

Apparently the answer must be no, because Klimesh has gone to quite a bit of work to obtain a weaker conclusion: not that p majorizes q, but that p \otimes r majorizes q \otimes r for some probability distribution r \in \mathbb{R}^m. He calls this catalytic majorization, with r serving as a ‘catalyst’:

• Matthew Klimesh, Inequalities that collectively completely characterizes the catalytic majorization relation.

I thank Vlatko Vedral here at the CQT for pointing this out!

Finally, here is a good general introduction to majorization, pointed out by Vasileios Anagnostopoulos:

• T. Ando, Majorization, doubly stochastic matrices, and comparison of eigenvalues, Linear Algebra and its Applications 118 (1989), 163-–248.


The Noisy Channel Coding Theorem

28 July, 2012

Here’s a charming, easily readable tale of Claude Shannon and how he came up with information theory:

• Erico Guizzo, The Essential Message: Claude Shannon and the Making of Information Theory.

I hadn’t known his PhD thesis was on genetics! His master’s thesis introduced Boolean logic to circuit design. And as a kid, he once set up a telegraph line to a friend’s house half a mile away.

So, he was perfectly placed to turn information into a mathematical quantity, deeply related to entropy, and prove some now-famous theorems about it.

These theorems set limits on how much information we can transmit through a noisy channel. More excitingly, they say we can cook up coding schemes that let us come as close as we want to this limit, with an arbitrarily low probability of error.

As Erico Guizzo points out, these results are fundamental to the ‘information age’ we live in today:

Can we transmit, say, a high-resolution picture over a telephone line? How long will that take? Is there a best way to do it?

Before Shannon, engineers had no clear answers to these questions. At that time, a wild zoo of technologies was in operation, each with a life of its own—telephone, telegraph, radio, television, radar, and a number of other systems developed during the war. Shannon came up with a unifying, general theory of communication. It didn’t matter whether you transmitted signals using a copper wire, an optical fiber, or a parabolic dish. It didn’t matter if you were transmitting text, voice, or images. Shannon envisioned communication in abstract, mathematical terms; he defined what the once fuzzy concept of “information” meant for communication engineers and proposed a precise way to quantify it. According to him, the information content of any kind of message could be measured in binary digits, or just bits—a name suggested by a colleague at Bell Labs. Shannon took the bit as the fundamental unit in information theory. It was the first time that the term appeared in print.

So, I want to understand Shannon’s theorems and their proofs—especially because they clarify the relation between information and entropy, two concepts I’d like to be an expert on. It’s sort of embarrassing that I don’t already know this stuff! But I thought I’d post some preliminary remarks anyway, in case you too are trying to learn this stuff, or in case you can help me.

There are various different theorems I should learn. For example:

• The source coding theorem says it’s impossible to compress a stream of data to make the average number of bits per symbol in the compressed data less than the Shannon information of the source, without some of the data almost certainly getting lost. However, you can make the number of bits per symbols arbitrarily close to the Shannon entropy with a probability of error as small as you like.

• the noisy channel coding theorem is a generalization to data sent over a noisy channel.

The proof of the noisy channel coding theorem seems not so bad—there’s a sketch of a proof in the Wikipedia article on this theorem. But many theorems have a hard lemma at their heart, and for this one it’s a result in probability theory called the asymptotic equipartition property.

You should not try to dodge the hard lemma at the heart of the theorem you’re trying to understand: there’s a reason it’s there. So what’s the asymptotic equipartition property?

Here’s a somewhat watered-down statement that gets the basic idea across. Suppose you have a method of randomly generating letters—for example, a probability distribution on the set of letters. Suppose you randomly generate a string of n letters, and compute -(1/n) times the logarithm of the probability that you got that string. Then as n \to \infty this number ‘almost surely’ approaches some number S. What’s this number S? It’s the entropy of the probability distribution you used to generate those letters!

(Almost surely is probability jargon for ‘with probability 100%’, which is not the same as ‘always’.)

This result is really cool—definitely worth understanding in its own right! It says that while many strings are possible, the ones you’re most likely to see lie in a certain ‘typical set’. The ‘typical’ strings are the ones where when you compute -(1/n) times the log of their probability, the result is close to S. How close? Well, you get to pick that.

The typical strings are not individually the most probable strings! But if you randomly generate a string, it’s very probable that it lies in the typical set. That sounds a bit paradoxical, but if you think about it, you’ll see it’s not. Think of repeatedly flipping a coin that has a 90% chance of landing heads up. The most probable single outcome is that it lands heads up every time. But the typical outcome is that it lands up close to 90% of the time. And, there are lots of ways this can happen. So, if you flip the coin a bunch of times, there’s a very high chance that the outcome is typical.

It’s easy to see how this result is the key to the noisy channel coding theorem. The ‘typical set’ has few elements compared to the whole set of strings with n letters. So, you can make short codes for the strings in this set, and compress your message that way, and this works almost all the time. How much you can compress your message depends on the entropy S.

So, we’re seeing the link between information and entropy!

The actual coding schemes that people use are a lot trickier than the simple scheme I’m hinting at here. When you read about them, you see scary things like this:

But presumably they’re faster to implement, hence more practical.

The first coding schemes that come really close to the Shannon limit are the turbo codes. Surprisingly, these codes were developed only in 1993! They’re used in 3G mobile communications and deep space satellite communications.

One key trick is to use, not one decoder, but two. These two decoders keep communicating with each other and improving their guesses about the signal they’re received, until they agree:

This iterative process continues until the two decoders come up with the same hypothesis for the m-bit pattern of the payload, typically in 15 to 18 cycles. An analogy can be drawn between this process and that of solving cross-reference puzzles like crossword or sudoku. Consider a partially completed, possibly garbled crossword puzzle. Two puzzle solvers (decoders) are trying to solve it: one possessing only the “down” clues (parity bits), and the other possessing only the “across” clues. To start, both solvers guess the answers (hypotheses) to their own clues, noting down how confident they are in each letter (payload bit). Then, they compare notes, by exchanging answers and confidence ratings with each other, noticing where and how they differ. Based on this new knowledge, they both come up with updated answers and confidence ratings, repeating the whole process until they converge to the same solution.

This can be seen as “an instance of loopy belief propagation in Bayesian networks.”

By the way, the picture I showed you above is a flowchart of the decoding scheme for a simple turbo code. You can see the two decoders, and maybe the loop where data gets fed back to the decoders.

While I said this picture is “scary”, I actually like it because it’s an example of network theory applied to real-life problems.


The Mathematics of Biodiversity (Part 8)

14 July, 2012

Last time I mentioned that estimating entropy from real-world data is important not just for measuring biodiversity, but also for another area of biology: neurobiology!

When you look at something, neurons in your eye start firing. But how, exactly, is their firing related to what you see? Questions like this are hard! Answering them— ‘cracking the neural code’—is a big challenge. To make progress, neuroscientists are using information theory. But as I explained last time, estimating information from experimental data is tricky.

Romain Brasselet, now a postdoc at the Max Planck Institute for Biological Cybernetics at Tübingen, is working on these topics. He sent me a nice email explaining this area.

This is a bit of a digression, but the Mathematics of Biodiversity program in Barcelona has been extraordinarily multidisciplinary, with category theorists rubbing shoulders with ecologists, immunologists and geneticists. One of the common themes is entropy and its role in biology, so I think it’s worth posting Romain’s comments here. This is what he has to say…

Information in neurobiology

I will try to explain why neurobiologists are today very interested in reliable estimates of entropy/information and what are the techniques we use to obtain them.

The activity of sensory as well as more central neurons is known to be modulated by external stimulations. In 1926, in a seminal paper, Adrian observed that neurons in the sciatic nerve of the frog fire action potentials (or spikes) when some muscle in the hindlimb is stretched. In addition, he observed that the frequency of the spikes increases with the amplitude of the stretching.

• E.D. Adrian, The impulses produced by sensory nerve endings. (1926).

For another very nice example, in 1962, Hubel and Wiesel found neurons in the cat visual cortex whose activity depends on the orientation of a visual stimulus, a simple black line over white background: some neurons fire preferentially for one orientation of the line (Hubel and Wiesel were awarded the 1981 Nobel Prize in Physiology for their work). This incidentally led to the concept of “receptive field” which is of tremendous importance in neurobiology—but though it’s fascinating, it’s a different topic.

Good, we are now able to define what makes a neuron tick. The problem is that neural activity is often very “noisy”: when the exact same stimulus is presented many times, the responses appear to be very different from trial to trial. Even careful observation cannot necessarily reveal correlations between the stimulations and the neural activity. So we would like a measure capable of capturing the statistical dependencies between the stimulation and the response of the neuron to know if we can say something about the stimulation just by observing the response of a neuron, which is essentially the task of the brain. In particular, we want a fundamental measure that does not rely on any assumption about the functioning of the brain. Information theory provides the tools to do this, that is why we like to use it: we often try to measure the mutual information between stimuli and responses.

To my knowledge, the first paper using information theory in neuroscience was by MacKay and McCulloch in 1952:

• Donald M. Mackay and Warren S. McCulloch, The limiting information capacity of a neuronal link, Bulletin of Mathematical Biophysics 14 (1952), 127–135.

But information theory was not used in neuroscience much until the early 90′s. It started again with a paper by Bialek et al. in 1991:

• W. Bialek, F. Rieke, R. R. de Ruyter van Steveninck and D. Warland, Reading a neural code, Science 252 (1991), 1854–1857.

However, when applying information-theoretic methods to biological data, we often have a limited sampling of the neural response, we are usually very happy when we have 50 trials for a given stimulus. Why is this limited sample a problem?

During the major part of the 20th century, following Adrian’s finding, the paradigm for the neural code was the frequency of the spikes or, equivalently, the number of spikes in a window of time. But in the early 90′s, it was observed that the exact timing of spikes is (in some cases) reliable across trials. So instead of considering the neural response as a single number (the number of spikes), the temporal patterns of spikes started to be taken into account. But time is continuous, so to be able to do actual computations, time was discretized and a neural response became a binary string.

Now, if you consider relevant time-scales, say, a 100 millisecond time window with a 1 millisecond bin with a firing frequency of about 50 per second, then your response space is huge and the estimates of information with only 50 trials are not reliable anymore. That’s why a lot of efforts have been carried to overcome the limited sampling bias.

Now, getting at the techniques developed in this field, John already mentioned the work by Liam Paninski, but here are other very interesting references:

• Stefano Panzeri and Alessandro Treves, Analytical estimates of limited sampling biases in different information measures, Network: Computation in Neural Systems 7 (1996), 87–107.

They computed the first-order bias of the information (related to the Miller–Madow correction) and then used a Bayesian technique to estimate the number of responses not included in the sample but that would be in an infinite sample (a goal similar to that of Good’s rule of thumb).

• S.P. Strong, R. Koberle, R.R. de Ruyter van Steveninck, and W. Bialek, Entropy and information in neural spike trains, Phys. Rev. Lett. 80 (1998), 197–200.

The entropy (or if you prefer, information) estimate can be expanded in a power series in N (the sample size) around the true value. By computing the estimate for various values of N and fitting it with a parabola, it is possible to estimate the value of the entropy as N \rightarrow \infty.

These approaches are also well-known:

• Ilya Nemenman, Fariel Shafee and William Bialek, Entropy and inference, revisited, 2002.

• Alexander Kraskov, Harald Stögbauer and Peter Grassberger, Estimating mutual information, Phys. Rev. E. 69 (2004), 066138.

Actually, Stefano Panzeri has quite a few impressive papers about this problem, and recently with colleagues he has made public a free Matlab toolbox for information theory (www.ibtb.org) implementing various correction methods.

Finally, the work by Jonathan Victor is worth mentioning, since he provided (to my knowledge again) the first estimate of mutual information using geometry. This is of particular interest with respect to the work by Christina Cobbold and Tom Leinster on measures of biodiversity that take the distance between species into account:

• J. D. Victor and K. P. Purpura, Nature and precision of temporal coding in visual cortex: a metric-space analysis, Journal of Neural Physiology 76 (1996), 1310–1326.

He introduced a distance between sequences of spikes and from this, derived a lower bound on mutual information.

• Jonathan D. Victor, Binless strategies for estimation of information from neural data, Phys. Rev. E. 66 (2002), 051903.

Taking inspiration from work by Kozachenko and Leonenko, he obtained an estimate of the information based on the distances between the closest responses.

Without getting too technical, that’s what we do in neuroscience about the limited sampling bias. The incentive is that obtaining reliable estimates is crucial to understand the ‘neural code’, the holy grail of computational neuroscientists.


The Mathematics of Biodiversity (Part 7)

12 July, 2012

How ignorant are you?

Do you know?

Do you know how much don’t you know?

It seems hard to accurately estimate your lack of knowledge. It even seems hard to say precisely how hard it is. But the cool thing is, we can actually extract an interesting math question from this problem. And one answer to this question leads to the following conclusion:

There’s no unbiased way to estimate how ignorant you are.

But the devil is in the details. So let’s see the details!

The Shannon entropy of a probability distribution is a way of measuring how ignorant we are when this probability distribution describes our knowledge.

For example, suppose all we care about is whether this ancient Roman coin will land heads up or tails up:

If we know there’s a 50% chance of it landing heads up, that’s a Shannon entropy of 1 bit: we’re missing one bit of information.

But suppose for some reason we know for sure it’s going to land heads up. For example, suppose we know the guy on this coin is the emperor Pupienus Maximus, a egomaniac who had lead put on the back of all coins bearing his likeness, so his face would never hit the dirt! Then the Shannon entropy is 0: we know what’s going to happen when we toss this coin.

Or suppose we know there’s a 90% it will land heads up, and a 10% chance it lands tails up. Then the Shannon entropy is somewhere in between. We can calculate it like this:

- 0.9 \log_2 (0.9) - 0.1 \log_2 (0.1) = 0.46899...

so that’s how many bits of information we’re missing.

But now suppose we have no idea. Suppose we just start flipping the coin over and over, and seeing what happens. Can we estimate the Shannon entropy?

Here’s a naive way to do it. First, use your experimental data to estimate the probability that that the coin lands heads-up. Then, stick that probability into the formula for Shannon entropy. For example, say we flip the coin 3 times and it lands head-up once. Then we can estimate the probability of it landing heads-up as 1/3, and tails-up as 2/3. So we can estimate that the Shannon entropy is

\displaystyle{ - \frac{1}{3} \log_2 (\frac{1}{3}) -\frac{2}{3} \log_2 (\frac{2}{3}) = 0.918... }

But it turns out that this approach systematically underestimates the Shannon entropy!

Say we have a coin that lands up a certain fraction of the time, say p. And say we play this game: we flip our coin n times, see what we get, and estimate the Shannon entropy using the simple recipe I just illustrated.

Of course, our estimate will depend on the luck of the game. But on average, it will be less than the actual Shannon entropy, which is

- p \log_2 (p) - (1-p) \log_2 (1-p)

We can prove this mathematically. But it shouldn’t be surprising. After all, if n = 1, we’re playing a game where we flip the coin just once. And with this game, our naive estimate of the Shannon entropy will always be zero! Each time we play the game, the coin will either land heads up 100% of the time, or tails up 100% of the time!

If we play the game with more coin flips, the error gets less severe. In fact it approaches zero as the number of coin flips gets ever larger, so that n \to \infty. The case where you flip the coin just once is an extreme case—but extreme cases can be good to think about, because they can indicate what may happen in less extreme cases.

One moral here is that naively generalizing on the basis of limited data can make you feel more sure you know what’s going on than you actually are.

I hope you knew that already!

But we can also say, in a more technical way, that the naive way of estimating Shannon entropy is a biased estimator: the average value of the estimator is different from the value of the quantity being estimated.

Here’s an example of an unbiased estimator. Say you’re trying to estimate the probability that the coin will land heads up. You flip it n times and see that it lands up m times. You estimate that the probability is m/n. That’s the obvious thing to do, and it turns out to be unbiased.

Statisticians like to think about estimators, and being unbiased is one way an estimator can be ‘good’. Beware: it’s not the only way! There are estimators that are unbiased, but whose standard deviation is so huge that they’re almost useless. It can be better to have an estimate of something that’s more accurate, even though on average it’s a bit too low. So sometimes, a biased estimator can be more useful than an unbiased estimator.

Nonetheless, my ears perked up when Lou Jost mentioned that there is no unbiased estimator for Shannon entropy. In rough terms, the moral is that:

There’s no unbiased way to estimate how ignorant you are.

I think this is important. For example, it’s important because Shannon entropy is also used as a measure of biodiversity. Instead of flipping a coin repeatedly and seeing which side lands up, now we go out and collect plants or animals, and see which species we find. The relative abundance of different species defines a probability distribution on the set of species. In this language, the moral is:

There’s no unbiased way to estimate biodiversity.

But of course, this doesn’t mean we should give up. We may just have to settle for an estimator that’s a bit biased! And people have spent a bunch of time looking for estimators that are less biased than the naive one I just described.

By the way, equating ‘biodiversity’ with ‘Shannon entropy’ is sloppy: there are many measures of biodiversity. The Shannon entropy is just a special case of the Rényi entropy, which depends on a parameter q: we get Shannon entropy when q = 1.

As q gets smaller, the Rényi entropy gets more and more sensitive to rare species—or shifting back to the language of probability theory, rare events. It’s the rare events that make Shannon entropy hard to estimate, so I imagine there should be theorems about estimators for Rényi entropy, which say it gets harder to estimate as q gets smaller. Do you know such theorems?

Also, I should add that biodiversity is better captured by the ‘Hill numbers’, which are functions of the Rényi entropy, than by the Rényi entropy itself. (See here for the formulas.) Since these functions are nonlinear, the lack of an unbiased estimator for Rényi entropy doesn’t instantly imply the same for the Hill numbers. So there are also some obvious questions about unbiased estimators for Hill numbers. Do you know answers to those?

Here are some papers on estimators for entropy. Most of these focus on estimating the Shannon entropy of a probability distribution on a finite set.

This old classic has a proof that the ‘naive’ estimator of Shannon entropy is biased, and estimates on the bias:

• Bernard Harris, The statistical estimation of entropy in the non-parametric case, Army Research Office, 1975.

He shows the bias goes to zero as we increase the number of samples: the number I was calling n in my coin flip example. In fact he shows the bias goes to zero like O(1/n). This is big big O notation which means that as n \to +\infty, the bias is bounded by some constant times 1/n. This constant depends on the size of our finite set—or, if you want to do better, the class number, which is the number of elements on which our probability distribution is nonzero.

Using this idea, he shows that you can find a less biased estimator if you have a probability distribution p_i on a finite set and you know that exactly k of these probabilities are nonzero. To do this, just take the ‘naive’ estimator I described earlier and add (k-1)/2n. This is called the Miller–Madow bias correction. The bias of this improved estimator goes to zero like O(1/n^2).

The problem is that in practice you don’t know ahead of time how many probabilities are nonzero! In applications to biodiversity this would amount to knowing ahead of time how many species exist, before you go out looking for them.

But what about the theorem that there’s no unbiased estimator for Shannon entropy? The best reference I’ve found is this:

• Liam Paninski, Estimation of entropy and mutual information, Neural Computation 15 (2003) 1191-1254.

In Proposition 8 of Appendix A, Paninski gives a quick proof that there is no unbiased estimator of Shannon entropy for probability distributions on a finite set. But his paper goes far beyond this. Indeed, it seems like a pretty definitive modern discussion of the whole subject of estimating entropy. Interestingly, this subject is dominated by neurobiologists studying entropy of signals in the brain! So, lots of his examples involve brain signals.

Another overview, with tons of references, is this:

• J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. van der Meulen, Nonparametric entropy estimation: an overview.

This paper focuses on the situation where don’t know ahead of time how many probabilities are nonzero:

• Anne Chao and T.-J. Shen, Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample, Environmental and Ecological Statistics 10 (2003), 429&–443.

In 2003 there was a conference on the problem of estimating entropy, whose webpage has useful information. As you can see, it was dominated by neurobiologists:

Estimation of entropy and information of undersampled probability distributions: theory, algorithms, and applications to the neural code, Whistler, British Columbia, Canada, 12 December 2003.

By the way, I was very confused for a while, because these guys claim to have found an unbiased estimator of Shannon entropy:

• Stephen Montgomery Smith and Thomas Schürmann, Unbiased estimators for entropy and class number.

However, their way of estimating entropy has a funny property: in the language of biodiversity, it’s only well-defined if our samples include at least one species of each organism. So, we cannot compute this estimate for an arbitary list of n samples. This means it’s not estimator in the usual sense—the sense that Paninski is using! So it doesn’t really contradict Paninski’s result.

To wrap up, let me state Paninski’s result in a mathematically precise way. Suppose p is a probability distribution on a finite set X. Suppose S is any number we can compute from p: that is, any real-valued function on the set of probability distributions. We’ll be interested in the case where S is the Shannon entropy:

\displaystyle{ S = -\sum_{x \in X} p(x) \, \log p(x) }

Here we can use whatever base for the logarithm we like: earlier I was using base 2, but that’s not sacred. Define an estimator to be any function

\hat{S}: X^n \to \mathbb{R}

The idea is that given n samples from the set X, meaning points x_1, \dots, x_n \in X, the estimator gives a number \hat{S}(x_1, \dots, x_n). This number is supposed to estimate some feature of the probability distribution p: for example, its entropy.

If the samples are independent and distributed according to the distribution p, the sample mean of the estimator will be

\displaystyle{ \langle \hat{S} \rangle = \sum_{x_1, \dots, x_n \in X} \hat{S}(x_1, \dots, x_n) \, p(x_1) \cdots p(x_n) }

The bias of the estimator is the difference between the sample mean of the estimator and actual value of S:

\langle \hat{S} \rangle - S

The estimator \hat{S} is unbiased if this bias is zero for all p.

Proposition 8 of Paninski’s paper says there exists no unbiased estimator for entropy! The proof is very short…

Okay, that’s all for today.

I’m back in Singapore now; I learned so much at the Mathematics of Biodiversity conference that there’s no way I’ll be able to tell you all that information. I’ll try to write a few more blog posts, but please be aware that my posts so far give a hopelessly biased and idiosyncratic view of the conference, which would be almost unrecognizable to most of the participants. There are a lot of important themes I haven’t touched on at all… while this business of entropy estimation barely came up: I just find it interesting!

If more of you blogged more, we wouldn’t have this problem.


The Mathematics of Biodiversity (Part 5)

3 July, 2012

I’d be happy to get your feedback on these slides of the talk I’m giving the day after tomorrow:

• John Baez, Diversity, entropy and thermodynamics, 6 July 2012, Exploratory Conference on the Mathematics of Biodiversity, Centre de Recerca Matemàtica, Barcelona.

Abstract: As is well known, some popular measures of biodiversity are formally identical to measures of entropy developed by Shannon, Rényi and others. This fact is part of a larger analogy between thermodynamics and the mathematics of biodiversity, which we explore here. Any probability distribution can be extended to a 1-parameter family of probability distributions where the parameter has the physical meaning of ‘temperature’. This allows us to introduce thermodynamic concepts such as energy, entropy, free energy and the partition function in any situation where a probability distribution is present—for example, the probability distribution describing the relative abundances of different species in an ecosystem. The Rényi entropy of this probability distribution is closely related to the change in free energy with temperature. We give one application of thermodynamic ideas to population dynamics, coming from the work of Marc Harper: as a population approaches an ‘evolutionary optimum’, the amount of Shannon information it has ‘left to learn’ is nonincreasing. This fact is closely related to the Second Law of Thermodynamics.

This talk is rather different than the one I’d envisaged giving! There was a lot of interest in my work on Rényi entropy and thermodynamics, because Rényi entropies—and their exponentials, called the Hill numbers—are an important measure of biodiversity. So, I decided to spend a lot of time talking about that.


Follow

Get every new post delivered to your Inbox.

Join 721 other followers