Entropy and Information in Biological Systems (Part 1)

2 November, 2013

John Harte is an ecologist who uses maximum entropy methods to predict the distribution, abundance and energy usage of species. Marc Harper uses information theory in bioinformatics and evolutionary game theory. Harper, Harte and I are organizing a workshop on entropy and information in biological systems, and I’m really excited about it!

It’ll take place at the National Institute for Mathematical and Biological Synthesis in Knoxville Tennesee. We are scheduling it for Wednesday-Friday, April 8-10, 2015. When the date gets confirmed, I’ll post an advertisement so you can apply to attend.

Writing the proposal was fun, because we got to pull together lots of interesting people who are applying information theory and entropy to biology in quite different ways. So, here it is!

Proposal

Ever since Shannon initiated research on information theory in 1948, there have been hopes that the concept of information could serve as a tool to help systematize and unify work in biology. The link between information and entropy was noted very early on, and it suggested that a full thermodynamic understanding of biology would naturally involve the information processing and storage that are characteristic of living organisms. However, the subject is full of conceptual pitfalls for the unwary, and progress has been slower than initially expected. Premature attempts at ‘grand syntheses’ have often misfired. But applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful, such as:

• the maximum entropy principle in ecology,
• Shannon and Rényi entropies as measures of biodiversity,
• information theory in evolutionary game theory,
• information and the thermodynamics of individual cells.

Because they work in diverse fields, researchers in these specific topics have had little opportunity to trade insights and take stock of the progress so far. The aim of the workshop is to do just this.

In what follows, participants’ names are in boldface, while the main goals of the workshop are in italics.

Roderick Dewar is a key advocate of the principle of Maximum Entropy Production, which says that biological systems—and indeed all open, non-equilibrium systems—act to produce entropy at the maximum rate. Along with others, he has applied this principle to make testable predictions in a wide range of biological systems, from ATP synthesis [DJZ2006] to respiration and photosynthesis of individual plants [D2010] and plant communities. He has also sought to derive this principle from ideas in statistical mechanics [D2004, D2009], but it remains controversial.

The first goal of this workshop is to study the validity of this principle.

While they may be related, the principle of Maximum Entropy Production should not be confused with the MaxEnt inference procedure, which says that we should choose the probabilistic hypothesis with the highest entropy subject to the constraints provided by our data. MaxEnt was first explicitly advocated by Jaynes. He noted that it is already implicit in the procedures of statistical mechanics, but convincingly argued that it can also be applied to situations where entropy is more ‘informational’ than ‘thermodynamic’ in character.

Recently John Harte has applied MaxEnt in this way to ecology, using it to make specific testable predictions for the distribution, abundance and energy usage of species across spatial scales and across habitats and taxonomic groups [Harte2008, Harte2009, Harte2011]. Annette Ostling is an expert on other theories that attempt to explain the same data, such as the ‘neutral model’ [AOE2008, ODLSG2009, O2005, O2012]. Dewar has also used MaxEnt in ecology [D2008], and he has argued that it underlies the principle of Maximum Entropy Production.

Thus, a second goal of this workshop is to familiarize all the participants with applications of the MaxEnt method to ecology, compare it with competing approaches, and study whether MaxEnt provides a sufficient justification for the principle of Maximum Entropy Production.

Entropy is not merely a predictive tool in ecology: it is also widely used as a measure of biodiversity. Here Shannon’s original concept of entropy naturally generalizes to ‘Rényi entropy’, which depends on a parameter \alpha \ge 0. This equals

\displaystyle{ H_\alpha(p) = \frac{1}{1-\alpha} \log \sum_i p_i^\alpha  }

where p_i is the fraction of organisms of the ith type (which could mean species, some other taxon, etc.). In the limit \alpha \to 1 this reduces to the Shannon entropy:

\displaystyle{  H(p) = - \sum_i p_i \log p_i }

As \alpha increases, we give less weight to rare types of organisms. Christina Cobbold and Tom Leinster have described a systematic and highly flexible system of biodiversity measurement, with Rényi entropy at its heart [CL2012]. They consider both the case where all we have are the numbers p_i, and the more subtle case where we take the distance between different types of organisms into account.

John Baez has explained the role of Rényi entropy in thermodynamics [B2011], and together with Tom Leinster and Tobias Fritz he has proved other theorems characterizing entropy which explain its importance for information processing [BFL2011]. However, these ideas have not yet been connected to the widespread use of entropy in biodiversity studies. More importantly, the use of entropy as a measure of biodiversity has not been clearly connected to MaxEnt methods in ecology. Does the success of MaxEnt methods imply a tendency for ecosystems to maximize biodiversity subject to the constraints of resource availability? This seems surprising, but a more nuanced statement along these general lines might be correct.

So, a third goal of this workshop is to clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

As the amount of data to analyze in genomics continues to surpass the ability of humans to analyze it, we can expect automated experiment design to become ever more important. In Chris Lee and Marc Harper’s RoboMendel program [LH2013], a mathematically precise concept of ‘potential information’—how much information is left to learn—plays a crucial role in deciding what experiment to do next, given the data obtained so far. It will be useful for them to interact with William Bialek, who has expertise in estimating entropy from empirical data and using it to constrain properties of models [BBS, BNS2001, BNS2002], and Susanne Still, who applies information theory to automated theory building and biology [CES2010, PS2012].

However, there is another link between biology and potential information. Harper has noted that in an ecosystem where the population of each type of organism grows at a rate proportional to its fitness (which may depend on the fraction of organisms of each type), the quantity

\displaystyle{ I(q||p) = \sum_i q_i \ln(q_i/p_i) }

always decreases if there is an evolutionarily stable state [Harper2009]. Here p_i is the fraction of organisms of the ith genotype at a given time, while q_i is this fraction in the evolutionarily stable state. This quantity is often called the Shannon information of q ‘relative to’ p. But in fact, it is precisely the same as Lee and Harper’s potential information! Indeed, there is a precise mathematical analogy between evolutionary games and processes where a probabilistic hypothesis is refined by repeated experiments.

Thus, a fourth goal of this workshop is to develop the concept of evolutionary games as ‘learning’ processes in which information is gained over time.

We shall try to synthesize this with Carl Bergstrom and Matina Donaldson-Matasci’s work on the ‘fitness value of information’: a measure of how much increase in fitness a population can obtain per bit of extra information [BL2004, DBL2010, DM2013]. Following Harper, we shall consider not only relative Shannon entropy, but also relative Rényi entropy, as a measure of information gain [Harper2011].

A fifth and final goal of this workshop is to study the interplay between information theory and the thermodynamics of individual cells and organelles.

Susanne Still has studied the thermodynamics of prediction in biological systems [BCSS2012]. And in a celebrated related piece of work, Jeremy England used thermodynamic arguments to a derive a lower bound for the amount of entropy generated during a process of self-replication of a bacterial cell [England2013]. Interestingly, he showed that E. coli comes within a factor of 3 of this lower bound.

In short, information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The time is ripe for an investigative workshop that brings together experts from different fields and lets them share insights and methods and begin to tackle some of the big remaining questions.

Bibliography

[AOE2008] D. Alonso, A. Ostling and R. Etienne, The assumption of symmetry and species abundance distributions, Ecology Letters 11 (2008), 93–105.

[TMMABB2012} D. Amodei, W. Bialek, M. J. Berry II, O. Marre, T. Mora, and G. Tkacik, The simplest maximum entropy model for collective behavior in a neural network, arXiv:1207.6319 (2012).

[B2011] J. Baez, Rényi entropy and free energy, arXiv:1102.2098 (2011).

[BFL2011] J. Baez, T. Fritz and T. Leinster, A characterization of entropy in terms of information loss, Entropy 13 (2011), 1945–1957.

[B2011] J. Baez and M. Stay, Algorithmic thermodynamics, Math. Struct. Comp. Sci. 22 (2012), 771–787.

[BCSS2012] A. J. Bell, G. E. Crooks, S. Still and D. A Sivak, The thermodynamics of prediction, Phys. Rev. Lett. 109 (2012), 120604.

[BL2004] C. T. Bergstrom and M. Lachmann, Shannon information and biological fitness, in IEEE Information Theory Workshop 2004, IEEE, 2004, pp. 50-54.

[BBS] M. J. Berry II, W. Bialek and E. Schneidman, An information theoretic approach to the functional classification of neurons, in Advances in Neural Information Processing Systems 15, MIT Press, 2005.

[BNS2001] W. Bialek, I. Nemenman and N. Tishby, Predictability, complexity and learning, Neural Computation 13 (2001), 2409–2463.

[BNS2002] W. Bialek, I. Nemenman and F. Shafee, Entropy and inference, revisited, in Advances in Neural Information Processing Systems 14, MIT Press, 2002.

[CL2012] C. Cobbold and T. Leinster, Measuring diversity: the importance of species similarity, Ecology 93 (2012), 477–489.

[CES2010] J. P. Crutchfield, S. Still and C. Ellison, Optimal causal inference: estimating stored information and approximating causal architecture, Chaos 20 (2010), 037111.

[D2004] R. C. Dewar, Maximum entropy production and non-equilibrium statistical mechanics, in Non-Equilibrium Thermodynamics and Entropy Production: Life, Earth and Beyond, eds. A. Kleidon and R. Lorenz, Springer, New York, 2004, 41–55.

[DJZ2006] R. C. Dewar, D. Juretíc, P. Zupanovíc, The functional design of the rotary enzyme ATP synthase is consistent with maximum entropy production, Chem. Phys. Lett. 430 (2006), 177–182.

[D2008] R. C. Dewar, A. Porté, Statistical mechanics unifies different ecological patterns, J. Theor. Bio. 251 (2008), 389–403.

[D2009] R. C. Dewar, Maximum entropy production as an inference algorithm that translates physical assumptions into macroscopic predictions: don’t shoot the messenger, Entropy 11 (2009), 931–944.

[D2010] R. C. Dewar, Maximum entropy production and plant optimization theories, Phil. Trans. Roy. Soc. B 365 (2010) 1429–1435.

[DBL2010} M. C. Donaldson-Matasci, C. T. Bergstrom, and
M. Lachmann, The fitness value of information, Oikos 119 (2010), 219-230.

[DM2013] M. C. Donaldson-Matasci, G. DeGrandi-Hoffman, and A. Dornhaus, Bigger is better: honey bee colonies as distributed information-gathering systems, Animal Behaviour 85 (2013), 585–592.

[England2013] J. L. England, Statistical physics of self-replication, J. Chem. Phys. 139 (2013), 121923.

[ODLSG2009} J. L. Green, J. K. Lake, J. P. O’Dwyer, A. Ostling and V. M. Savage, An integrative framework for stochastic, size-structured community assembly, PNAS 106 (2009), 6170–6175.

[Harper2009] M. Harper, Information geometry and evolutionary game theory, arXiv:0911.1383 (2009).

[Harper2011] M. Harper, Escort evolutionary game theory, Physica D 240 (2011), 1411–1415.

[Harte2008] J. Harte, T. Zillio, E. Conlisk and A. Smith, Maximum entropy and the state-variable approach to macroecology, Ecology 89 (2008), 2700–2711.

[Harte2009] J. Harte, A. Smith and D. Storch, Biodiversity scales from plots to biomes with a universal species-area curve, Ecology Letters 12 (2009), 789–797.

[Harte2011] J. Harte, Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics, Oxford U. Press, Oxford, 2011.

[LH2013] M. Harper and C. Lee, Basic experiment planning via information metrics: the RoboMendel problem, arXiv:1210.4808 (2012).

[O2005] A. Ostling, Neutral theory tested by birds, Nature 436 (2005), 635.

[O2012] A. Ostling, Do fitness-equalizing tradeoffs lead to neutral communities?, Theoretical Ecology 5 (2012), 181–194.

[PS2012] D. Precup and S. Still, An information-theoretic approach to curiosity-driven reinforcement learning, Theory in Biosciences 131 (2012), 139–148.


Autocatalysis in Reaction Networks

11 October, 2013

guest post by Manoj Gopalkrishnan

Since this is my first time writing a blog post here, let me start with a word of introduction. I am a computer scientist at the Tata Institute of Fundamental Research, broadly interested in connections between Biology and Computer Science, with a particular interest in reaction networks. I first started thinking about them during my Ph.D. at the Laboratory for Molecular Science. My fascination with them has been predominantly mathematical. As a graduate student, I encountered an area with rich connections between combinatorics and dynamics, and surprisingly easy-to-state and compelling unsolved conjectures, and got hooked.

There is a story about Richard Feynman that he used to take bets with mathematicians. If any mathematician could make Feynman understand a mathematical statement, then Feynman would guess whether or not the statement was true. Of course, Feynman was in a habit of winning these bets, which allowed him to make the boast that mathematics, especially in its obsession for proof, was essentially irrelevant, since a relative novice like himself could after a moment’s thought guess at the truth of these mathematical statements. I have always felt Feynman’s claim to be unjust, but have often wondered what mathematical statement I would put to him so that his chances of winning were no better than random.

Today I want to tell you of a result about reaction networks that I have recently discovered with Abhishek Deshpande. The statement seems like a fine candidate to throw at Feynman because until we proved it, I would not have bet either way about its truth. Even after we obtained a short and elementary proof, I do not completely ‘see’ why it must be true. I am hoping some of you will be able to demystify it for me. So, I’m just going to introduce enough terms to be able to make the statement of our result, and let you think about how to prove it.

John and his colleagues have been talking about reaction networks as Petri nets in the network theory series on this blog. As discussed in part 2 of that series, a Petri net is a diagram like this:

Following John’s terminology, I will call the aqua squares ‘transitions’ and the yellow circles ‘species’. If we have some number #rabbit of rabbits and some number #wolf of wolves, we draw #rabbit many black dots called ‘tokens’ inside the yellow circle for rabbit, and #wolf tokens inside the yellow circle for wolf, like this:

Here #rabbit = 4 and #wolf = 3. The predation transition consumes one ‘rabbit’ token and one ‘wolf’ token, and produces two ‘wolf’ tokens, taking us here:

John explained in parts 2 and 3 how one can put rates on different transitions. For today I am only going to be concerned with ‘reachability:’ what token states are reachable from what other token states. John talked about this idea in part 25.

By a complex I will mean a population vector: a snapshot of the number of tokens in each species. In the example above, (#rabbit, #wolf) is a complex. If y, y' are two complexes, then we write

y \to y'

if we can get from y to y' by a single transition in our Petri net. For example, we just saw that

(4,3)\to (3,4)

via the predation transition.

Reachability, denoted \to^*, is the transitive closure of the relation \to. So y\to^* y' (read y' is reachable from y) iff there are complexes

y=y_0,y_1,y_2,\dots,y_k =y'

such that

y_0\to y_1\to\cdots\to y_{k-1}\to y_k.

For example, here (5,1) \to^* (1, 5) by repeated predation.

I am very interested in switches. After all, a computer is essentially a box of switches! You can build computers by connecting switches together. In fact, that’s how early computers like the Z3 were built. The CMOS gates at the heart of modern computers are essentially switches. By analogy, the study of switches in reaction networks may help us understand biochemical circuits.

A siphon is a set of species that is ‘switch-offable’. That is, if there are no tokens in the siphon states, then they will remain absent in future. Equivalently, the only reactions that can produce tokens in the siphon states are those that require tokens from the siphon states before they can fire. Note that no matter how many rabbits there are, if there are no wolves, there will continue to be no wolves. So {wolf} is a siphon. Similarly, {rabbit} is a siphon, as is the union {rabbit, wolf}. However, when Hydrogen and Oxygen form Water, {Water} is not a siphon.

For another example, consider this Petri net:

The set {HCl, NaCl} is a siphon. However, there is a conservation law: whenever an HCl token is destroyed, an NaCl token is created, so that #HCl + #NaCl is invariant. If both HCl and NaCl were present to begin with, the complexes where both are absent are not reachable. In this sense, this siphon is not ‘really’ switch-offable. As a first pass at capturing this idea, we will introduce the notion of ‘critical set’.

A conservation law is a linear expression involving numbers of tokens that is invariant under every transition in the Petri net. A conservation law is positive if all the coefficients are non-negative. A critical set of states is a set that does not contain the support of a positive conservation law.

For example, the support of the positive conservation law #HCl + #NaCl is {HCl, NaCl}, and hence no set containing this set is critical. Thus {HCl, NaCl} is a siphon, but not critical. On the other hand, the set {NaCl} is critical but not a siphon. {HCl} is a critical siphon. And in our other example, {Wolf, Rabbit} is a critical siphon.

Of particular interest to us will be minimal critical siphons, the minimal sets among critical siphons. Consider this example:

Here we have two transitions:

X \to 2Y

and

2X \to Y

The set \{X,Y\} is a critical siphon. But so is the smaller set \{X\}. So, \{X,Y\} is not minimal.

We define a self-replicable set to be a set A of species such that there exist complexes y and y' with y\to^* y' such that for all i \in A we have

y'_i > y_i

So, there are transitions that accomplish the job of creating more tokens for all the species in A. In other words: these species can ‘replicate themselves’.

We define a drainable set by changing the > to a <. So, there are transitions that accomplish the job of reducing the number of tokens for all the species in A. These species can ‘drain away’.

Now here comes the statement:

Every minimal critical siphon is either drainable or self-replicable!

We prove it in this paper:

• Abhishek Deshpande and Manoj Gopalkrishnan, Autocatalysis in reaction networks.

But first note that the statement becomes false if the critical siphon is not minimal. Look at this example again:

The set \{X,Y\} is a critical siphon. However \{X,Y\} is neither self-replicable (since every reaction destroys X) nor drainable (since every reaction produces Y). But we’ve already seen that \{X,Y\} is not minimal. It has a critical subsiphon, namely \{X\}. This one is minimal—and it obeys our theorem, because it is drainable.

Checking these statements is a good way to make sure you understand the concepts! I know I’ve introduced a lot of terminology here, and it takes a while to absorb.

Anyway: our proof that every minimal critical siphon is either drainable or self-replicable makes use of a fun result about matrices. Consider a real square matrix with a sign pattern like this:

\left( \begin{array}{cccc} <0 & >0 & \cdots & > 0 \\ >0 & <0 & \cdots &> 0 \\ \vdots & \vdots & <0 &> 0 \\ >0 & >0 & \cdots & <0 \end{array} \right)

If the matrix is full-rank then there is a positive linear combination of the rows of the matrix so that all the entries are nonzero and have the same sign. In fact, we prove something stronger in Theorem 5.9 of our paper. At first, we thought this statement about matrices should be equivalent to one of the many well-known alternative statements of Farkas’ lemma, like Gordan’s theorem.

However, we could not find a way to make this work, so we ended up proving it by a different technique. Later, my colleague Jaikumar Radhakrishnan came up with a clever proof that uses Farkas’ lemma twice. However, so far we have not obtained the stronger result in Theorem 5.9 with this proof technique.

My interest in the result that every minimal critical siphon is either drainable or self-replicable is not purely aesthetic (though aesthetics is a big part of it). There is a research community of folks who are thinking of reaction networks as a programming language, and synthesizing molecular systems that exhibit sophisticated dynamical behavior as per specification:

International Conference on DNA Computing and Molecular Programming.

Foundations of Nanoscience: Self-Assembled Architectures and Devices.

Molecular Programming Architectures, Abstractions, Algorithms and Applications.

Networks that exhibit some kind of catalytic behavior are a recurring theme among such systems, and even more so in biochemical circuits.

Here is an example of catalytic behavior:

A + C \to B + C

The ‘catalyst’ C helps transform A to B. In the absence of C, the reaction is turned off. Hence, catalysts are switches in chemical circuits! From this point of view, it is hardly surprising that they are required for the synthesis of complex behaviors.

In information processing, one needs amplification to make sure that a signal can propagate through a circuit without being overwhelmed by errors. Here is a chemical counterpart to such amplification:

A + C \to 2C

Here the catalyst C catalyzes its own production: it is an ‘autocatalyst’, or a self-replicating species. By analogy, autocatalysis is key for scaling synthetic molecular systems.

Our work deals with these notions on a network level. We generalize the notion of catalysis in two ways. First, we allow a catalyst to be a set of species instead of a single species; second, its absence can turn off a reaction pathway instead of a single reaction. We propose the notion of self-replicable siphons as a generalization of the notion of autocatalysis. In particular, ‘weakly reversible’ networks have critical siphons precisely when they exhibit autocatalytic behavior. I was led to this work when I noticed the manifestation of this last statement in many examples.

Another hope I have is that perhaps one can study the dynamics of each minimal critical siphon of a reaction network separately, and then somehow be able to answer interesting questions about the dynamics of the entire network, by stitching together what we know for each minimal critical siphon. On the synthesis side, perhaps this could lead to a programming language to synthesize a reaction network that will achieve a specified dynamics. If any of this works out, it would be really cool! I think of how abelian group theory (and more broadly, the theory of abelian categories, which includes categories of vector bundles) benefits from a fundamental theorem that lets you break a finite abelian group into parts that are easy to study—or how number theory benefits from a special case, the fundamental theorem of arithmetic. John has also pointed out that reaction networks are really presentations of symmetric monoidal categories, so perhaps this could point the way to a Fundamental Theorem for Symmetric Monoidal Categories.

And then there is the Global Attractor Conjecture, a
long-standing open problem concerning the long-term behavior of solutions to the rate equations. Now that is a whole story by itself, and will have to wait for another day.


Levels of Excellence

29 September, 2013

 

Over on Google+, a computer scientist at McGill named Artem Kaznatcheev passed on this great description of what it’s like to learn math, written by someone who calls himself ‘man after midnight’:

The way it was described to me when I was in high school was in terms of ‘levels’.

Sometimes, in your mathematics career, you find that your slow progress, and careful accumulation of tools and ideas, has suddenly allowed you to do a bunch of new things that you couldn’t possibly do before. Even though you were learning things that were useless by themselves, when they’ve all become second nature, a whole new world of possibility appears. You have “leveled up”, if you will. Something clicks, but now there are new challenges, and now, things you were barely able to think about before suddenly become critically important.

It’s usually obvious when you’re talking to somebody a level above you, because they see lots of things instantly when those things take considerable work for you to figure out. These are good people to learn from, because they remember what it’s like to struggle in the place where you’re struggling, but the things they do still make sense from your perspective (you just couldn’t do them yourself).

Talking to somebody two or levels above you is a different story. They’re barely speaking the same language, and it’s almost impossible to imagine that you could ever know what they know. You can still learn from them, if you don’t get discouraged, but the things they want to teach you seem really philosophical, and you don’t think they’ll help you—but for some reason, they do.

Somebody three levels above is actually speaking a different language. They probably seem less impressive to you than the person two levels above, because most of what they’re thinking about is completely invisible to you. From where you are, it is not possible to imagine what they think about, or why. You might think you can, but this is only because they know how to tell entertaining stories. Any one of these stories probably contains enough wisdom to get you halfway to your next level if you put in enough time thinking about it.

What follows is my rough opinion on how this looks in a typical path towards a Ph.D. in math. Obviously this is rather subjective, and makes math look too linear, but I think it’s a useful thought experiment.

Consider the change that a person undergoes in first mastering elementary algebra. Let’s say that that’s one level. This student is now comfortable with algebraic manipulation and the idea of variables.

The next level may come somewhere during a first calculus course. The student now understands the concept of the infinitely small, of slope at a point, and can reason about areas, physical motion, and optimization.

Many stop here, believing that they have finally learned math. Those who do not stop, might proceed through multivariable calculus and perhaps a basic linear algebra course with the tools they currently possess. Their next level comes when they find themselves suffering through an abstract algebra course, and have to once again reshape their whole thought process just to squeak by with a C.

Once this student masters all of that, the rest of the undergraduate curriculum at their university might be a breeze. But not so with graduate school. They gain a level their first year. They gain another their third year. And they are horrified to discover that they are expected to gain a third level before they graduate. This level is the hardest of them all, because it is the first one that consists in mastering material that has been created largely by the student.

I don’t know how many levels there are after that. At least three.

So, the bad news is, you never do see the whole picture (though you see the old picture shrink down to a tiny point), and you can’t really explain what you do see. But the good news is that the world of mathematics is so rich and exciting and wonderful that even your wildest dreams about it cannot possibly compare. It is not like seeing the Matrix—it is like seeing the Matrix within the Matrix within the Matrix within the Matrix within the Matrix.

As he points out, this talk of ‘levels’ is too linear. You can be much better at algebraic geometry than your friend, but way behind them in probability theory. Or even within a field like algebraic geometry, you might be able to understand sheaf cohomology better than your friend, yet still way behind in some classical topic like elliptic curves.

To have worthwhile conversations with someone who is not evenly matched with you in some subject, it’s often good for one of you to play ‘student’ while the other plays ‘teacher’. Playing teacher is an ego boost, and it helps organize your thoughts – but playing student is a great way to amass knowledge and practice humility… and a good student can help the teacher think about things in new ways.

Taking turns between who is teacher and who is student helps keep things from becoming unbalanced. And it’s especially fun when some subject can only be understood with the combined knowledge of both players.

I have a feeling good mathematicians spend a lot of time playing these games—we often hear of famous teams like Atiyah, Bott and Singer, or even bigger ones like the French collective called ‘Bourbaki’. For about a decade, I played teacher/student games with James Dolan, and it was really productive. I should probably find a new partner to learn the new kinds of math I’m working on now. Trying to learn things by yourself is a huge disadvantage if you want to quickly rise to higher ‘levels’.

If we took things a bit more seriously and talked about them more, maybe a lot of us could get better at things faster.

 

 

Indeed, after I passed on these remarks, T.A. Abinandanan, a professor of materials science in Bangalore, pointed out this study on excellence in swimming:

• Daniel Chambliss, The mundanity of excellence.

Chambliss emphasizes that in swimming there really are discrete levels of excellence, because there are different kinds of swimming competitions, each with their own different ethos. Here are some of his other main points:

1) Excellence comes from qualitative changes in behavior, not just quantitative ones. More time practicing is not good enough. Nor is simply moving your arms faster! A low-level breaststroke swimmer does very different things than a top-ranked one. The low-level swimmer tends to pull her arms far back beneath her, kick the legs out very wide without bringing them together at the finish, lift herself high out of the water on the turn, and fail to go underwater for a long ways after the turn. The top-ranked one sculls her arms out to the side and sweeps back in, kicks narrowly with the feet finishing together, stays low on the turns, and goes underwater for a long distance after the turn. They’re completely different!

2) The different levels of excellence in swimming are like different worlds, with different rules. People can move up or down within a level by putting in more or less effort, but going up a level requires something very different—see point 1).

3) Excellence is not the product of socially deviant personalities. The best swimmers aren’t “oddballs,” nor are they loners—kids who have given up “the normal teenage life”.

4) Excellence does not come from some mystical inner quality of the athlete. Rather, it comes from learning how to do lots of things right.

5) The best swimmers are more disciplined. They’re more likely to be strict with their training, come to workouts on time, watch what they eat, sleep regular hours, do proper warmups before a meet, and the like.

6) Features of the sport that low-level swimmers find unpleasant, excellent swimmers enjoy. What others see as boring – swimming back and forth over a black line for two hours, say – the best swimmers find peaceful, even meditative, or challenging, or therapeutic. They enjoy hard practices, look forward to difficult competitions, and try to set difficult goals.

7) The best swimmers don’t spend a lot of time dreaming about big goals like winning the Olympics. They concentrate on “small wins”: clearly defined minor achievements that can be rather easily done, but produce real effects.

8) The best swimmers don’t “choke”. Faced with what seems to be a tremendous challenge or a strikingly unusual event such as the Olympic Games, they take it as a normal, manageable situation. One way they do this is by sticking to the same routines. Chambliss calls this the “mundanity of excellence”.

I’ve just paraphrased chunks of the paper. The whole thing is worth reading! I can’t help wondering how much these lessons apply to other areas. He gives an example that could easily apply to mathematics—a

more personal example of failing to maintain a sense of mundanity, from the world of academia: the inability to finish the doctoral thesis, the hopeless struggle for the magnum opus. Upon my arrival to graduate school some 12 years ago, I was introduced to an advanced student we will call Michael. Michael was very bright, very well thought of by his professors, and very hard working, claiming (apparently truthfully) to log a minimum of twelve hours a day at his studies. Senior scholars sought out his comments on their manuscripts, and their acknowledgements always mentioned him by name. All the signs pointed to a successful career. Yet seven years later, when I left the university, Michael was still there-still working 12 hours a day, only a bit less well thought of. At last report, there he remains, toiling away: “finishing up,” in the common expression.

In our terms, Michael could not maintain his sense of mundanity. He never accepted that a dissertation is a mundane piece of work, nothing more than some words which one person writes and a few other people read. He hasn’t learned that the real exams, the true tests (such as the dissertation requirement) in graduate school are really designed to discover whether at some point one is willing just to turn the damn thing in.


Quantum Network Theory (Part 2)

13 August, 2013

guest post by Tomi Johnson

Last time I told you how a random walk called the ‘uniform escape walk’ could be used to analyze a network. In particular, Google uses it to rank nodes. For the case of an undirected network, the steady state of this random walk tells us the degrees of the nodes—that is, how many edges come out of each node.

Now I’m going to prove this to you. I’ll also exploit the connection between this random walk and a quantum walk, also introduced last time. In particular, I’ll connect the properties of this quantum walk to the degrees of a network by exploiting its relationship with the random walk.

This is pretty useful, considering how tricky these quantum walks can be. As the parts of the world that we model using quantum mechanics get bigger and have more complicated structures, like biological network, we need all the help in understanding quantum walks that we can get. So I’d better start!

Flashback

Starting with any (simple, connected) graph, we can get an old-fashioned ‘stochastic’ random walk on this graph, but also a quantum walk. The first is the uniform escape stochastic walk, where the walker has an equal probability per time of walking along any edge leaving the node they are standing at. The second is the related quantum walk we’re going to study now. These two walks are generated by two matrices, which we called S and Q. The good thing is that these matrices are similar, in the technical sense.

We studied this last time, and everything we learned is summarized here:

Diagram outlining the main concepts (again)

where:

G is a simple graph that specifies

A the adjacency matrix (the generator of a quantum walk) with elements A_{i j} equal to unity if nodes i and j are connected, and zero otherwise (A_{i i} = 0), which subtracted from

D the diagonal matrix of degrees D_{i i} = \sum_j A_{i j} gives

L = D - A the symmetric Laplacian (generator of stochastic and quantum walks), which when normalized by D returns both

S = L D^{-1} the generator of the uniform escape stochastic walk and

Q = D^{-1/2} L D^{-1/2} the quantum walk generator to which it is similar!

Now I hope you remember where we are. Next I’ll talk you through the mathematics of the uniform escape stochastic walk S and how it connects to the degrees of the nodes in the large-time limit. Then I’ll show you how this helps us solve aspects of the quantum walk generated by Q.

Stochastic walk

The uniform escape stochastic walk generated by S is popular because it has a really useful stationary state.

To recap from Part 20 of the network theory series, a stationary state of a stochastic walk is one that does not change in time. By the master equation

\displaystyle{ \frac{d}{d t} \psi(t) = -S \psi(t)}

the stationary state must be an eigenvector of S with eigenvalue 0.

A fantastic pair of theorems hold:

• There is always a unique (up to multiplication by a positive number) positive eigenvector \pi of S with eigenvalue 0. That is, there is a unique stationary state \pi.

• Regardless of the initial state \psi(0), any solution of the master equation approaches this stationary state \pi in the large-time limit:

\displaystyle{ \lim_{t \rightarrow \infty} \psi(t) = \pi }

To find this unique stationary state, consider the Laplacian L, which is both infinitesimal stochastic and symmetric. Among other things, this means the rows of L sum to zero:

\displaystyle{ \sum_j L_{i j} = 0 }

Thus, the ‘all ones’ vector \mathbf{1} is an eigenvector of L with zero eigenvalue:

L \mathbf{1} = 0

Inserting the identity I = D^{-1} D into this equation we then find D \mathbf{1} is a zero eigenvector of S:

L \mathbf{1} =  ( L D^{-1} ) ( D \mathbf{1} ) = S ( D \mathbf{1} ) = 0

Therefore we just need to normalize this to get the large-time stationary state of the walk:

\displaystyle{ \pi = \frac{D \mathbf{1}}{\sum_i D_{i i}} }

If we write i for the basis vector that is zero except at the ith node of our graph, and 1 at that node, the inner product \langle i , \pi \rangle is large-time probability of finding a walker at that node. The equation above implies this is proportional to the degree D_{i i} of node i.

We can check this for the following graph:

Illustration of a simple graph

We find that \pi is

\displaystyle{ \left( \begin{matrix} 1/6 \\ 1/6 \\ 1/4 \\ 1/4 \\ 1/6 \end{matrix} \right) }

which implies large-time probability 1/6 for nodes 1, 2 and 5, and 1/4 for nodes 3 and 4. Comparing this to the original graph, this exactly reflects the arrangement of degrees, as we knew it must.

Math works!

The quantum walk

Next up is the quantum walk generated by Q. Not a lot is known about quantum walks on networks of arbitrary geometry, but below we’ll see some analytical results are obtained by exploiting the similarity of S and Q.

Where to start? Well, let’s start at the bottom, what quantum physicists call the ground state. In contrast to stochastic walks, for a quantum walk every eigenvector \phi_k of Q is a stationary state of the quantum walk. (In Puzzle 5, at the bottom of this page, I ask you to prove this). The stationary state \phi_0 is of particular interest physically and mathematically. Physically, since eigenvectors of the Q correspond to states of well-defined energy equal to the associated eigenvalue, \phi_0 is the state of lowest energy, energy zero, hence the name ‘ground state’. (In Puzzle 3, I ask you to prove that all eigenvalues of Q are non-negative, so zero really does correspond to the ground state.)

Mathematically, the relationship between eigenvectors implied by the similarity of S and Q means

\phi_0 \propto D^{-1/2} \pi \propto  D^{1/2} \mathbf{1}

So in the ground state, the probability of our quantum walker being found at node i is

| \langle i , \phi_0 \rangle |^2 \propto | \langle i , D^{1/2} \rangle \mathbf{1} |^2 = D_{i i}

Amazingly, this probability is proportional to the degree and so is exactly the same as \langle i , \pi \rangle, the probability in the stationary state \pi of the stochastic walk!

In short: a zero energy quantum walk Q leads to exactly the same distribution of the walker over the nodes as in the large-time limit of the uniform escape stochastic walk S. The classically important notion of degree distribution also plays a role in quantum walks!

This is already pretty exciting. What else can we say? If you are someone who feels faint at the sight of quantum mechanics, well done for getting this far, but watch out for what’s coming next.

What if the walker starts in some other initial state? Is there some quantum walk analogue of the unique large-time state of a stochastic walk?

In fact, the quantum walk in general does not converge to a stationary state. But there is a probability distribution that can be thought to characterize the quantum walk in the same way as the large-time state characterizes the stochastic walk. It’s the large-time average probability vector P.

If you didn’t know the time that had passed since the beginning of a quantum walk, then the best estimate for the probability of your measuring the walker to be at node i would be the large-time average probability

\displaystyle{ \langle i , P \rangle = \lim_{T \rightarrow \infty} \frac{1}{T} \int_0^T | \psi_i (t) |^2 d t }

There’s a bit that we can do to simplify this expression. As usual in quantum mechanics, let’s start with the trick of diagonalizing Q. This amounts to writing

\displaystyle{  Q= \sum_k \epsilon_k \Phi_k }

where \Phi_k are projectors onto the eigenvectors \phi_k of Q, and \epsilon_k are the corresponding eigenvalues of Q. If we insert this equation into

\psi(t)  = e^{-Q t} \psi(0)

we get

\displaystyle{  \psi(t)  = \sum_k e^{-\epsilon_k t} \Phi_k \psi(0) }

and thus

\displaystyle{ \langle i , P \rangle = \lim_{T \rightarrow \infty} \frac{1}{T} \int_0^T | \sum_k e^{-i \epsilon_k t} \langle i, \Phi_k \psi (0) \rangle |^2 d t }

Due to the integral over all time, the interference between terms corresponding to different eigenvalues averages to zero, leaving:

\displaystyle{ \langle i , P \rangle = \sum_k | \langle i, \Phi_k \psi(0) \rangle |^2 }

The large-time average probability is then the sum of terms contributed by the projections of the initial state onto each eigenspace.

So we have a distribution that characterizes a quantum walk for a general initial state, but it’s a complicated beast. What can we say about it?

Our best hope of understanding the large-time average probability is through the term | \langle i, \Phi_0 \psi (0) \rangle |^2 associated with the zero energy eigenspace, since we know everything about this space.

For example, we know the zero energy eigenspace is one-dimensional and spanned by the eigenvector \phi_0. This means that the projector is just the usual outer product

\Phi_0 = | \phi_0 \rangle \langle \phi_0 | = \phi_0 \phi_0^\dagger

where we have normalized \phi_0 according to the inner product \langle \phi_0, \phi_0\rangle = 1. (If you’re wondering why I’m using all these angled brackets, well, they’re a notation named after Dirac that is adored by quantum physicists.)

The zero eigenspace contribution to the large-time average probability then breaks nicely into two:

\begin{array}{ccl} | \langle i, \Phi_0 \psi (0) \rangle |^2 &=& | \langle i, \phi_0\rangle \; \langle \phi_0,  \psi (0) \rangle |^2  \\  \\  &=& | \langle i, \phi_0\rangle |^2 \; | \langle \phi_0 , \psi (0) \rangle |^2 \\   \\  &=& \langle i ,  \pi \rangle \; | \langle \phi_0 ,  \psi (0) \rangle |^2 \end{array}

This is just the product of two probabilities:

• first, the probability \langle i , \pi \rangle for a quantum state in the zero energy eigenspace to be at node i, as we found above,

and

• second, the probability | \langle \phi_0, \psi (0)\rangle |^2 of being in this eigenspace to begin with. (Remember, in quantum mechanics the probability of measuring the system to have an energy is the modulus squared of the projection of the state onto the associated eigenspace, which for the one-dimensional zero energy eigenspace means just the inner product with the ground state.)

This is all we need to say something interesting about the large-time average probability for all states. We’ve basically shown that we can break the large-time probability vector P into a sum of two normalized probability vectors:

P = (1- \eta) \pi + \eta \Omega

the first \pi being the stochastic stationary state associated with the zero energy eigenspace, and the second $\Omega$ associated with the higher energy eigenspaces, with

\displaystyle{ \langle i , \Omega \rangle = \frac{ \sum_{k\neq 0} | \langle i, \Phi_k  \psi (0) \rangle |^2  }{ \eta} }

The weight of each term is governed by the parameter

\eta =  1 - | \langle \phi_0, \psi (0)\rangle |^2

which you could think of as the quantumness of the result. This is one minus the probability of the walker being in the zero energy eigenspace, or equivalently the probability of the walker being outside the zero energy eigenspace.

So even if we don’t know \Omega, we know its importance is controlled by a parameter \eta that governs how close the large-time average distribution P of the quantum walk is to the corresponding stochastic stationary distribution \pi.

What do we mean by ‘close’? Find out for yourself:

Puzzle 1. Show, using a triangle inequality, that the trace distance between the two characteristic stochastic and quantum distributions \{ \langle i , P \rangle \}_i and \{ \langle i , \pi \rangle \}_i is upper-bounded by 2 \eta.

Can we say anything physical about when the quantumness \eta is big or small?

Because the eigenvalues of Q have a physical interpretation in terms of energy, the answer is yes. The quantumness \eta is the probability of being outside the zero energy state. Call the next lowest eigenvalue \Delta = \min_{k \neq 0} \epsilon_k the energy gap. If the quantum walk is not in the zero energy eigenspace then it must be in an eigenspace of energy greater or equal to \Delta. Therefore the expected energy E of the quantum walker must bound the quantumness E \ge \eta \Delta.

This tells us that a quantum walk with a low energy is similar to a stochastic walk in the large-time limit. We already knew this was exactly true in the zero energy limit, but this result goes further.

So little is known about quantum walks on networks of arbitrary geometry that we were very pleased to find this result. It says there is a special case in which the walk is characterized by the degree distribution of the network, and a clear physical parameter that bounds how far the walk is from this special case.

Also, in finding it we learned that the difficulties of the initial state dependence, enhanced by the lack of convergence to a stationary state, could be overcome for a quantum walk, and that the relationships between quantum and stochastic walks extend beyond those with shared generators.

What next?

That’s all for the latest bit of idea sharing at the interface between stochastic and quantum systems.

I hope I’ve piqued your interest about quantum walks. There’s so much still left to work out about this topic, and your help is needed!

Other questions we have include: What holds analytically about the form of the quantum correction? Numerically it is known that the so-called quantum correction \Omega tends to enhance the probability of being found on nodes of low degree compared to \pi. Can someone explain why? What happens if a small amount of stochastic noise is added to a quantum walk? Or a lot of noise?

It’s difficult to know who is best placed to answer these questions: experts in quantum physics, graph theory, complex networks or stochastic processes? I suspect it’ll take a bit of help from everyone.

Background reading

A couple of textbooks with comprehensive sections on non-negative matrices and continuous-time stochastic processes are:

• Peter Lancaster and Miron Tismenetsky, The Theory of Matrices: with Applications, 2nd edition, Academic Press, San Diego, 1985.

• James R. Norris, Markov Chains, Cambridge University Press, Cambridge, 1997.

There is, of course, the book that arose from the Azimuth network theory series, which considers several relationships between quantum and stochastic processes on networks:

• John Baez and Jacob Biamonte, A Course on Quantum Techniques for Stochastic Mechanics, 2012.

Another couple of books on complex networks are:

• Mark Newman, Networks: An Introduction, Oxford University Press, Oxford, 2010.

• Ernesto Estrada, The Structure of Complex Networks: Theory and Applications, Oxford University Press, Oxford, 2011. Note that the first chapter is available free online.

There are plenty more useful references in our article on this topic:

• Mauro Faccin, Tomi Johnson, Jacob Biamonte, Sabre Kais and Piotr Migdał, Degree distribution in quantum walks on complex networks.

Puzzles for the enthusiastic

Sadly I didn’t have space to show proofs of all the theorems I used. So here are a few puzzles that guide you to doing the proofs for yourself:

Stochastic walks and stationary states

Puzzle 2. (For the hard core.) Prove there is always a unique positive eigenvector for a stochastic walk generated by S. You’ll need the assumption that the graph G is connected. It’s not simple, and you’ll probably need help from a book, perhaps one of those above by Lancaster and Tismenetsky, and Norris.

Puzzle 3. Show that the eigenvalues of S (and therefore Q) are non-negative. A good way to start this proof is to apply the Perron-Frobenius theorem to the non-negative matrix M = - S + I \max_i S_{i i}. This implies that M has a positive eigenvalue r equal to its spectral radius

r = \max_k | \lambda_k |

where \lambda_k are the eigenvalues of M, and the associated eigenvector v is positive. Since S = - M + I \max_i S_{i i}, it follows that S shares the eigenvectors of M and the associated eigenvalues are related by inverted translation:

\epsilon_k = - \lambda_k + \max_i S_{i i}

Puzzle 4. Prove that regardless of the initial state \psi(0), the zero eigenvector \pi is obtained in the large-time limit \lim_{t \rightarrow \infty} \psi(t) = \pi of the walk generated by S. This breaks down into two parts:

(a) Using the approach from Puzzle 5, to show that S v  = \epsilon_v v, the positivity of v and the infinitesimal stochastic property \sum_i S_{i j} = 0 imply that \epsilon_v = \epsilon_0 = 0 and thus v = \pi is actually the unique zero eigenvector and stationary state of S (its uniqueness follows from puzzle 4, you don’t need to re-prove it).

(b) By inserting the decomposition S = \sum_k \epsilon_k \Pi_k into e^{-S t} and using the result of puzzle 5, complete the proof.

(Though I ask you to use the diagonalizability of S, the main results still hold if the generator is irreducible but not diagonalizable.)

Quantum walks

Here are a couple of extra puzzles for those of you interested in quantum mechanics:

Puzzle 5. In quantum mechanics, probabilities are given by the moduli squared of amplitudes, so multiplying a state by a number of modulus unity has no physical effect. By inserting

\displaystyle{ Q= \sum_k \epsilon_k \Phi_k }

into the quantum time evolution matrix e^{-Q t}, show that if

\psi(0) = \phi_k

then

\psi(t)  = e^{ - i \epsilon_k t} \psi(0)

hence \phi_k is a stationary state in the quantum sense, as probabilities don’t change in time.

Puzzle 6. By expanding the initial state \psi(0) in terms of the complete orthogonal basis vectors \phi_k show that for a quantum walk \psi(t) never converges to a stationary state unless it began in one.


Quantum Network Theory (Part 1)

5 August, 2013

guest post by Tomi Johnson

If you were to randomly click a hyperlink on this web page and keep doing so on each page that followed, where would you end up?

As an esteemed user of Azimuth, I’d like to think you browse more intelligently, but the above is the question Google asks when deciding how to rank the world’s web pages.

Recently, together with the team (Mauro Faccin, Jacob Biamonte and Piotr Migdał) at the ISI Foundation in Turin, we attended a workshop in which several of the attendees were asking a similar question with a twist. “What if you, the web surfer, behaved quantum mechanically?”

Now don’t panic! I have no reason to think you might enter a superposition of locations or tunnel through a wall. This merely forms part of a recent drive towards understanding the role that network science can play in quantum physics.

As we’ll find, playing with quantum networks is fun. It could also become a necessity. The size of natural systems in which quantum effects have been identified has grown steadily over the past few years. For example, attention has recently turned to explaining the remarkable efficiency of light-harvesting complexes, comprising tens of molecules and thousands of atoms, using quantum mechanics. If this expansion continues, perhaps quantum physicists will have to embrace the concepts of complex networks.

To begin studying quantum complex networks, we found a revealing toy model. Let me tell you about it. Like all good stories, it has a beginning, a middle and an end. In this part, I’ll tell you the beginning and the middle. I’ll introduce the stochastic walk describing the randomly clicking web surfer mentioned above and a corresponding quantum walk. In part 2 the story ends with the bounding of the difference between the two walks in terms of the energy of the walker.

But for now I’ll start by introducing you to a graph, this time representing the internet!

If this taster gets you interested, there are more details available here:

• Mauro Faccin, Tomi Johnson, Jacob Biamonte, Sabre Kais and Piotr Migdał, Degree distribution in quantum walks on complex networks, arXiv:1305.6078 (2013).

What does the internet look like from above?

As we all know, the idea of the internet is to connect computers to each other. What do these connections look like when abstracted as a network, with each computer a node and each connection an edge?

The internet on a local scale, such as in your house or office, might look something like this:

 

Local network

with several devices connected to a central hub. Each hub connects to other hubs, and so the internet on a slightly larger scale might look something like this:

 

Regional network

What about the full global, not local, structure of the internet? To answer this question, researchers have developed representations of the whole internet, such as this one:

 

Global network

While such representations might be awe inspiring, how can we make any sense of them? Or are they merely excellent desktop wallpapers and new-age artworks?

In terms of complex network theory, there’s actually a lot that can be said that is not immediately obvious from the above representation.

For example, we find something very interesting if we plot the number of web pages with different incoming links (called degree) on a log-log axis. What is found for the African web is the following:

Power law degree distribution

This shows that very few pages are linked to by a very large number others, while a very large number of pages receive very few links. More precisely, what this shows is a power law distribution, the signature of which is a straight line on a log-log axis.

In fact, power law distributions arise in a diverse number of real world networks, human-built networks such as the internet and naturally occurring networks. It is often discussed alongside the concept of the preferential attachment; highly connected nodes seem to accumulate connections more quickly. We all know of a successful blog whose success had led to an increased presence and more success. That’s an example of preferential attachment.

It’s clear then that degree is an important concept in network theory, and its distribution across the nodes a useful characteristic of a network. Degree gives one indication of how important a node is in a network.

And this is where stochastic walks come in. Google, who are in the business of ranking the importance of nodes (web pages) in a network (the web), use (up to a small modification) the idealized model of a stochastic walker (web surfer) who randomly hops to connected nodes (follows one of the links on a page). This is called the uniform escape model, since the total rate of leaving any node is set to be the same for all nodes. Leaving the walker to wander for a long while, Google then takes the probability of the walker being on a node to rank the importance of that node. In the case that the network is undirected (all links are reciprocated) this long-time probability, and therefore the rank of the node, is proportional to the degree of the node.

So node degrees and the uniform escape model play an important role in the fields of complex networks and stochastic walks. But can they tell us anything about the much more poorly understood topics of quantum networks and quantum walks? In fact, yes, and demonstrating that to you is the purpose of this pair of articles.

Before we move on to the interesting bit, the math, it’s worth just listing a few properties of quantum walks that make them hard to analyze, and explaining why they are poorly understood. These are the difficulties we will show how to overcome below.

No convergence. In a stochastic walk, if you leave the walker to wander for a long time, eventually the probability of finding a walker at a node converges to a constant value. In a quantum walk, this doesn’t happen, so the walk can’t be characterized so easily by its long-time properties.

Dependence on initial states. In some stochastic walks the long-time properties of the walk are independent of the initial state. It is possible to characterize the stochastic walk without referring to the initialization of the walker. Such a characterization is not so easy in quantum walks, since their evolution always depends on the initialization of the walker. Is it even possible then to say something useful that applies to all initializations?

Stochastic and quantum generators differ. Those of you familiar with the network theory series know that some generators produce both stochastic and quantum walks (see part 16 for more details). However, most stochastic walk generators, including that for the uniform escape model, do not generate quantum walks and vice versa. How do we then compare stochastic and quantum walks when their generators differ?

With the task outlined, let’s get started!

Graphs and walks

In the next couple of sections I’m going to explain the diagram below to you. If you’ve been following the network theory series, in particular part 20, you’ll find parts of it familiar. But as it’s been a while since the last post covering this topic, let’s start with the basics.

Diagram outlining the main concepts

A simple graph G can be used to define both stochastic and quantum walks. A simple graph is something like this:

Illustration of a simple graph

where there is at most one edge between any two nodes, there are no edges from a node to itself and all edges are undirected. To avoid complications, let’s stick to simple graphs with a finite number n of nodes. Let’s also assume you can get from every node to every other node via some combination of edges i.e. the graph is connected.

In the particular example above the graph represents a network of n = 5 nodes, where nodes 3 and 4 have degree (number of edges) 3, and nodes 1, 2 and 5 have degree 2.

Every simple graph defines a matrix A, called the adjacency matrix. For a network with n nodes, this matrix is of size n \times n, and each element A_{i j} is unity if there is an edge between nodes i and j, and zero otherwise (let’s use this basis for the rest of this post). For the graph drawn above the adjacency matrix is

\left( \begin{matrix} 0 & 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 \end{matrix} \right)

By construction, every adjacency matrix is symmetric:

A =A^T

(the T means the transposition of the elements in the node basis) and further, because each A is real, it is self-adjoint:

A=A^\dagger

(the \dagger means conjugate transpose).

This is nice, since (as seen in parts 16 and 20) a self-adjoint matrix generates a continuous-time quantum walk.

To recap from the series, a quantum walk is an evolution arising from a quantum walker moving on a network.

A state of a quantum walk is represented by a size n complex column vector \psi. Each element \langle i , \psi \rangle of this vector is the so-called amplitude associated with node i and the probability of the walker being found on that node (if measured) is the modulus of the amplitude squared |\langle i , \psi \rangle|^2. Here i is the standard basis vector with a single non-zero ith entry equal to unity, and \langle u , v \rangle = u^\dagger v is the usual inner product.

A quantum walk evolves in time according to the Schrödinger equation

\displaystyle{ \frac{d}{d t} \psi(t)= - i H \psi(t) }

where H is called the Hamiltonian. If the initial state is \psi(0) then the solution is written as

\psi(t) = \exp(- i t H) \psi(0)

The probabilities | \langle i , \psi (t) \rangle |^2 are guaranteed to be correctly normalized when the Hamiltonian H is self-adjoint.

There are other matrices that are defined by the graph. Perhaps the most familiar is the Laplacian, which has recently been a topic on this blog (see parts 15, 16 and 20 of the series, and this recent post).

The Laplacian L is the n \times n matrix

L = D - A

where the degree matrix D is an n \times n diagonal matrix with elements given by the degrees

\displaystyle{ D_{i i}=\sum_{j} A_{i j} }

For the graph drawn above, the degree matrix and Laplacian are:

\left( \begin{matrix} 2 & 0 & 0 & 0 & 0 \\ 0 & 2 & 0 & 0 & 0 \\ 0 & 0 & 3 & 0 & 0 \\ 0 & 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 0 & 2 \end{matrix} \right)  \qquad \mathrm{and} \qquad \left( \begin{matrix} 2 & -1 & 0 & -1 & 0 \\ -1 & 2 & -1 & 0 & 0 \\ 0 & -1 & 3 & -1 & -1 \\ -1 & 0 & -1 & 3 & -1 \\ 0 & 0 & -1 & -1 & 2 \end{matrix} \right)

The Laplacian is self-adjoint and generates a quantum walk.

The Laplacian has another property; it is infinitesimal stochastic. This means that its off diagonal elements are non-positive and its columns sum to zero. This is interesting because an infinitesimal stochastic matrix generates a continuous-time stochastic walk.

To recap from the series, a stochastic walk is an evolution arising from a stochastic walker moving on a network.

A state of a stochastic walk is represented by a size n non-negative column vector \psi. Each element \langle i , \psi \rangle of this vector is the probability of the walker being found on node i.

A stochastic walk evolves in time according to the master equation

\displaystyle{ \frac{d}{d t} \psi(t)= - H \psi(t) }

where H is called the stochastic Hamiltonian. If the initial state is \psi(0) then the solution is written

\psi(t) = \exp(- t H) \psi(0)

The probabilities \langle i , \psi (t) \rangle are guaranteed to be non-negative and correctly normalized when the stochastic Hamiltonian H is infinitesimal stochastic.

So far, I have just presented what has been covered on Azimuth previously. However, to analyze the important uniform escape model we need to go beyond the class of (Dirichlet) generators that produce both quantum and stochastic walks. Further, we have to somehow find a related quantum walk. We’ll see below that both tasks are achieved by considering the normalized Laplacians: one generating the uniform escape stochastic walk and the other a related quantum walk.

Normalized Laplacians

The two normalized Laplacians are:

• the asymmetric normalized Laplacian S = L D^{-1} (that generates the uniform escape Stochastic walk) and

• the symmetric normalized Laplacian Q = D^{-1/2} L D^{-1/2} (that generates a Quantum walk).

For the graph drawn above the asymmetric normalized Laplacian S is

\left( \begin{matrix} 1 & -1/2 & 0 & -1/3 & 0 \\ -1/2 & 1 & -1/3 & 0 & 0 \\ 0 & -1/2 & 1 & -1/3 & -1/2 \\ -1/2 & 0 & -1/3 & 1 & -1/2 \\ 0 & 0 & -1/3 & -1/3 & 1 \end{matrix} \right)

The identical diagonal elements indicates that the total rates of leaving each node are identical, and the equality within each column of the other non-zero elements indicates that the walker is equally likely to hop to any node connected to its current node. This is the uniform escape model!

For the same graph the symmetric normalized Laplacian Q is

\left( \begin{matrix} 1 & -1/2 & 0 & -1/\sqrt{6}  & 0 \\ -1/2 & 1 & -1/\sqrt{6}  & 0 & 0 \\ 0 & -1/\sqrt{6}  & 1 & -1/3 & -1/\sqrt{6}  \\ -1/\sqrt{6} & 0 & -1/3 & 1 & -1/\sqrt{6}  \\ 0 & 0 & -1/\sqrt{6}  & -1/\sqrt{6}  & 1 \end{matrix} \right)

That the diagonal elements are identical in the quantum case indicates that all nodes are of equal energy, this is type of quantum walk usually considered.

Puzzle 1. Show that in general S is infinitesimal stochastic but not self-adjoint.

Puzzle 2. Show that in general Q is self-adjoint but not infinitesimal stochastic.

So a graph defines two matrices: one S that generates a stochastic walk, and one Q that generates a quantum walk. The natural question to ask is whether these walks are related. The answer is that they are!

Underpinning this relationship is the mathematical property that S and Q are similar. They are related by the following similarity transformation

S = D^{1/2} Q D^{-1/2}

which means that any eigenvector \phi_k of Q associated to eigenvalue \epsilon_k gives a vector

\pi_k \propto D^{1/2} \phi_k

that is an eigenvector of S with the same eigenvalue! To show this, insert the identity I = D^{-1/2} D^{1/2} into

Q \phi_k = \epsilon_k \phi_k

and multiply from the left with D^{1/2} to obtain

\begin{aligned} (D^{1/2}  Q D^{-1/2} ) (D^{1/2} \phi_k) &= \epsilon_k ( D^{1/2} \phi_k ) \\ S \pi_k &= \epsilon_k \pi_k  \end{aligned}

The same works in the opposite direction. Any eigenvector \pi_k of S gives an eigenvector

\phi_k \propto D^{-1/2} \pi_k

of Q with the same eigenvalue \epsilon_k.

The mathematics is particularly nice because Q is self-adjoint. A self-adjoint matrix is diagonalizable, and has real eigenvalues and orthogonal eigenvectors.

As a result, the symmetric normalized Laplacian can be decomposed as

Q = \sum_k \epsilon_k \Phi_k

where \epsilon_k is real and \Phi_k are orthogonal projectors. Each \Phi_k acts as the identity only on vectors in the space spanned by \phi_k and as zero on all others, such that

\Phi_k \Phi_\ell = \delta_{k \ell} \Phi_k.

Multiplying from the left by D^{1/2} and the right by D^{-1/2} results in a similar decomposition for S:

S = \sum_k \epsilon_k \Pi_k

with orthogonal projectors

\Pi_k  = D^{1/2} \Phi_k D^{-1/2}

I promised above that I would explain the following diagram:

Diagram outlining the main concepts (again)

Let’s summarize what it represents now:

G is a simple graph that specifies

A the adjacency matrix (generator of a quantum walk), which subtracted from

D the diagonal matrix of the degrees gives

L the symmetric Laplacian (generator of stochastic and quantum walks), which when normalized by D returns both

S the generator of the uniform escape stochastic walk and

Q the quantum walk generator to which it is similar!

What next?

Sadly, this is where we’ll finish for now.

We have all the ingredients necessary to study the walks generated by the normalized Laplacians and exploit the relationship between them.

Next time, in part 2, I’ll talk you through the mathematics of the uniform escape stochastic walk S and how it connects to the degrees of the nodes in the long-time limit. Then I’ll show you how this helps us solve aspects of the quantum walk generated by Q.

In other news

Before I leave you, let me tell you about a workshop the ISI team recently attended (in fact helped organize) at the Institute of Quantum Computing, on the topic of quantum computation and complex networks. Needless to say, there were talks on papers related to quantum mechanics and networks!

Some researchers at the workshop gave exciting talks based on numerical examinations of what happens if a quantum walk is used instead of a stochastic walk to rank the nodes of a network:

• Giuseppe Davide Paparo and Miguel Angel Martín-Delgado, Google in a quantum network, Sci. Rep. 2 (2012), 444.

• Eduardo Sánchez-Burillo, Jordi Duch, Jesús Gómez-Gardenes and David Zueco, Quantum navigation and ranking in complex networks, Sci. Rep. 2 (2012), 605.

Others attending the workshop have numerically examined what happens when using quantum computers to represent the stationary state of a stochastic process:

• Silvano Garnerone, Paolo Zanardi and Daniel A. Lidar, Adiabatic quantum algorithm for search engine ranking, Phys. Rev. Lett. 108 (2012), 230506.

It was a fun workshop and we plan to organize/attend more in the future!


Monte Carlo Methods in Climate Science

23 July, 2013

joint with David Tweed

One way the Azimuth Project can help save the planet is to get bright young students interested in ecology, climate science, green technology, and stuff like that. So, we are writing an article for Math Horizons, an American magazine for undergraduate math majors. This blog article is a draft of that. You can also see it in PDF form here.

We’d really like to hear your comments! There are severe limits on including more detail, since the article should be easy to read and short. So please don’t ask us to explain more stuff: we’re most interested to know if you sincerely don’t understand something, or feel that students would have trouble understanding something. For comparison, you can see sample Math Horizons articles here.

Introduction

They look placid lapping against the beach on a calm day, but the oceans are actually quite dynamic. The ocean currents act as ‘conveyor belts’, transporting heat both vertically between the water’s surface and the depths and laterally from one area of the globe to another. This effect is so significant that the temperature and precipitation patterns can change dramatically when currents do.

For example: shortly after the last ice age, northern Europe experienced a shocking change in climate from 10,800 to 9,500 BC. At the start of this period temperatures plummeted in a matter of decades. It became 7° Celsius colder, and glaciers started forming in England! The cold spell lasted for over a thousand years, but it ended as suddenly as it had begun.

Why? The most popular theory is that that a huge lake in North America formed by melting glaciers burst its bank—and in a massive torrent lasting for years, the water from this lake rushed out to the northern Atlantic ocean. By floating atop the denser salt water, this fresh water blocked a major current: the Atlantic Meridional Overturning Circulation. This current brings warm water north and helps keep northern Europe warm. So, when iit shut down, northern Europe was plunged into a deep freeze.

Right now global warming is causing ice sheets in Greenland to melt and release fresh water into the North Atlantic. Could this shut down the Atlantic Meridional Overturning Circulation and make the climate of Northern Europe much colder? In 2010, Keller and Urban [KU] tackled this question using a simple climate model, historical data, probability theory, and lots of computing power. Their goal was to understand the spectrum of possible futures compatible with what we know today.

Let us look at some of the ideas underlying their work.

Box models

The earth’s physical behaviour, including the climate is far too complex to simulate from the bottom up using basic physical principles, at least for now. The most detailed models today can take days to run on very powerful computers. So to make reasonable predictions on a laptop in a tractable time-frame, geophysical modellers use some tricks.

First, it is possible to split geophysical phenomena into ‘boxes’ containing strongly related things. For example: atmospheric gases, particulate levels and clouds all affect each other strongly; likewise the heat content, currents and salinity of the oceans all interact strongly. However, the interactions between the atmosphere and the oceans are weaker, and we can approximately describe them using just a few settings, such as the amount of atmospheric CO2 entering or leaving the oceans. Clearly these interactions must be consistent—for example, the amount of CO2 leaving the atmosphere box must equal the amount entering the ocean box—but breaking a complicated system into parts lets different specialists focus on different aspects; then we can combine these parts and get an approximate model of entire planet. The box model used by Keller and Urban is shown in Figure 1.



1. The box model used by Keller and Urban.

 
Second, it turn out that simple but effective box models can be distilled from the complicated physics in terms of forcings and feedbacks. Essentially a forcing is a measured input to the system, such as solar radiation or CO2 released by burning fossil fuels. As an analogy, consider a child on a swing: the adult’s push every so often is a forcing. Similarly a feedback describes how the current ‘box variables’ influence future ones. In the swing analogy, one feedback is how the velocity will influence the future height. Specifying feedbacks typically uses knowledge of the detailed low-level physics to derive simple, tractable functional relationships between groups of large-scale observables, a bit like how we derive the physics of a gas by thinking about collisions of lots of particles.

However, it is often not feasible to get actual settings for the parameters in our model starting from first principles. In other words, often we can get the general form of the equations in our model, but they contain a lot of constants that we can estimate only by looking at historical data.

Probability modeling

Suppose we have a box model that depends on some settings S. For example, in Keller and Urban’s model, S is a list of 18 numbers. To keep things simple, suppose the settings are element of some finite set. Suppose we also have huge hard disc full of historical measurements, and we want to use this to find the best estimate of S. Because our data is full of ‘noise’ from other, unmodeled phenomena we generally cannot unambiguously deduce a single set of settings. Instead we have to look at things in terms of probabilities. More precisely, we need to study the probability that S take some value s given that the measurements take some value. Let’s call the measurements M, and again let’s keep things simple by saying M takes values in some finite set of possible measurements.

The probability that S = s given that M takes some value m is called the conditional probability P(S=s | M=m). How can we compute this conditional probability? This is a somewhat tricky problem.

One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values S = s, the model predicts measurements M=m. This again is a conditional probability, but now it is called P(M=m|S=s).

This is not what we want: it’s backwards! But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

\displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }

Here P(S = s) is the probability that the settings take a specific value s, and similarly for P(M = m). Bayes’ rule is quite easy to prove, and it is actually a general rule that applies to any random variables, not just the settings and the measurements in our problem [Y]. It underpins most methods of figuring out hidden quantities from observed ones. For this reason, it is widely used in modern statistics and data analysis [K].

How does Bayes’ rule help us here? When we repeatedly run our model with randomly chosen settings, we have control over P(S = s). As mentioned, we can compute P(M=m| S=s). Finally, P(M = m) is independent of our choice of settings. So, we can use Bayes’ rule to compute P(S = s | M = m) up to a constant factor. And since probabilities must sum to 1, we can figure out this constant.

This lets us do many things. It lets us find the most likely values of the settings for our model, given our hard disc full of observed data. It also lets us find the probability that the settings lie within some set. This is important: if we’re facing the possibility of a climate disaster, we don’t just want to know the most likely outcome. We would like to know to know that with 95% probability, the outcome will lie in some range.

An example

Let us look at an example much simpler than that considered by Keller and Urban. Suppose our measurements are real numbers m_0,\dots, m_T related by

m_{t+1} = s m_t - m_{t-1} + N_t

Here s, a real constant, is our ‘setting’, while N_t is some ‘noise’: an independent Gaussian random variable for each time t, each with mean zero and some fixed standard deviation. Then the measurements m_t will have roughly sinusoidal behavior but with irregularity added by the noise at each time step, as illustrated in Figure 2.



2. The example system: red are predicted measurements for a given value of the settings, green is another simulation for the same s value and blue is a simulation for a slightly different s.

 
Note how there is no clear signal from either the curves or the differences that the green curve is at the correct setting value while the blue one has the wrong one: the noise makes it nontrivial to estimate s. This is a baby version of the problem faced by Keller and Urban.

Markov Chain Monte Carlo

Having glibly said that we can compute the conditional probability P(M=m | S=s), how do we actually do this? The simplest way would be to run our model many, many times with the settings set at S=s and determine the fraction of times it predicts measurements equal to m. This gives us an estimate of P(M=m | S=s). Then we can use Bayes’ rule to work out P(M=m|S=s), at least up to a constant factor.

Doing all this by hand would be incredibly time consuming and error prone, so computers are used for this task. In our example, we do this in Figure 3. As we keep running our model over and over, the curve showing P(M=m |S=s) as a function of s settles down to the right answer.


3. The estimates of P(M=m | S=s) as a function of s using uniform sampling, ending up with 480 samples at each point.

 

However, this is computationally inefficient, as shown in the probability distribution for small numbers of samples. This has quite a few ‘kinks’, which only disappear later. The problem is that there are lots of possible choices of s to try. And this is for a very simple model!

When dealing with the 18 settings involved in the model of Keller and Urban, trying every combination would take far too long. A way to avoid this is Markov Chain Monte Carlo sampling. Monte Carlo is famous for its casinos, so a ‘Monte Carlo’ algorithm is one that uses randomness. A ‘Markov chain’ is a random walk: for example, where you repeatedly flip a coin and take one step right when you get heads, and one step right when you get tails. So, in Markov Chain Monte Carlo, we perform a random walk through the collection of all possible settings, collecting samples.

The key to making this work is that at each step on the walk a proposed modification s' to the current settings s is generated randomly—but it may be rejected if it does not seem to improve the estimates. The essence of the rule is:

The modification s \mapsto s' is randomly accepted with a probability equal to the ratio

\displaystyle{ \frac{P(M=m | S=s')}{ P(M=m | S=s)} }

Otherwise the walk stays at the current position.

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute P(M=m | S=s). Then we can compute P(S = s | M = m) using Bayes’ rule.

Figure 4 shows the results of using the Markov Chain Monte Carlo procedure to figure out P(S= s| M= m) in our example.


4. The estimates of P(S = s|M = m) curves using Markov Chain Monte Carlo, showing the current distribution estimate at increasing intervals. The red line shows the current position of the random walk. Again the kinks are almost gone in the final distribution.

 

Note that the final distribution has only peformed about 66 thousand simulations in total, while the full sampling peformed over 1.5 million. The key advantage of Markov Chain Monte Carlo is that it avoids performing many simulations in areas where the probability is low, as we can see from the way the walk path remains under the big peak in the probability density almost all the time. What is more impressive is that it achieves this without any global view of the probability density, just by looking at how P(M=m | S=s) changes when we make small changes in the settings. This becomes even more important as we move to dealing with systems with many more dimensions and settings, where it proves very effective at finding regions of high probability density whatever their shape.

Why is it worth doing so much work to estimate the probability distribution for settings for a climate model? One reason is that we can then estimate probabilities of future events, such as the collapse of the Atlantic Meridional Ocean Current. And what’s the answer? According to Keller and Urban’s calculation, this current will likely weaken by about a fifth in the 21st century, but a complete collapse is unlikely before 2300. This claim needs to be checked in many ways—for example, using more detailed models. But the importance of the issue is clear, and we hope we have made the importance of good mathematical ideas for climate science clear as well.

Exploring the topic

The Azimuth Project is a group of scientists, engineers and computer programmers interested in questions like this [A]. If you have questions, or want to help out, just email us. Versions of the computer programs we used in this paper will be made available here in a while.

Here are some projects you can try, perhaps with the help of Kruschke’s textbook [K]:

• There are other ways to do setting estimation using time series: compare some to MCMC in terms of accuracy and robustness.

• We’ve seen a 1-dimensional system with one setting. Simulate some multi-dimensional and multi-setting systems. What new issues arise?

Acknowledgements. We thank Nathan Urban and other
members of the Azimuth Project for many helpful discussions.

References

[A] Azimuth Project, http://www.azimuthproject.org.

[KU] Klaus Keller and Nathan Urban, Probabilistic hindcasts and projections of the coupled climate, carbon cycle and Atlantic meridional overturning circulation system: a Bayesian fusion of century-scale measurements with a simple model, Tellus A 62 (2010), 737–750. Also available free online.

[K] John K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R and BUGS, Academic Press, New York, 2010.

[Y] Eliezer S. Yudkowsky, An intuitive explanation of Bayes’ theorem.


Negative Probabilities

19 July, 2013

 

The physicists Dirac and Feynman, both bold when it came to new mathematical ideas, both said we should think about negative probabilities.

These days, Kolmogorov’s axioms for probabilities are used to justify formulating probability theory in terms of measure theory. Mathematically, the theory of measures that take negative or even complex values is well-developed. So, to the extent that probability theory is just measure theory, you can say a lot is known about negative probabilities.

But probability theory is not just measure theory; it adds its own distinctive ideas. To get these into the picture, we really need to ask some basic questions, like: what could it mean to say something had a negative chance of happening?

I really have no idea.

In this paper:

• Paul Dirac, The physical interpretation of quantum mechanics, Proc. Roy. Soc. London A 180 (1942), 1–39.

Dirac wrote:

Negative energies and probabilities should not be considered as nonsense. They are well-defined concepts mathematically, like a negative of money.

In fact, I think negative money could have been the origin of negative numbers. Venetian bankers started writing numbers in red to symbolize debts—hence the phrase ‘in the red’ for being in debt. So, you could say negative numbers were invented to formalize the idea of debt and make accounting easier. Bankers couldn’t really get rich if negative money didn’t exist.

A negative dollar is a dollar you owe someone. But how can you owe someone a probability? I haven’t figured this out.

Unsurprisingly, the clearest writing about negative probabilities that I’ve found is by Feynman:

• Richard P. Feynman, Negative probability, in Quantum Implications: Essays in Honour of David Bohm, eds. F. David Peat and Basil Hiley, Routledge & Kegan Paul Ltd, London, 1987, pp. 235–248.

He emphasizes that even if the final answer of a calculation must be positive, negative numbers are often allowed to appear in intermediate steps… and that this can happen with probabilities.

Let me quote some:

Some twenty years ago one problem we theoretical physicists had was that if we combined the principles of quantum mechanics and those of relativity plus certain tacit assumptions, we seemed only able to produce theories (the quantum field theories) which gave infinity for the answer to certain questions. These infinities are kept in abeyance (and now possibly eliminated altogether) by the awkward process of renormalization. In an attempt to understand all this better, and perhaps to make a theory which would give only finite answers from the start, I looked into the” tacit assumptions” to see if they could be altered.

One of the assumptions was that the probability for an event must always be a positive number. Trying to think of negative probabilities gave me cultural shock at first, but when I finally got easy with the concept I wrote myself a note so I wouldn’t forget my thoughts. I think that Prof. Bohm has just the combination of imagination and boldness to find them interesting and amusing. I am delighted to have this opportunity to publish them in such an appropriate place. I have taken the opportunity to add some further, more recent, thoughts about applications to two state systems.

Unfortunately I never did find out how to use the freedom of allowing probabilities to be negative to solve the original problem of infinities in quantum field theory!


It is usual to suppose that, since the probabilities of events must be positive, a theory which gives negative numbers for such quantities must be absurd. I should show here how negative probabilities might be interpreted. A negative number, say of apples, seems like an absurdity. A man starting a day with five apples who gives away ten and is given eight during the day has three left. I can calculate this in two steps: 5 -10 = -5 and -5 + 8 + 3. The final answer is satisfactorily positive and correct although in the intermediate steps of calculation negative numbers appear. In the real situation there must be special limitations of the time in which the various apples are received and given since he never really has a negative number, yet the use of negative numbers as an abstract calculation permits us freedom to do our mathematical calculations in any order simplifying the analysis enormously, and permitting us to disregard inessential details. The idea of negative numbers is an exceedingly fruitful mathematical invention. Today a person who balks at making a calculation in this way is considered backward or ignorant, or to have some kind of a mental block. It is the purpose of this paper to point out that we have a similar strong block against negative probabilities. By discussing a number of examples, I hope to show that they are entirely rational of course, and that their use simplifies calculation and thought in a number of applications in physics.

First let us consider a simple probability problem, and how we usually calculate things and then see what would happen if we allowed some of our normal probabilities in the calculations to be negative. Let us imagine a roulette wheel with, for simplicity, just three numbers: 1, 2, 3. Suppose however, the operator by control of a switch under the table can put the wheel into one of two conditions A, B in each of which the probability of 1, 2, 3 are different. If the wheel is in condition A, the probability of 1 is p1A = 0.3 say, of 2 is p2A = 0.6, of 3 is p3A =0.1. But if the wheel is in condition B, these probabilities are

p1B = 0.1, p2B = 0.4, p3B = 0.5

say as in the table.

      Cond. A     Cond. B
1 0.3 0.1
2 0.6 0.4
3 0.1 0.5

We, of course, use the table in this way: suppose the operator puts the wheel into condition A 7/10 of the time and into B the other 3/10 of the time at random. (That is the probability of condition A, pA = 0.7, and of B, pB = 0.3.) Then the probability of getting 1 is

Prob. 1 = 0.7 (0.3) + 0.3 (0.1) = 0.24,

etc.

[…]

Now, however, suppose that some of the conditional probabilities are negative, suppose the table reads so that, as we shall say, if the system is in condition B the probability of getting 1 is -0.4. This sounds absurd but we must say it this way if we wish that our way of thought and language be precisely the same whether the actual quantities pi α in our calculations are positive or negative. That is the essence of the mathematical use of negative numbers—to permit an efficiency in reasoning so that various cases can be considered together by the same line of reasoning, being assured that intermediary steps which are not readily interpreted (like -5 apples) will not lead to absurd results. Let us see what p1B = -0.4 “means” by seeing how we calculate with it.

He gives an example showing how meaningful end results can sometimes arise even if the conditional probabilities like p1B are negative or greater than 1.

It is not my intention here to contend that the final probability of a verifiable physical event can be negative. On the other hand, conditional probabilities and probabilities of imagined intermediary states may be negative in a calculation of probabilities of physical events or states. If a physical theory for calculating probabilities yields a negative probability for a given situation under certain assumed conditions, we need not conclude the theory is incorrect. Two other possibilities of interpretation exist. One is that the conditions (for example, initial conditions) may not be capable of being realized in the physical world. The other possibility is that the situation for which the probability appears to be negative is not one that can be verified directly. A combination of these two, limitation of verifiability and freedom in initial conditions, may also be a solution to the apparent difficulty.

The rest of this paper illustrates these points with a number of examples drawn from physics which are less artificial than our roulette wheel. Since the result must ultimately have a positive probability, the question may be asked, why not rearrange the calculation so that the probabilities are positive in all the intermediate states? The same question might be asked of an accountant who subtracts the total disbursements before adding the total receipts. He stands a chance of going through an intermediary negative sum. Why not rearrange the calculation? Why bother? There is nothing mathematically wrong with this method of calculating and it frees the mind to think clearly and simply in a situation otherwise quite complicated. An analysis in terms of various states or conditions may simplify a calculation at the expense of requiring negative probabilities for these states. It is not really much expense.

Our first physical example is one in which one· usually uses negative probabilities without noticing it. It is not a very profound example and is practically the same in content as our previous example. A particle diffusing in one dimension in a rod has a probability of being at x at time t of P(x,t) satisfying

\partial P(x,t)/\partial t = -\partial^2 P(x,t)/\partial x^2

Suppose at x =0 and x =\pi the rod has absorbers at both ends so that P(x,t) = 0 there. Let the probability of being at x at t = 0 be given as P(x,0) =f(x). What is P(x,t) thereafter? It is

\displaystyle{ P(x,t) = \sum_{n=1}^\infty P_n \; \sin x \;\exp(-n^2 t) }

where P_n is given by

x \displaystyle{  f(x) = \sum_{n = 1}^\infty P_n \; \sin n x }

or

x \displaystyle{ P_n = \frac{2}{\pi} \int_0^\pi f(x) \sin nx  \; dx }

The easiest way of analyzing this (and the way used if P(x,t) is a temperature, for example) is to say that there are certain distributions that behave in an especially simple way. If f(x) starts as \sin nx it will remain that shape, simply decreasing with time, as e^{-n^2 t} Any distribution f(x) can be thought of as a superposition of such sine waves. But f(x) cannot be \sin nx if f(x) is a probability and probabilities must always be positive. Yet the analysis is so simple this way that no one has really objected for long.

He also gives examples from quantum mechanics, but the interesting thing about the examples above is that they’re purely classical—and the second one, at least, is something physicists are quite used to.

Sometimes it’s good to temporarily put aside making sense of ideas and just see if you can develop rules to consistently work with them. For example: the square root of -1. People had to get good at using it before they understood what it really was: a rotation by a quarter turn in the plane.

Along those, lines, here’s an interesting attempt to work with negative probabilities:

• Gábor J. Székely, Half of a coin: negative probabilities, Wilmott Magazine (July 2005), 66–68.

He uses rigorous mathematics to study something that sounds absurd: ‘half a coin’. Suppose you make a bet with an ordinary fair coin, where you get 1 dollar if it comes up heads and 0 dollars if it comes up tails. Next, suppose you want this bet to be the same as making two bets involving two separate ‘half coins’. Then you can do it if a half coin has infinitely many sides numbered 0,1,2,3, etc., and you win n dollars when side number n comes up….

… and if the probability of side n coming up obeys a special formula…

and if this probability can be negative sometimes!

This seems very bizarre, but the math is solid, even if the problem of interpreting it may drive you insane.

Let’s see how it works. Consider a game G where the probability of winning n = 0, 1, 2, \dots dollars is g(n). Then we can summarize this game using a generating function:

\displaystyle{ G(z) = \sum_{n = 0}^\infty g(n) , z^n }

Now suppose you play two independent games like this, G and another one, say H, with generating function

\displaystyle{ H(z) = \sum_{n = 0}^\infty h(n) , z^n }

Then there’s a new game GH that consists of playing both games. The reason I’m writing it as GH is that its generating function is the product

\displaystyle{ G(z) H(z) = \sum_{m,n = 0}^\infty g(m) h(n) z^{m+n} }

See why? With probability g(m) h(n) you win m dollars in game G and n dollars in game H, for a total of m + n dollars.

The game where you flip a fair coin and win 1 dollar if it lands heads up and 0 dollars if lands tails up has generating function

\displaystyle{ G(z) = \frac{1}{2}(1 + z) }

The half-coin is an imaginary game H such that playing two copies of this game is the same as playing the game G. If such a game really existed, we would have

G(z) = H(z)^2

so

\displaystyle{ H(z) = \sqrt{\frac{1}{2}(1 + z)} }

However, if you work out the Taylor series of this function, every even term is negative except for the zeroth term. So, this game can exist only if we allow negative probabilities.

(Experts on generating functions and combinatorics will enjoy how the coefficients of the Taylor series of H(z) involves the Catalan numbers.)

By the way, it’s worth remembering that for a long time mathematicians believed that negative numbers made no sense. As late as 1758 the British mathematician Francis Maseres claimed that negative numbers

… darken the very whole doctrines of the equations and make dark of the things which are in their nature excessively obvious and simple.

So opinions on these things can change. And since I’ve spent a lot of time working on ‘sets with fractional cardinality’, and have made lots of progress on that idea, and other strange ideas, I like to spend a little time now and then investigating other nonsensical-sounding generalizations of familiar concepts.

This paper by Mark Burgin has a nice collection of references on negative probability:

• Mark Burgin, Interpretations of negative probability.

He valiantly tries to provide a frequentist interpretation of negative probabilities. He needs ‘negative events’ to get negative frequencies of events occurring, and he gives this example:

To better understand how negative elementary events appear and how negative probability emerges, consider the following example. Let us consider the situation when an attentive person A with the high knowledge of English writes some text T. We may ask what the probability is for the word “texxt” or “wrod” to appear in his text T. Conventional probability theory gives 0 as the answer. However, we all know that there are usually misprints. So, due to such a misprint this word may appear but then it would be corrected. In terms of extended probability, a negative value (say, -0.1) of the probability for the word “texxt” to appear in his text T means that this word may appear due to a misprint but then it’ll be corrected and will not be present in the text T.

Maybe he’s saying that the misprint occurs with probability 0.1 and then it ‘de-occurs’ with the same probability, giving a total probability of

0.1 - 0.1 = 0

I’m not sure.

Here’s another paper on the subject:

• Espen Gaarder Haug, Why so negative to negative probabilities?, Wilmott Magazine.

It certainly gets points for a nice title! However, like Burgin’s paper, I find it a lot less clear than what Feynman wrote.

Notice that like Székely’s paper, Haug’s originally appeared in the Wilmott Magazine. I hadn’t heard of that, but it’s about finance. So it seems that the bankers, having invented negative numbers to get us into debt, are now struggling to invent negative probabilities! In fact Haug’s article tries some applications of negative probabilities to finance.

Scary.

For further discussion, with some nice remarks by the quantum physics experts Matt Leifer and Michael Nielsen, see the comments on my Google+ post on this topic. Matt Leifer casts cold water on the idea of using negative probabilities in quantum theory. On the other hand, Michael Nielsen points out some interesting features of the Wigner quasiprobability distribution, which is the best possible attempt to assign a probability density for a quantum particle to have any given position and momentum. It can be negative! But if you integrate it over all momenta, you get the probability density for the particle having any given position:

|\psi(x)|^2

And if you integrate it over all positions, you get the probability density for the particle having any given momentum:

|\widehat{\psi}(p)|^2


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers