I’ll start you off with two puzzles. Their relevance should become clear by the end of this post:

• **Puzzle 1.** Suppose I have a box of jewels. The average value of a jewel in the box is $10. I randomly pull one out of the box. What’s the probability that its value is at least $100?

• **Puzzle 2.** Suppose I have a box full of numbers—they can be arbitrary real numbers. Their average is zero, and their standard deviation is 10. I randomly pull one out. What’s the probability that it’s at least 100?

Before you complain, I’ll admit: in both cases, you can’t actually tell me the probability. But you *can* say *something* about the probability! What’s the most you can say?

### Noether theorems

Some good news: Brendan Fong, who worked here with me, has now gotten a scholarship to do his PhD at the University of Oxford! He’s talking to with people like Bob Coecke and Jamie Vicary, who work on diagrammatic and category-theoretic approaches to quantum theory.

But we’ve also finished a paper on good old-fashioned probability theory:

• John Baez and Brendan Fong, A Noether theorem for Markov processes.

This is based on a result Brendan proved in the network theory series on this blog. But we go further in a number of ways.

What’s the basic idea?

For months now I’ve been pushing the idea that we can take ideas from quantum mechanics and push them over to ‘stochastic mechanics’, which differs in that we work with probabilities rather than amplitudes. Here we do this for Noether’s theorem.

I should warn you: here I’m using ‘Noether’s theorem’ in an extremely general way to mean any result relating *symmetries* and *conserved quantities*. There are many versions. We prove a version that applies to Markov processes, which are random processes of the nicest sort: those where the rules don’t change with time, and the state of the system in the future only depends on its state now, not the past.

In quantum mechanics, there’s a very simple relation between symmetries and conserved quantities: an observable commutes with the Hamiltonian if and only if its expected value remains constant in time for every state. For Markov processes this is no longer true. But we show the next best thing: an observable commutes with the Hamiltonian if and only if both its expected value *and standard deviation* are constant in time for every state!

Now, we explained this stuff very simply and clearly back in Part 11 and Part 13 of the network theory series. We also tried to explain it clearly in the paper. So now let me explain it in a complicated, confusing way, for people who prefer that.

(Judging from the papers I read, that’s a lot of people!)

I’ll start by stating the quantum theorem we’re trying to mimic, and then state the version for Markov processes.

### Noether’s theorem: quantum versions

For starters, suppose both our Hamiltonian and the observable are *bounded* self-adjoint operators. Then we have this:

**Noether’s Theorem, Baby Quantum Version.** Let and be bounded self-adjoint operators on some Hilbert space. Then

if and only if for all states obeying Schrödinger’s equation

the expected value is constant as a function of

What if is an unbounded self-adjoint operator? That’s no big deal: we can get a bounded one by taking where is any bounded measurable function. But Hamiltonians are rarely bounded for fully realistic quantum systems, and we can’t mess with the Hamiltonian without changing Schrödinger’s equation! So, we definitely want a version of Noether’s theorem that lets be unbounded.

It’s a bit tough to make the equation precise in a useful way when is unbounded, because then is only densely defined. If doesn’t map the domain of to itself, it’s hard to know what even means! We could demand that *does* preserve the domain of but a better workaround is instead to say that

for all Then we get this:

**Noether’s Theorem, Full-fledged Quantum Version.** Let and be self-adjoint operators on some Hilbert space, with being bounded. Then

if and only if for all states

the expected value is constant as a function of

Here of course we’re using the fact that is what we get when we *solve* Schrödinger’s equation with initial data

But in fact, this version of Noether’s theorem follows instantly from a simpler one:

**Noether’s Theorem, Simpler Quantum Version.** Let be a unitary operator and let be a bounded self-adjoint operator on some Hilbert space. Then

if and only if for all states

This version applies to a single unitary operator instead of the 1-parameter unitary group

It’s incredibly easy to prove. And this is is the easiest version to copy over to the Markov case! However, the proof over there is not quite so easy.

### Noether’s theorem: stochastic versions

In stochastic mechanics we describe states using probability distributions, not vectors in a Hilbert space. We also need a new concept of ‘observable’, and unitary operators will be replaced by ‘stochastic operators’.

Suppose that is a -finite measure space with a measure we write simply as . Then probability distributions on lie in Let’s define an **observable** to be any element of the dual space allowing us to define the expected valued of in the probability distribution to be

The angle brackets are supposed to remind you of quantum mechanics, but we don’t have an inner product on a Hilbert space anymore! Instead, we have a pairing between and Probability distributions live in while observables live in But we can also think of an observable as a bounded operator on namely the operator of multiplying by the function

Let’s say an operator

is **stochastic** if it’s bounded and it maps probability distributions to probability distributions. Equivalently, is stochastic if it’s linear and it obeys

and

for all .

A Markov process, or technically a **Markov semigroup**, is a collection of operators

for such that:

• is stochastic for all

• depends continuously on

• for all

•

By the Hille–Yosida theorem, any Markov semigroup may be written as

for some operator , called its **Hamiltonian**. However, is typically unbounded and only densely defined. This makes it difficult to work with the commutator So, we should borrow a trick from quantum mechanics and work with the commutator instead. This amounts to working directly with the Markov semigroup instead of its Hamiltonian. And then we have:

**Noether’s Theorem, Full-fledged Stochastic Version.** Suppose is a -finite measure space and

is a Markov semigroup. Suppose is an observable. Then

for all if and only if for all probability distributions on , and are constant as a function of

In plain English: time evolution commutes with an observable if the mean and standard deviation of that observable never change with time. As in the quantum case, this result follows instantly from a simpler one, which applies to a single stochastic operator:

**Noether’s Theorem, Simpler Stochastic Version.** Suppose is a -finite measure space and

is stochastic operator. Suppose is an observable. Then

if and only if for all probability distributions on

and

It looks simple, but the proof is a bit tricky! It’s easy to see that implies those other equations; the work lies in showing the converse. The reason is that implies

for *all* , not just 1 and 2. The expected values of the powers of are more or less what people call its moments. So, we’re saying *all* the moments of are unchanged when we apply to an arbitrary probability distribution, given that we know this fact for the first two.

The proof is fairly technical but also sort of cute: we use Chebyshev’s inequality, which says that the probability of a random variable taking a value at least standard deviations away from its mean is less than or equal to I’ve always found this to be an amazing fact, but now it seems utterly obvious. You can figure out the proof yourself if you do the puzzles at the start of this post.

But now I’ll let you read our paper! And I’m really hoping you’ll spot mistakes, or places it can be improved.

Let J be the set of all jewels, and E the set of expensive jewels: those valued at $100 or more. We want to know something about the probability |E|/|J|.

The average value of all jewels is $10, so if we write for the value of jewel :

We also have:

from which it follows that:

Combining these inequalities, we obtain:

So the probability that a randomly chosen jewel’s value is at least $100 is at most one tenth.

Great—excellent! This is basically the proof of

Markov’s Inequality.If a nonnegative random variable has mean the probability that its value is at least isYour proof takes and but apart from that it’s quite general.

However, we can be even more general. We don’t need the jewels to be equally probable, as long as we interpret ‘average value’ as the expected value with respect to the probability distribution which we use to randomly choose a jewel.

In this more general case, Markov’s inequality follows from this obvious fact:

Fact.Let be a measurable function on a measure space If the measure of the setis , then the integral of is at least

Just take to be a probability measure space; then this says the mean of the random variable is at least times the probability that And that’s another way of stating Markov’s inequality.

The first inequality used for Puzzle 1, where the sum over the full set J was greater than or equal to the sum over the subset E, was only true because we could assume that the value of a jewel would always be non-negative. To do something analogous when we have negative numbers, we need to square them … so it’s lucky we’ve been told their standard deviation.

Define B to be the set of numbers in the box and L the subset of large numbers:

Since the average of the numbers in B is zero and their standard deviation is 10, their average square is just equal to their variance, 100, giving us:

But from the definition of L we also have:

So we have:

Great! This is basically the proof of:

Chebyshev’s Inequality.If a real-valued random variable has standard deviation the probability that it differs from its mean by at least isAnd as you hint, we reduce this to Markov’s inequality by taking our random variable, subtracting off its mean, and squaring it. This gives a new random variable that’s nonnegative, whose mean is the variance of the original one—that is, If we then apply Markov’s inequality to this new random variable, we get the above result.

A more pithy statement is:

Chebyshev’s Inequality.The probability that a real-valued random variable differs from its mean by at least standard deviations is isThis can’t be bettered without knowing more about the random variable. A Gaussian random variable has a much smaller probability of differing a lot from its mean. But that’s misleading: ‘fat-tailed’ probability distributions give random variables that are more likely to differ a lot from their mean.

So: if we think all random variables are Gaussian, we’ll be surprised by lots of drastic events!But Chebyshev’s inequality puts a limit on how fat-tailed a fat-tailed probability distribution can actually be while still having a finite standard deviation.

We could guess this result by noticing that the probability distribution

is just slightly too fat-tailed to have its standard deviation be finite! Why? Because if you multiply it by you get a function that’s not quite integrable, since it decays like But if you integrate from to infinity, you get an answer that decays like just as Chebyshev’s inequality suggests.

But we need to tweak this guess to get a function with finite standard deviation that makes Chebyshev’s inequality an equality! And I believe we can only make it an equality for one value of at a time, not a bunch. (I’m not sure about that.)

A perspective that’s not so much about details as much as context.

John wrote:

So: if we think all random variables are Gaussian, we’ll be surprised by lots of drastic events!Nouriel Taleb, the ex-trader who popularised the “Black Swan” idea, has been writing about how this isn’t quite the right way to look at extreme events. His basic point is that in many situations there aren’t that many more drastic events compared to Gaussian-ish events, but that the distributions will generate really, really drastic events (far more drastic than the still-tractable fat-tailed distributions) that can completely change planned responsive behaviour. The crucial point is that if fat-tails just mean a greater frequency events were slightly more extreme, one could still imagine sensibly working with sums/averages/other aggregates over a

groupof future events and choosing responses to “optimise the response” to this model. However, if there’s a chance of some event, still relativley rare, butsoextreme happening that’s so drastic its occurrence changes the whole context, this kind of analysis becomes much, much less useful. (As a user of probabilistic modeling, it exposes a more difficult question: in order toestimateboth “model structure” (distributions, dependencies, etc)andprobabilities to come up with a usable model. To get a good estimate the distribution of drastic events one must see at least a few examples of drastic events, but what if a drastic event changes the context afterwards so that the model structure becomes different after one? How can one get areasonably goodmodel that’s still applicable for future prediction? Of course, this overstates things.)Taleb’s suggestion is that mathematical-probabilistic modeling should be used for “everyday events” (which may still use distributions with fatter-tailed distributions than the Gaussian), but that attempts to bring genuinely drastic events into this framework is futile, and needs a different kind of analysis. I’m not sure I completely agree with this viewpoint, but I do think that real-world use of mathematical models that appear to “tame” extreme events should be treated with caution.

The probability distributions that make Chebyshev’s inequality an equality have several ‘spikes’ in them. So it’s not surprising you can do better for

unimodalprobability distributions meaning those where has a single maximum. But I just learned about this:Vysochanskiï–Petunin Inequality.If a real-valued random variable has a unimodal probability distribution, the probability that it differs from its mean by at least standard deviations is at most at least whenChebyshev’s inequality would give a bound of not

I’m sure there must be lots of more sophisticated theorems that come after Markov’s inequality, Chebyshev’s inequality and the Vysochanskiï–Petunin inequality. Obviously we can copy the proof of Markov’s inequality and Chebyshev’s inequality and get better bounds on the probability of large deviations if we know more about the higher moments of a probability distribution. But I bet there are more interesting things to say, and I’d like to hear some of them from the experts out there! Please don’t point me to a book… I just want to chat.

Hmm, I think you can actually calculate the precise probability if you think more like a Bayesian and less like a Baezian. The probability distribution should correspond to maximal ignorance, subject to the constraints of our knowledge. For Puzzle 1, our Ignorance, as a functional of the probability of pulling a jewel of value , is

The constraints are that the probabilities must sum to one,

and that the expected value is 10,

Using Lagrange multipliers, solve for the probability distribution extremizing

The variation of this is

which vanishes for

We get the constants by satisfying the constraints, which demand and . And so, like magic, we have the precise probability distribution consistent with our knowledge:

Once we have this, we can answer any relevant question. What’s the probability of pulling a jeweled coin valued at least $100?

(This same approach relates to our discussion of quantropy. I still think it’s better to extremize complex ignorance and find the complex probability distribution than to make up something new and extremize it for the amplitude. But, that’s a different discussion.)

That’s excellent!

Gaussians are famous for being the probability distributions that maximize entropy subject to constraints on the mean and variance. The quadratic stuff inside the exponential in a Gaussian comes from the formulas for the mean and variance. You’re noting that an exponential maximizes entropy subject to constraints on the mean—at least for a probability distribution on nonnegative numbers. (If we didn’t have that fine print we’d be sunk!) The linear stuff inside your exponential comes from the formula for the mean.

Of course you’re answering a slightly different question than the one I intended. I was asking what we could say

for sureabout some question given some constraints on a probability distribution. You’re calculating what’s theexpectedanswer to a question based on a probability distribution that maximizes entropy subject to that constraint.They’re both interesting: we just have to decide which we’re interested in. My question focuses our attention on the ‘worst-case scenario’, while your focuses attention on some sort of ‘expected’ scenario.

It’s interesting how the probability you get is so much smaller than what’s allowed by Markov’s inequality. Markov’s inequality can’t be improved in the worst-case scenario! But the worst-case scenario is a fat-tailed distribution, while the exponential distribution you get is quite skinny-tailed.

The worst-case scenario, by the way, is when you have a 90% chance of picking out a completely worthless jewel, and a 10% chance of picking out a $100 jewel. Then the expected value of the jewel you pull from the box is $10, and the probability of it being worth at least $100 is 10%. Markov’s inequality says the probability can never be higher than that, given that the expected value of the jewel is $100.

I don’t understand what you mean, for at least three reasons, but if you want to explain that a bit more, please leave a comment on the latest quantropy post!

Yes, I think the Baezian thinking here is “what is the best you can say for sure, assuming the worst consistent probability distribution,” while the Bayesian thinking is “what can you say, assuming the proper consistent probability distribution?” Without other information, isn’t the correct probability distribution the one that maximises entropy? I understand that this approach allows the puzzles to be answered in a way different than intended, but I think it correctly answers them the way they were asked! To pose the intended question, you’d have to add the “for sure,” which is a meta constraint on the probability distribution. Without that information, I think the correct probability distribution is determined by the principle of maximum entropy. After all (and this is delightfully meta) this is the probability distribution you should have when you don’t know what the probability distribution should be.

Garrett wrote:

That’s a widely accepted viewpoint in some communities, controversial in others. But again: my question wasn’t intended to ask about what was true of the ‘correct’ probability distribution given some constraints, but rather, what was true of

allprobability distributions obeying those constraints.Admittedly, my question was phrased in a way that could admit either interpretation! So, it’s cool that you thought of this other interpretation. Both questions are important; which one is the ‘right’ question would depend on the details of what we were planning to do with the answer.

But if you read my whole blog post, it becomes clear that my question was really just a trick to get people to reinvent Markov’s theorem and Chebyshev’s theorem. Greg Egan and a bunch of people on Google+ succeeded in doing that, so it did its job.

Yes, I know I’m making trouble! I guess I’m trying to persuade you to consider the principle of maximum entropy seriously. I didn’t really get it until I understood at as a way to calculate the correct probability distribution consistent with one’s ignorance. My oblique motivation is that if you do take it seriously, then you can also use it as a starting point for qm instead of quantropy. (Happy to discuss via email if you like, as I don’t wish to derail discussions.)

I

dotake the principle of maximum entropy seriously. I’ve been going around advocating it for years! Lately I’ve been using it in all sorts of ways.Everyone should think about E. T. Jaynes’ famous puzzle:

“If you have an unfair die and the expected value of the number that comes up when you roll it is 4, what’s the probability that you roll a 6?”At first it seems impossible. But then Jaynes points out that in statistical mechanics we solve similar problems every day! We pick the probability distribution that maximizes entropy subject to the constraints we know, and compute expected values using that.

So, it was just a comitragic mistake on my part that when I posed puzzles meant to illustrate Markov’s theorem and Chebyshev’s theorem, I phrased them loosely enough that they could also have been illustrations of the principle of maximum entropy!

I really prefer to talk about things publicly, but let’s do it here, in the original quantropy thread. To get the ball rolling, I’ll ask a question: what’s “complex ignorance” and more importantly, how do you “maximize” a complex-valued function? I really hope that someday there will be a Fourth Annual Workshop on Complex Ignorance—this title nicely describes a lot of branches of theoretical physics. But I don’t want to wait for that to get the answer to my questions!

Works for me.

I side with Garrett on this one. To get the 10% value, we need to assume that there are a finite number of jewels in the box. In this case 10 gives the Markov inequality based on counting statistics.

But say we had no knowledge of the number of jewels in the box. According to our ignorance of that number, we have to assume a uniform distribution which can range anywhere from 0 to a very large number. The mean of a uniform is simply 1/2 of that very large number. That would certainly be larger than 10, correct?

Based on that, I would assume the value is closer to Garrett’s 0.000045 number than 0.1.

Thanks for a word problem that makes one think.

Over on Google+, here’s what people said:

John Baez wrote:

Andrea C G Mennucci wrote:

Jan Moren wrote:

Michael Stankowski wrote:

Jan Moren wrote:

Michael Stankowski wrote:

Cod Codliness wrote:

Jan Moren wrote:

Cod Codliness wrote:

David Guild wrote:

Satyr Icon wrote:

Michael Stankowski wrote:

Heikki Arponen wrote:

John Baez wrote:

John Baez wrote:

Andrea C G Mennucci wrote:

Samuel Cook wrote:

Samuel Cook wrote:

John Baez wrote:

I’m a bit out of sleep these days… but the theorem looks a bit too amazing to be new or non-vacuous. For Brownian motion I get O must be harmonic, hence constant.

Sorry, I forgot there are nice Riemannian manifolds which have lots of non-constant bounded harmonic functions.

Martin Gisser wrote:

What examples are you thinking of here? For a

compactRiemannian manifold, the only harmonic functions are locally constant functions. Thus, the space of harmonic functions is a vector space whose dimension equals the number of connected components. This is a bit of ‘Hodge theory’.I don’t know examples of (noncompact) connected Riemannian manifolds that have bounded harmonic functions that aren’t constant.

I might even be able to use our theorem to prove this, at least for

completeconnected Riemannian manifolds, where the Laplacian is self-adjoint so Brownian motion is well-defined.My intuition makes me suspect there are

notany conserved quantities for Brownian motion on such a manifold – except for constants!But that’s not the right kind of example to illustrate our result. It’s easy to make up Markov processes that have plenty of conserved quantities. You can do it even when the state space is a finite set. Our theorem is easier to prove in that case: we do that in the first part of the paper. The proof makes it clear what’s going on. The idea is simply that is partitioned into subsets, such that the probability of moving from one subset to another is zero.

When isn’t a finite set the proof gets harder, because all these subsets may have measure zero. For example, consider foliating a Riemannian manifold by submanifolds, and then considering Brownian motion restricted by the condition that the particle stays on the same leaf of the foliation. Any bounded measurable function that’s constant on each leaf will be a conserved quantity!

Hyperbolic space and its quotients.

Fascinating stuff: Complete manifolds with ends at infinity that suck up Brownian motion (“parabolic” ends). (Problem, they need not be Brownian complete, i.e. Brownian motion might “explode” in finite time, and then it’s not Markovian in your sense.) See e.g. work of Peter Li. A theorem of Li and Tam states that the number of parabolic ends is bounded by the dimension of the space of

boundedharmonic functions withfinite Dirichlet integral. (J. Diff. Geom. 35 (1992) 359-383)Googling for “Liouville property” (i.e. constancy of bounded harmonic functions) I found this fascinating paper:

* Anna Erschler: Liouville property for groups and manifolds, Invent. Math. 155 (2004) 55-80

—————————-

Given the vast and old literature on Markov processes, the Markov Noether theorem should be quite sensational. What a pretty little gem!

Since this paper is an offshoot of the network theory project, we’re mainly thinking about Markov processes coming from chemical reaction networks, where a state consists of a collection of molecules, and we have conserved quantities like ‘the number of nitrogen atoms’.

But thanks to you, I’m now trying to imagine geometrical examples. The only really nice ones that leap to mind come from foliations of Riemannian manifolds, where we require that a particle stay on a given leaf while doing its random walk.

Argh. Mestupid should have checked the converse first: A bounded harmonic function does not necessarily commute with the Brownian motion semigroup! Luckily. If it does, it must be locally constant.

Thanks, this was interesting.

1. In the proof of Theorem 1, (i)=>(ii), perhaps you’d like to state more clearly what is when is not a diagonal matrix, as this notation is introduced for diagonal matrices. Maybe the notation used (probably by mistype) 5 lines above the end of the proof: is better.

2. In the proof of Theorem 1, (iii)=>(iv), (second line) you must mean .

Wow, thanks! Very useful edits.

In my blog posts on this topic I always used the notation

for the integral of an function Then in writing the latter part of this paper I started liking the notation

for the integral of the product of an function and a function This looks more like the pairing of functions you see in quantum mechanics. It’s interestingly similar and interestingly different.

So, we went back to the first part and mechanically changed all expressions of the form to But this doesn’t make sense when is an operator on that’s not multiplication by an function! Or at least, we never defined it. So thanks—I need to go back and fix this somehow.

I have some comment to the first puzzle (sorry for quality, I am only learning commenting in WordPress). It was partially inspired by environmental examples, there some random values often may be quite well described by log-normal distributions. Possibly, it may be explained by the consideration of composition of many random factors similarly with derivation of the normal distribution using sums. The thing is less clear for me, is that standard deviation for such distributions is very often almost equal to average. If cost of things would be described by the same way, we could calculate more concrete values.

Let’s have some random variable x with normal distribution with mean m and variance d, then variable X = exp(x) has log-normal distribution with mean M = exp(m + d/2) and for some k probability to obtain X > k M is

If to suggest that standard deviation of X is equal to mean, then d = log 2 and for k=10 we get probability near 0.012%. It is also possible to ask about distribution with maximal probability for given k. Because maximum of that expression is for d=2 log k and it is equal to , for k=10 the probability is about 1.59%

I’m a bit pressed for time, so this might be coming out from left field (or worse, completely wrong), but when I looked at this a bit myself here:

Network Theory and Discrete Calculus – Noether’s Theorem

I vaguely remember that if the expectation of and is zero, then it means itself must be constant on all connected components.

In other words, if , it means itself is constant.

This seems significantly less interesting the the quantum version. The stuff I looked at was discrete in time as well, so maybe things are more interesting in continuous time (?).

Eric wrote:

If the expectation of is zero, then has to be

zerowherever the probability distribution is supported.Maybe you meant something like this:

Theorem.Given a Markov process on a finite set of states , and has the property that the expectation values of and don’t change with time regardless of the initial conditions, then is constant on each connected component of the ‘transition graph’ of that Markov process.This is part of our Theorem 1. We prove it for continuous-time Markov processes, but the result is also true for discrete-time ones (also known as ‘Markov chains’).

Morally speaking the situation is the same when the state space is an infinite measure space, but technically speaking it’s a lot more subtle.

Btw, I’ll mention to readers who care about priority that your calculations are based to some extent on Brendan’s original calculations which went into the proof of Theorem 1. I’ve already had trouble publishing one paper because I talked about it on this blog a bunch and a referee somehow concluded that meant the paper contained nothing new!

Sorry. Of course I meant expectations are constant. Was in a rush.

My calculations are certainly based, in part, on Brendan’s original. I simply took his ideas and reformulated them using discrete calculus.

The bit where you say “time evolution is also random” bugs me a bit, because under the standard interpretations of quantum theory time evolution is perfectly deterministic, except during measurements.

Yeah—my point was that in both quantum mechanics and stochastic mechanics, the state of the system at time is a deterministic function of the state at time zero, but this state describes the

probability distributionof results you get when you measure any observable. So if you say time evolution is random in a so-called ‘random walk’, maybe you should say the same for time evolution in quantum mechanics too. But if you don’t, you shouldn’t. Anyway, my first sentence here is what I meant to say, so let me fix my blog post so it says that!Thanks.

While I really do like this exploration of the analogy between stochastic and quantum, I do think that the randomness in the stochastic case is a much different beast. Measurements are very different in the two cases. Obviously in both cases measuring changes future evolution. The difference crops up in that even unconditioned measurements change final distributions in quantum mechanics. In stochastic mechanics we can imagine that we “peek” at every time step, without changing anything. In that sense, considering every step random gives no observable differences. This is just not possible in quantum mechanics.

(You’re well aware of all this of course, but it’s something that regularly gets swept under the rug, and I think is worth pointing out when making analogies of this sort.)