A Noether Theorem for Markov Processes

I’ll start you off with two puzzles. Their relevance should become clear by the end of this post:

Puzzle 1. Suppose I have a box of jewels. The average value of a jewel in the box is $10. I randomly pull one out of the box. What’s the probability that its value is at least $100?

Puzzle 2. Suppose I have a box full of numbers—they can be arbitrary real numbers. Their average is zero, and their standard deviation is 10. I randomly pull one out. What’s the probability that it’s at least 100?

Before you complain, I’ll admit: in both cases, you can’t actually tell me the probability. But you can say something about the probability! What’s the most you can say?

Noether theorems

Some good news: Brendan Fong, who worked here with me, has now gotten a scholarship to do his PhD at the University of Oxford! He’s talking to with people like Bob Coecke and Jamie Vicary, who work on diagrammatic and category-theoretic approaches to quantum theory.

But we’ve also finished a paper on good old-fashioned probability theory:

• John Baez and Brendan Fong, A Noether theorem for Markov processes.

This is based on a result Brendan proved in the network theory series on this blog. But we go further in a number of ways.

What’s the basic idea?

For months now I’ve been pushing the idea that we can take ideas from quantum mechanics and push them over to ‘stochastic mechanics’, which differs in that we work with probabilities rather than amplitudes. Here we do this for Noether’s theorem.

I should warn you: here I’m using ‘Noether’s theorem’ in an extremely general way to mean any result relating symmetries and conserved quantities. There are many versions. We prove a version that applies to Markov processes, which are random processes of the nicest sort: those where the rules don’t change with time, and the state of the system in the future only depends on its state now, not the past.

In quantum mechanics, there’s a very simple relation between symmetries and conserved quantities: an observable commutes with the Hamiltonian if and only if its expected value remains constant in time for every state. For Markov processes this is no longer true. But we show the next best thing: an observable commutes with the Hamiltonian if and only if both its expected value and standard deviation are constant in time for every state!

Now, we explained this stuff very simply and clearly back in Part 11 and Part 13 of the network theory series. We also tried to explain it clearly in the paper. So now let me explain it in a complicated, confusing way, for people who prefer that.

(Judging from the papers I read, that’s a lot of people!)

I’ll start by stating the quantum theorem we’re trying to mimic, and then state the version for Markov processes.

Noether’s theorem: quantum versions

For starters, suppose both our Hamiltonian H and the observable O are bounded self-adjoint operators. Then we have this:

Noether’s Theorem, Baby Quantum Version. Let H and O be bounded self-adjoint operators on some Hilbert space. Then

[H,O] = 0

if and only if for all states \psi(t) obeying Schrödinger’s equation

\displaystyle{ \frac{d}{d t} \psi(t) = -i H \psi(t) }

the expected value \langle \psi(t), O \psi(t) \rangle is constant as a function of t.

What if O is an unbounded self-adjoint operator? That’s no big deal: we can get a bounded one by taking f(O) where f is any bounded measurable function. But Hamiltonians are rarely bounded for fully realistic quantum systems, and we can’t mess with the Hamiltonian without changing Schrödinger’s equation! So, we definitely want a version of Noether’s theorem that lets H be unbounded.

It’s a bit tough to make the equation [H,O] = 0 precise in a useful way when H is unbounded, because then H is only densely defined. If O doesn’t map the domain of H to itself, it’s hard to know what [H,O] = HO - OH even means! We could demand that H does preserve the domain of O, but a better workaround is instead to say that

[\mathrm{exp}(-itH), O] = 0

for all t. Then we get this:

Noether’s Theorem, Full-fledged Quantum Version. Let H and O be self-adjoint operators on some Hilbert space, with O being bounded. Then

[\mathrm{exp}(-itH),O] = 0

if and only if for all states

\psi(t) = \mathrm{exp}(-itH) \psi

the expected value \langle \psi(t), O \psi(t) \rangle is constant as a function of t.

Here of course we’re using the fact that \mathrm{exp}(-itH) \psi is what we get when we solve Schrödinger’s equation with initial data \psi.

But in fact, this version of Noether’s theorem follows instantly from a simpler one:

Noether’s Theorem, Simpler Quantum Version. Let U be a unitary operator and let O be a bounded self-adjoint operator on some Hilbert space. Then

[U,O] = 0

if and only if for all states \psi,

\langle U \psi, O U \psi \rangle = \langle \psi, O \psi \rangle.

This version applies to a single unitary operator U instead of the 1-parameter unitary group

U(t) = \exp(-i t H)

It’s incredibly easy to prove. And this is is the easiest version to copy over to the Markov case! However, the proof over there is not quite so easy.

Noether’s theorem: stochastic versions

In stochastic mechanics we describe states using probability distributions, not vectors in a Hilbert space. We also need a new concept of ‘observable’, and unitary operators will be replaced by ‘stochastic operators’.

Suppose that X is a \sigma-finite measure space with a measure we write simply as dx. Then probability distributions \psi on X lie in L^1(X). Let’s define an observable O to be any element of the dual space L^\infty(X), allowing us to define the expected valued of O in the probability distribution \psi to be

\langle O, \psi \rangle = \int_X O(x) \psi(x) \, dx

The angle brackets are supposed to remind you of quantum mechanics, but we don’t have an inner product on a Hilbert space anymore! Instead, we have a pairing between L^1(X) and L^\infty(X). Probability distributions live in L^1(X), while observables live in L^\infty(X). But we can also think of an observable O as a bounded operator on L^1(X), namely the operator of multiplying by the function O.

Let’s say an operator

U : L^1(X) \to L^1(X)

is stochastic if it’s bounded and it maps probability distributions to probability distributions. Equivalently, U is stochastic if it’s linear and it obeys

\psi \ge 0 \implies U \psi \ge 0


\int_X (U\psi)(x) \, dx = \int_X \psi(x) \, dx

for all \psi \in L^1(X).

A Markov process, or technically a Markov semigroup, is a collection of operators

U(t) : L^1(X) \to L^1(X)

for t \ge 0 such that:

U(t) is stochastic for all t \ge 0.

U(t) depends continuously on t.

U(s+t) = U(s)U(t) for all s,t \ge 0.

U(0) = I.

By the Hille–Yosida theorem, any Markov semigroup may be written as

U(t) = \exp(tH)

for some operator H, called its Hamiltonian. However, H is typically unbounded and only densely defined. This makes it difficult to work with the commutator [H,O]. So, we should borrow a trick from quantum mechanics and work with the commutator [\exp(tH),O] instead. This amounts to working directly with the Markov semigroup instead of its Hamiltonian. And then we have:

Noether’s Theorem, Full-fledged Stochastic Version. Suppose X is a \sigma-finite measure space and

U(t) : L^1(X) \to L^1(X)

is a Markov semigroup. Suppose O is an observable. Then

[U(t),O] = 0

for all t \ge 0 if and only if for all probability distributions \psi on X, \langle O, U(t) \psi \rangle and \langle O^2, U(t) \psi \rangle are constant as a function of t.

In plain English: time evolution commutes with an observable if the mean and standard deviation of that observable never change with time. As in the quantum case, this result follows instantly from a simpler one, which applies to a single stochastic operator:

Noether’s Theorem, Simpler Stochastic Version. Suppose X is a \sigma-finite measure space and

U : L^1(X) \to L^1(X)

is stochastic operator. Suppose O is an observable. Then

[U,O] = 0

if and only if for all probability distributions \psi on X,

\langle O, U \psi \rangle = \langle O, \psi \rangle


\langle O^2, U \psi \rangle = \langle O^2, \psi \rangle

It looks simple, but the proof is a bit tricky! It’s easy to see that [U,O] = 0 implies those other equations; the work lies in showing the converse. The reason is that [U,O] = 0 implies

\langle O^n, U \psi \rangle = \langle O^n, \psi \rangle

for all n, not just 1 and 2. The expected values of the powers of O are more or less what people call its moments. So, we’re saying all the moments of O are unchanged when we apply U to an arbitrary probability distribution, given that we know this fact for the first two.

The proof is fairly technical but also sort of cute: we use Chebyshev’s inequality, which says that the probability of a random variable taking a value at least k standard deviations away from its mean is less than or equal to 1/k^2. I’ve always found this to be an amazing fact, but now it seems utterly obvious. You can figure out the proof yourself if you do the puzzles at the start of this post.

But now I’ll let you read our paper! And I’m really hoping you’ll spot mistakes, or places it can be improved.

32 Responses to A Noether Theorem for Markov Processes

  1. Greg Egan says:

    Puzzle 1. Suppose I have a box of jewels. The average value of a jewel in the box is $10. I randomly pull one out of the box. What’s the probability that its value is at least $100?

    Let J be the set of all jewels, and E the set of expensive jewels: those valued at $100 or more. We want to know something about the probability |E|/|J|.

    The average value of all jewels is $10, so if we write V(x) for the value of jewel x:

    10 = \frac{1}{|J|} \sum_{\ x \in J} V(x) \ge \frac{1}{|J|} \sum_{\ x \in E} V(x)

    We also have:

    E = \{x | V(x) \ge 100\}

    from which it follows that:

    \frac{1}{|J|} \sum_{\ x \in E} V(x) \ge \frac{1}{|J|} \sum_{\ x \in E} 100 = 100 \frac{|E|}{|J|}

    Combining these inequalities, we obtain:

    \frac{|E|}{|J|} \le \frac{1}{10}

    So the probability that a randomly chosen jewel’s value is at least $100 is at most one tenth.

    • John Baez says:

      Great—excellent! This is basically the proof of

      Markov’s Inequality. If a nonnegative random variable has mean A, the probability that its value is at least B is \le A/B.

      Your proof takes A = 10 and B = 100, but apart from that it’s quite general.

      However, we can be even more general. We don’t need the jewels to be equally probable, as long as we interpret ‘average value’ as the expected value with respect to the probability distribution which we use to randomly choose a jewel.

      In this more general case, Markov’s inequality follows from this obvious fact:

      Fact. Let f : X \to [0,\infty) be a measurable function on a measure space X. If the measure of the set

      \{ x \in X: f(x) \ge B \}

      is \epsilon, then the integral of f is at least \epsilon B.

      Just take X to be a probability measure space; then this says the mean of the random variable f is at least B times the probability that f \ge B. And that’s another way of stating Markov’s inequality.

  2. Greg Egan says:

    Puzzle 2. Suppose I have a box full of numbers—they can be arbitrary real numbers. Their average is zero, and their standard deviation is 10. I randomly pull one out. What’s the probability that it’s at least 100?

    The first inequality used for Puzzle 1, where the sum over the full set J was greater than or equal to the sum over the subset E, was only true because we could assume that the value of a jewel would always be non-negative. To do something analogous when we have negative numbers, we need to square them … so it’s lucky we’ve been told their standard deviation.

    Define B to be the set of numbers in the box and L the subset of large numbers:

    L = \{x \in B | x \ge 100\}

    Since the average of the numbers in B is zero and their standard deviation is 10, their average square is just equal to their variance, 100, giving us:

    100 = \frac{1}{|B|} \sum_{\ x \in B} x^2 \ge \frac{1}{|B|} \sum_{\ x \in L} x^2

    But from the definition of L we also have:

    \frac{1}{|B|} \sum_{\ x \in L} x^2 \ge \frac{1}{|B|} \sum_{\ x \in L} 100^2 = 100^2 \frac{|L|}{|B|}

    So we have:

    \frac{|L|}{|B|} \le \frac{1}{100}

    • John Baez says:

      Great! This is basically the proof of:

      Chebyshev’s Inequality. If a real-valued random variable has standard deviation A, the probability that it differs from its mean by at least B is \le A^2/B^2.

      And as you hint, we reduce this to Markov’s inequality by taking our random variable, subtracting off its mean, and squaring it. This gives a new random variable that’s nonnegative, whose mean is the variance of the original one—that is, A^2. If we then apply Markov’s inequality to this new random variable, we get the above result.

      A more pithy statement is:

      Chebyshev’s Inequality. The probability that a real-valued random variable differs from its mean by at least k standard deviations is is 1/k^2.

      This can’t be bettered without knowing more about the random variable. A Gaussian random variable has a much smaller probability of differing a lot from its mean. But that’s misleading: ‘fat-tailed’ probability distributions give random variables that are more likely to differ a lot from their mean.

      So: if we think all random variables are Gaussian, we’ll be surprised by lots of drastic events!

      But Chebyshev’s inequality puts a limit on how fat-tailed a fat-tailed probability distribution can actually be while still having a finite standard deviation.

      We could guess this result by noticing that the probability distribution

      p(x) = \left\{ \begin{array}{cl} c/x^3 &x \ge 1 \\ 0 &x > 1 \end{array} \right.

      is just slightly too fat-tailed to have its standard deviation be finite! Why? Because if you multiply it by x^2, you get a function that’s not quite integrable, since it decays like 1/x. But if you integrate p from B to infinity, you get an answer that decays like 1/B^2, just as Chebyshev’s inequality suggests.

      But we need to tweak this guess to get a function with finite standard deviation that makes Chebyshev’s inequality an equality! And I believe we can only make it an equality for one value of B at a time, not a bunch. (I’m not sure about that.)

      • dave tweed says:

        A perspective that’s not so much about details as much as context.

        John wrote:

        So: if we think all random variables are Gaussian, we’ll be surprised by lots of drastic events!

        Nouriel Taleb, the ex-trader who popularised the “Black Swan” idea, has been writing about how this isn’t quite the right way to look at extreme events. His basic point is that in many situations there aren’t that many more drastic events compared to Gaussian-ish events, but that the distributions will generate really, really drastic events (far more drastic than the still-tractable fat-tailed distributions) that can completely change planned responsive behaviour. The crucial point is that if fat-tails just mean a greater frequency events were slightly more extreme, one could still imagine sensibly working with sums/averages/other aggregates over a group of future events and choosing responses to “optimise the response” to this model. However, if there’s a chance of some event, still relativley rare, but so extreme happening that’s so drastic its occurrence changes the whole context, this kind of analysis becomes much, much less useful. (As a user of probabilistic modeling, it exposes a more difficult question: in order to estimate both “model structure” (distributions, dependencies, etc) and probabilities to come up with a usable model. To get a good estimate the distribution of drastic events one must see at least a few examples of drastic events, but what if a drastic event changes the context afterwards so that the model structure becomes different after one? How can one get a reasonably good model that’s still applicable for future prediction? Of course, this overstates things.)

        Taleb’s suggestion is that mathematical-probabilistic modeling should be used for “everyday events” (which may still use distributions with fatter-tailed distributions than the Gaussian), but that attempts to bring genuinely drastic events into this framework is futile, and needs a different kind of analysis. I’m not sure I completely agree with this viewpoint, but I do think that real-world use of mathematical models that appear to “tame” extreme events should be treated with caution.

    • John Baez says:

      The probability distributions that make Chebyshev’s inequality an equality have several ‘spikes’ in them. So it’s not surprising you can do better for unimodal probability distributions p, meaning those where p has a single maximum. But I just learned about this:

      Vysochanskiï–Petunin Inequality. If a real-valued random variable has a unimodal probability distribution, the probability that it differs from its mean by at least k standard deviations is at most 4/(9 k^2), at least when

      k > \sqrt{8/3} = 1.63299...

      Chebyshev’s inequality would give a bound of 1/k^2, not 4/(9k^2).

    • John Baez says:

      I’m sure there must be lots of more sophisticated theorems that come after Markov’s inequality, Chebyshev’s inequality and the Vysochanskiï–Petunin inequality. Obviously we can copy the proof of Markov’s inequality and Chebyshev’s inequality and get better bounds on the probability of large deviations if we know more about the higher moments of a probability distribution. But I bet there are more interesting things to say, and I’d like to hear some of them from the experts out there! Please don’t point me to a book… I just want to chat.

  3. Garrett says:

    Hmm, I think you can actually calculate the precise probability if you think more like a Bayesian and less like a Baezian. The probability distribution should correspond to maximal ignorance, subject to the constraints of our knowledge. For Puzzle 1, our Ignorance, as a functional of the probability of pulling a jewel of value v, is

    I = \int_0^\infty dv \, p(v) \, \ln \, p(v)

    The constraints are that the probabilities must sum to one,

    1 = \int_0^\infty dv \, p(v)

    and that the expected value is 10,

    10 = \int_0^\infty dv \, v \, p(v)

    Using Lagrange multipliers, solve for the probability distribution extremizing

    \int_0^\infty dv \{ p \, \ln \, p + \lambda p + \mu v p \}

    The variation of this is

    \int_0^\infty dv \, \delta p \{ \ln \, p + 1 + \lambda + \mu v \}

    which vanishes for

    p(v) = p_0 e^{-\mu v}

    We get the constants by satisfying the constraints, which demand p_0 = \mu and \mu = 1/10. And so, like magic, we have the precise probability distribution consistent with our knowledge:

    p(v) = \frac{1}{10} e^{- \frac{v}{10}}

    Once we have this, we can answer any relevant question. What’s the probability of pulling a jeweled coin valued at least $100?

    P(v \geq 100) = \int_{100}^\infty dv p(v) = \int_{100}^\infty dv \frac{1}{10} e^{- \frac{v}{10}} = e^{-10} \simeq 0.000045

    (This same approach relates to our discussion of quantropy. I still think it’s better to extremize complex ignorance and find the complex probability distribution than to make up something new and extremize it for the amplitude. But, that’s a different discussion.)

    • John Baez says:

      That’s excellent!

      Gaussians are famous for being the probability distributions that maximize entropy subject to constraints on the mean and variance. The quadratic stuff inside the exponential in a Gaussian comes from the formulas for the mean and variance. You’re noting that an exponential maximizes entropy subject to constraints on the mean—at least for a probability distribution on nonnegative numbers. (If we didn’t have that fine print we’d be sunk!) The linear stuff inside your exponential comes from the formula for the mean.

      Of course you’re answering a slightly different question than the one I intended. I was asking what we could say for sure about some question given some constraints on a probability distribution. You’re calculating what’s the expected answer to a question based on a probability distribution that maximizes entropy subject to that constraint.

      They’re both interesting: we just have to decide which we’re interested in. My question focuses our attention on the ‘worst-case scenario’, while your focuses attention on some sort of ‘expected’ scenario.

      It’s interesting how the probability you get is so much smaller than what’s allowed by Markov’s inequality. Markov’s inequality can’t be improved in the worst-case scenario! But the worst-case scenario is a fat-tailed distribution, while the exponential distribution you get is quite skinny-tailed.

      The worst-case scenario, by the way, is when you have a 90% chance of picking out a completely worthless jewel, and a 10% chance of picking out a $100 jewel. Then the expected value of the jewel you pull from the box is $10, and the probability of it being worth at least $100 is 10%. Markov’s inequality says the probability can never be higher than that, given that the expected value of the jewel is $100.

      I still think it’s better to extremize complex ignorance and find the complex probability distribution than to make up something new and extremize it for the amplitude. But, that’s a different discussion.

      I don’t understand what you mean, for at least three reasons, but if you want to explain that a bit more, please leave a comment on the latest quantropy post!

      • Garrett says:

        Yes, I think the Baezian thinking here is “what is the best you can say for sure, assuming the worst consistent probability distribution,” while the Bayesian thinking is “what can you say, assuming the proper consistent probability distribution?” Without other information, isn’t the correct probability distribution the one that maximises entropy? I understand that this approach allows the puzzles to be answered in a way different than intended, but I think it correctly answers them the way they were asked! To pose the intended question, you’d have to add the “for sure,” which is a meta constraint on the probability distribution. Without that information, I think the correct probability distribution is determined by the principle of maximum entropy. After all (and this is delightfully meta) this is the probability distribution you should have when you don’t know what the probability distribution should be.

      • John Baez says:

        Garrett wrote:

        Without other information, isn’t the correct probability distribution the one that maximises entropy?

        That’s a widely accepted viewpoint in some communities, controversial in others. But again: my question wasn’t intended to ask about what was true of the ‘correct’ probability distribution given some constraints, but rather, what was true of all probability distributions obeying those constraints.

        Admittedly, my question was phrased in a way that could admit either interpretation! So, it’s cool that you thought of this other interpretation. Both questions are important; which one is the ‘right’ question would depend on the details of what we were planning to do with the answer.

        But if you read my whole blog post, it becomes clear that my question was really just a trick to get people to reinvent Markov’s theorem and Chebyshev’s theorem. Greg Egan and a bunch of people on Google+ succeeded in doing that, so it did its job.

        • Garrett says:

          Yes, I know I’m making trouble! I guess I’m trying to persuade you to consider the principle of maximum entropy seriously. I didn’t really get it until I understood at as a way to calculate the correct probability distribution consistent with one’s ignorance. My oblique motivation is that if you do take it seriously, then you can also use it as a starting point for qm instead of quantropy. (Happy to discuss via email if you like, as I don’t wish to derail discussions.)

        • John Baez says:

          I do take the principle of maximum entropy seriously. I’ve been going around advocating it for years! Lately I’ve been using it in all sorts of ways.

          Everyone should think about E. T. Jaynes’ famous puzzle: “If you have an unfair die and the expected value of the number that comes up when you roll it is 4, what’s the probability that you roll a 6?”

          At first it seems impossible. But then Jaynes points out that in statistical mechanics we solve similar problems every day! We pick the probability distribution that maximizes entropy subject to the constraints we know, and compute expected values using that.

          So, it was just a comitragic mistake on my part that when I posed puzzles meant to illustrate Markov’s theorem and Chebyshev’s theorem, I phrased them loosely enough that they could also have been illustrations of the principle of maximum entropy!

          My oblique motivation is that if you do take it seriously, then you can also use it as a starting point for qm instead of quantropy. (Happy to discuss via email if you like, as I don’t wish to derail discussions.)

          I really prefer to talk about things publicly, but let’s do it here, in the original quantropy thread. To get the ball rolling, I’ll ask a question: what’s “complex ignorance” and more importantly, how do you “maximize” a complex-valued function? I really hope that someday there will be a Fourth Annual Workshop on Complex Ignorance—this title nicely describes a lot of branches of theoretical physics. But I don’t want to wait for that to get the answer to my questions!

        • Garrett says:

          Works for me.

        • I side with Garrett on this one. To get the 10% value, we need to assume that there are a finite number of jewels in the box. In this case 10 gives the Markov inequality based on counting statistics.

          But say we had no knowledge of the number of jewels in the box. According to our ignorance of that number, we have to assume a uniform distribution which can range anywhere from 0 to a very large number. The mean of a uniform is simply 1/2 of that very large number. That would certainly be larger than 10, correct?

          Based on that, I would assume the value is closer to Garrett’s 0.000045 number than 0.1.

          Thanks for a word problem that makes one think.

  4. John Baez says:

    Over on Google+, here’s what people said:

    John Baez wrote:

    Brendan Fong and I have a new paper out! It generalizes Noether’s theorem from quantum theory to probability theory. Sounds fancy, eh? But it’s related to this simple puzzle:

    Suppose I have a box of jewels. The average value of a jewel in the box is $10. I randomly pull one out. What’s the probability that its value is at least $100?

    Okay, I admit it – you can’t actually tell me the probability. But you can say something about it! What’s the most you can say?

    Andrea C G Mennucci wrote:

    Probability is between zero and 10%

    EDIT: since some people are interested, I will be more precise.

    Probability p has to be between zero and 10%. If the jewels in the box are equiprobable in the drawing, then p has to be rational.

    Vice versa. Choose p between zero and 10%. If p is rational, say p=n/m, then a box containing m equiprobable jewels, n of them of value 100$ and (m-n) jewels of value (10m-100n)/(m-n)$ will suffice.

    If p is not rational, then a solution may be found with two jewels, (non equiprobable in the drawing), a jewel of value 100$ and of probability p, and another jewel of probability (1-p) and value (10-100p)/(1-p).

    Jan Moren wrote:

    Well, you don’t know anything about the distribution. The only thing you can say is that it’s not 100%. Could have all jewels be worth exactly $10 for instance.

    Michael Stankowski wrote:

    Jan Moren: you can tell that the probability is between zero and 10%. To get 10%, you must have 9 out of 10 gems have a value of zero, and 1 out of 10 have a value of 100. You can’t have a value lower than 0 to counterbalance a higher occurrence of values > 100 to get an average of ten dollars, therefore that is the maximum percentage. Therefore, the probability of a value of at least $100 must be 1 in 10 or less.

    Jan Moren wrote:

    Irrespective of the distribution?

    Edit: That would require the value to always be non-negative. As in, you don’t need to pay for taking a jewel out of the box, they’re not coated with poison or any other fanciful ideas.

    Edit2: Which, I admit, is a stretch. And I didn’t think of your argument; thanks for clarifying it.

    Michael Stankowski wrote:

    I was assuming non negative values, yes.

    Cod Codliness wrote:

    Michael Stankowski wrote:

    You can tell that the probability is between zero and 10%. To get 10%, you must have 9 out of 10 gems have a value of zero, and 1 out of 10 have a value of 100.

    That surely presupposes there are 10 jewels in the box? What if there are 30,000? I’m not a maths expert so don’t bite me if that’s a stupid pair of questions!

    Jan Moren wrote:

    Cod Codliness that’s the neat part, that I didn’t get at first: That will get you a lower probability, but there’s no (non-negative) set of jewels that will give you more than 10%.

    Cod Codliness wrote:

    Yeah, just realised that – if the average is $10 then the more jewels there are the lower the chances of picking a jewel worth $100 at random. D’oh. More coffee is required!

    David Guild wrote:

    Actually the number of jewels is totally irrelevant. As long as gems have non-negative value, then (probability of pulling a $100 or better gem > 10%) implies that (average value > $10). Since the average is exactly $10, then the probability can’t be more than 10%.

    Satyr Icon wrote:

    Please note the wording though. At least $100. So your answers should perhaps be at most 10%!?

    Michael Stankowski wrote:

    Satyr Icon: that’s how it’s worded already…
    I said that the maximum percentage was 10%, and that the minimum was 0%, assuming non-negative values.

    Heikki Arponen wrote:

    Interesting paper. A word about the jargon though: I think Noether’s theorem refers (usually) to the classical action formulation and states that to each symmetry there is a (classically) conserved quantity. Its generalization to (quantum or statistical) field theories is usually called a Ward identity. Time evolution can of course be described as a time translation, so this particular case would be a Ward identity associated to time translation. But then again, maybe a group of scientists has “hijacked” the terminology, as I suppose tends to happen in all fields…

    …OR I just didn’t read carefully enough, again…

    John Baez wrote:

    I was assuming the jewels have non-negative values.

    So, Andrea C G Mennucci, Michael Stankowski and David Guild gave the answer I wanted, and have ably argued for its correctness! This puzzle is a special case of Markov’s inequality, which says that if the mean of a nonnegative random variable is A, the probability of it being greater than or equal to B is at most A/B:

    Anyone who can solve this puzzle should be able to prove Markov’s inequality, since the argument is just the same! The puzzle is just the case A = 10, B = 100.

    Congratulations, Andrea C G Mennucci, Michael Stankowski and David Guild!

    John Baez wrote:

    So here’s a slightly harder puzzle along the same lines. I have a box of numbers: real numbers, which can be positive or negative. The average of these numbers is zero, and their standard deviation is 10. I randomly pick a number from the box. What can you say about the probability that it’s at least 100?

    Andrea C G Mennucci wrote:

    I call myself out :-) I teach probability.

    Samuel Cook wrote:

    At most 1% on the standard deviation 10 problem.

    Samuel Cook wrote:

    Also, on the jewel problem, was it stated that all jewels are equally likely to be selected? Just to be devil’s advocate, if a jewel that is worth over $100 much more likely to be “randomly” selected then the problem depends on the distribution of the jewel (sizes?). (It may have been stated they have equal probability when selected, but I am being lazy just so that I can say something about it).

    John Baez wrote:

    I was not assuming the jewels were equally likely to be selected, just that each one is selected with some probability. By the “average value” of a jewel, I meant the expected value of a jewel randomly chosen according to that probability distribution.

    My wording was a bit sloppy because I was trying to be terse.

    If I’d wanted to be really precise, I would have said this: given a probability measure on the set of nonnegative real numbers, and assuming its mean is 10, what can we say about the measure of the set of numbers greater than or equal to 100?

  5. I’m a bit out of sleep these days… but the theorem looks a bit too amazing to be new or non-vacuous. For Brownian motion I get O must be harmonic, hence constant.

    • Sorry, I forgot there are nice Riemannian manifolds which have lots of non-constant bounded harmonic functions.

      • John Baez says:

        Martin Gisser wrote:

        Sorry, I forgot there are nice Riemannian manifolds which have lots of non-constant bounded harmonic functions.

        What examples are you thinking of here? For a compact Riemannian manifold, the only harmonic functions are locally constant functions. Thus, the space of harmonic functions is a vector space whose dimension equals the number of connected components. This is a bit of ‘Hodge theory’.

        I don’t know examples of (noncompact) connected Riemannian manifolds that have bounded harmonic functions that aren’t constant.

        I might even be able to use our theorem to prove this, at least for complete connected Riemannian manifolds, where the Laplacian is self-adjoint so Brownian motion is well-defined.
        My intuition makes me suspect there are not any conserved quantities for Brownian motion on such a manifold – except for constants!

        But that’s not the right kind of example to illustrate our result. It’s easy to make up Markov processes that have plenty of conserved quantities. You can do it even when the state space X is a finite set. Our theorem is easier to prove in that case: we do that in the first part of the paper. The proof makes it clear what’s going on. The idea is simply that X is partitioned into subsets, such that the probability of moving from one subset to another is zero.

        When X isn’t a finite set the proof gets harder, because all these subsets may have measure zero. For example, consider foliating a Riemannian manifold by submanifolds, and then considering Brownian motion restricted by the condition that the particle stays on the same leaf of the foliation. Any bounded measurable function that’s constant on each leaf will be a conserved quantity!

        • What examples are you thinking of here?

          Hyperbolic space and its quotients.

          Fascinating stuff: Complete manifolds with ends at infinity that suck up Brownian motion (“parabolic” ends). (Problem, they need not be Brownian complete, i.e. Brownian motion might “explode” in finite time, and then it’s not Markovian in your sense.) See e.g. work of Peter Li. A theorem of Li and Tam states that the number of parabolic ends is bounded by the dimension of the space of bounded harmonic functions with finite Dirichlet integral. (J. Diff. Geom. 35 (1992) 359-383)

          Googling for “Liouville property” (i.e. constancy of bounded harmonic functions) I found this fascinating paper:

          * Anna Erschler: Liouville property for groups and manifolds, Invent. Math. 155 (2004) 55-80

          We introduce a method to estimate the entropy of random walks on groups. We apply this method to exhibit the existence of compact manifolds with amenable fundamental groups such that the universal cover is not Liouville.


          Given the vast and old literature on Markov processes, the Markov Noether theorem should be quite sensational. What a pretty little gem!

        • John Baez says:

          Since this paper is an offshoot of the network theory project, we’re mainly thinking about Markov processes coming from chemical reaction networks, where a state consists of a collection of molecules, and we have conserved quantities like ‘the number of nitrogen atoms’.

          But thanks to you, I’m now trying to imagine geometrical examples. The only really nice ones that leap to mind come from foliations of Riemannian manifolds, where we require that a particle stay on a given leaf while doing its random walk.

      • Argh. Mestupid should have checked the converse first: A bounded harmonic function does not necessarily commute with the Brownian motion semigroup! Luckily. If it does, it must be locally constant.

  6. Orr says:

    Thanks, this was interesting.

    1. In the proof of Theorem 1, (i)=>(ii), perhaps you’d like to state more clearly what is \langle A, \psi \rangle when A is not a diagonal matrix, as this notation is introduced for diagonal matrices. Maybe the notation used (probably by mistype) 5 lines above the end of the proof: \langle O^2 H e_j \rangle is better.

    2. In the proof of Theorem 1, (iii)=>(iv), (second line) you must mean H_{ij} \neq 0.

    • John Baez says:

      Wow, thanks! Very useful edits.

      In my blog posts on this topic I always used the notation

      \langle \psi \rangle

      for the integral of an L^1 function \psi. Then in writing the latter part of this paper I started liking the notation

      \langle O, \psi \rangle

      for the integral of the product of an L^\infty function O and a L^1 function \psi. This looks more like the pairing of L^2 functions you see in quantum mechanics. It’s interestingly similar and interestingly different.

      So, we went back to the first part and mechanically changed all expressions of the form \langle A \psi \rangle to \langle A, \psi \rangle. But this doesn’t make sense when A is an operator on L^1 that’s not multiplication by an L^\infty function! Or at least, we never defined it. So thanks—I need to go back and fix this somehow.

  7. I have some comment to the first puzzle (sorry for quality, I am only learning commenting in WordPress). It was partially inspired by environmental examples, there some random values often may be quite well described by log-normal distributions. Possibly, it may be explained by the consideration of composition of many random factors similarly with derivation of the normal distribution using sums. The thing is less clear for me, is that standard deviation for such distributions is very often almost equal to average. If cost of things would be described by the same way, we could calculate more concrete values.

    Let’s have some random variable x with normal distribution with mean m and variance d, then variable X = exp(x) has log-normal distribution with mean M = exp(m + d/2) and for some k probability to obtain X > k M is

    1-CDF(k M) = (1-{\rm erf}((\log k + d/2)/\sqrt{2d}))/2.

    If to suggest that standard deviation of X is equal to mean, then d = log 2 and for k=10 we get probability near 0.012%. It is also possible to ask about distribution with maximal probability for given k. Because maximum of that expression is for d=2 log k and it is equal to (1-erf \sqrt{\log k})/2, for k=10 the probability is about 1.59%

  8. Eric says:

    I’m a bit pressed for time, so this might be coming out from left field (or worse, completely wrong), but when I looked at this a bit myself here:

    Network Theory and Discrete Calculus – Noether’s Theorem

    I vaguely remember that if the expectation of O and O^2 is zero, then it means O itself must be constant on all connected components.

    In other words, if [O,H] = 0, it means O itself is constant.

    This seems significantly less interesting the the quantum version. The stuff I looked at was discrete in time as well, so maybe things are more interesting in continuous time (?).

    • John Baez says:

      Eric wrote:

      I vaguely that if the expectation of O and O^2 is zero, then it means O itself must be constant on all connected components.

      If the expectation of O^2 is zero, then O has to be zero wherever the probability distribution is supported.

      Maybe you meant something like this:

      Theorem. Given a Markov process on a finite set of states X, and O: X \to \mathbb{R} has the property that the expectation values of O and O^2 don’t change with time regardless of the initial conditions, then O is constant on each connected component of the ‘transition graph’ of that Markov process.

      This is part of our Theorem 1. We prove it for continuous-time Markov processes, but the result is also true for discrete-time ones (also known as ‘Markov chains’).

      Morally speaking the situation is the same when the state space X is an infinite measure space, but technically speaking it’s a lot more subtle.

      Btw, I’ll mention to readers who care about priority that your calculations are based to some extent on Brendan’s original calculations which went into the proof of Theorem 1. I’ve already had trouble publishing one paper because I talked about it on this blog a bunch and a referee somehow concluded that meant the paper contained nothing new!

      • Eric says:

        Sorry. Of course I meant expectations are constant. Was in a rush.

        My calculations are certainly based, in part, on Brendan’s original. I simply took his ideas and reformulated them using discrete calculus.

  9. Aaron says:

    The bit where you say “time evolution is also random” bugs me a bit, because under the standard interpretations of quantum theory time evolution is perfectly deterministic, except during measurements.

    • John Baez says:

      Yeah—my point was that in both quantum mechanics and stochastic mechanics, the state of the system at time t is a deterministic function U(t) \psi of the state at time zero, but this state describes the probability distribution of results you get when you measure any observable. So if you say time evolution is random in a so-called ‘random walk’, maybe you should say the same for time evolution in quantum mechanics too. But if you don’t, you shouldn’t. Anyway, my first sentence here is what I meant to say, so let me fix my blog post so it says that!

      • Aaron says:


        While I really do like this exploration of the analogy between stochastic and quantum, I do think that the randomness in the stochastic case is a much different beast. Measurements are very different in the two cases. Obviously in both cases measuring changes future evolution. The difference crops up in that even unconditioned measurements change final distributions in quantum mechanics. In stochastic mechanics we can imagine that we “peek” at every time step, without changing anything. In that sense, considering every step random gives no observable differences. This is just not possible in quantum mechanics.

        (You’re well aware of all this of course, but it’s something that regularly gets swept under the rug, and I think is worth pointing out when making analogies of this sort.)

  10. I’ll be speaking at a conference celebrating the centenary of Emmy Noether’s work connecting symmetries and conservation laws:

    The Philosophy and Physics of Noether’s Theorems, 5-6 October 2018, Fischer Hall, 1-4 Suffolk Street, London, UK. Organized by Bryan W. Roberts (LSE) and Nicholas Teh (Notre Dame).

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.