## Entropy and Uncertainty

I was going to write about a talk at the CQT, but I found a preprint lying on a table in the lecture hall, and it was so cool I’ll write about that instead:

• Mario Berta, Matthias Christandl, Roger Colbeck, Joseph M. Renes, Renato Renner, The uncertainty principle in the presence of quantum memory, Nature Physics, July 25, 2010.

Actually I won’t talk about the paper per se, since it’s better if I tell you a more basic result that I first learned from reading this paper: the entropic uncertainty principle!

Everyone loves the concept of entropy, and everyone loves the uncertainty principle. Even folks who don’t understand ’em still love ’em. They just sound so mysterious and spooky and dark. I love ’em too. So, it’s nice to see a mathematical relation between them.

I explained entropy back here, so let me say a word about the uncertainty principle. It’s a limitation on how accurately you can measure two things at once in quantum mechanics. Sometimes you can only know a lot about one thing if you don’t know much about the other. This happens when those two things “fail to commute”.

Mathematically, the usual uncertainty principle says this:

$\Delta A \cdot \Delta B \ge \frac{1}{2} |\langle [A,B] \rangle |$

In plain English: the uncertainty in $A$ times the uncertainty in $B$ is bigger than the absolute value of the expected value of their commutator

$[A,B] = A B - B A$

Whoops! That started off as plain English, but it degenerated into plain gibberish near the end… which is probably why most people don’t understand the uncertainty principle. I don’t think I’m gonna cure that today, but let me just nail down the math a bit.

Suppose $A$ and $B$ are observables — and to keep things really simple, by observable I’ll just mean a self-adjoint $n \times n$ matrix. Suppose $\psi$ is a state: that is, a unit vector in $\mathbb{C}^n$. Then the expected value of $A$ in the state $\psi$ is the average answer you get when you measure that observable in that state. Mathematically it’s equal to

$\langle A \rangle = \langle \psi, A \psi \rangle$

Sorry, there are a lot of angle brackets running around here: the ones at right stand for the inner product in $\mathbb{C}^n$, which I’m assuming you understand, while the ones at left are being defined by this equation. They’re just a shorthand.

Once we can compute averages, we can compute standard deviations, so we define the standard deviation of an observable $A$ in the state $\psi$ to be $\Delta A$ where

$(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2$

Got it? Just like in probability theory. So now I hope you know what every symbol here means:

$\Delta A \cdot \Delta B \ge \frac{1}{2} |\langle [A,B] \rangle |$

and if you’re a certain sort of person you can have fun going home and proving this. Hint: it takes an inequality to prove an inequality. Other hint: what’s the most important inequality in the universe?

But now for the fun part: entropy!

Whenever you have an observable $A$ and a state $\psi$, you get a probability distribution: the distribution of outcomes when you measure that observable in that state. And this probability distribution has an entropy! Let’s call the entropy $S(A)$. I’ll define it a bit more carefully later.

But the point is: this entropy is really a very nice way to think about our uncertainty, or ignorance, of the observable $A$. It’s better, in many ways, than the standard deviation. For example, it doesn’t change if we multiply $A$ by 2. The standard deviation doubles, but we’re not twice as ignorant!

Entropy is invariant under lots of transformations of our observables. So we should want an uncertainty principle that only involves entropy. And here it is, the entropic uncertainty principle:

$S(A) + S(B) \ge \mathrm{log} \, \frac{1}{c}$

Here $c$ is defined as follows. To keep things simple, suppose that $A$ is nondegenerate, meaning that all its eigenvalues are distinct. If it’s not, we can tweak it a tiny bit and it will be. Let its eigenvectors be called $\phi_i$. Similarly, suppose $B$ is nondegenerate and call its eigenvectors $\chi_j$. Then we let

$c = \mathrm{max}_{i,j} |\langle \phi_i, \chi_j \rangle|^2$

Note this becomes 1 when there’s an eigenvector of $A$ that’s also an eigenvector of $B$. In this case its possible to find a state where we know both observables precisely, and in this case also

$\mathrm{log}\, \frac{1}{c} = 0$

And that makes sense: in this case $S(A) + S(B)$, which measures our ignorance of both observables, is indeed zero.

But if there’s no eigenvector of $A$ that’s also an eigenvector of $B$, then $c$ is smaller than 1, so

$\mathrm{log} \, \frac{1}{c} > 0$

so the entropic uncertainty principle says we really must have some ignorance about either $A$ or $B$ (or both).

So the entropic uncertainty principle makes intuitive sense. But let me define the entropy $S(A)$, to make the principle precise. If $\phi_i$ are the eigenvectors of $A$, the probabilities of getting various outcomes when we measure $A$ in the state $\psi$ are

$p_i = |\langle \phi_i, \psi \rangle|^2$

So, we define the entropy by

$S(A) = - \sum_i p_i \; \mathrm{log}\, p_i$

Here you can use any base for your logarithm, as long as you’re consistent. Mathematicians and physicists use e, while computer scientists, who prefer integers, settle for the best known integer approximation: 2.

Just kidding! Darn — now I’ve insulted all the computer scientists. I hope none of them reads this.

Who came up with this entropic uncertainty principle? I’m not an expert on this, so I’ll probably get this wrong, but I gather it came from an idea of Deutsch:

• David Deutsch, Uncertainty in quantum measurements, Phys. Rev. Lett. 50 (1983), 631-633.

Then it got improved and formulated as a conjecture by Kraus:

• K. Kraus, Complementary observables and uncertainty relations, Phys. Rev. D 35 (1987), 3070-3075.

and then that conjecture was proved here:

• H. Maassen and J. B. Uffink, Generalized entropic uncertainty relations, Phys. Rev. Lett. 60 (1988), 1103-1106.

The paper I found in the lecture hall proves a more refined version where the system being measured — let’s call it $X$ — is entangled to the observer’s memory apparatus — let’s call it $O$. In this situation they show

$S(A|O) + S(B|O) \ge S(X|O) + \mathrm{log} \, \frac{1}{c}$

where I’m using a concept of “conditional entropy”: the entropy of something given something else. Here’s their abstract:

The uncertainty principle, originally formulated by Heisenberg, clearly illustrates the difference between classical and quantum mechanics. The principle bounds the uncertainties about the outcomes of two incompatible measurements, such as position and momentum, on a particle. It implies that one cannot predict the outcomes for both possible choices of measurement to arbitrary precision, even if information about the preparation of the particle is available in a classical memory. However, if the particle is prepared entangled with a quantum memory, a device that might be available in the not-too-distant future, it is possible to predict the outcomes for both measurement choices precisely. Here, we extend the uncertainty principle to incorporate this case, providing a lower bound on the uncertainties, which depends on the amount of entanglement between the particle and the quantum memory. We detail the application of our result to witnessing entanglement and to quantum key distribution.

By the way, on a really trivial note…

My wisecrack about 2 being the best known integer approximation to e made me wonder: since 3 is actually closer to e, are there some applications where ternary digits would theoretically be better than binary ones? I’ve heard of "trits" but I don’t actually know any applications where they’re optimal.

Oh — here’s one.

### 50 Responses to Entropy and Uncertainty

1. DavidTweed says:

About bits vs trits: in addition to issues about physical representation, computer scientists think in terms of bits because a lot of the computer stuff is based on boolean logic (even when the variables are “boolean values” which aren’t strictly represented as single bits).

There are some situations where one might want to use a tri-valued logic with “true”, “false” and “unknown”. You can, with a bit of thought about what you want to acheive, write down satisfactory truth tables for all the logical operations (and, or, etc) and generally extend this to a both a reasoning logic and something you might want to implement in programs. But I don’t think there’s ever been a big enough use to actually build a machine with physical trits rather than encoding the values into integers on a bit-based machine. For an example, see

http://www.c2.com/cgi/wiki?ThreeValuedLogic

• Tim van Beek says:

Of course all truly object oriented programming languages have trits instead of bits, because the variable

val myBoolean : Boolean

may be true, false or nil = null.

• John Baez says:

Interesting stuff!

There’s a bit about multi-valued logic in electronic circuits on Wikipedia.

That page David cites has some very interesting discussions but also a little wackiness, like this:

BottomType did not originate in Category Theory, because I first ran into it in the 1970s in some theoretical CS books (the Anatomy of Lisp, 1979, and another one on something like Denotational Semantics from circa 1973-1975), and Category Theory was very new at that time.

Huh? Category theory has been around since 1945, and what these folks are calling “Bottom Type” is what category theorists call an “initial object”.

Topos theory, a branch of category theory, can be seen as a very sophisticated and beautiful multi-valued logic. But three-valued logic goes back a lot earlier, to Łukasiewicz, around 1917. So he deserves a lot of credit in the subject of multi-valued logic. (He also invented “Polish notation”, but for some reason people saw fit only to credit his country for that — maybe because they thought it was too hard to say “Łukasiewicz”.)

I’ve also heard that some work in the tradition of Buddhist logic investigated multivalued logics, but I don’t know much about that, and I kinda doubt Łukasiewicz was influenced by that. (Dunno.)

• Tim van Beek says:

Category theory has been around since 1945…

Maybe the author intended to say that the notion of BottomType was invented without any knowledge of/reference to category theory, (although the relative clause beginning with “because…” may indicate that this is not so…).

Bottom Type” is what category theorists call an “initial object”.

What is an arrow in the category with types as objects? “a arrow b” = a extends b? (This would make the BottomType an initial object).

…maybe because they thought it was too hard to say “Łukasiewicz”.

Yes, definitely! Polish has several versions of “s” and “sh”, I’m not capable to distinguish those let alone pronouncing them.

• DavidTweed says:

Clearly the category theory remark is a misunderstanding of when category theory became widely known in CS as opposed to when it was created by mathematicians. I can believe that the notion of a bottom value (with a polymorphic type) was independently “invented” before those times.

But I wouldn’t defend that wiki much, it was the first reasonably explicit website on tri-valued logic that a web-search found.

• C. McCann says:

Category theory has been around since 1945, and what these folks are calling “Bottom Type” is what category theorists call an “initial object”.

I’m pretty sure that this is not the case, if only on the grounds that what they’re calling “Bottom Type” there is at least two distinct concepts in the context of at least three possible categories. They all relate to bottom elements of some poset (though not necessarily one on types), except for the one concept of the lot that’s most relevant to multivalued logic–namely, nullable references to a boolean value, which are merely isomorphic to any other three-element enumerated type.

So there’s a lot of confusion and misuse of terminology in that discussion, but I think I should leave it at that because most people here are probably not interested in programming language theory, and it’s all veering off-topic for the post anyway.

• John Baez says:

Tim wrote:

Maybe the author intended to say that the notion of BottomType was invented without any knowledge of/reference to category theory…

Yes. I don’t actually know the early history of ‘types’ in computer science. By now it’s clear that it’s really good to formalize types using category theory, but I don’t know when this was realized, and which ideas on types were:

1) first invented by computer scientists and then independently reinvented by category theorists (in different language),

2) first invented by category theorists and then independently reinvented by computer scientists (in different language),

3) first invented by computer scientists and then consciously borrowed by category theorists,

4) first invented by category theorists and then consciously borrowed by computer scientists.

By now the two communities are communicating pretty well, so there’s a lot of 3) and 4). But that was not always so.

It’s also quite possible that when the author wrote:

BottomType did not originate in Category Theory, because I first ran into it in the 1970s in some theoretical CS books (the Anatomy of Lisp, 1979, and another one on something like Denotational Semantics from circa 1973-1975), Category Theory was very new at that time.

he was taking a very computer-science-centered point of view, and meant “category theory was very new to computer scientists at that time”.

I just know a few dates. McCarthy invented the programming language Lisp, based on the lambda calculus, 1958. In 1965, an infuential paper by Landin pointed out an analogy between the lambda calculus and the language ALGOL. And in 1980, the deep relation between the typed lambda calculus and cartesian closed categories was
discovered by Lambek.

Tim writes:

What is an arrow in the category with types as objects?

You can imagine a category where objects are data types and a morphism f: A → B is an equivalence class of programs that takes data of type A as input and outputs data of type B. For more details, try the introduction Mike Stay and I wrote.

He was the computer scientist in this team; I was the category theorist, so don’t expect me to know anything about actual computer languages! But you can imagine ‘extension’ of types as a trivial sort of program: for example, a program where you input an integer and it outputs that integer viewed as a floating-point. Often, of course, this and other forms of ‘type conversion’ are built in, via ‘polymorphism’. But still, you can think of ‘extension’ and other forms of ‘type conversion’ as degenerate cases of programs, and all programs as all giving rise to morphisms in a category whose objects are types.

And then the ‘void type’ or ‘bottom type’ is the initial object.

C. McCann wrote:

I’m pretty sure that this is not the case, if only on the grounds that what they’re calling “Bottom Type” there is at least two distinct concepts in the context of at least three possible categories. They all relate to bottom elements of some poset (though not necessarily one on types), except for the one concept of the lot that’s most relevant to multivalued logic–namely, nullable references to a boolean value, which are merely isomorphic to any other three-element enumerated type.

As you probably know (but others may not), any poset can be seen as a category, and then a bottom element for this poset is the same as an initial object for that category.

Since I’m more of a category theorist, the only kind of ‘bottom type’ that I understand is an initial object. I don’t understand ‘nullable references to a boolean value’, and why they deserve the name ‘bottom type’ too. But I’d be glad to learn.

• Tim van Beek says:

C. McCann wrote:

…most people here are probably not interested in programming language theory…

Maybe, but at least some people are.
It would seem that JB is interested in the connection of category theory and programming languages, and I happen to be a software developer, so I’m interested in programming language from a very pragmatic POV.

I don’t understand ‘nullable references to a boolean value’, and why they deserve the name ‘bottom type’ too. But I’d be glad to learn.

I don’t want to preempt the answer of C. McCann, here is how I understand this statement, using Java nomenclature:

NULL is a special memory address that indicates that a reference to an object is not initialized. Java distinguishes classes and “primitive types”, the latter are not derived from “Object”. “Object” is the root class of the inheritance tree.

The primitive types include boolean. Primitive types in Java cannot be set to NULL, this is possible with references to objects only: A primitive type is always initialized with a default value.

Java also has “wrapper classes” for the primitive types, these have the same names but start with a capital letter, example:

boolean and Boolean.

So, boolean is not NULL-able, but Boolean is:

// this results in a compiler error:
boolean isPrimitive = null;

// this initializes the variable isPrimitive with the default value false:
boolean isPrimitive;

Boolean isWrapperClass = null;

Java also knows the “instanceof” operator, that compares a reference A and a class C and returns true if the class of A is derived from C. “instanceof” will always return true if A is NULL.

Example:

Boolean isNull = null;

if(isNull instanceof String)
{
// we’ll end up here as long as “isNull” is NULL.
}

The “primitive type inconsistency” of Java was introduced for performance considerations, I’d guess, and has often been critized. The successor languages C# and Scala don’t have “primitive types”, in these languages everything is an object :-)

• Giampiero Campa says:

On the other hand, considering high impedance as a possible state of a digital terminal (which it is), all digital circuits are actually based on 3 values, high, low and floating:

http://en.wikipedia.org/wiki/Three-state_logic

• John Baez says:

• John F says:

Multivalued digital logic is easy. Even with all non-negatives in a five volt pulse train system, using a 2.5 V level defines the typical unbalanced ternary.

The problem is transmission. Binary transmits well – e.g. “something or nothing”, or “either or” Morse dit dit dah – but multivalues do not because of shaping. The pulses aren’t perfectly timed rectangles.

• I think that from either a conceptual or physical standpoint unknown states like U an X are neither states nor values, just a way or representing uncertainty, so i am not sure that they should be treated as equal to 0 or 1.

Similarly, the don’t care value, -, is a characteristic of a given function or truth table, so it definitely does not belong there.

Distinguishing Z from W is really a subtlety, while H and L might actually be useful, even if i still believe that grouping HiZ states under Z is fine for most purposes.

Anyway, Z is probably hard to detect, so it’s not practical to use Z to either store or transmit information, and therefore we are “stuck” with binary logic for practical purposes.

2. Joe says:

As a big fan of your book on gauge fields, knots, and gravity, I’m glad you liked our paper! In case you’re interested, I wrote up a short not-too-technical explanation here.

• John Baez says:

Great, thanks! I’m sorry I didn’t get around to actually explaining your paper… but now people here can read your explanation. It looks nice!

3. Justin Scheiner says:

Thanks! I’ve been lurking on your site(s) for a while and the topics are generally out my league. This article was right on target for me.

I’m a computer science guy and have found decimal ternary useful for understanding cantor sets. :D

• John Baez says:

Glad you liked the article! It’s always nice getting feedback, even feedback like “huh?” I know what the folks who comment on this blog are like, but I have little sense for who is reading it and not commenting.

4. Mike Stay says:

How does this relate to Fisher information?

• Mike Stay says:

OK, the Fisher information is the variance of the “score”; he introduced it in 1925. Everett used it for talking about the uncertainty principle in his book on the many worlds interpretation in 1973.

The “score” of $X$ is $\frac{\partial}{\partial \theta} \mathrm{log}\, P(X|\theta)$.

The variance of a quantity $Y$ is $\langle Y^2\rangle - \langle Y\rangle^2.$

The Fisher information in a density matrix $\rho$ is
$\displaystyle I_\rho = \int \rho(\vec{r})\left[\vec{\nabla}_D\ln \rho(\vec{r})\right]^2 d^Dr$

http://iopscience.iop.org/1367-2630/8/12/330/fulltext

• John Baez says:

Mike wrote:

OK, the Fisher information is the variance of the “score”; Everett used it for talking about the uncertainty principle in his book on the many worlds interpretation in 1973.

The problem is, I don’t have any intuition for what’s going on in those formulas you wrote down — except the variance, which I learned back in high school.

I always have trouble understanding Fisher information. I first tried understanding it back when Frieden started claiming that it was earth-shatteringly important throughout all science, but I failed miserably.

Later I got some help from David Corfield.

There’s a quantity you know well, called the “information gain” or “relative entropy” S(p,q): how much information you gain when your hypothesis about some system was given by the probability distribution q and someone tells you “no, it’s p”.

For example, when I flip a coin and hide it from you, you hypothesize that it has a 50% chance of being heads and a 50% chance of being tails: that’s your p. Then I lift my hand and you see it’s heads up. So q is 100% heads, 0% tails. The information gain works out to be one bit. This example is too easy, but it conveys the flavor.

(Some obscurantists call information gain the “Kullback-Leibler divergence”. No criticism of Kullback and Leibler intended, but that term completely obscures the meaning of the concept, whereas I could pretty much reinvent the math on my own starting from the phrase “relative entropy” or “information gain”.)

I learned about information gain when I studied Everett’s thesis in college, working with my friend Bruce Smith. It made a lot of sense. But I don’t remember Everett talking about Fisher information.

The information gain does not define a metric:

S(p,q) ≠ S(q,p)

but it does obey the triangle inequality, so you can symmetrize it and get a metric on probability measures! That’s called the “information metric”.

And in science, you often have hypotheses parametrized by a bunch of numbers. In this situation, what you really have is a function from some manifold to a space of probability measures. So, the information metric on probability measures gives you a metric on that manifold! And lots of times it’s a Riemannian metric.

This makes a lot of intuitive sense: your manifold is a space of “hypotheses”, and there’s a distance between hypotheses, that says (in a symmetrized way) how much information you get if you started out believing one hypothesis and then learn the other is true.

There are people working on machine learning who think about hypotheses in this geometrical way, and think about geodesics in this manifold, and stuff like that. The buzzword is “information geometry”. David Corfield has been telling us about information geometry for years now, on various blogs

And then, if I remember right, the Fischer information is related in some simple way to the Riemann curvature tensor of this Riemannian metric!

Unfortunately I forget the details. Hmm, try this and
this.

But unfortunately, I’m stuck on some more basic points.

First of all, why the heck should we care about the curvature of the information metric? What is its conceptual meaning?

Second, the Fischer information itself seems to define some sort of metric on probability distributions! So what’s going on? How could the curvature of one metric be another metric?

I’m probably making some sort of terrible mistake here, right about the point where Fisher information raises its ugly head. Maybe the Fischer information is the information metric, slightly repackaged — that would be more sensible than having it be the curvature of the information metric.

Yeah, I’m hoping that’s true. That would make the world a simpler, better place.

• One thing is to get straight on how Riemannian manifold, metric, distance, divergence, and connection fit together – an exercise in differential geometry.

I may well have this wrong, but a Riemannian manifold is a manifold equipped with a Riemannian metric (one of those things that gives a real number when fed two tangent vectors). A Riemannian metric gives rise to a (symmetric) distance via geodesics.

It may be possible to define many connections on a Riemannian manifold which are compatible with the metric. Each of these connections gives rise to a type of geodesic.

One connection compatible with the Fisher information metric gives rise to a geodesic with the property that the associated (nonsymmetric) distance is relative entropy.

The Fisher information metric is the unique Riemannian metric invariant under reparameterization. Its curvature is studied in problems such as estimator accuracy.

• Tim van Beek says:

We already mentioned Fisher information on this blog briefly
here.

My simplistic view of Fisher information was this: We have a family of probability distributions (aka random variables) that can be parametrized by a finite set of parameters such that they form a manifold.

The Fisher metric is a Riemann metric on this manifold. It is unique in the following sense:

Definition: A stochastic map is a linear map on the algebra of random variables that preserves positivity and takes 1 to itself.

If a metric measures the ability to distinguish points, then the distance between two points should be reduced – at most – by any stochastic map, that is, a stochastic map should “muddy the waters” and reduce the ability to distinguish points.

It is possible to prove that the Fisher-Rao metric is the only one (up to multiples) that has this property:

Cencov, N. N. (1982). Statistical Decision Rules and Optimal Inference (Providence, RI, American Mathematical Society).

With regard to one interpretation of the metric distance, let me cite myself :-):

Let’s say we can obtain iid (independent identically distributed) samples from a fixed, but unkown, probability distribution (= point of our manifold), and would like to use these to estimate the parameters of this point.

What we would like to have is a uniformly minimum variance unbiased estimator T. Now, the Cramér-Rao lower bound gives a lower bound on the variance of all unbiased estimators, it is proportional to the inverse of the Fisher information matrix I.

In this sense, somewhat simplifying, the farther away a point is from the point where you get your samples from, the less samples you need, on the average, to find out that you are not there :-)

• Chris Goddard says:

Hi, John.

Interesting blog; I’m enjoying the variety of posts so far.

Unfortunately, I’m stuck on some more basic points.

First of all, why the heck should we care about the curvature of the information metric? What is its conceptual meaning?

Second, the Fischer information itself seems to define some sort of metric on probability distributions! So what’s going on? How could the curvature of one metric be another metric?”

Now, I don’t know much about the standard approaches to information geometry through Machine Learning etc, but I’ll bite.

These two questions, according to my understanding, are related. This is because it is possible to define the Fisher Information in the following fairly simple way –

Let $\{ TM \times TM \rightarrow R \}$ be the class of metrics on a space $M$ with differentiable structure. Choose a trivial basis for this space and extend over the manifold. Say for $R^n$ this would be $x_i \otimes x_j$, but of course this extends. Broadly speaking, think of this as a basis for $GL(n)$ – more interesting examples can then be constructed by merely considering the space locally as $GL(n)$, e.g. bundles etc.

Then one can take a section of this space over the manifold $M$ , as $g : TM \times TM \rightarrow R$ . For clarity I will refer to this as the information metric , and assume that it is Riemannian.

Then the connection between this and the Fisher information is via the following rather beautiful observation. View the information metric as a smooth sampling of the space of possibilities for the Riemannian metric at each point in the manifold. Then we can interpret it probabilistically as a pdf

$f(m,\alpha) = \delta( g(m) - \alpha)$

which indeed trivially has the property that if one integrates over the space of metrics for fixed m, we obtain 1.

From this the Fisher information functional can be constructed-

$\int_M \int_A f(m,\alpha) \partial ln f(m,\alpha) \partial ln f(m,\alpha) d\alpha dm$

and for this choice of geometric objects it can be demonstrated that the above reduces to

$\int_M R(g) dm,$

nothing other than the integral of the scalar curvature over the original space.

• Tim van Beek says:

Second, the Fischer information itself seems to define some sort of metric on probability distributions! So what’s going on? How could the curvature of one metric be another metric?

Sorry if I muddy the waters with my comment, but take a look a a simple example where our manifold is some $\mathbb{R}^n$, let the Kullback-Leibler divergence be $S(p, q)$ and compute the Hessian of it, that is choose vectors $u, v$ and parameters $t, s$, differentiate $S(p + tu, q + sv)$ for $t, s$, at $t=s=0$ and get the Fisher information $F(u, v)$.

(How do I use latex? Sorry, I already forgot…).

That is the Hessian of the real valued function $S(p, q)$ is the Fisher information.

• John Baez says:

Your comment didn’t muddy the waters. It should help!

If we can understand the example of $\mathbb{R}^n$ well enough, we should be able to understand any manifold, since they all look locally like $\mathbb{R}^n$.

Two basic questions:

1) In this formulation, what’s the Fisher information a function of? You wrote $F(u,v)$ but I guess it’s really a function of $p,q,u,v$. In other words, it’s a function on the tangent bundle of $\mathbb{R}^n \times \mathbb{R}^n$. Is that right?

For some reason I had been thinking that Fisher information showed up only when we took $p = q$. Then it becomes a symmetric bilinear function of the two tangent vectors $u, v$, which is about what we expect for a Riemannian metric.

2) What’s the point? I understand the conceptual meaning of ‘relative entropy’ (or as you seem to insist on calling it, despite my already announced disgust, ‘Kullback-Leibler divergence’.) But why should we take the Hessian of the relative entropy? What good is that? What does it mean?

I converted your comment into LaTeX. To do LaTeX, use dollar signs as usual (single dollar signs only), but write “latex” directly after the first dollar sign, without a space. Thus,

$latex p \in \mathbb{R}^n$

gives

$p \in \mathbb{R}^n$

• Tim van Beek says:

1) Yes, that’s right, I was lazy.

2) The relative entropy induces a topology on the space of probability distributions. I implicitly fixed a coordinate system, the Hessian with respect to this – arbitrary but fixed – coordinate system defines a Riemann metric that is compatible with this topology.

This metric is the Fisher metric, which is unique in the sense I described.

All of this is just an observation, I don’t know if there is any “point”…I don’t understand how the Fisher metric is supposed to be the “curvature” of a “metric” derived from the relative entropy, my educated guess is that this is just an unconventional phrasing of the connection I pointed out, but I don’t know…

• Oh, it looks like what I said wasn’t quite right. There’s a unique connection associated with the Fisher information metric so that the inner product of vectors is invariant under transport by that connection. This is the Levi-Civita or Riemannian connection. But this connection gives rise to a distance which is not relative entropy (KL-divergence).

However, there is a whole one-parameter family of pairs of dual connections such that the inner product is preserved when the first vector is transported according to the first connection and the second vector by the second. (The Riemannian connection is the self-dual member of this family.)

One special member of this family has dual connections often denoted the e-connection and m-connection. It’s the e-connection which gives rise to the relative entropy divergence.

There’s then a bunch of clever stuff involving Legendre transformations between dual coordinate systems as in here.

Tim’s returning us from the e-connection, or rather its form of divergence, to the Fisher information metric.

• John F says:

The Jensen-Shannon metric is better.
http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence

BTW, Jensen’s work in this area was in the 1890s, predating whatever other dates mentioned above, published in 1904 I think.

• John F says:

Crooks had a nice article “Measuring Thermodynamic Length”
http://prl.aps.org/pdf/PRL/v99/i10/e100602

Equation 6 therein may provide the aha you’re looking for.

• John Baez says:

Aha!

Very, very nice — thanks! This helps immensely.

This article by Crooks should be readable to anyone who is comfortable with the usual statistical mechanics game of computing the means, variances, and higher moments of observables by repeatedly differentiating the logarithm of the partition function. And you don’t need a subscription to Phys. Rev. Lett. to read it! It’s free here:

• Gavin E. Crooks, Measuring thermodynamic length,
arXiv:0706.0559.

• Tim van Beek says:

Aha!

(What’s “fractional distillation”?)

But I’m still getting confused by statements like this one:

The square root of the
Jensen-Shannon divergence is a metric between probability
distributions [25]. However, unlike a Riemannian
metric, the Jensen-Shannon metric space is not an intrinsic
length space. There may not be a mid point b between
points a and c such that d(a, b) + d(b, c) = d(a, c)
and consequentially we cannot naturally measure path
lengths. However, on any metric space we can define a
new intrinsic metric by measuring the distance along continuous
paths.

How does the author define “divergence” and “square root of a divergence”?

And what do the last two sentences mean?

• John Baez says:

As for the last two sentence in the text you quote: the author is apparently defining a metric space to be intrinsic if for any two points a and c there exists a midpoint, meaning a point b with

d(a, b) + d(b, c) = d(a, c).

And then he’s claiming that given any metric space with metric d, we can define a new metric d’ that is intrinsic, by letting d'(a,b) be the infimum of the “arclengths” of continuous paths from a to b. I’m not sure, but I think I see a reasonable way to define the “arclength” of a continuous path in any metric space (which could, of course, be infinite).

And I think he’s also hinting that for an intrinsic metric space, the metric d’ is the same as the metric d.

In other words: for an intrinsic metric space, the distance between two points equals the infimum of the arclengths of all continuous paths from one point to another. This is pretty easy to see: just take your two points and repeatedly take choose midpoints to create, in the limit, a continuous path from one point to the other, whose arclength is the distance between these points!

He is saying all this a fairly concise way.

• John F says:

Tim,

numerically a divergence is simply a monotonic nonnegative measure of (usually highly multidimensional) difference, and can’t necessarily be massaged into a norm or anything. Divergences are mostly commonly used as class separability indicators, for instance in hyperspectral geospatial imaging applications. The most useful divergence implemented in most GIS software is called Transform Divergence, which despite its vague name is fast to calculate. The square root is numerically the square root.

The Jensen-Shannon metric is a metric that doesn’t require coordinatizing to define. I think the last sentence is saying that it could be coordinatized if you wanted to.

5. Blake Stacey says:

If you like entropic uncertainty principles, you’d probably also like entropic Bell inequalities.

6. On trits, back in my undergrad, a friend got me to amuse myself with the “balanced base 3” representation of integers. A familier sequence of integers might, in this notation, look like
“… NN, N0, NP, 0N, 00, 0P, PN, P0, PP…”
It’s a cute little thing, and it elucidates a conundrum of Meziriac — although it also makes some familiar digit-crunching algorithms somewhat messier — or I never thought enough about them, maybe.

7. DavidTweed says:

Going off topic, I’m starting to see a couple of user names from TheOilDrum show up here (such as WebHubTel), so I’ll point out that there’s currently a draft page on peak oil page under development on the Azimuth wiki.

http://www.azimuthproject.org/azimuth/show/Peak+oil

I’m slowly adding to it, but any contributions (particularly corrections of errors) are very welcome to that or any other wiki page. It’s John Baez’ project, but here’s my interpretation of a couple of points to bear in mind with respect to wiki articles:

1. Azimuth, with an intended audience of scientists, is trying to have its articles maintain the distinction between things like “observationally confirmed facts” (ideally with references as close to the source as possible), “model based projections” and “intuitive conjectures”, “speculations”, etc. It’s also preferred to be explicit about any uncertainty that exists. Part of why it’s gone slowly so far is me trying to figure out which stuff falls into each category (along with references).

2. Azimuth’s goal is providing an intro to all “environmental/sustainability” problems, not just energy.

3. Information about possible solutions are also very relevant, but advocacy should maintain the same sort of distinctions as 1.

(I’m not trying to scare anybody off from contributing, just trying to state what the goals for wiki articles are.) Information about how to get a wiki account can be found at

http://www.azimuthproject.org/azimuth/show/HomePage

8. The question of bits, trits, or nats is bypassed by using the base-free version, namely the antilog of Shannon’s entropy:

$H_{m}(p)=2^{H(p)}=\prod_{i}(\frac{1}{p_{i}})^{p_{i}}$,

where the usual Shannon (additive) entropy is:

$H(p)=\sum_{i}p_{i}\mathrm{log}_{2}\,(\frac{1}{p_{i}})$

for a finite probability distribution $p = \{p_{1},\ldots,p_{n}\}$. Ironically, the basic property often used to justify Shannon’s additive entropy formula, namely the asymptotic equipartition property (see Cover and Thomas, chapter 3), actually develops the base-free multiplicative form given above, and then Shannon’s additive form is presented by choosing a base such as 2 and then writing $H_{m}(p)=2^{H(p)}$. Every math teacher has to get students to realize that the probabilities of independent events multiply rather than add, but Shannon “intuited” that “information” should then add so he developed his theory using the additive version, which involved choosing some base for the logs, rather than using the base-free multiplicative version of his entropy.

• That’s Thomas Cover and Joy Thomas, Elements of Information Theory, Wiley, 1991.

• WebHubTel says:

I think that definition of entropy is also known a “perplexity”, which is a measure of surprise.

It might have a better intuitive notion behind it as you can lower the entropy by constraining a system and thus effectively reducing the state space and therefore the surprise. It has a more direct combinatorial feel to it, IMO.

I might have that interpretation wrong but it usually gets stated as the math relation and people leave it there. I always like to see an intuitive explanation for these things.

• John Baez says:

You can think of the entropy of a probability distribution to be the average value of the surprise you feel when, starting out knowing this probability distribution, you discover the actual value of the random variable it describes.

Here I’m doing not what David suggests, but the usual thing: treating surprise as additive. Suppose I feel one tiny bit of surprise when you flip a fair coin and I discover it lands heads up. Then if we treat surprise as additive, I’ll feel $n$ bits of surprise if you flip the coin $n$ times and it lands heads up every time. So, this event, which has probability $2^{-n}$, delivers $n$ bits of surprise. So, it seems sensible to define the surprise of an event of probability $p$ to be

$- \mathrm{log}_2 p$

Thus, if we are in a situation where there are a bunch of possible outcomes, with the ith outcome having probability $p_i$, the average surprise I’ll feel is

$- \sum_i p_i \mathrm{log}_2 p_i$

And this is the entropy!

People usually say “expected value” where I’ve been saying “average” above, so we can summarize by saying the entropy is the expected surprise.

This sounds a bit funny, because we think of a surprise as being unexpected. “Expected surprise” sounds a bit paradoxical. But if we remember that “expected” means “average” here, and wrap our heads around the question “how surprised do you expect to be?”, we’ll see the answer is entropy. The entropy of a probability distribution tells us how surprised we should expect to be.

David wisely suggests treating surprise as multiplicative rather than additive, to avoid the need for logarithms, and avoid the need to choose a specific base (like 2) for our logarithms. We could retell the whole story I just told using that system, if we wanted.

• DavidTweed says:

Going off topic, I’m starting to see a couple of user names from TheOilDrum show up here (such as WebHubTel), so I’ll point out that there’s currently a draft page on peak oil page under development on the Azimuth wiki.

http://www.azimuthproject.org/azimuth/show/Peak+oil

I’m slowly adding to it, but any contributions (particularly corrections of errors) are very welcome to that or any other wiki page. It’s John Baez’ project, but here’s my interpretation of a couple of points to bear in mind with respect to wiki articles:

1. Azimuth, with an intended audience of scientists, is trying to have its articles maintain the distinction between things like “observationally confirmed facts” (ideally with references as close to the source as possible), “model based projections” and “intuitive conjectures”, “speculations”, etc. It’s also preferred to be explicit about any uncertainty that exists. Part of why it’s gone slowly so far is me trying to figure out which stuff falls into each category (along with references).

2. Azimuth’s goal is providing an intro to all “environmental/sustainability” problems, not just energy.

3. Information about possible solutions are also very relevant, but advocacy should maintain the same sort of distinctions as 1.

(I’m not trying to scare anybody off from contributing, just trying to state what the goals for wiki articles are.) Information about how to get a wiki account can be found at

http://www.azimuthproject.org/azimuth/show/HomePage

• DavidTweed says:

Reading my comment back, it’s a little unclear in (1). It’s perfectly ok to have relevant speculations in an article, but the wording should make it clear that’s what they are (likewise with the other levels of evidence).

• WebHubTel says:

Responding to DavidTweed, I would be happy to contribute to the Azimuth Wiki. I infer what is different about the Azimuth Wiki versus something like Wikipedia, is that Azimuth is a bit more open to speculative approaches. In contrast, Wikipedia is really set up to go the accepted wisdom route.

• John Baez says:

There are lot of differences between the Azimuth Project and Wikipedia. About the only thing they have in common is that they’re wikis and they’re trying to provide reliable information! The Azimuth Project is not an encyclopedia: it’s a focal point for scientists and engineers interested in saving the planet. We’re trying to provide accurate, reliable information, so speculations should be clearly labelled as such, and attributed to whoever is doing the speculating.

For details, go here. And if you want to contribute, please join the Azimuth Forum, so you can log your changes and discuss issues with the rest of us.

9. As John says, one could redo the whole theory using a multiplicative notion of surprise. Indeed, one common notion of surprise or unexpectedness is the reciprocal of probability, and as one information theorist put it: “[Shannon] defined a quantity which he called ‘amount-of-information’, which is essentially a logarithmic measure of the statistical unexpectedness (reciprocal of probability) of the message concerned.” [MacKay, Donald M. 1969. Information, Mechanism and Meaning. Cambridge: MIT Press, p. 79.] Using the reciprocal probability measure, the “average” of independent events is their geometric mean: $H_{m}(p)=\prod_{i}(\frac{1}{p_{i}})^{p_{i}}=2^{H(p)}$ so the multiplicative base-free version of Shannon’s entropy is in that sense also the “average surprise.”

There is another way to approach this matter that avoids the “tarpit” of subjective notions like “surprise.” One commonly illustrates a probability 1/n as being a choice of one out of a set of n equiprobable alternatives. Thus a probability p is modeled as a choice out of 1/p equiprobabile options which is sometimes called the “number-equivalent” of the probability or of the proportion (in a nonprobabilistic context). Now $p = \{p_{1},\ldots,p_{n}\}$ can be interpreted either as a probability distribution or a set of proportions with each $p_{i}$ interpreted as a choice of one among $\frac{1}{p_{i}}$ equal alternatives. Then the multiplicative entropy is the (geometric) average number of equal alternatives.

This is relevant to the question of interpreting entropy raised by one commenter. Shannon’s additive entropy is the average number of yes-or-no questions needed to single out a specific equal alternative, while the multiplicative version of his entropy is just the average number of equal alternatives.

One use of entropy in mathematical biology is to measure species diversity. The mathematical biologist Robert H. MacArthur noted that it is the multiplicative version of entropy that has the more intuitive interpretation as the average number of equally common species:

“Notice that if all N species are equally common, each is a proportion 1/N of the total. Thus the [Shannon entropy] measure … equals log N, so the [Shannon entropy] measure of equally common species is simply the logarithm of the number of equally common species, meaning that E equally common species would have the same diversity as the N unequally common species in our census. ”

“Returning to the example of a census of 99 individuals of one species and 1 of a second, we calculate H = … =0.0560 [as the Shannon entropy using natural logs]. For a census of fifty individuals of each of the two species we would get H = … =0.693. To convert these back to ‘equally common species’, we take e^0.0560 = 1.057 for the first census and e^0.693 = 2.000 for the second. These numbers, 1.057 and 2, accord much more closely with our intuition of how diverse the areas actually are,… .”

[MacArthur, Robert H. 1965. Patterns of Species Diversity. Biol. Rev. 40, p. 514.]

If the Shannon additive entropy using natural logs is $H_{e}(p)$, then MacArthur is noting that the easier-to-interpret notion is: $H_{m}(p)=\prod_{i}(\frac{1}{p_{i}})^{p_{i}}=2^{H(p)}=e^{H_{e}(p)}$.

The more common measure of species diversity is the Gini-Simpson index [see Rao, C. R. 1982. Diversity and Dissimilarity Coefficients: A Unified Approach. Theoretical Population Biology. 21: 24-43] which is just the previously mentioned logical entropy: $h(p)=\sum_{i}p_{i}(1-p_{i})$ and which is easily interpreted as the probability of getting distinct species or outcomes in two independent draws.

I have gone into this matter only to ‘loosen’ the usual attachment to Shannon’s specific additive formula. There is no point to redo information theory routinely substituting the antilog for Shannon’s entropy since nothing mathematically new would be obtained [although it might be interesting to see what results could be rederived using logical entropy–see Ellerman, David 2009. Counting Distinctions: On the Conceptual Foundations of Shannon’s Information Theory. Synthese. 168 (1 (May)): 119-149, for a beginning].

Incidentally, the old comparison of Shannon’s additive entropy to the entropy formula derived in statistical mechanics is somewhat less than it seems since that derivation depends on using a specific version of Stirling’s approximation for n!. Add some higher-order terms to the approximation and a different formula emerges. This is not important numerically but it is precisely to the point if one is making a point about functional forms.

• WebHubTel says:

One reciprocal probability measure is the odds function. This is defined as Odds = P/(1-P) where P is a cumulative probability. What is interesting about the idea of odds is that it has taken hold as an intuitive measure of probability for many people.

Interesting if you invert the odds function, then you get the power law function P = 1/(1+1/Odds), where Odds is just a relative measure against the median (in this case, the median is 1). I think this function is the key to species diversity and lots of other behaviors. It comes up as a ratio distribution between two exponential distributions for one. Since species diversify by rate mechanisms (think mutations) may be Rate = dX/dTime you can see how this may naturally come about. Noth dX and dTime have natural distributions so that the rato may have something different.

I haven’t been able to deduce this yet but I think that the P=1/(1+1/Odds) distribution is the maximum entropy distribution for some unknown constraint. It definitely doesn’t have any finite moments.

http://mobjectivist.blogspot.com/2010/04/entroplet-species-area-relationships.html
Plus some links in there. Fun messing around with this stuff.

10. westy31 says:

John wote:

are there some applications where ternary digits would theoretically be better than binary ones?

One is the most efficient way to weigh something with a balance.

If you put the object to be weighed on one side, and then weights on the other side, then binary weights (1,2,4,8…) are the most efficient, they use the least weights. But if you use both sides of the balance, subtracting the weights put together with the weighed objects, then ternary weights (1,3,9,27,…) are the most efficient. You can weigh anything upto 13 with just 3 weights:

* 1=1
* 2=3-1
* 3=3
* 4=3+1
* 5=9-3-1
* 6=9-3
* 7=9-3+1
* 8 =9-1
* 9=9
* 10=9+1
* 11=9+3-1
* 12=9+3
* 13=9+3+1

Similarly, coins of 1,3,8,27,.. are the most efficient way of paying if you allow the seller to give change. (Binary is the best if you don’t allow change)

Gerard

11. […] was pointed out by John F in our discussion of entropy and […]

12. Sniffnoy says:

Regarding ternary: I’ve been doing some stuff with a kind of terrible sequence known as integer complexity and the fact that 3 is closer to e than 2 comes up there. I’ll use ||n|| to denote the complexity of n, the smallest number of 1s needed to write n using addition and multiplication – well, you can go see the Sloane entry for yourself. This fact isn’t relevant for finding an upper bound – the upper bound based on binary representation is better than the one based on ternary. But if you kind of invert the question and ask, I have k ones, what’s the largest number I can make with them using addition and multiplication? Well ideally you’d group them into groups of e and multiply those together to get e^(k/e), but seeing as that doesn’t make a lot of sense, the actual maximum is 3^(k/3). (If k is divisible by 3, anyway; otherwise you have a group of 2 or 4 left over.) So you get ||n||>=3log_3(n). And this fact leads to 3 being nice in other ways with respect to this sequence, which results in you ending up looking at ternary representations after all for a bunch of things…

• Sniffnoy says:

Oops, I missed the “radix economy” note. Which has of course the same reason – 3 being the integer maximizing $n^{1/n}$, or minimizing n/(log n)…

13. john e gray says:

This misses reference to pioneering work by Weinhold that I first read about when I started reading Physics Today, found in the introductory article in Physics Today:

• F. Weinhold, Geometry and Thermodynamics, Physics Today 29(3), 23-30 (1976).

The pioneering insight was the usage of concavity found in the property equations of state surface in a geometric fashion. Riemann was introduced by Gilmore (Drexel Univ) extended this to Reimannian geometry in the 80’s. A number of papers can be found on his website:

http://www.physics.drexel.edu/~bob/Thermodynamics_pgm.html

People interested in this subject can find much of interest on this webpage.

This site uses Akismet to reduce spam. Learn how your comment data is processed.