Entropy and Uncertainty

19 October, 2010

I was going to write about a talk at the CQT, but I found a preprint lying on a table in the lecture hall, and it was so cool I’ll write about that instead:

• Mario Berta, Matthias Christandl, Roger Colbeck, Joseph M. Renes, Renato Renner, The uncertainty principle in the presence of quantum memory, Nature Physics, July 25, 2010.

Actually I won’t talk about the paper per se, since it’s better if I tell you a more basic result that I first learned from reading this paper: the entropic uncertainty principle!

Everyone loves the concept of entropy, and everyone loves the uncertainty principle. Even folks who don’t understand ’em still love ’em. They just sound so mysterious and spooky and dark. I love ’em too. So, it’s nice to see a mathematical relation between them.

I explained entropy back here, so let me say a word about the uncertainty principle. It’s a limitation on how accurately you can measure two things at once in quantum mechanics. Sometimes you can only know a lot about one thing if you don’t know much about the other. This happens when those two things “fail to commute”.

Mathematically, the usual uncertainty principle says this:

\Delta A \cdot \Delta B \ge \frac{1}{2} |\langle [A,B] \rangle |

In plain English: the uncertainty in A times the uncertainty in B is bigger than the absolute value of the expected value of their commutator

[A,B] = A B - B A

Whoops! That started off as plain English, but it degenerated into plain gibberish near the end… which is probably why most people don’t understand the uncertainty principle. I don’t think I’m gonna cure that today, but let me just nail down the math a bit.

Suppose A and B are observables — and to keep things really simple, by observable I’ll just mean a self-adjoint n \times n matrix. Suppose \psi is a state: that is, a unit vector in \mathbb{C}^n. Then the expected value of A in the state \psi is the average answer you get when you measure that observable in that state. Mathematically it’s equal to

\langle A \rangle = \langle \psi, A \psi \rangle

Sorry, there are a lot of angle brackets running around here: the ones at right stand for the inner product in \mathbb{C}^n, which I’m assuming you understand, while the ones at left are being defined by this equation. They’re just a shorthand.

Once we can compute averages, we can compute standard deviations, so we define the standard deviation of an observable A in the state \psi to be \Delta A where

(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2

Got it? Just like in probability theory. So now I hope you know what every symbol here means:

\Delta A \cdot \Delta B \ge \frac{1}{2} |\langle [A,B] \rangle |

and if you’re a certain sort of person you can have fun going home and proving this. Hint: it takes an inequality to prove an inequality. Other hint: what’s the most important inequality in the universe?

But now for the fun part: entropy!

Whenever you have an observable A and a state \psi, you get a probability distribution: the distribution of outcomes when you measure that observable in that state. And this probability distribution has an entropy! Let’s call the entropy S(A). I’ll define it a bit more carefully later.

But the point is: this entropy is really a very nice way to think about our uncertainty, or ignorance, of the observable A. It’s better, in many ways, than the standard deviation. For example, it doesn’t change if we multiply A by 2. The standard deviation doubles, but we’re not twice as ignorant!

Entropy is invariant under lots of transformations of our observables. So we should want an uncertainty principle that only involves entropy. And here it is, the entropic uncertainty principle:

S(A) + S(B) \ge \mathrm{log} \, \frac{1}{c}

Here c is defined as follows. To keep things simple, suppose that A is nondegenerate, meaning that all its eigenvalues are distinct. If it’s not, we can tweak it a tiny bit and it will be. Let its eigenvectors be called \phi_i. Similarly, suppose B is nondegenerate and call its eigenvectors \chi_j. Then we let

c = \mathrm{max}_{i,j} |\langle \phi_i, \chi_j \rangle|^2

Note this becomes 1 when there’s an eigenvector of A that’s also an eigenvector of B. In this case its possible to find a state where we know both observables precisely, and in this case also

\mathrm{log}\, \frac{1}{c} = 0

And that makes sense: in this case S(A) + S(B), which measures our ignorance of both observables, is indeed zero.

But if there’s no eigenvector of A that’s also an eigenvector of B, then c is smaller than 1, so

\mathrm{log} \, \frac{1}{c} > 0

so the entropic uncertainty principle says we really must have some ignorance about either A or B (or both).

So the entropic uncertainty principle makes intuitive sense. But let me define the entropy S(A), to make the principle precise. If \phi_i are the eigenvectors of A, the probabilities of getting various outcomes when we measure A in the state \psi are

p_i = |\langle \phi_i, \psi \rangle|^2

So, we define the entropy by

S(A) = - \sum_i p_i \; \mathrm{log}\, p_i

Here you can use any base for your logarithm, as long as you’re consistent. Mathematicians and physicists use e, while computer scientists, who prefer integers, settle for the best known integer approximation: 2.

Just kidding! Darn — now I’ve insulted all the computer scientists. I hope none of them reads this.

Who came up with this entropic uncertainty principle? I’m not an expert on this, so I’ll probably get this wrong, but I gather it came from an idea of Deutsch:

• David Deutsch, Uncertainty in quantum measurements, Phys. Rev. Lett. 50 (1983), 631-633.

Then it got improved and formulated as a conjecture by Kraus:

• K. Kraus, Complementary observables and uncertainty relations, Phys. Rev. D 35 (1987), 3070-3075.

and then that conjecture was proved here:

• H. Maassen and J. B. Uffink, Generalized entropic uncertainty relations, Phys. Rev. Lett. 60 (1988), 1103-1106.

The paper I found in the lecture hall proves a more refined version where the system being measured — let’s call it X — is entangled to the observer’s memory apparatus — let’s call it O. In this situation they show

S(A|O) + S(B|O) \ge S(X|O) + \mathrm{log} \, \frac{1}{c}

where I’m using a concept of “conditional entropy”: the entropy of something given something else. Here’s their abstract:

The uncertainty principle, originally formulated by Heisenberg, clearly illustrates the difference between classical and quantum mechanics. The principle bounds the uncertainties about the outcomes of two incompatible measurements, such as position and momentum, on a particle. It implies that one cannot predict the outcomes for both possible choices of measurement to arbitrary precision, even if information about the preparation of the particle is available in a classical memory. However, if the particle is prepared entangled with a quantum memory, a device that might be available in the not-too-distant future, it is possible to predict the outcomes for both measurement choices precisely. Here, we extend the uncertainty principle to incorporate this case, providing a lower bound on the uncertainties, which depends on the amount of entanglement between the particle and the quantum memory. We detail the application of our result to witnessing entanglement and to quantum key distribution.

By the way, on a really trivial note…

My wisecrack about 2 being the best known integer approximation to e made me wonder: since 3 is actually closer to e, are there some applications where ternary digits would theoretically be better than binary ones? I’ve heard of "trits" but I don’t actually know any applications where they’re optimal.

Oh — here’s one.


Algorithmic Thermodynamics (Part 1)

12 October, 2010

My grad student Mike Stay and I put a new paper on the arXiv:

• John Baez and Mike Stay, Algorithmic thermodynamics.

Mike has a masters degree in computer science, and he’s working for Google on a project called Caja. This is a system for letting people write programs in JavaScript while protecting the end users from dirty tricks the programmers might have tried. With me, he’s mainly working on the intersection of computer science and category theory, trying to bring 2-categories into the game. But Mike also knows a lot about algorithmic information theory, a subject with fascinating connections to thermodynamics. So, it was natural for us to work on that too.

Let me just tell you a little about what we did.



Around 1948, the electrical engineer Claude Shannon came up with a mathematical theory of information. Here is a quote that gives a flavor of his thoughts:

The whole point of a message is that it should contain something new.

Say you have a source of information, for example a mysterious radio transmission from an extraterrestrial civilization. Suppose every day you get a signal sort of like this:

00101101011111101010101011011101110

How much information are you getting? If the message always looks like this, presumably not much:

00000000000000000000000000000000000

Shannon came up with a precise formula for the information. But beware: it’s not really a formula for the information of a particular message. It’s a formula for the average information of a message chosen from some probability distribution. It’s this:

- \sum_i p_i \;\mathrm{log}(p_i)

where we sum over all possible messages, and p_i is the probability of the i th message.

So, for example, suppose you keep getting the same message. Then every message has probability 0 except for one message, which has probability 1. Then either p_i or \ln(p_i) is zero, so the information is zero.

That seems vaguely plausible in the example where every day we get this message:

000000000000000000000000000000000000000

It may seem less plausible if every day we get this message:

011010100010100010100010000010100000100

It looks like the aliens are trying to tell us which numbers are prime! 1 is not prime, 2 is, 3 is, 4 is not, 5 is, and so on. Aren’t we getting some information?

Maybe so: this could be considered a defect of Shannon information. On the other hand, you might be willing to admit that if we keep getting the same message every day, the second time we get it we’re not getting any new information. Or, you might not be willing to admit this — it’s actually a subtle issue, and I don’t feel like arguing.

But at the very least, you’ll probably admit that the second time you get the same message, you get less new information than the first time. The third time you get even less, and so on. So it’s quite believable that in the long run, the average amount of new information per message approaches 0 in this case. For Shannon, information means new information.

On the other hand, suppose we are absolutely unable to to predict each new bit we get from the aliens. Suppose our ability to predict the next bit is no better than our ability to predict whether a fair coin comes up heads or tails. Then Shannon’s formula says we are getting the same amount of new information with every bit: namely, \mathrm{log}(2).

If we take the logarithm using base 2 here, we get 1 — so we say we’re getting one bit of information. If we take it using base e, as physicists prefer, we get \ln(2) — and we say we’re getting \ln(2) nats of information. One bit equals \ln(2) nats.

There’s a long road from these reflections to a full justification of the specific formula for Shannon information! To begin marching down that road you can read his original paper, A mathematical theory of communication.

Anyway: it soon became clear that Shannon’s formula was almost the same as the usual formula for “entropy”, which goes back to Josiah Willard Gibbs. Gibbs was actually the first engineer to get a Ph.D. in the United States, back in 1863… but he’s mainly famous as a mathematician, physicist and chemist.



Entropy is a measure of disorder. Suppose we have a box of stuff — solid, liquid, gas, whatever. There are many possible states this stuff can be in: for example, the atoms can be in different positions, and have different velocities. Suppose we only know the probability that the stuff is in any one of the allowed states. If the ith state is occupied with probability p_i, Gibbs said the entropy of our box of stuff is

- k \sum_i p_i \; \mathrm{log}(p_i)

Here k is a constant called Boltzmann’s constant.

There’s a wonderful story here, which I don’t have time to tell in detail. The way I wrote Shannon’s formula for information and Gibbs’ formula for entropy, you’d think only a moron would fail to instantly grasp that they’re basically the same. But historically, it took some work.

The appearance of Bolzmann’s constant hints at why. It shows up because people had ideas about entropy, and the closely related concept of temperature, long before they realized the full mathematical meaning of these concepts! So entropy traditionally came in units of “joules/kelvin”, and physicists had a visceral understanding of it. But dividing by Boltzmann’s constant, we can translate that notion of entropy into the modern, more abstract way of thinking of entropy as information!

Henceforth I’ll work in units where k = 1, as modern mathematical physicists do, and treat entropy and information as the same concept.

Closely related to information and entropy is a third concept, which I will call Kolmogorov complexity, though it was developed by many people — including Martin-Löf, Solomonoff, Kolmogorov, Levin, and Chaitin — and it has many names, including descriptive complexity, Kolmogorov-Chaitin complexity, algorithmic entropy, and program-size complexity. You may be intimidated by all these names, but you shouldn’t be: when lots of people keep rediscovering something and giving it new names, it’s usually because this thing is pathetically simple.

So, what is Kolmogorov complexity? It’s a way of measuring the information in a single message rather than a probability distribution of messages. And it’s just the length of the shortest computer program that prints out this message.

I suppose some clarification may be needed here:

1) I’m only talking about programs without any input, that either calculate, print out a message, and halt… or calculate forever and never halt.

2) Of course the length of the shortest program that prints out the desired message depends on the programming language. But there are theorems saying it doesn’t depend “too much” on the language. So don’t worry about it: just pick your favorite language and stick with that.

If you think about it, Kolmogorov complexity is a really nice concept. The Kolmogorov complexity of a string of a million 0’s is a lot less than a million: you can write a short program that says “print a million 0’s”. But the Kolmogov complexity of a highly unpredictable string of a million 0’s and 1’s is about a million: you basically need to include that string in your program and then say “print this string”. Those are the two extremes, but in general the complexity will be somewhere in between.

Various people — the bigshots listed above, and others too — soon realized that Kolmogorov complexity is deeply connected to Shannon information. They have similar properties, but they’re also directly related. It’s another great story, and I urge you to learn it. For that, I recommend:

• Ming Li and Paul Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, Berlin, 2008.

To understand the relation a bit better, Mike and I started thinking about probability measures on the set of programs. People had already thought about this — but we thought about it a bit more the way physicists do.

Physicists like to talk about something called a Gibbs ensemble. Suppose we have a set X and a function

F: X \to \mathbb{R}

Then the Gibbs ensemble is the probability distribution on X that maximizes entropy subject to the condition that F has some specified average, or “expected value”.

So, to find the Gibbs ensemble, we need to find a probability distribution p : X \to \mathbb{R} that maximizes

- \sum_{i \in X} p_i \; \mathrm{log}(p_i)

subject to the constraint that

\sum_{i \in X} p_i \; F(i) = \langle F \rangle

where \langle F \rangle is some number, the expected value of F. Finding the probability distribution that does the job is an exercise in Lagrange multipliers. I won’t do it. There’s a nice formula for the answer, but we won’t need it here.

What you really need to know is something more important: why Gibbs invented the Gibbs ensemble! He invented it to solve some puzzles that sound completely impossible at first.

For example, suppose you have a box of stuff and you don’t know which state it’s in. Suppose you only know the expected value of its energy, say \langle E \rangle. What’s the probability that this stuff is in its ith state?

This may sound like an insoluble puzzle: how can we possibly know? But Gibbs proposed an answer! He said, basically: find the probability distribution p that maximizes entropy subject to the constraint that the mean value of energy is \langle E \rangle. Then the answer is p_i.

In other words: use the Gibbs ensemble.

Now let’s come back to Kolmogorov complexity.

Imagine randomly picking a program out of a hat. What’s the probability that you pick a certain program? Again, this sounds like an impossible puzzle. But you can answer it if you know the expected value of the length of this program! Then you just use the Gibbs ensemble.

What does this have to do with Kolmogorov complexity? Here I’ll be a bit fuzzy, because the details are in our paper, and I want you to read that.

Suppose we start out with the Gibbs ensemble I just mentioned. In other words, we have a program in a hat, but all you know is the expected value of its length.

But now suppose I tell you the message that this program prints out. Now you know more. How much more information do you have now? The Kolmogorov complexity of the message — that’s how much!

(Well, at least this is correct up to some error bounded by a constant.)

The main fuzzy thing in what I just said is “how much more information do you have?” You see, I’ve explained information, but I haven’t explained “information gain”. Information is a quantity you compute from one probability distribution. Information gain is a quantity you compute from two. More precisely,

- \sum_i p_i \; \mathrm{log}(p_i/q_i)

is the information you gain when you thought the probability distribution was q, but then someone comes along and tells you it’s p.

In fact, we argue that information gain is more fundamental than information. This is a Bayesian idea: q is your “prior”, the probability distribution you thought was true, and the information you get upon hearing the distribution is p should be defined relative to this prior. When people think they’re talking about information without any prior, they are really using a prior that’s so bland that they don’t notice it’s there: a so-called “uninformative prior”.

But I digress. To help you remember the story so far, let me repeat myself. Up to a bounded error, the Kolmogorov complexity of a message is the information gained when you start out only knowing the expected length of a program, and then learn which message the program prints out.

But this is just the beginning of the story. We’ve seen how Kolmogorov complexity is related to Gibbs ensembles. Now that we have Gibbs ensembles running around, we can go ahead and do thermodynamics! We can talk about quantities analogous to temperature, pressure, and so on… and all the usual thermodynamic relations hold! We can even take the ideas about steam engines and apply them to programs!

But for that, please read our paper. Here’s the abstract, which says what we really do:

Algorithmic entropy can be seen as a special case of entropy as studied in statistical mechanics. This viewpoint allows us to apply many techniques developed for use in thermodynamics to the subject of algorithmic information theory. In particular, suppose we fix a universal prefix-free Turing machine and let X be the set of programs that halt for this machine. Then we can regard X as a set of ‘microstates’, and treat any function on X as an ‘observable’. For any collection of observables, we can study the Gibbs ensemble that maximizes entropy subject to constraints on expected values of these observables. We illustrate this by taking the log runtime, length, and output of a program as observables analogous to the energy E, volume V and number of molecules N in a container of gas. The conjugate variables of these observables allow us to define quantities which we call the ‘algorithmic temperature’ T, ‘algorithmic pressure’ P and `algorithmic potential’ μ, since they are analogous to the temperature, pressure and chemical potential. We derive an analogue of the fundamental thermodynamic relation dE = T dS – P d V + μ dN, and use it to study thermodynamic cycles analogous to those for heat engines. We also investigate the values of T, P and μ for which the partition function converges. At some points on the boundary of this domain of convergence, the partition function becomes uncomputable. Indeed, at these points the partition function itself has nontrivial algorithmic entropy.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers