Algorithmic Thermodynamics (Part 2)

Here are some notes for a talk tomorrow at the Centre for Quantum Technologies. You can think of this as a kind of followup to my first post on this subject.

Introduction

The idea of entropy arose in thermodynamics, and its connection to probability theory is fundamental to statistical mechanics. Its connection to information was developed later by Shannon. We now think of entropy and information as two sides of the same coin.

But there’s another concept, called algorithmic information, which was developed in work on logic and computer science. This concept is related to Shannon’s earlier notion of information, but it also seems rather different.

For example, Shannon’s ideas let us compute the information of a probability distribution on bit strings. So if we detect a radio signal from an extraterrestrial life form, that sends us a string of a dozen 0’s and 1’s each day, and it seems the string is chosen randomly according to some probability distribution, we can calculate the information per message. But if I just show you a single bit string, like this:

011101100001

it makes no sense to ask what its information is. On the other hand, we can define its algorithmic information.

Nonetheless, my goal here is to show you that algorithmic information can really be seen as a special case of the old probabilistic concept of information. This lets us take all the familiar tools from statistical mechanics and thermodynamics, and apply them to algorithmic information theory. Mike Stay and I started to do this here:

• John Baez and Mike Stay, Algorithmic thermodynamics.

Unfortunately, I’ll only have time to sketch a bit of what we did!

Algorithmic information – first definition

Here’s a definition of algorithmic information. Later we’ll see a better one that’s almost equivalent.

Fix a programming language where a program is a finite bit string and its output, if it halts, is again a finite bit string. Then the algorithmic information of a finite bit string is the length of the shortest program that has that bit string as output.

So, for example, the algorithmic information of a string of a trillion 0’s is low, because you can write a short program that prints this out. On the other hand, the algorithmic information of a long “random” string of bits will be high, because the shortest program to print it out will be a program that says “print out this string: 01011100101100…” — so the program is approximately as long as the string itself: slightly longer, but only by a fixed amount.

Of course you may ask what “random” means here. I haven’t defined it! But the point is, people have used these ideas to give a definition what it means for a string of bits to be random. There are different ways to do it. For example: a bit string is ε-random if the shortest program that prints it out is at least ε times as long as the string itself.

You may also wonder: “Doesn’t the definition of algorithmic information depend on the details of our programming language?” And the answer is, “Yes, but not very much.” More precisely, for any two universal programming languages, there’s a constant K such that for any finite bit string, the algorithmic information of that string will differ by at most K depending on which language we use.

Ordinary information

Let’s see how algorithmic information compares to the usual concept of information introduced by Shannon. The usual concept works like this:

If an event of probability p occurs, we define the information gained by learning this event has occurred to be - \log p. We take the logarithm because probabilities of independent events multiply, and we want information to add. The minus sign makes the information positive. We can use any base we want for the logarithm: physicists like e, while computer scientists favor 2.

If there’s a set X of possible outcomes, where the outcome x \in X occurs with probability p_x, then the average or expected amount of information we gain by learning the outcome is

S(p) = - \sum_{x \in X} p_x \log p_x

We call this the entropy of the probability distribution p.

Now, ordinary information doesn’t look very much like algorithmic information! But there’s a second definition of algorithmic information, almost equivalent to the first one, that makes the relation clearer.

Algorithmic information – second definition

First, a minor shift in viewpoint. Before we defined algorithmic information for a finite bit string. Now let’s define it for a natural number. This change in viewpoint is no big deal, since we can set up a one-to-one correspondence between natural numbers and finite bit strings that’s easy to compute.

We’ll still think of our programs as finite bit strings. Not every such string gives a program that halts. We also may demand that bit strings have special features to count as programs. For example, we may demand that programs end in the string 11111, just like a Fortran program must end with the word END.

Let X be the set of programs that halt. Without loss of generality, we can assume that X is prefix-free. This means that if x \in X, no other bit string starting with the string x is also in X. For example, if the string 11111 marks the end of a program, and is only allowed to show up at the end of the program, X will be prefix-free.

Let V(x) be the length of the program x \in X. It’s easy to see that if X is prefix-free,

\sum_{x \in X} 2^{-V(x)} \le 1

Making this sum finite is the reason we want the prefix-free condition — you’ll see exactly why in a minute.

Let N(x) be the output of the program x \in X. Remember, we now think of the output as a natural number. So, we have functions

V, N : X \to \mathbb{N}

We define the algorithmic information of a number n \in \mathbb{N} to be

S(n) = - \log \sum_{x \in X, N(x) = n} 2^{-V(x)}

So: we do a sum over all programs x having the number n as output. We sum up 2^{-V(x)}, where V(x) is the length of the program x. The sum converges because

\sum_{x \in X} 2^{-V(x)} \le 1

That’s where the prefix-free condition comes into play.
Finally, we take minus the logarithm of this sum.

This seems a bit complicated! But suppose there’s just one program x having the number n as output. Then the sum consists of one term, the logarithm cancels out the exponential, and the minus signs cancel too, so we get

S(n) = V(x)

This matches our first definition: the algorithmic information of a number is the length of the shortest program having that number as output.

Of course in this case the shortest program is the only program having that output. If we have more than one program with n as output, the second definition of algorithmic entropy gives a smaller answer than the first definition:

S_{\mathrm{second}}(n) \le S_{\mathrm{first}}(n)

However, there’s a famous theorem, proved by Leonid Levin in 1974, that says the new definition is not very different from the old one. The difference is bounded by a constant. More precisely: for any universal prefix-free programming language, there’s a constant K such that for every n \in \mathbb{N},

S_{\mathrm{second}}(n) \ge S_{\mathrm{first}}(n) - K

Since the algorithmic information depends on our programming language, it was only well-defined within an additive constant in the first place. So, switching from the first definition to the second one doesn’t significantly change the concept. But, it makes the relation to ordinary information a lot easier to see!

From now on we’ll use the second definition.

Algorithmic information versus ordinary information

To relate algorithmic information to ordinary information, we need to get logarithms and probabilities into the game. We see a logarithm here:

S(n) = - \log \sum_{x \in X, N(x) = n} 2^{-V(x)}

but it’s in a funny place, outside the sum — and where are the probabilities?

To solve both these puzzles, let’s define a probability distribution p on the set of natural numbers by

p_n = \frac{1}{Z} \sum_{x \in X, N(x) = n} 2^{-V(x)}

Here Z is a normalizing constant, to make the probabilities sum to 1. Taking

Z = \sum_{x \in X} 2^{-V(x)}

ensures that

\sum_{n \in \mathbb{N}} p_n = 1

Now, notice that

S(n) = - \log p_n \; + \; \log Z

So, up to an additive constant, the algorithmic entropy of the number n is -\log p_n. But -\log p_n is just the information gained upon learning the number n, if we started out knowing only that n was chosen randomly according to the probability distribution p.

What’s the meaning of the probability distribution p? Simple: p_n is the probability that the number n is the output of a randomly chosen program… where ‘randomly chosen’ means chosen according to a certain specific rule. The rule is that increasing the length of a program by 1 makes it half as probable. In other words, the probability q_x of the program x is

q_x = \frac{1}{Z} 2^{-V(x)}

where Z is the normalization factor chosen to make these probabilities sum to 1.

So, the relation between algorithmic information and ordinary information is this:

The algorithmic information of a number is the information gained by learning that number, if beforehand we only knew that it was the output of a randomly chosen program.

Algorithmic thermodynamics

Why should the probability of the program x \in X be defined as

q_x = \frac{1}{Z} 2^{-V(x)} ?

There is no reason we need to use this probability distribution. However, there’s something good about it. It’s an example of a Gibbs state, meaning a probability distribution that maximizes entropy subject to a constraint on the expected value of some observable. Nature likes to maximize entropy, so Gibbs states are fundamental to statistical mechanics. So is the quantity Z: it’s called the partition function.

The idea works like this: suppose we want to maximize the entropy of a probability distribution p on the set of programs, subject to a constraint on the expected value of the length of the program. Then the answer is

p_x = \frac{1}{Z} e^{-\gamma V(x)}

where \gamma is some number, and the partition function Z ensures that the probabilities sum to 1:

Z = \sum_{x \in X} e^{- \gamma V(x)}

So, p equals our previous probability distribution q when

\gamma = \ln 2

However, it is also interesting to consider other values of \gamma, and in fact Tadaki has already done so:

• K. Tadaki, A generalization of Chaitin’s halting probability and halting self-similar sets, Hokkaido Math. J. 31 (2002), 219–253.

The partition function converges for \gamma \ge 2 but diverges otherwise.

More generally, we can look for the probability distribution that maximizes entropy subject to constraints on the expected value of several observables. For example, let:

E(x) = the logarithm of the runtime of the program x

V(x) = the length of the program x

N(x) = the output of the program x

Then the probability distribution that maximizes entropy subject to constraints on the expected values of these three quantities is

p_x = \frac{1}{Z} e^{-\beta E(x) - \gamma V(x) - \delta N(x)}

where now the partition function is

Z = \sum_{x \in X} e^{-\beta E(x) - \gamma V(x) - \delta N(x)}

We’ve chosen our notation to remind experts on statistical mechanics that they’ve seen formulas like this before. The exact same formulas describe a piston full of gas in thermal equilibrium! From a formal, purely mathematical viewpoint:

E is analogous to the internal energy of the gas, and \beta is analogous to 1/T where T is the temperature in units where Boltzmann’s constant is 1.

V is analogous to the volume of the gas, and \gamma is analogous to P/T where P is the pressure.

N is analogous to the number of molecules in the gas, and \delta is analogous to -\mu/T where \mu is the chemical potential.

The analogy here is quite arbitrary. I’m not saying that the length of a program is profoundly similar to the volume of a cylinder of gas; we could have chosen the letter E to stand for the length of a program and everything would still work.

But the analogy works. In other words: now we can take a lot of familiar facts about thermodynamics and instantly translate them into analogous facts about algorithmic information theory! For example, define the entropy of the probability distribution p in the usual way:

S(p) = \sum_{x \in X} p_x \ln p_x

We can think of this as a function of either T, P, and \mu or the expected values of E, V and N. In thermodynamics we learn lots of equations like this:

\left. \frac{\partial S}{\partial E} \right|_{V,N} = \frac{1}{T}

where by standard abuse of notation we’re using E, V, N to stand for their expected values. And, all these equations remain true in algorithmic thermodynamics!

The interesting part is figuring out what all these equations really mean in the context of algorithmic thermodynamics. For a start on that, try our paper.

For example, we call T algorithmic temperature. If you allow programs to run longer, more of them will halt and give an answer. Thanks to the equation

\left. \frac{\partial S}{\partial E} \right|_{V,N} = \frac{1}{T}

the algorithmic temperature is roughly the number of times you have to double the runtime in order to double the number of programs that satisfy the constraints on length and output.

We also consider the algorithmic analogues of thermodynamic cycles, familiar from old work on steam engines. Charles Babbage described a computer powered by a steam engine; we describe a heat engine powered by programs! The significance of this line of thinking remains fairly mysterious. However, I’m hoping that it will help point the way toward a further synthesis of algorithmic information theory and thermodynamics.

41 Responses to Algorithmic Thermodynamics (Part 2)

  1. Brandon says:

    Very interesting. Thanks for sharing!

  2. John F says:

    One of my favorite math tricks involves performing the transformation of variables using log of sum, maybe second favorite only to
    x” = 1/2 d/dx (x’)^2

    I wonder if using a Jensen-Shannon information, roughly how much of the Shannon information is not exactly extensive (even more roughly the difference between log of sum and sum of log), would improve your Levin inequality results.

    • John Baez says:

      Can you explain a bit more, John? I don’t know what Jensen-Shannon information is. I could look it up, and it will probably turn out to be one of the many variants of Shannon information that I’ve seen and failed to pay propert attention to… but I’m getting ready for my talk so I’ll just be lazy and ask you!

      By the way, it would be interesting to watch some mathematicians chew on that equation of yours and try to figure out what it means. I think I know what it means, but only because I studied physics. Your garden-variety mathematician might say

      \frac{1}{2} \frac{d}{dx} (x')^2=  x' \frac{d}{dx} x' = x' \frac{d}{dx} \frac{dx}{dt} =  x' \frac{d^2 x}{dx \,dt}

      and then interchange the mixed partial derivatives to get

      \frac{d^2 x}{dx \,dt} = \frac{d}{d t} \frac{dx}{dx} = \frac{d}{d t} 1 = 0

      thus concluding

      \frac{1}{2} \frac{d}{dx} (x')^2 = 0

      I challenge mathematicians worldwide to figure out what you really meant when you said

      \frac{1}{2} \frac{d}{dx} (x')^2 = x''

      and why my calculation above is wrong given this intended interpretation.

      • John F says:

        The Jensen-Shannon divergence JSD
        http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
        is probably better defined in Eqs 5-6 in “Properties of Classical and Quantum Jensen-Shannon Divergence” eprint
        http://eprints.pascal-network.org/archive/00005071/01/QJSD.pdf
        JSD is a divergence (a measure of difference between distributions), but if H in Eq 5 is an entropy formula like the Shannon entropy, then JSD is also an entropy (expected information), being simply a difference of entropies. You can use other formulas for H besides Shannon entropy, but the results of the calculation is still some kind of difference in some kind of information from being purely extensive.

        For the derivative equation above, well, I gave it away when I said it was actually for changing variables rather than calculating.

    • John F says:

      This paper “Saving Private Randomness” explores cryptographic applications of a couple of points under discussion.
      http://www.cs.technion.ac.il/~harnik/PrivateRandomness.pdf
      Money quote
      “The secret input to our one-way function need not consist of k uniform independent bits: inputs from any distribution of entropy k suffice”

  3. Justin says:

    Using a logarithm to define algorithmic entropy, you get Gibbs states. It’s cool, but not mysterious, don’t you think? The logarithm imposes extensivity doesn’t it? (broadly speaking)

    • John Baez says:

      The main thing that’s mysterious to me is the fact that:

      1) Mike and I developed a detailed analogy between computation and thermodynamics, up to the point where we can describe the computational analogue of a Carnot cycle.

      2) There’s another relation between computation and thermodynamics, namely Landauer’s principle, which says roughly that a computer that erases one bit of information at temperature T must convert at least \ln 2 \, k T of energy into heat.

      (In my blog entry I set Boltzmann’s constant k equal to 1, but I don’t have the heart to do it here. Seeing k T triggers a Pavlovian response among people working on thermodynamics, and I want that now.)

      3) I don’t see the connection between 1) and 2). The analogy in 1) seems ‘purely formal’: they don’t require that we think about computation as a physical process subject to the laws of thermodynamics. The relation in 2), on the other hand, involves treating a computer as an actual physical device, and computation as a process subject to the laws of thermodynamics.

      So, there is something left to understand: the connection between an analogy and a relation.

      Normally I would keep this puzzle secret until I figured out the answer — every scientist wants a well-stocked private supply of puzzles and mysteries! But today I’m in the mood for giving this one away.

      There’s a fellow here at the CQT, Oscar Dahlsten, who is a real expert on Landauer’s principle. He recently gave a great talk on related ideas, which unfortunately I didn’t manage to blog about:

      Quantifying the relation between information and work

      Abstract: I will give the key points from two related papers: arXiv:1009.1630, The thermodynamic meaning of negative entropy, and arXiv:0908.0424, The work value of information. These papers deal with Landauer’s erasure principle, whereby information erasure costs work, as well as the inverse process wherein work is gained at the cost of losing information (Szilard’s engine). The first paper shows one can quantify these phenomena in a quite general way using the recently developed smooth entropy approach (the von Neumann entropy turns out to be inadequate). The second one shows how to interpret negative conditional entropy (which arises as a consequence of entanglement) as work gain from information erasure. Joint work with Aaberg, del Rio, Renner, Rieper and Vedral.

      So, I’m sort of hoping he and I can solve my mystery before anyone here does! But if someone reading this solves it first, please cite me and let me know what you did, so I don’t waste time reinventing the wheel.

      • Justin says:

        I found part 1 section C of the first paper pretty illuminating. I’ll have to spend some time on the appendices when I get a chance. I only read part of the second paper — smoothing of min and max entropy seems like a useful idea.

        By no means am I trying to trivialize your work on algorithmic thermodynamics. (In fact, last month I read your paper on it with some enthusiasm.) My commentary is this: when you set up an extensive/logarithmic entropy, then extremize it to some constraints on expectation values… it formally reduces to the same Lagrange multiplier problem that gives Gibbs states, every time. So, asking why the deep connection, when you’re formally going through the same process, by construction, seems like it’s begging the question or at least missing the point. I’ll concede it’s quite possible that I’ve missed something subtle.

        • John Baez says:

          Justin wrote:

          My commentary is this: when you set up an extensive/logarithmic entropy, then extremize it to some constraints on expectation values… it formally reduces to the same Lagrange multiplier problem that gives Gibbs states, every time. So, asking why the deep connection, when you’re formally going through the same process, by construction, seems like it’s begging the question or at least missing the point.

          Right: to ask this would be missing the point. And just to be clear: I didn’t ask this.

          In my blog entry I said something was mysterious:

          Charles Babbage described a computer powered by a steam engine; we describe a heat engine powered by programs! The significance of this line of thinking remains fairly mysterious.

          But the mystery is not that maximizing entropy subject to constraints on expectation values gives us Gibbs states — that’s just obvious. The mystery is this.

          Luckily, when I gave my talk at the CQT today, Mile Gu made some comments that gave me some ideas on how to understand this mystery.

      • Justin says:

        In 1), the process of learning the output of a randomly chosen program seems similar to moving from a mixed state to a pure state, i.e. 2) erasure. Ok, sounds interesting.

        I was thrown off by all the time spent here, and in your paper, on Gibbs states, Carnot cycles, and Maxwell relations, which are all inviolable from your entropic form.

        • John Baez says:

          Okay, I see. My paper, and my talk, was mainly about stuff I understand. What’s mysterious and fascinating is the stuff I don’t. I didn’t want to tip my hand about that until now.

          (This is a general problem with science: the incentives push us to keep good questions secret until we’ve found answers. Often this leads to a situation where most of the papers on a subject fail to mention an important open question even though the authors all know about it.)

      • Giampiero Campa says:

        Well, i haven’t looked at this deeply enough, but isn’t 2) the result of the fact that you have to do some work on the system to erase the information ?

        Like for example, if you have an empty room with a glass of wine on a table, (you know that the wine is in the glass) and you want to erase the information, you have to spill the wine all over the room, which costs some work, one way or another, and therefore you increase the heat …

        I would bet that the same happens for a computer memory, or anytime you want to “unknow” something, like some reverse maxwell demon. I am not sure what would be the analogous algorithmic process.

        Anyway, i am sure i’m oversimplifying somehow.

    • John Baez says:

      Giampiero wrote:

      Well, I haven’t looked at this deeply enough, but isn’t 2) the result of the fact that you have to do some work on the system to erase the information?

      Item 2) is called Landauer’s principle. It says roughly that any physical system that erases one bit of information at temperature T must convert at least \ln 2 \, k T of energy into heat.

      Here’s how I think of it: if we could really erase one bit of information in some memory apparatus, without any side-effect, two states of the apparatus would get mapped to one. But time evolution in both classical and quantum mechanics is fundamentally reversible. So, the information about which state the apparatus was in cannot truly vanish! Instead, it’s merely converted into information we don’t care about. Information we don’t care about is called ‘entropy’ in this context, so the entropy of something must go up by one bit.

      Now suppose that everything is happening at close to thermal equilibrium, so that temperature is well-defined. Using the relation between entropy and heat, it follows that the heat energy of the apparatus or its surroundings must go up by \ln 2 \, k T, where T is the temperature.

      Your guess that this principle is related to Maxwell’s daemon is exactly right! Landauer’s principle is often invoked as the ‘reason’ Maxwell’s daemon can’t violate the second law of thermodynamics.

      The daemon can look into the two-part box of gas, measure the position and velocity of molecules in the box, and decrease the entropy of the gas by cleverly opening and closing the little door between the two parts at just the right times to let molecules go from the right to the left, but not vice versa.

      However, for the daemon to forget the information it has measured, it must convert energy into heat, by Landauer’s principle! And if you do the math, this turns out to imply that overall, the entropy of the whole system — box plus daemon — increases.

      But is Landauer’s principle really true? I sketched one argument for it. Maxwell’s daemon gives another argument: if Landauer’s principle weren’t true, we could violate the second law of thermodynamics!

      Of course, if you’re trying to use Landauer’s principle to prove that Maxwell’s daemon can’t violate the second law of thermodynamics, this is a circular argument, so it’s not very satisfactory.

      But if you already believe in the second law, this argument may be good enough.

      People have been arguing about this for many decades. Not everyone believes in Landauer’s principle! I won’t try to dig any deeper here. I gave a rather advanced reference last time I mentioned this principle — a paper by Charles Bennett reviewing lots of arguments people have had about Landauer’s principle.

      But here’s something a lot easier, to start with:

      Landauer’s principle.

      • Justin says:

        I would like to add that the reason the daemon has to forget the information it measures is that the daemon itself is a finite physical system and thus has finite memory.

        I presume you are using the alternative spelling “daemon” here as a purely tongue-in-cheek CS reference so I have followed suit.

      • John Baez says:

        Justin wrote:

        I would like to add…

        Good point.

        I presume you are using the alternative spelling “daemon” here as a purely tongue-in-cheek CS reference so I have followed suit.

        That’s a great reason for me to use the spelling “daemon”, so I’ll keep right on! But that’s actually not why I used it.

        I tend to use “demon” to mean an evil being, perhaps even a Christian “devil” with tails and horn, while saving “daemon” to mean the original Greek sort of thing, which was not necessarily evil. Daemons can be good, and they often serve as intermediaries between the divine and the human, a bit like angels do in Christian mythology. They get things done: that makes the word “daemon” especially appropriate for computer science, but also for what we’re talking about.

        So, I thought the usual spelling was “Maxwell’s daemon” — and when I visited the Wikipedia article I was surprised to see the spelling was “demon”, and amused to see the tail and horns here:

  4. There’s a typo in the definition of ‘prefix-free’: an errant i that snuck in somehow.

  5. streamfortyseven says:

    Probably a dumb question but if “Without loss of generality, we can assume that X is prefix-free. This means that if x \in X, no other bit string starting with the string x is also in X.”

    So it seems to me that you (a) can’t have recursion and (b) no subprograms are allowed and (c) there’s no looping allowed, i.e. you execute an instruction and never return to the place where that instruction is physically located.

    • John Baez says:

      None of (a), (b), or (c) is true. Take your favorite programming language and add two new rules:

      1) all programs must end with the word END, and

      2) this word cannot appear anywhere else in the program.

      Now you’ve got a programming language that’s prefix-free: if you take a program, and make it longer by sticking on a bunch of symbols after the END, the result is not a program anymore, because it violates rule 2).

      This is why the the prefix-free condition is completely innocuous.

      Of course I’m illustrating it now for programs written using letters of the alphabet. In my blog entry I used programs written in binary, and suggested that we could use 11111 to mean ‘END’.

      This makes only a slight difference. For programs written in binary it’s natural to define the probability of a program x of length V(x) to be proportional to 2^{-V(x)}; for programs written in an alphabet with n letters it’s better to use n^{-V(x)}. The prefix-free condition then guarantees that the sum over all programs

      \sum_{x} n^{-V(x)}

      converges. That’s why we want this condition. It’s really just a technical hack, as far as I can see.

      • Tim van Beek says:

        Maybe streamfortyseven thought about execution flow rather than a (compiled) program.

        If we choose any programming language common today (written with words from the latin alphabet), this language will have both key words and syntax rules (at least). This means that only a very tiny subset of all strings consisting of numbers and latin letters, commonly called “alphanumeric strings”, are valid programs.

        The “trick” with the “end” keyword simply says that we ignore all of the complicated restrictions of a real programming language and simply impose only one, namely that there is a dedicated keyword for the end of the program.

        Thus the set of alphanumeric strings that we sum over is still much too big, but our restriction is strong enough to make our sums converge.

        This reminds me of an exercise in calculus: We know that the harmonic sum \sum_{n \in \mathbb{N}} \frac{1}{n} diverges. But it converges if one removes all natural numbers that contain the digit 9.

        For the convergence proof, it is enough to assume that one removes only 9, then 9x, then 9xx etc.

        • John Baez says:

          Great point, Tim.

          By the way, you’ve been making a mistake lately when typing TeX on this blog. I’ve been correcting it, but if I keep correcting it without complaining, you’ll keep making it. You’re writing things like

          $\latex \sqrt{2}$

          when you should be writing

          $latex \sqrt{2}$

          The former produces

          $\latex \sqrt{2}$

          while the latter produces

          \sqrt{2}

          In other words, you need “latex” right after the first dollar sign, not “\latex”.

        • Tim van Beek says:

          Thanks, I keep making it because I’m a slow learner, I noticed my mistake after the posting :-)

          Is there no way that authors can be enabled to edit their posts?

        • John Baez says:

          Tim wrote:

          Is there no way that authors can be enabled to edit their posts?

          Not with this free WordPress-hosted blog. One can pay to get a WordPress blog that’s hosted wherever you like, and then you can use various plugins to improve the blog. But so far the minor expense and (more importantly) the unknown amount of extra time required have prevented me from doing it.

          If you look at Peter Woit’s blog you’ll see it’s a WordPress blog hosted on a site of his own choosing, which allows commenters to see a preview of their comments.

          For an argumentative blog like his, a system like that is better than a system that lets commenters go back in time and rewrite their comments. (There’s somebody we all know, who likes to do that to make other people’s comments look stupid.)

          But unfortunately, with a free WordPress-hosted blogs I can neither let you preview your comments nor rewrite them. So you just have to learn to be perfect.

      • streamfortyseven says:

        @Tim: yes, that’s what I was thinking about, pretty much. I’m thinking of a program in the sense of byte-length instructions, the way I used to do things long ago in my assembly language days.

        • streamfortyseven says:

          assuming that there’s two bytes per step, one instruction and one data byte, each byte four bits long. If you have an FSA, though, (recalling from 33 years ago), you might have problems: 0010 0101 0011 1100 1110 0001 1111 might halt prematurely?

        • streamfortyseven says:

          after thinking about it some more, ignore the comment above… sorry.

  6. john e gray says:

    A minor technical point, logarithmic entropy is not necessarily extensive. See Hill’s (Dover) books on thermodynamics of small scale systems. Tsallis introducde Tsallis entropy in 1988 in J. Phys. A as an attempt to deal with the question of having an entropy that preserved extensive behavior for systems beyond those traditionally dealt with in physics such as those found in ecology, economics, living systems, etc. This has been a lot of discussion in J. Phys. A over the last 22 years about the question of capturing the essential features of an entropy measure in a broader context including a book of articles published by Addison-Wesley edited by Gell-Mann and Tsallis.

    The more interesting point about algorithmic thermodynamics is: how computation is related to learning in complex systems? Does algorithmic thermodynamics inform the choice or optimization of algorithms used control a complex system such as an ecology or other feedback systems. This seems particularly relevant to the ultimate goal of Azimuth, where learning would seem to be a necessary component of creating and maintaining self-substantiating systems. Learning in terms of pattern recognition has been studied by EE’s for learning patterns, with Renyi entropy being a popular choice, but what has been discussed here seems to raise broader and more interesting questions by bringing more interesting notions of systems.

    • Justin says:

      A minor technical point, logarithmic entropy is not necessarily extensive.

      Gibbs and Shannon entropy are extensive in any formulation I’ve seen. The other common logarithmic entropy is von Neumann’s quantum entropy, which has a weaker condition of sub-additivity, recovering extensivity when you restrict only to independent subsystems.
      Logarithmic just means using the standard logarithm, not any q-logarithms or other generalizations.

      Tsallis introduce Tsallis entropy in 1988 in J Phys A as an attempt to deal with the question of having an entropy that preserved extensive behavior…

      Tsallis entropy is non-extensive, as is Renyi entropy. Irony bonus points for mentioning Tsallis’ book, Nonextensive Entropy.

    • John Baez says:

      I think you guys are both right, using different definitions of ‘extensive’.

      Justin appears to be saying that given a physical system made of two parts, and the system is in a mixed state where the states of the two parts are uncorrelated, the total entropy is the sum of the entropies of the two parts. This is a mathematical fact about Gibbs-Shannon-von Neumann entropy.

      John seems to saying that if a physical system is made of two parts, the total entropy may not be the sum of the entropies of the two parts if the states of the two parts are correlated. This is another mathematical fact about Gibbs-Shannon-von Neumann entropy.

      In many everyday situations, the states of two parts of a larger system are correlated, so their Gibbs-Shannon-von Neumann entropies don’t quite add to give the entropy of the whole system — but the correction is negligible.

      For example, if I have a box of air in thermal equilibrium, I believe the entropy of the left half plus the entropy of the right half is not exactly equal to the entropy of the whole box. But the correction is incredibly small, and people almost always neglect it — unless the box has just a few molecules of air in it.

      John wrote:

      The more interesting point about algorithmic thermodynamics is: how computation is related to learning in complex systems? Does algorithmic thermodynamics inform the choice or optimization of algorithms used control a complex system such as an ecology or other feedback systems. This seems particularly relevant to the ultimate goal of Azimuth, where learning would seem to be a necessary component of creating and maintaining self-substantiating systems.

      I work on lots of things that seem unrelated at first — but often they turn out to be related, which proves that my interests are not as far-ranging as I like to think.

      When I decided to spend time at the Centre for Quantum Technologies, quit working on n-categories, and start thinking more about quantum mechanics, computation, information, but also environmental issues, I hoped that someday these new interests would start to interact in a synergetic way. It might be starting to happen, ever so slightly — but it will probably take a few more years to build up speed.

      Thanks, everyone here, for helping it happen faster! I really hope that we can all, together, learn a lot and figure out some brand new things.

      I want this to be a cooperative project, so I’m always on the lookout for interesting ‘guest posts’ for this blog. Miguel’s post is
      a nice example, and soon we’ll see another one, by Renato Irrugia. I hope everyone here keeps that option in mind. I think you can probably tell what kind of posts I like: long on information and explanation, short on opinion, politics or grumpiness.

      • Justin says:

        I think you guys are both right, using different definitions of ‘extensive’.

        Justin appears to be saying that given a physical system made of two parts, and the system is in a mixed state where the states of the two parts are uncorrelated, the total entropy is the sum of the entropies of the two parts. This is a mathematical fact about Gibbs-Shannon-von Neumann entropy.

        John seems to saying that if a physical system is made of two parts, the total entropy may not be the sum of the entropies of the two parts if the states of the two parts are correlated. This is another mathematical fact about Gibbs-Shannon-von Neumann entropy.

        In many everyday situations, the states of two parts of a larger system are correlated, so their Gibbs-Shannon-von Neumann entropies don’t quite add to give the entropy of the whole system — but the correction is negligible.

        For example, if I have a box of air in thermal equilibrium, I believe the entropy of the left half plus the entropy of the right half is not exactly equal to the entropy of the whole box. But the correction is incredibly small, and people almost always neglect it — unless the box has just a few molecules of air in it.

        John, I assign a high probability that you are wrong here. If you look at what I’ve seen some authors call the Shannon-Khinchin axioms, you’ll find that independence of subsystems is an unnecessarily strong requirement for Gibbs-Shannon entropy. The substantially weaker condition that the composite system depend on conditional probabilities of its components will uniquely give you Gibbs-Shannon entropy. These entropic forms remain extensive even for correlated subsystems. This characterization is biconditional (iff). Of course, this is for non-operator-valued observables as it is well-known that von Neumann entropy does not share this property.

        I’ll be glad to share references.

        My response to the original post tackled two issues. One, the claim that logarithmic entropy is non-extensive. In the classical case, it is strictly extensive. In the quantum case, the weaker subadditivity follows, becoming extensivity when considering independent subsystems. Two, the claim that Tsallis entropy preserved extensivity. I assure you it does not. It is a deformation of the extensive Gibbs-Shannon entropy designed precisely to account for systems with non-extensive behavior.

        I really hope to clear this up.

        • John Baez says:

          Justin wrote:

          The substantially weaker condition that the composite system depend on conditional probabilities of its components will uniquely give you Gibbs-Shannon entropy.

          Thanks, that’s very interesting and good to know!

          [Note added later: but, as Justin later pointed out, it’s not true.]

          So, you’ve succeeded in enhancing the last-minute worries that made me stick “I believe” in this sentence:

          For example, if I have a box of air in thermal equilibrium, I believe the entropy of the left half plus the entropy of the right half is not exactly equal to the entropy of the whole box.

          At the last minute, I realized that I’d need to do a calculation (or look one up) for some particular models of the equilibrium state of a gas before being sure about this issue.

          In an ‘ideal’ gas, with noninteracting molecules, I can almost do the calculation in my head, and see that the entropy of the whole box of gas is equal to the sum of the entropies of the two halves!

          But in a real-world gas, where the molecules interact, I was hoping it’s false.

          Are you’re telling me it’s true for any sort of gas treated classically?

          If so, I was fooled by the examples I know better, which come from condensed matter physics. These examples involve quantum mechanics.

          I will now say some stuff you probably know — but I want everyone in the universe to know:

          If we take quantum mechanics into account, it’s easy to get very simple examples of two-part systems where the entropy of the whole is zero but the sum of the entropies of the parts is not.

          For example, suppose we take a system made of two spin-1/2 particles in the Bell state. Then the state of the whole system is pure so its entropy is zero. However, the state of each half is a mixed state, and has entropy equal to 1 bit. This is the famous phenomenon of ‘entanglement entropy’.

          For a more exciting example of the same phenomenon, consider a spin chain: a one-dimensional lattice of spin-1/2 particles interacting by (say) nearest-neighbor interactions. Then in its ground state, the entropy of the total system is zero. But the entropy of each half is typically nonzero, due to entanglement.

          In higher-dimensional condensed matter problems, the entanglement entropy is often roughly proportional to the area of the boundary dividing the two halves of the systems:

          • J. Eisert, M. Cramer, M.B. Pleni, Area laws for the entanglement entropy – a review.

          I was guessing that a similar correction could happen even in the classical case. Of course it won’t be entanglement that does it. I was imagining something like this: suppose you flip one coin and then set down another coin that’s heads-down iff the first coin was heads-up. Then the entropy of each half of the system is 1 bit, but the entropy of the whole system is not 2 bits: it’s still just 1 bit.

          Since this effect requires a mixed state in classical mechanics, it can’t happen in the ground state (if there’s a unique ground state). But I was hoping it might happen in some thermal equilibrium states of interesting classical systems at nonzero temperature.

          Like, for example, a gas made of interacting particles.

        • Justin says:

          Sigh, I have to correct myself twice — once for being misleading and once for being flat-out wrong.

          Everything is vague to a degree you do not realize till you have tried to make it precise.
          -Bertrand Russell

          Physics is a constant reminder of how often I am wrong after making such things precise.

          1) An entropy satisfying some simple axioms and:
          S[I \; \mathrm{and} \; II] = S[I] + \sum_i p_i S[II|i]
          will be of the Gibbs-Shannon form. Forgive me if the LaTeX is a bit sloppy. Despite what I said previously, this is only extensive for uncorrelated subsystems, as you had thought all along! For instance, your example of the classical coins would be S[composite] = 1bit +0bits since the entropy of the second coin conditioned on the knowledge of the state of the first coin is zero, i.e. it is completely determined. Extensivity is violated in exactly the manner you described. I started thinking how absurd it would be if correlations implied nothing interesting… although perhaps it is as absurd how successful the assumption of un-correlation has been, historically! (viz. ideal gases, metals)

          2) The Tsallis entropy is indeed non-extensive for independent subsystems, but (!) it can be extensive in some cases for correlated subsystems. I had forgotten this important property, so my criticism there was misleading.

          I hope I did not damage anyone’s intuition any more than my own ego with this series of posts.

        • John Baez says:

          Thanks, Justin! Don’t feel bad: intellectual honesty is vastly important than infallibility (especially since nobody attains the latter state).

          After I wrote my sentence about a pair of correlated classical coins:

          I was imagining something like this: suppose you flip one coin and then set down another coin that’s heads-down iff the first coin was heads-up. Then the entropy of each half of the system is 1 bit, but the entropy of the whole system is not 2 bits: it’s still just 1 bit.

          and did something else for half an hour, I thought wait a minute! — here’s a perfect example of what I was talking about: if a classical system made of two parts is in a state where the two parts are correlated, the entropy of the whole will be less than the sum of the parts.

          So okay, it sounds like we agree that entropy of a box of classical gas in thermal equilibrium could be a bit less than the entropy of the left half, plus the entropy of the right half.

          Btw, I’ve been staying out of the discussion of Tsallis entropy since I don’t really understand it. Indeed, I’m just finally starting to get interested in alternative definitions of entropy. They always seemed pointless to me until Oscar Dahlsten explained that they arise naturally when we think about — for example — a ‘risk-averse’ person, who doesn’t care about the expected amount of work he can extract from a given system, but only the work he can extract in a worst-case scenario.

        • Justin says:

          Generalized entropies have some interesting stability properties relating “closeness” of the entropy functionals to “closeness” of the underlying distributions. There are connections to information geometry as well. It gets pretty muddy from there. You might say, one of my own mysteries.

          Thanks for fixing my html and latex tags above.

  7. Giampiero Campa says:

    I was also wondering whether there some connections between the algorithmic complexity and the computational complexity (computation time as a function of the size of the input). Perhaps treating the input of the program as an initial state (that is incorporating it into the program, assuming it doesn’t change) you can somehow connect the things. But i am not sure if it would make a lot of sense.

    More generally, a program is a finite state machine, which is a particular case of a dynamical system. So extending this theory to “open systems” you might be able to establish connections to general systems that process information (as opposite to single inputless programs). Many years ago I did actually read the first 3-4 chapters of a book by Haddad et. al. that touches on similar arguments:

    http://press.princeton.edu/titles/8122.html

    However i think it’s different because it relies on the scale of the system to be large enough (many subsystems). Again, i am not sure whether a connection between the approaches would make a lot of sense, or if it’d be even trivial.

    • John Baez says:

      Giampiero wrote:

      I was also wondering whether there some connections between the algorithmic complexity and the computational complexity (computation time as a function of the size of the input).

      Not quite, but there’s a generalization of algorithmic complexity called the Levin complexity, where programs that take longer to run are judged to have more complexity.

      And this too connects nicely to statistical mechanics! To define it, you just copy everything I said in my talk, but take the Gibbs state

      p_x = \frac{1}{Z} e^{- \gamma V(x)}

      where V(x) is the length of the program x, and replace it with the more general Gibbs state

      p_x = \frac{1}{Z} e^{-\beta E(x) - \gamma V(x)}

      where E(x) is the logarithm of the program’s runtime. This ‘penalizes’ programs that take a long time to run.

      Mike and I discuss this in the conclusion of our paper.

      The great advantage of the Levin complexity over the original algorithmic complexity is that it’s computable.

      More generally, a program is a finite state machine, which is a particular case of a dynamical system. So extending this theory to “open systems” you might be able to establish connections to general systems that process information (as opposite to single inputless programs).

      These sound like great ideas to explore. Luckily there are some people at the CQT who seem interested in talking about them… and, of course, their generalizations to quantum computers.

  8. […] Algorithmic Thermodynamics (Part 2) […]

  9. john e gray says:

    John B, I did not mean to start a food fight with Justin after seeing his comments. What I was referring to was the history reviewed in

    [4] S. R. Addison … “Is extensivity a fundamental property of entropy?”, J. Phys. A. Gen. 34 (2001), 7733-7737.

    In particular, this was motivated by the editorial by John Maddox in Nature in 1993 “Is entropy Extensive?” in reference to the Bersktein-Hawkings entropy of black holes that started a debate reexamining these issues because the entropy of a black hole is proportional to the area rather than the volume. Tsallis has database of hundreds of papers related to non-extensive thermodynamics that he regularly updates about non-extensive physics/ entropy. The discussion of entropy from its inception is discussed in Jaynes:

    [5] E. T. Jaynes, in Maximum Entropy and Bayesian Methods, edited by C.R. Smith, G. J. Erickson, and P.O. Neudorfer, (Kluwer Academic, 1992).

    where he notes the original definition of entropy due to Claudius, is the integral of dQ/T is path dependent, so Jaynes noted it has nothing to due with extensivity per se. It depends on the particular physical system. For the Gibbs-Boltzmann entropy, this is still the case for micro-systems as well, though most physical systems appear to be extensive for the logarithmic measure. Hill, well before many in this paper

    [6] T. L. Hill, J. Chem. Phys. 36 (1962), 3182.

    reviewed different types of systems, some of which the logarithmic measure were extensive and some which aren’t. This is a point of discussion in nano-systems… Hill’s discussion of this is available online in Nano-systems Letters. (He is an example to us all in his nineties and still active). It is this that lead Addison to suggest generalizing the notion of extensivity might be useful. I will quit rambling and not post anymore.

    One final note, not related, for you readers: SIAM has posted the following:

    The Mathematics and Climate Research Network is a nation-wide NSF funded initiative. Our goal is to define, develop and solve mathematical problems that arise in climate science. A number of postdoctoral positions will be available starting in Summer or Fall, 2011. The successful applicants will be known as Ed Lorenz Postdoctoral Fellows in the Mathematics of Climate and will have an affiliation with one of the network nodes. The topics of research will range from sea-ice processes to paleoclimate studies and data assimilation in climate models. The network has twelve funded nodes and a number of other collaborating institutions. For more details, see http://www.mathclimate.org.

    to the special function SIAG, which may be of interest to any children with new PhD’s. As, always it is a pleasure to learn from your posts.

    • John Baez says:

      John E. Gray wrote:

      John B, I did not mean to start a food fight with Justin after seeing his comments.

      Don’t worry: I wouldn’t characterize anything that happened here as a ‘food fight’, at worst a heated discussion of the dessert menu.

      In particular, this was motivated by the editorial by John Maddox in Nature in 1993 “Is entropy extensive?” in reference to the Bekenstein-Hawking entropy of black holes that started a debate reexamining these issues because the entropy of a black hole is proportional to the area rather than the volume.

      Interesting!

      As I mentioned, ordinary Shannon entropy fails to add for correlated systems, and the correction terms are often roughly proportional to the area of the boundary between systems. This has made people wonder if black hole entropy arises from this mechanism, and much ink has been spilt on this question.

      I’ve never read anything trying to relate black hole entropy to Tsallis entropy. But that’s probably because I’m a bit scared of Tsallis entropy. Now that I’m looking, I’m seeing a bit about this combination of ideas…

      Thanks for mentioning the announcement regarding the Mathematics and Climate Research Network! I’ll post a blog article about that.

    • Justin says:

      Well, if it was a food fight, I think I was the only one who ended up with egg on my face. :)

      I think we had read the same materials and started with different base cases in mind. I’ll remind myself and others in discussions of extensivity to explicitly specify whether (non-)independent subsystems are under consideration or if they are excluded. That changes the context considerably.

  10. […] once we realize that algorithmic entropy is a special case of ordinary entropy. […]

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s