Over at the n-Category Café some of us have been trying an experiment: writing a math paper in full public view, both on that blog and on its associated wiki: the nLab. One great thing about doing things this way is that people can easily chip in with helpful suggestions. It’s also more fun! Both these tend to speed the process.
Like Frankenstein’s monster, our paper’s main result was initially jolted into life by huge blasts of power: in this case, not lightning but category theory. It was awesome to behold, but too scary for this blog.
First Tom Leinster realized that the concept of entropy fell out — unexpectedly, but very naturally — from considerations involving ‘operads’, which are collections of abstract operations. He was looking at a particular operad where the operations are ‘convex linear combinations’, and he discovered that this operad has entropy lurking in its heart. Then Tobias Fritz figured out a nice way to state Tom’s result without mentioning operads. By now we’ve taught the monster table manners, found it shoes that fit, and it’s ready for polite society:
• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.
The idea goes like this. Say you’ve got a finite set with a probability measure on it, meaning a number for each point , obeying
Then the Shannon entropy of is defined by
This funny-looking formula can be justified in many ways. Our new way involves focusing not on entropy itself, but on changes in entropy. This makes sense for lots of reasons. For example, in physics we don’t usually measure entropy directly. Instead, we measure changes in entropy, using the fact that a system at temperature absorbing a tiny amount of heat in a reversible way will experience an entropy change of . But our real reason for focusing on changes in entropy is that it gives a really slick theorem.
Suppose we have two finite sets with probability measures, say and . Then we define a morphism to be a measure-preserving function: in other words, one for which the probability of any point in is the sum of the probabilities of the points in with .
A morphism of this sort is a deterministic process that carries one random situation to another. For example, if I have a random integer between -10 and 10, chosen according to some probability distribution, and I square it, I get a random integer between 0 and 100. A process of this sort always decreases the entropy: given any morphism , we have
Since the second law of thermodynamics says that entropy always increases, this may seem counterintuitive or even paradoxical! But there’s no paradox here. It makes more intuitive sense if you think of entropy as information, and the function as some kind of data processing that doesn’t introduce any additional randomness. Such a process can only decrease the amount of information. For example, squaring the number -5 gives the same answer as squaring 5, so if I tell you “this number squared is 25”, I’m giving you less information than if I said “this number is -5”.
For this reason, we call the difference the information loss of the morphism . And here’s our characterization of Shannon entropy in terms of information loss:
First, let’s write a morphism as for short. Suppose is a function that assigns to any such morphism a number which we think of as its information loss. And suppose that obeys three axioms:
1. Functoriality. Whenever we can compose morphisms and , we demand that
In other words: when we do a process consisting of two stages, the amount of information lost in the whole process is the sum of the amounts lost in each stage!
2. Convex linearity. Suppose we have two finite sets equipped with probability measures, say and , and a real number . Then there is a probability measure on the disjoint union of the two sets, obtained by weighting the two measures by and , respectively. Similarly, given morphisms and there is an obvious morphism from to . Let’s call this morphism . We demand that
In other words: if we flip a probability-λ coin to decide whether to do one process or another, the information lost is λ times the information lost by the first process plus (1 – λ) times the information lost by the second!
3. Continuity. The same function between finite sets can be thought of as a measure-preserving map in different ways, by changing the measures on these sets. In this situation the quantity should depend continuously on the measures in question.
In other words: if we slightly change what we do a process to, the information it loses changes only slightly.
Then we conclude that there exists a constant such that for any morphism , we have
In other words: the information loss is some multiple of the change in Shannon entropy!
What’s pleasing about this theorem is that the three axioms are pretty natural, and it’s hard to see the formula
hiding in them… but it’s actually there.
(We also prove a version of this theorem for Tsallis entropy, in case you care. This obeys a mutant version of axiom 2, namely:
where is a parameter with . Tsallis entropy is a close relative of Rényi entropy, which I discussed here earlier. Just as Rényi entropy is a kind of q-derivative of the free energy, the Tsallis entropy is a q-derivative of the partition function. I’m not sure either of them are really important, but when you’re trying to uniquely characterize Shannon entropy, it’s nice for it to have some competitors to fight against, and these are certainly the main two. Both of them depend on a parameter and reduce to the Shannon entropy at a certain value of that parameter.)