Second, it doesn’t follow from the definition that relative entropy is nonnegative. The problem is that you can’t a priori control the signs of the logarithms. It’s a nontrivial fact that this is true.

I’m not sure what you’re asserting is true.

However, when and are probability distributions on a finite set, I believe that

though it may be infinite: we define but when , to handle cases involving the logarithm of zero.

I believe the same sort of inequality holds whenever and are probability measures and is absolutely continuous with respect to If you have a counterexample, I’d like to see it!

You might like the argument on page 16 here: it gives non-negativity for a generalization of relative information to ‘non-normalized probability distributions’ on finite sets.

]]>[…] relative entropy does not reduce to ordinary entropy with a uniform prior, even up to a constant. It reduces to the negative of ordinary entropy up to a constant. (This is why I’m tempted to call “relative entropy” the negative of what you wrote.)

Yeah, the missing minus sign annoys me too. Lots of people call this quantity relative entropy, but in a later paper I wrote:

Most of our results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We shall use the first term. Given two probability distributions and on a finite set their **relative information**, or more precisely the **information of relative to ** is

We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that decreases with time under various conditions. The reason is that the Shannon entropy

contains a minus sign that is missing from the definition of relative information.

]]>Second, it doesn’t follow from the definition that relative entropy is nonnegative. The problem is that you can’t a priori control the signs of the logarithms. It’s a nontrivial fact that this is true.

]]>I said ‘real line’, but that wasn’t quite right: relative entropy can be infinite, so we should really think of it as taking values in , and a function from a topological space taking values in is lower semicontinuous iff it’s continuous when is given the upper topology.

]]>A guy named Faddeev characterized entropy using only equations and a continuity requirement. In a previous paper written with a mutual friend, Tobias and I reformulated Faddeev’s characterization in category-theoretic terms:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.

In our new paper we started by trying to phrase Petz’s characterization of relative entropy in category-theoretic terms. Petz’s theorem turned out to be flawed, so Tobias had to fix it. But we still owe our inspiration to him, and

we got a characterization involving only equations and a lower semicontinuity requirement. (For what it’s worth, lower semicontinuity can be thought of as continuity with respect to a funny topology on the real line.)

For a comparison of Faddeev’s characterization of entropy and the Shannon–Khinchine characterization, try this:

• Shigeru Furuichi, On uniqueness theorems for Tsallis entropy and Tsallis relative entropy.

This covers something more general called ‘Tsallis entropy’, but if you take the limit you get back to Shannon entropy.

Our paper with Tom Leinster covered the more general Tsallis case, but now that we’re doing relative entropy we’re having enough trouble just handling the Shannon case!

]]>