Network Theory III

In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy.

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

• Brendan Fong, Causal Theories: a Categorical Perspective on Bayesian Networks.

I took a slightly different approach, by saying that a causal theory $\mathcal{C}_G$ is the free category with products on certain objects and morphisms coming from a directed acyclic graph $G$. In his thesis he said $\mathcal{C}_G$ was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

3 Responses to Network Theory III

Where does the data processing inequality come from?

It seems there are typos on page 10

$S(X,p)-S(X,r) \rightarrow S(X,p)-S(Z,r)$

and

$S(X,p)-S(X,q)+S(X,q)-S(X,r) \to$ $S(X,p)-S(Y,q)+S(Y,q)-S(Z,r)$

It seems you get the information loss between $p$ and $\tilde{q}$ from the relative Shannon entropy

$-\sum_i p_i \ln{\frac{p_i}{q_i}}$

via setting

$q_i = \tilde{q}_i^{\frac{\tilde{q}_i}{p_i}}$

(eventually apart from a minus sign), so this condition that the functor should vanish on morphisms in FP seems somehow connected with a kind of convex structure on FinStat. Do you use this in the proof? (Sorry, no time to look at the proof).

• John Baez says:

It seems there are typos on page 10

$S(X,p)-S(X,r) \rightarrow S(X,p)-S(Z,r)$

and

$S(X,p)-S(X,q)+S(X,q)-S(X,r) \to$
$S(X,p)-S(Y,q)+S(Y,q)-S(Z,r)$

Wow, you’re right! I’m not surprised I made this typo; I’m surprised that nobody attending my talk noticed it. This proves something I should have remembered: at a talk, almost nobody carefully reads equations on the slides, because they are too busy listening to the speaker. So, it’s good to avoid showing them equations. (Equations are a bit easier to absorb when sitting at your computer at home.)

Where does the data processing inequality come from?

In my talk I mentioned the data processing inequality because it’s one way to prove the fact I needed: when you apply a function to a random variable, its entropy goes down. I figured that since I was talking to an audience of computer scientists working on quantum information theory, a lot of them would know and love the data processing inequality. But it’s probably easier to show the inequality I needed directly.

All I needed was this. Suppose you have a list of probabilities $p_i$ that sum to 1. Suppose you partition them into bunches, add up the numbers in each bunch and get numbers $q_i$: a shorter list of probabilities that sum to 1. Then

$S(q) \le S(p)$                  (♥)

or in other words

$\displaystyle{ - \sum_i q_i \ln q_i \le - \sum_i p_i \ln p_i }$

I imagine the easiest way to show this is to show it inductively, starting from this: if $p_1, p_2$ are two probabilities that sum to ≤ 1, and

$q = p_1 + p_2$

then

$-q \ln q \le p_1 \ln p_1 \; + \; p_2 \ln p_2$

I forget how to show this, but it should be easy, so I’ll leave it as an exercise for the reader!

The Scholarpedia page on mutual information gives a statement and easy proof of the data processing inequality—search it for the phrase ‘data processing’.

This inequality says that if we have two random variables, X and Y, whose mutual information is I(X;Y), and a third random variable Z that is a function of Y, we have I(X;Z) ≤ I(X;Y). In other words, there’s no way that Z can know more about X than Y did, because Z is obtained from Y.

If we apply this in the case where X = Y, we get I(X;Z) ≤ I(X;X). Using the definition of mutual information we can simplify both sides and get S(Z) ≤ S(X).

So, I’m saying that if Z is a function of X then

S(Z) ≤ S(X)

and this is another way of stating inequality (♥) above. However, you need to know about random variables and mutual information to understand this—so if you don’t, please ignore what I just said! It’s easier to just prove (♥) directly.

It seems you get the information loss between $p$ and $\tilde{q}$ from the relative Shannon entropy

$\displaystyle{ -\sum_i p_i \ln{\frac{p_i}{q_i}} }$

via setting

$\displaystyle{ q_i = \tilde{q}_i^{\frac{\tilde{q}_i}{p_i}} }$

No, we don’t do anything like this—I’d never even thought of this trick. Raising a probability to a power like that looks very scary! I’ve done similar things when introducing ‘temperature’ into probability theory, but then you need to divide by a fudge factor (the partition function) to make sure the probabilities still sum to one.

2. […] John Baez’s This Week’s Finds and Azimuth blog also make math sexy. But there’s too much stuff to fit into this tiny list. I’ll just highlight his Network Theory series. Some of the topics it touches on include information theory and Bayesian networks. […]