In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

I took a slightly different approach, by saying that a causal theory is the free category with products on certain objects and morphisms coming from a directed acyclic graph . In his thesis he said was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

Where does the data processing inequality come from?

It seems there are typos on page 10

and

It seems you get the information loss between and from the relative Shannon entropy

via setting

(eventually apart from a minus sign), so this condition that the functor should vanish on morphisms in FP seems somehow connected with a kind of convex structure on FinStat. Do you use this in the proof? (Sorry, no time to look at the proof).

Wow, you’re right! I’m not surprised I made this typo; I’m surprised that nobody attending my talk noticed it. This proves something I should have remembered: at a talk, almost nobody carefully reads equations on the slides, because they are too busy listening to the speaker. So, it’s good to avoid showing them equations. (Equations are a bit easier to absorb when sitting at your computer at home.)

Where does the data processing inequality come from?

In my talk I mentioned the data processing inequality because it’s one way to prove the fact I needed: when you apply a function to a random variable, its entropy goes down. I figured that since I was talking to an audience of computer scientists working on quantum information theory, a lot of them would know and love the data processing inequality. But it’s probably easier to show the inequality I needed directly.

All I needed was this. Suppose you have a list of probabilities that sum to 1. Suppose you partition them into bunches, add up the numbers in each bunch and get numbers : a shorter list of probabilities that sum to 1. Then

(♥)

or in other words

I imagine the easiest way to show this is to show it inductively, starting from this: if are two probabilities that sum to ≤ 1, and

then

I forget how to show this, but it should be easy, so I’ll leave it as an exercise for the reader!

The Scholarpedia page on mutual information gives a statement and easy proof of the data processing inequality—search it for the phrase ‘data processing’.

This inequality says that if we have two random variables, X and Y, whose mutual information is I(X;Y), and a third random variable Z that is a function of Y, we have I(X;Z) ≤ I(X;Y). In other words, there’s no way that Z can know more about X than Y did, because Z is obtained from Y.

If we apply this in the case where X = Y, we get I(X;Z) ≤ I(X;X). Using the definition of mutual information we can simplify both sides and get S(Z) ≤ S(X).

So, I’m saying that if Z is a function of X then

S(Z) ≤ S(X)

and this is another way of stating inequality (♥) above. However, you need to know about random variables and mutual information to understand this—so if you don’t, please ignore what I just said! It’s easier to just prove (♥) directly.

It seems you get the information loss between and from the relative Shannon entropy

via setting

No, we don’t do anything like this—I’d never even thought of this trick. Raising a probability to a power like that looks very scary! I’ve done similar things when introducing ‘temperature’ into probability theory, but then you need to divide by a fudge factor (the partition function) to make sure the probabilities still sum to one.

[…] John Baez’s This Week’s Finds and Azimuth blog also make math sexy. But there’s too much stuff to fit into this tiny list. I’ll just highlight his Network Theory series. Some of the topics it touches on include information theory and Bayesian networks. […]

You can use HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it. Cancel reply

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.

Where does the data processing inequality come from?

It seems there are typos on page 10

and

It seems you get the information loss between and from the relative Shannon entropy

via setting

(eventually apart from a minus sign), so this condition that the functor should vanish on morphisms in FP seems somehow connected with a kind of convex structure on FinStat. Do you use this in the proof? (Sorry, no time to look at the proof).

Nadja wrote:

Wow, you’re right! I’m not surprised I made this typo; I’m surprised that nobody attending my talk noticed it. This proves something I should have remembered: at a talk, almost nobody carefully reads equations on the slides, because they are too busy listening to the speaker. So, it’s good to avoid showing them equations. (Equations are a bit easier to absorb when sitting at your computer at home.)

In my talk I mentioned the data processing inequality because it’s one way to prove the fact I needed:

when you apply a function to a random variable, its entropy goes down. I figured that since I was talking to an audience of computer scientists working on quantum information theory, a lot of them would know and love the data processing inequality. But it’s probably easier to show the inequality I neededdirectly.All I needed was this. Suppose you have a list of probabilities that sum to 1. Suppose you partition them into bunches, add up the numbers in each bunch and get numbers : a shorter list of probabilities that sum to 1. Then

(♥)

or in other words

I imagine the easiest way to show this is to show it inductively, starting from this: if are two probabilities that sum to ≤ 1, and

then

I forget how to show this, but it should be easy, so I’ll leave it as an exercise for the reader!

The Scholarpedia page on mutual information gives a statement and easy proof of the data processing inequality—search it for the phrase ‘data processing’.

This inequality says that if we have two random variables, X and Y, whose mutual information is I(X;Y), and a third random variable Z that is a function of Y, we have I(X;Z) ≤ I(X;Y). In other words,

there’s no way that Z can know more about X than Y did, because Z is obtained from Y.If we apply this in the case where X = Y, we get I(X;Z) ≤ I(X;X). Using the definition of mutual information we can simplify both sides and get S(Z) ≤ S(X).

So, I’m saying that if Z is a function of X then

S(Z) ≤ S(X)

and this is another way of stating inequality (♥) above. However, you need to know about random variables and mutual information to understand this—so if you don’t, please ignore what I just said! It’s easier to just prove (♥) directly.

No, we don’t do anything like this—I’d never even thought of this trick. Raising a probability to a power like that looks very scary! I’ve done similar things when introducing ‘temperature’ into probability theory, but then you need to divide by a fudge factor (the partition function) to make sure the probabilities still sum to one.

[…] John Baez’s This Week’s Finds and Azimuth blog also make math sexy. But there’s too much stuff to fit into this tiny list. I’ll just highlight his Network Theory series. Some of the topics it touches on include information theory and Bayesian networks. […]