Relative Entropy (Part 1)

I’m trying to finish off a paper that Tobias Fritz and I have been working on, which gives a category-theoretic (and Bayesian!) characterization of relative entropy. It’s a kind of sequel to our paper with Tom Leinster, in which we characterized entropy.

That earlier paper was developed in conversations on the n-Category Café. It was a lot of fun; I sort of miss that style of working. Also, to get warmed up, I need to think through some things I’ve thought about before. So, I might as well write them down here.

The idea

There are many categories related to probability theory, and they’re related in many ways. Last summer—on the 24th of August 2012, according to my notes here—Jamie Vicary, Brendan Fong and I worked through a bunch of these relationships. I need to write them down now, even if they’re not all vitally important to my paper with Tobias. They’re sort of buzzing around my brain like flies.

(Tobias knows this stuff too, and this is how we think about probability theory, but we weren’t planning to stick it in our paper. Maybe we should.)

Let’s restrict attention to probability measures on finite sets, and related structures. We could study these questions more generally, and we should, but not today. What we’ll do is give a unified purely algebraic description of:

• finite sets

• measures on finite sets

• probability measures on finite sets

and various kinds of maps between these:

• functions

• bijections

• measure-preserving functions

• stochastic maps

Finitely generated free [0,∞)-modules

People often do linear algebra over a field, which is—roughly speaking—a number system where you can add, subtract, multiply and divide. But algebraists have long realized that a lot of linear algebra still works with a commutative ring, where you can’t necessarily divide. It gets more complicated, but also a lot more interesting.

But in fact, a lot still works with a commutative rig, where we can’t necessarily subtract either! Something I keep telling everyone is that linear algebra over rigs is a good idea for studying things like probability theory, thermodynamics, and the principle of least action.

Today we’ll start with the rig of nonnegative real numbers with their usual addition and multiplication; let’s call this [0,\infty) . The idea is that measure theory, and probability theory, are closely related to linear algebra over this rig.

Let C be the category with of finitely generated free [0,\infty) -modules as objects, and module homomorphisms as morphisms. I’ll call these morphisms maps.

Puzzle. Do we need to say ‘free’ here? Are there finitely generated modules over [0,\infty) that aren’t free?

Every finitely generated free [0,\infty)-module is isomorphic to [0,\infty)^S for some finite set S . In other words, it’s isomorphic to [0,\infty)^n for some n = 0, 1, 2, \dots . So, C is equivalent to the category where objects are natural numbers, a morphism from m to n is an m \times n matrix of numbers in [0,\infty) , and composition is done by matrix multiplication. I’ll also call this equivalent category C.

We can take tensor products of finitely generated free modules, and this makes C into a symmetric monoidal †-category. This means we can draw maps using string diagrams in the usual way. However, I’m feeling lazy so I’ll often write equations when I could be drawing diagrams.

One of the rules of the game is that all these equations will make sense in any symmetric monoidal †-category. So we could, if we wanted, generalize ideas from probability theory this way. If you want to do this, you’ll need to know that [0,\infty) is the unit for the tensor product in C. We’ll be seeing this guy [0,\infty) a lot. So if you want to generalize, replace C by any symmetric monoidal †-category, and replace [0,\infty) by the unit for the tensor product.

Finite sets

There’s a way to see the category of finite sets lurking in C, which we can borrow from this paper:

• Bob Coecke, Dusko Pavlovic and Jamie Vicary, A new description of orthogonal bases.

For any finite set S , we get a free finitely generated [0,\infty) -module, namely [0,\infty)^S . This comes with some structure:

• a multiplication m: [0,\infty)^S \otimes [0,\infty)^S \to [0,\infty)^S , coming from pointwise multiplication of [0,\infty) -valued functions on S

• the unit for this multiplication, an element of [0,\infty)^S, which we can write as a morphism i: [0,\infty) \to [0,\infty)^S

• a comultiplication, obtained by taking the diagonal map \Delta : S \to S \times S and promoting it to a linear map \Delta : [0,\infty)^S \to [0, \infty)^S \otimes [0,\infty)^S

• a counit for this comultiplication, obtained by taking the unique map to the terminal set ! : S \to 1 and promoting it to a linear map e: [0,\infty)^S \to [0,\infty)

These morphisms m, i, \Delta, e make

x = [0,\infty)^S

into a commutative Frobenius algebra in C . That’s a thing where the unit, counit, multiplication and comultiplication obey these laws:

(I drew these back when I was feeling less lazy.) This Frobenius algebra is also ‘special’, meaning it obeys this:

And it’s also a †-Frobenius algebra, meaning that the counit and comultiplication are obtained from the unit and multiplication by ‘flipping’ them using the †category structure. (If we think of a morphism in C as a matrix, its dagger is its transpose.)

Conversely, suppose we have any special commutative †-Frobenius algebra x . Then using the ideas in the paper by Coecke, Pavlovich and Vicary we can recover a basis for x , consisting of the vectors e_i \in x with

\Delta(e_i) = e_i \otimes e_i

This basis forms a set S such that

x \cong [0,\infty)^S

for some specified isomorphism in C. Furthermore, this is an isomorphism of special commutative †-Frobenius algebras!

In case you’re wondering, these vectors e_i correspond to the functions on S that are zero everywhere except at one point i \in S, where they equal 1.

In short, a special commutative †-Frobenius algebra in C is just a fancy way of talking about a finite set. This may seem silly, but it’s a way to start describing probability theory using linear algebra very much as we do with quantum theory. This analogy between quantum theory and probability theory is so interesting that it deserves a book.

Functions and bijections

Now suppose we have two special commutative †-Frobenius algebra in C, say x and y .

Suppose f : x \to y is a Frobenius algebra homomorphism: that is, a map preserving all the structure—the unit, counit, multiplication and comultiplication. Then it comes from an isomorphism of finite sets. This lets us find \mathrm{FinSet}_0 , the groupoid of finite sets and bijections, inside C.

Alternatively, suppose f : x \to y is just a coalgebra homomorphism: that is a map preserving just the counit and comultiplication. Then it comes from an arbitrary function between finite sets. This lets us find FinSet , the category of finite sets and functions, inside C .

But what if f preserves just the counit? This sounds like a dry, formal question. But it’s not: the answer is something useful, a ‘stochastic map’.

Stochastic maps

A stochastic map from a finite set S to a finite set T is a map sending each point of S to a probability measure on T .

We can think of this as a T \times S -shaped matrix of numbers in [0,\infty) , where a given column gives the probability that a given point in S goes to any point in T . The sum of the numbers in each column will be 1. And conversely, any T \times S -shaped matrix of numbers in [0,\infty) , where each column sums to 1, gives a stochastic map from S to T .

But now let’s describe this idea using the category C. We’ve seen a finite set is the same as a special commutative †-Frobenius algebra. So, say we have two of these, x and y . Our matrix of numbers in [0,\infty) is just a map

f: x \to y

So, we just need a way to state the condition that each column in the matrix sums to 1. And this condition simply says that f preserves the counit:

\epsilon_y \circ f = \epsilon_x

where \epsilon_x : x \to [0,\infty) is the counit for x , and similarly for \epsilon_y .

To understand this, note that if we use the canonical isomorphism

x \cong [0,\infty)^S

the counit \epsilon_x can be seen as the map

[0,\infty)^S \to [0,\infty)

that takes any S -tuple of numbers and sums them up. In other words, it’s integration with respect to counting measure. So, the equation

\epsilon_y \circ f = \epsilon_x

says that if we take any S -tuple of numbers, multiply it by the matrix f , and then sum up the entries of the resulting T -tuple, it’s the same as if we summed up the original S -tuple. But this says precisely that each column of the matrix f sums to 1.

So, we can use our formalism to describe \mathrm{FinStoch}, the category with finite sets as objects and stochastic maps as morphisms. We’ve seen this category is equivalent to the category with special commutative †-Frobenius algebras in C as objects and counit-preserving maps as morphisms.

Finite measure spaces

Now let’s use our formalism to describe finite measure spaces—by which, beware, I mean a finite sets equipped with measures! To do this, we’ll use a special commutative †-Frobenius algebra x in C together with any map

\mu: [0,\infty) \to x

Starting from these, we get a specified isomorphism

x \cong [0,\infty)^S

and \mu sends the number 1 to a vector in [0,\infty)^S : that is, a function on S taking values in [0,\infty) . Multiplying this function by counting measure, we get a measure on S .

Puzzle. How can we describe this measure without the annoying use of counting measure?

Conversely, any measure on a finite set gives a special commutative †-Frobenius algebra x in C equipped with a map from [0,\infty) .

So, we can say a finite measure space is a special commutative †-Frobenius algebra in C equipped with a map

\mu: [0,\infty) \to x

And given two of these,

\mu: [0,\infty) \to x , \qquad  \nu: [0,\infty) \to y

and a coalgebra morphism

f : x \to y

obeying this equation

f \circ \mu  = \nu

then we get a measure-preserving function between finite measure spaces! If you’re a category theorist, you’ll draw this equation as a commutative triangle:

Conversely, any measure-preserving function between finite measure spaces obeys this equation. So, we get an algebraic way of describing the category \mathrm{FinMeas} , with finite measure spaces as objects and measure-preserving maps as morphisms.

Finite probability measure spaces

I’m mainly interested in probability measures. So suppose x is a special commutative †-Frobenius algebra in C equipped with a map

\mu: [0,\infty) \to x

We’ve seen this gives a finite measure space. But this is a probability measure space if and only if

e \circ \mu = 1

where

e : x \to [0,\infty)

is the counit for x . The equation simply says the total integral of our measure \mu is 1.

So, we get a way of describing the category \mathrm{FinProb} , which has finite probability measure spaces as objects and measure-preserving maps as objects. Given finite probability measure spaces described this way:

\mu: [0,\infty) \to x , \qquad  \nu: [0,\infty) \to y

a measure-preserving function is a coalgebra morphism

f : x \to y

such that the obvious triangle commutes:

f \circ \mu  = \nu

Measure-preserving stochastic maps

Say we have two finite measure spaces. Then we can ask whether a stochastic map from one to the other is measure-preserving. And we can answer this question in the language of C .

Remember, a finite measure space is a special commutative †-Frobenius algebra x in C together with a map

\mu: [0,\infty) \to x

Say we have another one:

\nu: [0,\infty) \to y

A stochastic map is just a map

f: x \to y

that preserves the counit:

\epsilon_y \circ f = \epsilon_x

But it’s a measure-preserving stochastic map if also

f \circ \mu  = \nu

Next…

There’s a lot more to say; I haven’t gotten anywhere near what Tobias and I are doing! But it’s pleasant to have this basic stuff written down.

19 Responses to Relative Entropy (Part 1)

  1. Toby Bartels says:

    I usually use the rig [0,\infty] for measure theory (with 0 \cdot \infty = 0, so that it is a rig; see also nLab: lower real number#arithmetic for a constructive treatment that makes this happen automatically) rather than [0,\infty). This doesn’t really matter for probability measures, but normally one allows positive measures (even on finite measure spaces) to be infinite.

    • John Baez says:

      Good point! I don’t feel a pressing need for \infty here, but I think everything would work just as well with it… and when we get to relative entropy, we’ll see it really needs to take the value \infty whenever you have a prior that assigns zero probability to an event that turns out to happen, leaving you ‘infinitely surprised’. I hadn’t thought that that use of [0,\infty] could be extended to supplant the use of [0,\infty) here, since I (implicitly) thought of entropy and probability as being different ‘types of thing’. But maybe using [0,\infty] everywhere will make things nicer somehow.

  2. saibod says:

    It’s great to see this written down! Typo: $\dagger $-category is missing the “latex”.

    Concerning the first puzzle on finitely generated [0,\infty)-modules: most of them are not free, and these are among my favorite mathematical objects: polyhedral convex cones! The simplest non-free example is a 3-dimensional convex cone spanned by four rays such as (1,\pm 1,\pm 1). Including these in the category will take us into the realm of general probabilistic theories, where the cones are non-classical state spaces. I don’t see how to retain the †-structure, though…

    There are other [0,\infty)-modules which are not even embeddable into a vector space. The most notorious ones are semilattices with a bottom element, i.e. commutative idempotent monoids (M,\lor,0). Such a gadget becomes a [0,\infty)-module if one takes multiplication by positive scalars to act trivially.

    Concerning the second puzzle, what about using \mu:x\to[0,\infty) rather than \mu:[0,\infty)\to x? I’m probably missing something, it’s early morning ;)

    • John Baez says:

      Thanks! Are you trying to keep your identity secret? I’d like to let people know who you are.

      More later… thanks a lot for help on that first puzzle. The second puzzle I think I understand.

      • Tobias Fritz says:

        I’m not trying to keep my identity secret, I just need to get to grips with WordPress ;) So this comment is just an experiment for trying to see whether it puts my name down now…

    • John Baez says:

      I actually think it’s better to describe a measure on a finite set using a map \mu : [0,\infty) \to x rather than a map \mu : x \to [0,\infty). Of course using the †-category structure we can freely turn one into the other, but still…

      The reason is that measures are covariant; we want to be able to take a measure \mu : [0,\infty) \to x and push it forward along f : x \to y and get a new measure on y. And in this approach. it’s easy: just compose them and get f \circ \mu : [0,\infty) \to y. Note this works when f is actually a function between finite sets, but also more generally when f is a stochastic map or even more generally when it’s what I was calling a `map’, meaning any homomorphism of [0,\infty)-modules.

      Anyway, here’s a slick way to see how a map \mu : [0,\infty) \to x gives a measure on S when x = [0,\infty)^S, without mentioning counting measure. First you note that a map f : x \to [0,\infty) is the same as a [0,\infty)-valued function on S. Then you note that the composite map

      f \circ \mu : [0,\infty) \to [0,\infty)

      amounts to multiplying by some number in [0,\infty). We can think of this number as the integral of the function f with respect to the measure corresponding to \mu.

      In short, measures are maps \mu : [0,\infty) \to x. Functions are maps f : x \to [0,\infty). Measures and functions are dual, both in the sense that we can compose them and get numbers, and in the sense that we can ‘turn one around’ using the †-structure and get the other!

  3. John Huerta says:

    I think the simplest nonfree example is 1d, isn’t it? I’m thinking about:

    X = \frac{[0,\infty)^2}{(x,x) \sim (0,0)}

    As a space, this has to be 1-dimensional, as one can easily check, and the unique 1d free [0,\infty) module is spanned by one element. Yet no one element of X spans it.

    • Tobias Fritz says:

      Isn’t this just the real line, considered as an [0,\infty)-module? Your quotient sounds a lot like the construction of formally turning an abelian monoid into an abelian group by throwing in inverses. This can also be applied to [0,\infty)-modules, in which case it turns every module into a vector space.

      So yes, the simplest non-free module is 1-dimensional, namely just the real line. D’oh!

      In fact, the “simplest” example is probably just \{0,1\}, considered as a join semilattice…

      • John Huerta says:

        Hey, you’re right! I didn’t think of that because I wasn’t sure if the real line would be finitely generated, but of course it is! It’s generated by 1 and -1. And that’s how I thought of my example: two generators such that the sum is zero.

        Your second example is more interesting, though, because who ever thinks of \mathbb{Z}_2 as a [0,\infty)-module?

        • Tobias Fritz says:

          Actually what’s really going on here is that \{0,1\} is the Boolean semiring (rig) with 1+1=1, which comes with a rig homomorphism [0,\infty)\rightarrow\{0,1\}. This is how the Boolean rig \{0,1\} becomes a [0,\infty)-module.

          More generally, the \{0,1\}-modules are precisely the join-semilattices, hence all these become [0,\infty)-modules via the above homomorphism.

          (Writing \mathbb{Z}_2 for the Boolean rig may be a bit misleading, since the addition is not the usual one!)

        • Tobias Fritz says:

          Sorry, I should have explained that the homomorphism [0,\infty)\to\{0,1\} is the map which sends 0 to 0 and every positive number to 1.

        • John Baez says:

          You probably know this, but I should point out: if you think of [0,\infty) as the rig that governs probability theory, then its quotient \{0,1\} is the rig that governs possibility theory.

          Numbers in [0,\infty) are relative probabilities. In \{0,1\}, 0 means ‘impossible’ and 1 means ‘possible’. So, the quotient map sends nonzero probabilities to ‘possible’, and zero probability to ‘impossible’.

        • Toby Bartels says:

          So what’s really really going on is that \{0,1\} is the rig of truth values under the operations OR and AND. This makes sense, since probability theory is generalized logic.

    • John Baez says:

      Hey—great to see you here, John! This example

      X = \frac{[0,\infty)^2}{(x,x) \sim (0,0)}

      is confusing to me. Is that quotient a quotient of sets, or of [0,\infty)-modules? Since you’re trying to create a [0,\infty)-module I’ll assume the latter.

      If so, you’re taking [0,\infty)^2, viewed as a free [0,\infty)-module with two generators, and you’re creating a quotient [0,\infty)-module in which all elements of the form (x,x) are set equal to zero.

      But doing that forces

      (x,x) + (y,z) \sim (y,z)

      That is, all elements along diagonal lines get identified. So, every element becomes equivalent to one of the form (x,0) or one of the form (0,x). And

      (x,0) + (0,x) = (x,x) \sim (0,0)

      so every element has a negative. Now I’m thinking Tobias was right and your gadget is isomorphic to \mathbb{R} as a [0,\infty)-module. Let me just check that addition works correctly: I want (x,0) plus (0,y) to equal (x-y,0) if x is bigger, and (0,y-x) if y is bigger. I’ll just do the first case:

      (x,0) + (0,y) = (x,y) = (x-y,0) + (y,y) \sim (x-y,0)

      Yup! So, you’ve got a topological 1-dimensional [0,\infty)^2 module that requires two generators, and it’s isomorphic to \mathbb{R}.

      Nice.

  4. In the first part of this mini-series, I described how various ideas important in probability theory arise naturally when you start doing linear algebra using only the nonnegative real numbers. But Tobias Fritz and I have proved a theorem characterizing the concept of relative entropy, which is also known as ‘relative information’, ‘information gain’ or—most terrifying and least helpful of all—’Kullback-Leibler divergence’. And in this second part, I’ll introduce two key players in this theorem.

  5. In the last couple days I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:

    Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

    Relative Entropy (Part 2): a category related to statistical inference, \mathrm{FinStat}, and how relative entropy defines a functor on this category.

    But then Tobias Fritz noticed a big problem.

  6. Tobias Fritz, a postdoc at the Perimeter Institute, is working with me on category-theoretic aspects of information theory. We published a paper on entropy with Tom Leinster, and we’ve got a followup on relative entropy that’s almost done. I should be working on it right this instant! But for now, read the series of posts here on Azimuth: Relative Entropy Part 1, Part 2 and Part 3. […]

  7. Now Tobias Fritz and I have finally finished our paper on this subject:

    A Bayesian characterization of relative entropy.

You can use HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,824 other followers