I’m having a great time at a workshop on Biological and Bio-Inspired Information Theory in Banff, Canada. You can see videos of the talks online. There have been lots of good talks so far, but this one really blew my mind:

• Naftali Tishby, Sensing and acting under information constraints—a principled approach to biology and intelligence, 28 October 2014.

Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was *great!*

It was about the ‘action-perception loop':

This is the feedback loop in which living organisms—like us—*take actions* depending on our goals and what we perceive, and *perceive things* depending on the actions we take and the state of the world.

How do we do this so well? Among other things, we need to balance the *cost* of storing information about the *past* against the *payoff* of achieving our desired goals in the *future*.

Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.

It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.

### Rate-distortion theory

**Rate-distortion theory** is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.

For the basic definitions and a statement of the theorem, try this:

• Natasha Devroye, Rate-distortion theory, course notes, University of Chicago, Illinois, Fall 2009.

One of the people organizing this conference is a big expert on rate-distortion theory, and he wrote a book about it.

• Toby Berger, *Rate Distortion Theory: A Mathematical Basis for Data Compression*, Prentice–Hall, 1971.

Unfortunately it’s out of print and selling for $259 used on Amazon! An easier option might be this:

• Thomas M. Cover and Joy A. Thomas, *Elements of Information Theory*, Chapter 10: Rate Distortion Theory, Wiley, New York, 2006.

### The Bellman equation

The **Bellman equation** reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.

If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the **Euler–Lagrange equation**.

And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.

Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!

Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point in space at time If we call this least possible action the **Hamilton–Jacobi equation** says

where is the particle’s momentum at the endpoint of its path, and is its energy there.

If we replace derivatives by differences, and talk about *maximizing total reward* instead of *minimizing action*, we get Bellman’s equation:

• Bellman equation, Wikipedia.

### Markov decision processes

Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.

In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set of **states**, and time proceeds in discrete steps. At each time step, the probability to hop from state to state is some fixed number

This sort of thing is called a **Markov chain**, or if you feel the need to be more insistent, a **discrete-time Markov chain**.

A **Markov decision process** is a generalization where an outside agent gets to change these probabilities! The agent gets to choose **actions** from some set If at a given time he chooses the action the probability of the system hopping from state to state is Needless to say, these probabilities have to sum to one for any fixed

That would already be interesting, but the real fun is that there’s also a **reward** This is a real number saying how much joy or misery the agent experiences if he does action and the system hops from to

The problem is to choose a **policy**—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!

If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward time steps in the future gets multiplied by where is some **discount factor**. This extra tweak is easily handled, and you can see it all here:

• Markov decision process, Wikipedia.

### Partially observable Markov decision processes

There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action and the system hops from state to state he sees something: a point in some set of **observations**. He makes the observation with probability

(I don’t know why this probability depends on but not Maybe it ultimately doesn’t matter much.)

Again, the goal is to choose a **policy** that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.

Partially observable Markov decision processes are also called **POMPD**s. If you want to learn about them, try these:

• Partially observable Markov decision process, Wikipedia.

• Tony Cassandra, Partially observable Markov decision processes.

The latter includes an introduction *without any formulas* to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).

### The action-perception loop

I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:

Nafthali Tishby has a nice picture of it that’s more abstract:

We’re assuming time comes in discrete steps, just to keep things simple.

At each time

• the **world** has some state and

• the **agent** has some state

Why the letter ? This stands for **memory**: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.

At each time

• the agent takes an **action** which affects the world’s next state, and

• the world provides a **sensation** to the agent, which affect’s the agent’s next state.

This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!

We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they

• get a reward depending on the action taken and the resulting state of the world

but

• pay a price for the information stored from sensations.

Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!

So, everything I sketched fits together somehow!

I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.

For details, I highly recommend this paper:

• Naftali Tishby and Daniel Polani, Information theory of decisions and actions, in *Perception-Reason-Action Cycle: Models, Algorithms and System*. Vassilis, Hussain and Taylor, Springer, Berlin, 2010.

I’m going to print this out, put it by my bed, and read it every night until I’ve absorbed it.