I’m having a great time at a workshop on Biological and Bio-Inspired Information Theory in Banff, Canada. You can see videos of the talks online. There have been lots of good talks so far, but this one really blew my mind:

• Naftali Tishby, Sensing and acting under information constraints—a principled approach to biology and intelligence, 28 October 2014.

Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was *great!*

It was about the ‘action-perception loop’:

This is the feedback loop in which living organisms—like us—*take actions* depending on our goals and what we perceive, and *perceive things* depending on the actions we take and the state of the world.

How do we do this so well? Among other things, we need to balance the *cost* of storing information about the *past* against the *payoff* of achieving our desired goals in the *future*.

Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.

It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.

### Rate-distortion theory

**Rate-distortion theory** is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.

For the basic definitions and a statement of the theorem, try this:

• Natasha Devroye, Rate-distortion theory, course notes, University of Chicago, Illinois, Fall 2009.

One of the people organizing this conference is a big expert on rate-distortion theory, and he wrote a book about it.

• Toby Berger, *Rate Distortion Theory: A Mathematical Basis for Data Compression*, Prentice–Hall, 1971.

Unfortunately it’s out of print and selling for $259 used on Amazon! An easier option might be this:

• Thomas M. Cover and Joy A. Thomas, *Elements of Information Theory*, Chapter 10: Rate Distortion Theory, Wiley, New York, 2006.

### The Bellman equation

The **Bellman equation** reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.

If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the **Euler–Lagrange equation**.

And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.

Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!

Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point in space at time If we call this least possible action the **Hamilton–Jacobi equation** says

where is the particle’s momentum at the endpoint of its path, and is its energy there.

If we replace derivatives by differences, and talk about *maximizing total reward* instead of *minimizing action*, we get Bellman’s equation:

• Bellman equation, Wikipedia.

### Markov decision processes

Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.

In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set of **states**, and time proceeds in discrete steps. At each time step, the probability to hop from state to state is some fixed number

This sort of thing is called a **Markov chain**, or if you feel the need to be more insistent, a **discrete-time Markov chain**.

A **Markov decision process** is a generalization where an outside agent gets to change these probabilities! The agent gets to choose **actions** from some set If at a given time he chooses the action the probability of the system hopping from state to state is Needless to say, these probabilities have to sum to one for any fixed

That would already be interesting, but the real fun is that there’s also a **reward** This is a real number saying how much joy or misery the agent experiences if he does action and the system hops from to

The problem is to choose a **policy**—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!

If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward time steps in the future gets multiplied by where is some **discount factor**. This extra tweak is easily handled, and you can see it all here:

• Markov decision process, Wikipedia.

### Partially observable Markov decision processes

There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action and the system hops from state to state he sees something: a point in some set of **observations**. He makes the observation with probability

(I don’t know why this probability depends on but not Maybe it ultimately doesn’t matter much.)

Again, the goal is to choose a **policy** that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.

Partially observable Markov decision processes are also called **POMPD**s. If you want to learn about them, try these:

• Partially observable Markov decision process, Wikipedia.

• Tony Cassandra, Partially observable Markov decision processes.

The latter includes an introduction *without any formulas* to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).

### The action-perception loop

I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:

Nafthali Tishby has a nice picture of it that’s more abstract:

We’re assuming time comes in discrete steps, just to keep things simple.

At each time

• the **world** has some state and

• the **agent** has some state

Why the letter ? This stands for **memory**: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.

At each time

• the agent takes an **action** which affects the world’s next state, and

• the world provides a **sensation** to the agent, which affect’s the agent’s next state.

This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!

We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they

• get a reward depending on the action taken and the resulting state of the world

but

• pay a price for the information stored from sensations.

Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!

So, everything I sketched fits together somehow!

I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.

For details, I highly recommend this paper:

• Naftali Tishby and Daniel Polani, Information theory of decisions and actions, in *Perception-Reason-Action Cycle: Models, Algorithms and System*. Vassilis, Hussain and Taylor, Springer, Berlin, 2010.

I’m going to print this out, put it by my bed, and read it every night until I’ve absorbed it.

I come from this field, sort of (my very first conference paper was about Q-learning), and there’s a few wrinkles to this.

Most important is perhaps that real agents generally do not use exponential discounting of future rewards, but hyperbolic discounting.

For instance, if I offer you 100 tonight, or 101 tomorrow, you’re likely to take me up on the offer right now. But let’s say I offer you 100 in exactly a year, or 101 in a year and a day. Exponential discounting predicts that you will take the 100, whereas hyperbolic discounting – and real organisms – will go for 101 a day later.

So the Bellman equation doesn’t really hold for real-life agents. There’s ways to deal with it, but things start to get a little messy.

I do not think that such an easy argument can be used against exponential discounting.

When the offer is made by a trusted authority people might take the 101 dollars a day later (because they do not need the 100 dollars today) and in pensions many of us take the early lower pension over the higher later pension.

I gave a simple example as an illustration. You can frequently find that organisms don’t conform to exponential discounting in their decision making (technically it’s frequently not exactly hyperbolic either).

Tishby’s framework allows for hyperbolic discounting nicely, see http://www.dagstuhl.de/mat/Files/11/11131/11131.TishbyNaftali.Slides.pdf Information Gain Discounting slide.

Cool! I hadn’t gotten into the discussion of hyperbolic discounting here because ‘discounting’ is not what matters to me in Tishby’s work. However, it’s nice to see people have thought about it.

I was surprised that there was so little about vision at this workshop (going by the titles). Vision is the most information-rich sense that we have, and there are clearly informational constraints. We have 100 million photoreceptive cells per retina, but only one million fibres in an optic nerve. One third of the brain is devoted to vision. And there is evidence from change blindness (http://en.wikipedia.org/wiki/Change_blindness) that we don’t remember much about what we just saw, even 100ms later. This seems a much better testing ground for Tishby’s theory than sleeping cats and mice swimming in milk.

The book Active Vision by John M Findlay and Iain D Gilchrist, 2003, makes a strong case for vision being an active process rather than a passive one. Especially important are eye movements, which depend on what you’re trying to do, and what you’ve seen (and remembered) so far. Each saccade, you change what you see. The saccades even give you discrete time steps, more or less.

There was a fun talk about information theory in vision, and certainly a lot of people alluded to it. But the application of information theory people talked about most was to intercellular molecular communication!

Actually, the colleagues in Bielefeld got the sequence exactly backwards: instead of sensation – perception – reaction, it should be action – sensation – perception.

Stimulus-response organization of behavior may occur, but I have yet to find a good example. Even reflexes, when studied closely, only appear to be organized according to S-R concepts. If such S-R organization exists, then it is a highly derived trait and a rather rare exception.

Since they’re describing it as a loop, I don’t think they’re claiming there’s a best starting-point for that loop. And they said ‘action’, not ‘reaction’. However, what you say is interesting.

There’s an interesting paper on information seeking, which alludes to some of the topics mentioned here…so I can’t wait to watch the talk! How we handle information, obviously, impacts hugely how we interact with our environment + learn + seek out information.

Here’s the paper – http://www.cell.com/trends/cognitive-sciences/abstract/S1364-6613(13)00205-2

For POMDP-type models, based on what I understand, how does it handle varying probabilities? In an actual environment – agent interaction situation, the probability of the event is not always constant. How does that affect the optimal strategy?

I imagine POMDP models either 1) need to be generalized to allow the agent to update its policies as time goes on, based on new data, or 2) need to be able to handle enough contingencies that any change in conditions is simply treated as another observation.

I hope you get what I mean. In case 1) we have actual

learningwith the passage of time as we run the model, while in case 2) all the learning must occur before we start running the model; the model needs to be sufficiently well-prepared to be already familiar with all situations it comes across.Clearly approach 1) is more ambitious and more interesting. I wonder what various systems like self-driving cars actually do! Does one of Google’s cars actually learn as it drives, for example learn more about the probability that another car will cut in front of it as it drives along a particular highway you use to go do work, perhaps in a way that depends on the time of day? This could be good, but it’s probably harder to get right.

I remember an old discussion on sci.physics.research where James Dolan discussed a type of linear algebra based on Bellman’s Dynamic Programming where addition was replaced by the minimization of costs and multiplication was replaced by the addition of costs. The transition costs for traversing a graph roughly corresponded to the action of classical mechanics so that minimizing the cost of a path from point a to point b corresponded to the least action principle. From what I remember, this approach really showed the close relationship between statistical, classical, and quantum mechanics, although I also remember never quite grokking the whole picture. Does this ring a bell for anyone? Does anyone know if this approach is written up anywhere?

I spent lots of time discussing this with James, and you can read bits of it written up in my Winter 2007 quantum gravity seminar notes, especially here:

• Week 12 (Jan. 30) – Classical, quantum and statistical mechanics as "matrix mechanics". In quantum mechanics we use linear algebra over the ring of complex numbers; in classical mechanics everything is formally the same, but we instead use the rig R

^{min}= R ∪ {+∞} where the addition is min and the multiplication is +. As a warmup for bringing statistical mechanics into the picture – and linear algebra over yet another rig – we recall how the dynamics of particles becomes the statics of strings after Wick rotation. Blog entry.• Week 13 (Feb. 6) – Statistical mechanics and deformation of rigs. Statistical mechanics (or better, "thermal statics") as matrix mechanics over a rig R

^{T}that depends on the temperature T. As T → 0, the rig R^{T}reduces to R^{min}and thermal statics reduces to classical statics, just as quantum dynamics reduces to classical dynamics as Planck’s constant approaches zero. Tropical mathematics, idempotent analysis and Maslov dequantization. Blog entry.Supplementary reading: Grigori L. Litvinov, The Maslov dequantization, idempotent and tropical mathematics: a very brief introduction. Also see the longer version here.

• Week 14 (Feb. 13) – An example of path-integral quantization: the free particle on a line (part 1). Blog entry.

• Week 15 (Feb. 20) – The free particle on a line (part 2). Showing the path-integral approach agrees with the Hamiltonian approach. Fourier transforms and Gaussian integrals. Blog entry.