I’m having a great time at a workshop on Biological and Bio-Inspired Information Theory in Banff, Canada. You can see videos of the talks online. There have been lots of good talks so far, but this one really blew my mind:
• Naftali Tishby, Sensing and acting under information constraints—a principled approach to biology and intelligence, 28 October 2014.
Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was great!
It was about the ‘action-perception loop’:
This is the feedback loop in which living organisms—like us—take actions depending on our goals and what we perceive, and perceive things depending on the actions we take and the state of the world.
How do we do this so well? Among other things, we need to balance the cost of storing information about the past against the payoff of achieving our desired goals in the future.
Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.
It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.
Rate-distortion theory is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.
For the basic definitions and a statement of the theorem, try this:
• Natasha Devroye, Rate-distortion theory, course notes, University of Chicago, Illinois, Fall 2009.
One of the people organizing this conference is a big expert on rate-distortion theory, and he wrote a book about it.
• Toby Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice–Hall, 1971.
Unfortunately it’s out of print and selling for $259 used on Amazon! An easier option might be this:
• Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, Chapter 10: Rate Distortion Theory, Wiley, New York, 2006.
The Bellman equation
The Bellman equation reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.
If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the Euler–Lagrange equation.
And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.
Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!
Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point in space at time If we call this least possible action the Hamilton–Jacobi equation says
where is the particle’s momentum at the endpoint of its path, and is its energy there.
If we replace derivatives by differences, and talk about maximizing total reward instead of minimizing action, we get Bellman’s equation:
• Bellman equation, Wikipedia.
Markov decision processes
Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.
In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set of states, and time proceeds in discrete steps. At each time step, the probability to hop from state to state is some fixed number
This sort of thing is called a Markov chain, or if you feel the need to be more insistent, a discrete-time Markov chain.
A Markov decision process is a generalization where an outside agent gets to change these probabilities! The agent gets to choose actions from some set If at a given time he chooses the action the probability of the system hopping from state to state is Needless to say, these probabilities have to sum to one for any fixed
That would already be interesting, but the real fun is that there’s also a reward This is a real number saying how much joy or misery the agent experiences if he does action and the system hops from to
The problem is to choose a policy—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!
If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward time steps in the future gets multiplied by where is some discount factor. This extra tweak is easily handled, and you can see it all here:
• Markov decision process, Wikipedia.
Partially observable Markov decision processes
There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action and the system hops from state to state he sees something: a point in some set of observations. He makes the observation with probability
(I don’t know why this probability depends on but not Maybe it ultimately doesn’t matter much.)
Again, the goal is to choose a policy that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.
Partially observable Markov decision processes are also called POMPDs. If you want to learn about them, try these:
• Partially observable Markov decision process, Wikipedia.
• Tony Cassandra, Partially observable Markov decision processes.
The latter includes an introduction without any formulas to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).
The action-perception loop
I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:
Nafthali Tishby has a nice picture of it that’s more abstract:
We’re assuming time comes in discrete steps, just to keep things simple.
At each time
• the world has some state and
• the agent has some state
Why the letter ? This stands for memory: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.
At each time
• the agent takes an action which affects the world’s next state, and
• the world provides a sensation to the agent, which affect’s the agent’s next state.
This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!
We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they
• get a reward depending on the action taken and the resulting state of the world
• pay a price for the information stored from sensations.
Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!
So, everything I sketched fits together somehow!
I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.
For details, I highly recommend this paper:
• Naftali Tishby and Daniel Polani, Information theory of decisions and actions, in Perception-Reason-Action Cycle: Models, Algorithms and System. Vassilis, Hussain and Taylor, Springer, Berlin, 2010.
I’m going to print this out, put it by my bed, and read it every night until I’ve absorbed it.