## Why Google Gave Up

5 January, 2015

I was disappointed when Google gave up. In 2007, the company announced a bold initiative to fight global warming:

### Google’s Goal: Renewable Energy Cheaper than Coal

Creates renewable energy R&D group and supports breakthrough technologies

Mountain View, Calif. (November 27, 2007) – Google (NASDAQ: GOOG) today announced a new strategic initiative to develop electricity from renewable energy sources that will be cheaper than electricity produced from coal. The newly created initiative, known as RE<C, will focus initially on advanced solar thermal power, wind power technologies, enhanced geothermal systems and other potential breakthrough technologies. RE<C is hiring engineers and energy experts to lead its research and development work, which will begin with a significant effort on solar thermal technology, and will also investigate enhanced geothermal systems and other areas. In 2008, Google expects to spend tens of millions on research and development and related investments in renewable energy. As part of its capital planning process, the company also anticipates investing hundreds of millions of dollars in breakthrough renewable energy projects which generate positive returns.

But in 2011, Google shut down the program. I never heard why. Recently two engineers involved in the project have given a good explanation:

• Ross Koningstein and David Fork, What it would really take to reverse climate change, 18 November 2014.

But the short version is this. They couldn’t find a way to accomplish their goal: producing a gigawatt of renewable power more cheaply than a coal-fired plant — and in years, not decades.

And since then, they’ve been reflecting on their failure and they’ve realized something even more sobering. Even if they’d been able to realize their best-case scenario — a 55% carbon emissions cut by 2050 — it would not bring atmospheric CO2 back below 350 ppm during this century.

This is not surprising to me.

What would we need to accomplish this? They say two things. First, a cheap dispatchable, distributed power source:

Consider an average U.S. coal or natural gas plant that has been in service for decades; its cost of electricity generation is about 4 to 6 U.S. cents per kilowatt-hour. Now imagine what it would take for the utility company that owns that plant to decide to shutter it and build a replacement plant using a zero-carbon energy source. The owner would have to factor in the capital investment for construction and continued costs of operation and maintenance—and still make a profit while generating electricity for less than $0.04/kWh to$0.06/kWh.

That’s a tough target to meet. But that’s not the whole story. Although the electricity from a giant coal plant is physically indistinguishable from the electricity from a rooftop solar panel, the value of generated electricity varies. In the marketplace, utility companies pay different prices for electricity, depending on how easily it can be supplied to reliably meet local demand.

“Dispatchable” power, which can be ramped up and down quickly, fetches the highest market price. Distributed power, generated close to the electricity meter, can also be worth more, as it avoids the costs and losses associated with transmission and distribution. Residential customers in the contiguous United States pay from $0.09/kWh to$0.20/kWh, a significant portion of which pays for transmission and distribution costs. And here we see an opportunity for change. A distributed, dispatchable power source could prompt a switchover if it could undercut those end-user prices, selling electricity for less than $0.09/kWh to$0.20/kWh in local marketplaces. At such prices, the zero-carbon system would simply be the thrifty choice.

But “dispatchable”, they say, means “not solar”.

Second, a lot of carbon sequestration:

While this energy revolution is taking place, another field needs to progress as well. As Hansen has shown, if all power plants and industrial facilities switch over to zero-carbon energy sources right now, we’ll still be left with a ruinous amount of CO2 in the atmosphere. It would take centuries for atmospheric levels to return to normal, which means centuries of warming and instability. To bring levels down below the safety threshold, Hansen’s models show that we must not only cease emitting CO2 as soon as possible but also actively remove the gas from the air and store the carbon in a stable form. Hansen suggests reforestation as a carbon sink. We’re all for more trees, and we also exhort scientists and engineers to seek disruptive technologies in carbon storage.

How to achieve these two goals? They say government and energy businesses should spend 10% of employee time on “strange new ideas that have the potential to be truly disruptive”.

## Networks in Climate Science

29 November, 2014

What follows is draft of a talk I’ll be giving at the Neural Information Processing Seminar on December 10th. The actual talk may contain more stuff—for example, more work that Dara Shayda has done. But I’d love comments now, so I’m posting this now and hoping you can help out.

You can click on any of the pictures to see where it came from or get more information.

### Preliminary throat-clearing

I’m very flattered to be invited to speak here. I was probably invited because of my abstract mathematical work on networks and category theory. But when I got the invitation, instead of talking about something I understood, I thought I’d learn about something a bit more practical and talk about that. That was a bad idea. But I’ll try to make the best of it.

I’ve been trying to learn climate science. There’s a subject called ‘complex networks’ where people do statistical analyses of large graphs like the worldwide web or Facebook and draw conclusions from it. People are trying to apply these ideas to climate science. So that’s what I’ll talk about. I’ll be reviewing a lot of other people’s work, but also describing some work by a project I’m involved in, the Azimuth Project.

The Azimuth Project is an all-volunteer project involving scientists and programmers, many outside academia, who are concerned about environmental issues and want to use their skills to help. This talk is based on the work of many people in the Azimuth Project, including Jan Galkowski, Graham Jones, Nadja Kutz, Daniel Mahler, Blake Pollard, Paul Pukite, Dara Shayda, David Tanzer, David Tweed, Steve Wenner and others. Needless to say, I’m to blame for all the mistakes.

### Climate variability and El Niño

Okay, let’s get started.

You’ve probably heard about the ‘global warming pause’. Is this a real thing? If so, is it due to ‘natural variability’, heat going into the deep oceans, some combination of both, a massive failure of our understanding of climate processes, or something else?

Here is chart of global average air temperatures at sea level, put together by NASA’s Goddard Institute of Space Science:

You can see a lot of fluctuations, including a big dip after 1940 and a tiny dip after 2000. That tiny dip is the so-called ‘global warming pause’. What causes these fluctuations? That’s a big, complicated question.

One cause of temperature fluctuations is a kind of cycle whose extremes are called El Niño and La Niña.

A lot of things happen during an El Niño. For example, in 1997 and 1998, a big El Niño, we saw all these events:

El Niño is part of an irregular cycle that happens every 3 to 7 years, called the El Niño Southern Oscillation or ENSO. Two strongly correlated signs of an El Niño are:

1) Increased sea surface temperatures in a patch of the Pacific called the Niño 3.4 region. The temperature anomaly in this region—how much warmer it is than usual for that time of year—is called the Niño 3.4 index.

2) A decrease in air pressures in the western side of the Pacific compared to those further east. This is measured by the Southern Oscillation Index or SOI.

You can see the correlation here:

El Niños are important because they can cause billions of dollars of economic damage. They also seem to bring heat stored in the deeper waters of the Pacific into the atmosphere. So, one reason for the ‘global warming pause’ may be that we haven’t had a strong El Niño since 1998. The global warming pause might end with the next El Niño. For a while it seemed we were due for a big one this fall, but that hasn’t happened.

### Teleconnections

The ENSO cycle is just one of many cycles involving teleconnections: strong correlations between weather at distant locations, typically thousands of kilometers. People have systematically looked for these teleconnections using principal component analysis of climate data, and also other techniques.

The ENSO cycle shows up automatically when you do this kind of study. It stands out as the biggest source of climate variability on time scales greater than a year and less than a decade. Some others include:

For example, the Pacific Decadal Oscillation is a longer-period relative of the ENSO, centered in the north Pacific:

### Complex network theory

Recently people have begun to study teleconnections using ideas from ‘complex network theory’.

What’s that? In complex network theory, people often start with a weighted graph: that is, a set $N$ of nodes and for any pair of nodes $i, j \in N,$ a weight $A_{i j},$ which can be any nonnegative real number.

Why is this called a weighted graph? It’s really just a matrix of nonnegative real numbers!

The reason is that we can turn any weighted graph into a graph by drawing an edge from node $j$ to node $i$ whenever $A_{i j} >0.$ This is a directed graph, meaning that we should draw an arrow pointing from $j$ to $i.$ We could have an edge from $i$ to $j$ but not vice versa! Note that we can also have an edge from a node to itself.

Conversely, if we have any directed graph, we can turn it into a weighted graph by choosing the weight $A_{i j} = 1$ when there’s an edge from $j$ to $i,$ and $A_{i j} = 0$ otherwise.

For example, we can make a weighted graph where the nodes are web pages and $A_{i j}$ is the number of links from the web page $j$ to the web page $i.$

People in complex network theory like examples of this sort: large weighted graphs that describe connections between web pages, or people, or cities, or neurons, or other things. The goal, so far, is to compute numbers from weighted graphs in ways that describe interesting properties of these complex networks—and then formulate and test hypotheses about the complex networks we see in real life.

### The El Niño basin

Here’s a very simple example of what we can do with a weighted graph. For any node $i,$ we can sum up the weights of edges going into $i:$

$\sum_{j \in N} A_{j i}$

This is called the degree of the node $i.$ For example, if lots of people have web pages with lots of links to yours, your webpage will have a high degree. If lots of people like you on Facebook, you will have a high degree.

So, the degree is some measure of how ‘important’ a node is.

People have constructed climate networks where the nodes are locations on the Earth’s surface, and the weight $A_{i j}$ measures how correlated the weather is at the $i$th and $j$th location. Then, the degree says how ‘important’ a given location is for the Earth’s climate—in some vague sense.

For example, in Complex networks in climate dynamics, Donges et al take surface air temperature data on a grid and compute the correlation between grid points.

More precisely, let $T_i(t)$ be the temperature at the $i$th grid point at month $t$ after the average for that month in all years under consideration has been subtracted off, to eliminate some seasonal variations. They compute the Pearson correlation $A_{i j}$ of $T_i(t)$ and $T_j(t)$ for each pair of grid points $i, j.$ The Pearson correlation is the simplest measure of linear correlation, normalized to range between -1 and 1.

We could construct a weighted graph this way, and it would be symmetric, or undirected:

$A_{i j} = A_{j i}$

However, Donges et al prefer to work with a graph rather than a weighted graph. So, they create a graph where there is an edge from $i$ to $j$ (and also from $j$ to $i$) when $|A_{i j}|$ exceeds a certain threshold, and no edge otherwise.

They can adjust this threshold so that any desired fraction of pairs $i, j$ actually have an edge between them. After some experimentation they chose this fraction to be 0.5%.

A certain patch dominates the world! This is the El Niño basin. The Indian Ocean comes in second.

(Some details, which I may not say:

The Pearson correlation is the covariance

$\Big\langle \left( T_i - \langle T_i \rangle \right) \left( T_j - \langle T_j \rangle \right) \Big\rangle$

normalized by dividing by the standard deviation of $T_i$ and the standard deviation of $T_j.$

The reddest shade of red in the above picture shows nodes that are connected to 5% or more of the other nodes. These nodes are connected to at least 10 times as many nodes as average.)

The Pearson correlation detects linear correlations. A more flexible measure is mutual information: how many bits of information knowing the temperature at time $t$ at grid point $i$ tells you about the temperature at the same time at grid point $j.$

Donges et al create a climate network this way as well, putting an edge between nodes if their mutual information exceeds a certain cutoff. They choose this cutoff so that 0.5% of node pairs have an edge between them, and get the following map:

The result is almost indistinguishable in the El Niño basin. So, this feature is not just an artifact of focusing on linear correlations.

### El Niño breaks climate links

We can also look at how climate networks change with time—and in particular, how they are affected by El Niños. This is the subject of a 2008 paper by Tsonis and Swanson, Topology and predictability of El Niño and La Niña networks.

They create a climate network in a way that’s similar to the one I just described. The main differences are that they:

1. separately create climate networks for El Niño and La Niña time periods;
2. create a link between grid points when their Pearson correlation has absolute value greater than $0.5;$

3. only use temperature data from November to March in each year, claiming that summertime introduces spurious links.

They get this map for La Niña conditions:

and this map for El Niño conditions:

They conclude that “El Niño breaks climate links”.

This may seem to contradict what I just said a minute ago. But it doesn’t! While the El Niño basin is a region where the surface air temperatures are highly correlated to temperatures at many other points, when an El Niño actually occurs it disrupts correlations between temperatures at different locations worldwide—and even in the El Niño basin!

For the rest of the talk I want to focus on a third claim: namely, that El Niños can be predicted by means of an increase in correlations between temperatures within the El Niño basin and temperatures outside this region. This claim was made in a recent paper by Ludescher et al. I want to examine it somewhat critically.

### Predicting El Niños

People really want to predict El Niños, because they have huge effects on agriculture, especially around the Pacific ocean. However, it’s generally regarded as very hard to predict El Niños more than 6 months in advance. There is also a spring barrier: it’s harder to predict El Niños through the spring of any year.

It’s controversial how much of the unpredictability in the ENSO cycle is due to chaos intrinsic to the Pacific ocean system, and how much is due to noise from outside the system. Both may be involved.

There are many teams trying to predict El Niños, some using physical models of the Earth’s climate, and others using machine learning techniques. There is a kind of competition going on, which you can see at a National Oceanic and Atmospheric Administration website.

The most recent predictions give a sense of how hard this job is:

When the 3-month running average of the Niño 3.4 index exceeds 0.5°C for 5 months, we officially declare that there is an El Niño.

As you can see, it’s hard to be sure if there will be an El Niño early next year! However, the consensus forecast is yes, a weak El Niño. This is the best we can do, now. Right now multi-model ensembles have better predictive skill than any one model.

### The work of Ludescher et al

The Azimuth Project has carefully examined a 2013 paper by Ludescher et al called Very early warning of next El Niño, which uses a climate network for El Niño prediction.

They build their climate network using correlations between daily surface air temperature data between points inside the El Niño basin and certain points outside this region, as shown here:

The red dots are the points in their version of the El Niño basin.

(Next I will describe Ludescher’s procedure. I may omit some details in the actual talk, but let me include them here.)

The main idea of Ludescher et al is to construct a climate network that is a weighted graph, and to say an El Niño will occur if the average weight of edges between points in the El Niño basin and points outside this basin exceeds a certain threshold.

As in the other papers I mentioned, Ludescher et al let $T_i(t)$ be the surface air temperature at the $i$th grid point at time $t$ minus the average temperature at that location at that time of year in all years under consideration, to eliminate the most obvious seasonal effects.

They consider a time-delayed covariance between temperatures at different grid points:

$\langle T_i(t) T_j(t - \tau) \rangle - \langle T_i(t) \rangle \langle T_j(t - \tau) \rangle$

where $\tau$ is a time delay, and the angle brackets denote a running average over the last year, that is:

$\displaystyle{ \langle f(t) \rangle = \frac{1}{365} \sum_{d = 0}^{364} f(t - d) }$

where $t$ is the time in days.

They normalize this to define a correlation $C_{i,j}^t(\tau)$ that ranges from -1 to 1.

Next, for any pair of nodes $i$ and $j,$ and for each time $t,$ they determine the maximum, the mean and the standard deviation of $|C_{i,j}^t(\tau)|,$ as the delay $\tau$ ranges from -200 to 200 days.

They define the link strength $S_{i,j}(t)$ as the difference between the maximum and the mean value of $|C_{i,j}^t(\tau)|,$ divided by its standard deviation.

Finally, they let $S(t)$ be the average link strength, calculated by averaging $S_{i j}(t)$ over all pairs $i,j$ where $i$ is a grid point inside their El Niño basin and $j$ is a grid point outside this basin, but still in their larger rectangle.

Here is what they get:

The blue peaks are El Niños: episodes where the Niño 3.4 index is over 0.5°C for at least 5 months.

The red line is their ‘average link strength’. Whenever this exceeds a certain threshold $\Theta = 2.82,$ and the Niño 3.4 index is not already over 0.5°C, they predict an El Niño will start in the following calendar year.

Ludescher et al chose their threshold for El Niño prediction by training their algorithm on climate data from 1948 to 1980, and tested it on data from 1981 to 2013. They claim that with this threshold, their El Niño predictions were correct 76% of the time, and their predictions of no El Niño were correct in 86% of all cases.

On this basis they claimed—when their paper was published in February 2014—that the Niño 3.4 index would exceed 0.5 by the end of 2014 with probability 3/4.

The latest data as of 1 December 2014 seems to say: yes, it happened!

### Replication and critique

Graham Jones of the Azimuth Project wrote code implementing Ludescher et al’s algorithm, as best as we could understand it, and got results close to theirs, though not identical. The code is open-source; one goal of the Azimuth Project is to do science ‘in the open’.

More interesting than the small discrepancies between our calculation and theirs is the question of whether ‘average link strengths’ between points in the El Niño basin and points outside are really helpful in predicting El Niños.

Steve Wenner, a statistician helping the Azimuth Project, noted some ambiguities in Ludescher et al‘s El Niño prediction rules and disambiguated them in a number of ways. For each way he used Fischer’s exact test to compute the $p$-value of the null hypothesis that Ludescher et al‘s El Niño prediction does not improve the odds that what they predict will occur.

The best he got (that is, the lowest $p$-value) was 0.03. This is just a bit more significant than the conventional 0.05 threshold for rejecting a null hypothesis.

Do high average link strengths between points in the El Niño basin and points elsewhere in the Pacific really increase the chance that an El Niño is coming? It is hard to tell from the work of Ludescher et al.

One reason is that they treat El Niño as a binary condition, either on or off depending on whether the Niño 3.4 index for a given month exceeds 0.5 or not. This is not the usual definition of El Niño, but the real problem is that they are only making a single yes-or-no prediction each year for 65 years: does an El Niño occur during this year, or not? 31 of these years (1950-1980) are used for training their algorithm, leaving just 34 retrodictions and one actual prediction (1981-2013, and 2014).

So, there is a serious problem with small sample size.

We can learn a bit by taking a different approach, and simply running some linear regressions between the average link strength and the Niño 3.4 index for each month. There are 766 months from 1950 to 2013, so this gives us more data to look at. Of course, it’s possible that the relation between average link strength and Niño is highly nonlinear, so a linear regression may not be appropriate. But it is at least worth looking at!

Daniel Mahler and Dara Shayda of the Azimuth Project did this and found the following interesting results.

### Simple linear models

Here is a scatter plot showing the Niño 3.4 index as a function of the average link strength on the same month:

(Click on these scatter plots for more information.)

The coefficient of determination, $R^2,$ is 0.0175. In simple terms, this means that the average link strength in a given month explains just 1.75% of the variance of the Niño 3.4 index. That’s quite low!

Here is a scatter plot showing the Niño 3.4 index as a function of the average link strength six months earlier:

Now $R^2$ is 0.088. So, the link strength explains 8.8% of the variance in the Niño 3.4 index 6 months later. This is still not much—but interestingly, it’s much more than when we try to relate them at the same moment in time! And the $p$-value is less than $2.2 \cdot 10^{-16},$ so the effect is statistically significant.

Of course, we could also try to use Niño 3.4 to predict itself. Here is the Niño 3.4 index plotted against the Niño 3.4 index six months earlier:

Now $R^2 = 0.162.$ So, this is better than using the average link strength!

That doesn’t sound good for average link strength. But now let’s could try to predict Niño 3.4 using both itself and the average link strength 6 months earlier. Here is a scatter plot showing that:

Here the $x$ axis is an optimally chosen linear combination of average and link strength and Niño 3.4: one that maximizes $R^2$.

In this case we get $R^2 = 0.22.$

### Conclusions

What can we conclude from this?

Using a linear model, the average link strength on a given month accounts for only 8% of the variance of Niño 3.4 index 6 months in the future. That sounds bad, and indeed it is.

However, there are more interesting things to say than this!

Both the Niño 3.4 index and the average link strength can be computed from the surface air temperature of the Pacific during some window in time. The Niño 3.4 index explains 16% of its own variance 6 months into the future; the average link strength explains 8%, and taken together they explain 22%. So, these two variables contain a fair amount of independent information about the Niño 3.4 index 6 months in the future.

Furthermore, they explain a surprisingly large amount of its variance for just 2 variables.

For comparison, Mahler used a random forest variant called ExtraTreesRegressor to predict the Niño 3.4 index 6 months into the future from much larger collections of data. Out of the 778 months available he trained the algorithm on the first 400 and tested it on the remaining 378.

The result: using a full world-wide grid of surface air temperature values at a given moment in time explains only 23% of the Niño 3.4 index 6 months into the future. A full grid of surface air pressure values does considerably better, but still explains only 34% of the variance. Using twelve months of the full grid of pressure values only gets around 37%.

From this viewpoint, explaining 22% of the variance with just two variables doesn’t look so bad!

Moreover, while the Niño 3.4 index is maximally correlated with itself at the same moment in time, for obvious reasons, the average link strength is maximally correlated with the Niño 3.4 index 10 months into the future:

(The lines here occur at monthly intervals.)

However, we have not tried to determine if the average link strength as Ludescher et al define it is optimal in this respect. Graham Jones has shown that simplifying their definition of this quantity doesn’t change it much. Maybe modifying their definition could improve it. There seems to be a real phenomenon at work here, but I don’t think we know exactly what it is!

My talk has avoided discussing physical models of the ENSO, because I wanted to focus on very simple, general ideas from complex network theory. However, it seems obvious that really understanding the ENSO requires a lot of ideas from meteorology, oceanography, physics, and the like. I am not advocating a ‘purely network-based approach’.

## A Second Law for Open Markov Processes

15 November, 2014

guest post by Blake Pollard

What comes to mind when you hear the term ‘random process’? Do you think of Brownian motion? Do you think of particles hopping around? Do you think of a drunkard staggering home?

Today I’m going to tell you about a version of the drunkard’s walk with a few modifications. Firstly, we don’t have just one drunkard: we can have any positive real number of drunkards. Secondly, our drunkards have no memory; where they go next doesn’t depend on where they’ve been. Thirdly, there are special places, such as entrances to bars, where drunkards magically appear and disappear.

The second condition says that our drunkards satisfy the Markov property, making their random walk into a Markov process. The third condition is really what I want to tell you about, because it makes our Markov process into a more general ‘open Markov process’.

There are a collection of places the drunkards can be, for example:

$V= \{ \text{bar},\text{sidewalk}, \text{street}, \text{taco truck}, \text{home} \}$

We call this set $V$ the set of states. There are certain probabilities associated with traveling between these places. We call these transition rates. For example it is more likely for a drunkard to go from the bar to the taco truck than to go from the bar to home so the transition rate between the bar and the taco truck should be greater than the transition rate from the bar to home. Sometimes you can’t get from one place to another without passing through intermediate places. In reality the drunkard can’t go directly from the bar to the taco truck: he or she has to go from the bar to sidewalk to the taco truck.

This information can all be summarized by drawing a directed graph where the positive numbers labelling the edges are the transition rates:

For simplicity we draw only three states: home, bar, taco truck. Drunkards go from home to the bar and back, but they never go straight from home to the taco truck.

We can keep track of where all of our drunkards are using a vector with 3 entries:

$\displaystyle{ p(t) = \left( \begin{array}{c} p_h(t) \\ p_b(t) \\ p_{tt}(t) \end{array} \right) \in \mathbb{R}^3 }$

We call this our population distribution. The first entry $p_h$ is the number of drunkards that are at home, the second $p_b$ is how many are at the bar, and the third $p_{tt}$ is how many are at the taco truck.

There is a set of coupled, linear, first-order differential equations we can write down using the information in our graph that tells us how the number of drunkards in each place change with time. This is called the master equation:

$\displaystyle{ \frac{d p}{d t} = H p }$

where $H$ is a 3×3 matrix which we call the Hamiltonian. The off-diagonal entries are nonnegative:

$H_{ij} \geq 0, i \neq j$

and the columns sum to zero:

$\sum_i H_{ij}=0$

We call a matrix satisfying these conditions infinitesimal stochastic. Stochastic matrices have columns that sum to one. If we take the exponential of an infinitesimal stochastic matrix we get one whose columns sum to one, hence the label ‘infinitesimal’.

The Hamiltonian for the graph above is

$H = \left( \begin{array}{ccc} -2 & 5 & 10 \\ 2 & -12 & 0 \\ 0 & 7 & -10 \end{array} \right)$

John has written a lot about Markov processes and infinitesimal stochastic Hamiltonians in previous posts.

Given two vectors $p,q \in \mathbb{R}^3$ describing the populations of drunkards which obey the same master equation, we can calculate the relative entropy of $p$ relative to $q$:

$\displaystyle{ S(p,q) = \sum_{ i \in V} p_i \ln \left( \frac{p_i}{q_i} \right) }$

This is an example of a ‘divergence’. In statistics, a divergence a way of measuring the distance between probability distributions, which may not be symmetrical and may even not obey the triangle inequality.

The relative entropy is important because it decreases monotonically with time, making it a Lyapunov function for Markov processes. Indeed, it is a well known fact that

$\displaystyle{ \frac{dS(p(t),q(t) ) } {dt} \leq 0 }$

This is true for any two population distributions which evolve according to the same master equation, though you have to allow infinity as a possible value for the relative entropy and negative infinity for its time derivative.

Why is entropy decreasing? Doesn’t the Second Law of Thermodynamics say entropy increases?

Don’t worry: the reason is that I have not put a minus sign in my definition of relative entropy. Put one in if you like, and then it will increase. Sometimes without the minus sign it’s called the Kullback–Leibler divergence. This decreases with the passage of time, saying that any two population distributions $p(t)$ and $q(t)$ get ‘closer together’ as they get randomized with the passage of time.

That itself is a nice result, but I want to tell you what happens when you allow drunkards to appear and disappear at certain states. Drunkards appear at the bar once they’ve had enough to drink and once they are home for long enough they can disappear. The set of places where drunkards can appear or disappear $B$ is called the set of boundary states.  So for the above process

$B = \{ \text{home},\text{bar} \}$

is the set of boundary states. This changes the way in which the population of drunkards changes with time!

The drunkards at the taco truck obey the master equation. For them,

$\displaystyle{ \frac{dp_{tt}}{dt} = 7p_b -10 p_{tt} }$

still holds. But because the populations can appear or disappear at the boundary states the master equation no longer holds at those states! Instead it is useful to define the flow of drunkards into the $i^{th}$ state by

$\displaystyle{ \frac{Dp_i}{Dt} = \frac{dp_i}{dt}-\sum_j H_{ij} p_j}$

This quantity describes by how much the rate of change of the populations at the boundary states differ from that given by the master equation.

The reason why we are interested in open Markov processes is because you can take two open Markov processes and glue them together along some subset of their boundary states to get a new open Markov process! This allows us to build up or break down complicated Markov processes using open Markov processes as the building blocks.

For example we can draw the graph corresponding to the drunkards’ walk again, only now we will distinguish boundary states from internal states by coloring internal states blue and having boundary states be white:

Consider another open Markov process with states

$V=\{ \text{home},\text{work},\text{bar} \}$

where

$B=\{ \text{home}, \text{bar}\}$

are the boundary states, leaving

$I=\{\text{work}\}$

as an internal state:

Since the boundary states of this process overlap with the boundary states of the first process we can compose the two to form a new Markov process:

Notice the boundary states are now internal states. I hope any Markov process that could approximately model your behavior has more interesting nodes! There is a nice way to figure out the Hamiltonian of the composite from the Hamiltonians of the pieces, but we will leave that for another time.

We can ask ourselves, how does relative entropy change with time in open Markov processes? You can read my paper for the details, but here is the punchline:

$\displaystyle{ \frac{dS(p(t),q(t) ) }{dt} \leq \sum_{i \in B} \frac{Dp_i}{Dt}\frac{\partial S}{\partial p_i} + \frac{Dq_i}{Dt}\frac{\partial S}{\partial q_i} }$

This is a version of the Second Law of Thermodynamics for open Markov processes.

It is important to notice that the sum is only over the boundary states! This inequality tells us that relative entropy still decreases inside our process, but depending on the flow of populations through the boundary states the relative entropy of the whole process could either increase or decrease! This inequality will be important when we study how the relative entropy changes in different parts of a bigger more complicated process.

That is all for now, but I leave it as an exercise for you to imagine a Markov process that describes your life. How many states does it have? What are the relative transition rates? Are there states you would like to spend more or less time in? Are there states somewhere you would like to visit?

Here is my paper, which proves the above inequality:

• Blake Pollard, A Second Law for open Markov processes.

If you have comments or corrections, let me know!

## Network Theory Seminar (Part 4)

5 November, 2014

Since I was in Banff, my student Franciscus Rebro took over this week and explained more about cospan categories. These are a tool for constructing categories where the morphisms are networks such as electrical circuit diagrams, signal flow diagrams, Markov processes and the like. For some more details see:

• John Baez and Brendan Fong, A compositional framework for passive linear networks.

Cospan categories are really best thought of as bicategories, and Franciscus gets into this aspect too.

## Network Theory (Part 33)

4 November, 2014

Last time I came close to describing the ‘black box functor’, which takes an electrical circuit made of resistors

and sends it to its behavior as viewed from outside. From outside, all you can see is the relation between currents and potentials at the ‘terminals’—the little bits of wire that poke out of the black box:

I came close to defining the black box functor, but I didn’t quite make it! This time let’s finish the job.

### The categories in question

The black box functor

$\blacksquare : \mathrm{ResCirc} \to \mathrm{LinRel}$

goes from the category $\mathrm{ResCirc},$ where morphisms are circuits made of resistors, to the category $\mathrm{LinRel},$ where morphisms are linear relations. Let me remind you how these categories work, and introduce a bit of new notation.

Here is the category $\mathrm{ResCirc}:$

• an object is a finite set;

• a morphism from $X$ to $Y$ is an isomorphism class of cospans

in the category of graphs with edges labelled by resistances: numbers in $(0,\infty).$ Here we think of the finite sets $X$ and $Y$ as graphs with no edges. We call $X$ the set of inputs and $Y$ the set of outputs.

• we compose morphisms in $\mathrm{ResCirc}$ by composing isomorphism classes of cospans.

And here is the category $\mathrm{LinRel}:$

• an object is a finite-dimensional real vector space;

• a morphism from $U$ to $V$ is a linear relation $R : U \leadsto V,$ meaning a linear subspace $R \subseteq U \times V;$

• we compose a linear relation $R \subseteq U \times V$ and a linear relation $S \subseteq V \times W$ in the usual way we compose relations, getting:

$SR = \{(u,w) \in U \times W : \; \exists v \in V \; (u,v) \in R \mathrm{\; and \;} (v,w) \in S \}$

In case you’re wondering: I’ve just introduced the wiggly arrow notation

$R : U \leadsto V$

for a linear relation from $U$ to $V,$ because it suggests that a relation is a bit like a function but more general. Indeed, a function is a special case of a relation, and composing functions is a special case of composing relations.

### The black box functor

Now, how do we define the black box functor?

Defining it on objects is easy. An object of $\mathrm{ResCirc}$ is a finite set $S,$ and we define

$\blacksquare{S} = \mathbb{R}^S \times \mathbb{R}^S$

The idea is that $S$ could be a set of inputs or outputs, and then

$(\phi, I) \in \mathbb{R}^S \times \mathbb{R}^S$

is a list of numbers: the potentials and currents at those inputs or outputs.

So, the interesting part is defining the black box functor on morphisms!

For this we start with a morphism in $\mathrm{ResCirc}$:

The labelled graph $\Gamma$ consists of:

• a set $N$ of nodes,

• a set $E$ of edges,

• maps $s, t : E \to N$ sending each edge to its source and target,

• a map $r : E \to (0,\infty)$ sending each edge to its resistance.

The cospan gives maps

$i: X \to N, \qquad o: Y \to N$

These say how the inputs and outputs are interpreted as nodes in the circuit. We’ll call the nodes that come from inputs or outputs ‘terminals’. So, mathematically,

$T = \mathrm{im}(i) \cup \mathrm{im}(o) \subseteq N$

is the set of terminals: the union of the images of $i$ and $o.$

In the simplest case, the maps $i$ and $o$ are one-to-one, with disjoint ranges. Then each terminal either comes from a single input, or a single output, but not both! This is a good picture to keep in mind. But some subtleties arise when we leave this simplest case and consider other cases.

Now, the black box functor is supposed to send our circuit to a linear relation. I’ll call the circuit $\Gamma$ for short, though it’s really the whole cospan

So, our black box functor is supposed to send this circuit to a linear relation

$\blacksquare(\Gamma) : \mathbb{R}^X \times \mathbb{R}^X \leadsto \mathbb{R}^Y \times \mathbb{R}^Y$

This is a relation between the potentials and currents at the input terminals and the potentials and currents at the output terminals! How is it defined?

I’ll start by outlining how this works.

First, our circuit picks out a subspace

$dQ \subseteq \mathbb{R}^T \times \mathbb{R}^T$

This is the subspace of allowed potentials and currents on the terminals. I’ll explain this and why it’s called $dQ$ a bit later. Briefly, it comes from the principle of minimum power, described last time.

Then, the map

$i: X \to T$

gives a linear relation

$S(i) : \mathbb{R}^X \times \mathbb{R}^X \leadsto \mathbb{R}^T \times \mathbb{R}^T$

This says how the potentials and currents at the inputs are related to those at the terminals. Similarly, the map

$o: Y \to T$

gives a linear relation

$S(o) : \mathbb{R}^Y \times \mathbb{R}^Y \leadsto \mathbb{R}^T \times \mathbb{R}^T$

This says how the potentials and currents at the outputs are related to those at the terminals.

Next, we can ‘turn around’ any linear relation

$R : \mathbb{R}^Y \times \mathbb{R}^Y \leadsto \mathbb{R}^T \times \mathbb{R}^T$

to get a relation

$R^\dagger : \mathbb{R}^T \times \mathbb{R}^T \leadsto \mathbb{R}^Y \times \mathbb{R}^Y$

defined by

$R^\dagger = \{(\phi',-I',\phi,-I) : (\phi, I, \phi', I') \in R \}$

Here we are just switching the input and output potentials, but when we switch the currents we also throw in a minus sign. The reason is that we care about the current flowing in to an input, but out of an output.

Finally, one more trick: given a linear subspace

$L \subseteq V$

of a vector space $V$ we get a linear relation

$1|_L : V \leadsto V$

called the identity restricted to $L$, defined like this:

$1|_L = \{ (v, v) :\; v \in L \} \subseteq V \times V$

If $L$ is all of $V$ this relation is actually the identity function on $V.$ Otherwise it’s a partially defined function that’s defined only on $L,$ and is the identity there. (A partially defined function is an example of a relation.) My notation $1|_L$ is probably bad, but I don’t know a better one, so bear with me.

Let’s use all these ideas to define

$\blacksquare(\Gamma) : \mathbb{R}^X \times \mathbb{R}^X \leadsto \mathbb{R}^Y \times \mathbb{R}^Y$

To do this, we compose three linear relations:

$S(i) : \mathbb{R}^X \times \mathbb{R}^X \leadsto \mathbb{R}^T \times \mathbb{R}^T$

2) We compose this with

$1|_{dQ} : \mathbb{R}^T \times \mathbb{R}^T \leadsto \mathbb{R}^T \times \mathbb{R}^T$

3) Then we compose this with

$S(o)^\dagger : \mathbb{R}^T \times \mathbb{R}^T \leadsto \mathbb{R}^Y \times \mathbb{R}^Y$

Note that:

1) says how the potentials and currents at the inputs are related to those at the terminals,

2) picks out which potentials and currents at the terminals are actually allowed, and

3) says how the potentials and currents at the terminals are related to those at the outputs.

So, I hope all makes sense, at least in some rough way. In brief, here’s the formula:

$\blacksquare(\Gamma) = S(o)^\dagger \; 1|_{dQ} \; S(i)$

Now I just need to fill in some details. First, how do we define $S(i)$ and $S(o)?$ They work exactly the same way, by ‘copying potentials and adding currents’, so I’ll just talk about one. Second, how do we define the subspace $dQ?$ This uses the principle of minimum power.

### Duplicating potentials and adding currents

Any function between finite sets

$i: X \to T$

gives a linear map

$i^* : \mathbb{R}^T \to \mathbb{R}^X$

Mathematicians call this linear map the pullback along $i,$ and for any $\phi \in \mathbb{R}^T$ it’s defined by

$i^*(\phi)(x) = \phi(i(x))$

In our application, we think of $\phi$ as a list of potentials at terminals. The function $i$ could map a bunch of inputs to the same terminal, and the above formula says the potential at this terminal gives the potential at all those inputs. So, we are copying potentials.

We also get a linear map going the other way:

$i_* : \mathbb{R}^X \to \mathbb{R}^T$

Mathematicians call this the pushforward along $i,$ and for any $I \in \mathbb{R}^X$ it’s defined by

$\displaystyle{ i_*(I)(t) = \sum_{x \; : \; i(x) = t } I(x) }$

In our application, we think of $I$ as a list of currents entering at some inputs. The function $i$ could map a bunch of inputs to the same terminal, and the above formula says the current at this terminal is the sum of the currents at all those inputs. So, we are adding currents.

Putting these together, our map

$i : X \to T$

gives a linear relation

$S(i) : \mathbb{R}^X \times \mathbb{R}^X \leadsto \mathbb{R}^T \times \mathbb{R}^T$

where the pair $(\phi, I) \in \mathbb{R}^X \times \mathbb{R}^X$ is related to the pair $(\phi', I') \in \mathbb{R}^T \times \mathbb{R}^T$ iff

$\phi = i^*(\phi')$

and

$I' = i_*(I)$

So, here’s the rule of thumb when attaching the points of $X$ to the input terminals of our circuit: copy potentials, but add up currents. More formally:

$\begin{array}{ccl} S(i) &=& \{ (\phi, I, \phi', I') : \; \phi = i^*(\phi') , \; I' = i_*(I) \} \\ \\ &\subseteq& \mathbb{R}^X \times \mathbb{R}^X \times \mathbb{R}^T \times \mathbb{R}^T \end{array}$

### The principle of minimum power

Finally, how does our circuit define a subspace

$dQ \subseteq \mathbb{R}^T \times \mathbb{R}^T$

of allowed potential-current pairs at the terminals? The trick is to use the ideas we discussed last time. If we know the potential at all nodes of our circuit, say $\phi \in \mathbb{R}^N$, we know the power used by the circuit:

$P(\phi) = \displaystyle{ \sum_{e \in E} \frac{1}{r_e} \big(\phi(s(e)) - \phi(t(e))\big)^2 }$

We saw last time that if we fix the potentials at the terminals, the circuit will choose potentials at the other nodes to minimize this power. We can describe the potential at the terminals by

$\psi \in \mathbb{R}^T$

So, the power for a given potential at the terminals is

$Q(\psi) = \displaystyle{ \frac{1}{2} \min_{\phi \in \mathbb{R}^N \; : \; \phi|_T = \psi} \sum_{e \in E} \frac{1}{r_e} \big(\phi(s(e)) - \phi(t(e))\big)^2 }$

Actually this is half the power: I stuck in a factor of 1/2 for some reason we’ll soon see. This $Q$ is a quadratic function

$Q : \mathbb{R}^T \to \mathbb{R}$

so its derivative is linear. And, our work last time showed something interesting: to compute the current $J_x$ flowing into a terminal $x \in T,$ we just differentiate $Q$ with respect to the potential at that terminal:

$\displaystyle{ J_x = \frac{\partial Q(\psi)}{\partial \psi_x} }$

This is the reason for the 1/2: when we take the derivative of $Q,$ we bring down a 2 from differentiating all those squares, and to make that go away we need a 1/2.

The space of allowed potential-current pairs at the terminals is thus the linear subspace

$dQ = \{ (\psi, J) : \; \displaystyle{ J_x = \frac{\partial Q(\psi)}{\partial \psi_x} \} \subseteq \mathbb{R}^T \times \mathbb{R}^T }$

And this completes our precise description of the black box functor!

The hard part is this:

Theorem. $\blacksquare : \mathrm{ResCirc} \to \mathrm{LinRel}$ is a functor.

In other words, we have to prove that it preserves composition:

$\blacksquare(fg) = \blacksquare(f) \blacksquare(g)$

For that, read our paper:

• John Baez and Brendan Fong, A compositional framework for passive linear networks.

## Sensing and Acting Under Information Constraints

30 October, 2014

I’m having a great time at a workshop on Biological and Bio-Inspired Information Theory in Banff, Canada. You can see videos of the talks online. There have been lots of good talks so far, but this one really blew my mind:

• Naftali Tishby, Sensing and acting under information constraints—a principled approach to biology and intelligence, 28 October 2014.

Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was great!

It was about the ‘action-perception loop':

This is the feedback loop in which living organisms—like us—take actions depending on our goals and what we perceive, and perceive things depending on the actions we take and the state of the world.

How do we do this so well? Among other things, we need to balance the cost of storing information about the past against the payoff of achieving our desired goals in the future.

Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.

It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.

### Rate-distortion theory

Rate-distortion theory is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.

For the basic definitions and a statement of the theorem, try this:

• Natasha Devroye, Rate-distortion theory, course notes, University of Chicago, Illinois, Fall 2009.

One of the people organizing this conference is a big expert on rate-distortion theory, and he wrote a book about it.

• Toby Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice–Hall, 1971.

Unfortunately it’s out of print and selling for \$259 used on Amazon! An easier option might be this:

• Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, Chapter 10: Rate Distortion Theory, Wiley, New York, 2006.

### The Bellman equation

The Bellman equation reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.

If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the Euler–Lagrange equation.

And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.

Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!

Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point $q$ in space at time $t.$ If we call this least possible action $W(t,q),$ the Hamilton–Jacobi equation says

$\displaystyle{ \frac{\partial W(t,q)}{\partial q_i} = p_i }$

$\displaystyle{ \frac{\partial W(t,q)}{\partial t} = -E }$

where $p$ is the particle’s momentum at the endpoint of its path, and $E$ is its energy there.

If we replace derivatives by differences, and talk about maximizing total reward instead of minimizing action, we get Bellman’s equation:

Bellman equation, Wikipedia.

### Markov decision processes

Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.

In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set $S$ of states, and time proceeds in discrete steps. At each time step, the probability to hop from state $s$ to state $s'$ is some fixed number $P(s,s').$

This sort of thing is called a Markov chain, or if you feel the need to be more insistent, a discrete-time Markov chain.

A Markov decision process is a generalization where an outside agent gets to change these probabilities! The agent gets to choose actions from some set $A.$ If at a given time he chooses the action $\alpha \in A,$ the probability of the system hopping from state $s$ to state $s'$ is $P_\alpha(s,s').$ Needless to say, these probabilities have to sum to one for any fixed $s.$

That would already be interesting, but the real fun is that there’s also a reward $R_\alpha(s,s').$ This is a real number saying how much joy or misery the agent experiences if he does action $\alpha$ and the system hops from $s$ to $s'.$

The problem is to choose a policy—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!

If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward $n$ time steps in the future gets multiplied by $\gamma^n,$ where $0 < \gamma \le 1$ is some discount factor. This extra tweak is easily handled, and you can see it all here:

Markov decision process, Wikipedia.

### Partially observable Markov decision processes

There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action $\alpha \in A$ and the system hops from state $s$ to state $s',$ he sees something: a point in some set $O$ of observations. He makes the observation $o \in O$ with probability $\Omega_\alpha(o,s').$

(I don’t know why this probability depends on $s'$ but not $s.$ Maybe it ultimately doesn’t matter much.)

Again, the goal is to choose a policy that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.

Partially observable Markov decision processes are also called POMPDs. If you want to learn about them, try these:

Partially observable Markov decision process, Wikipedia.

• Tony Cassandra, Partially observable Markov decision processes.

The latter includes an introduction without any formulas to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).

### The action-perception loop

I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:

Nafthali Tishby has a nice picture of it that’s more abstract:

We’re assuming time comes in discrete steps, just to keep things simple.

At each time $t$

• the world has some state $W_t,$ and
• the agent has some state $M_t.$

Why the letter $M$? This stands for memory: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.

At each time

• the agent takes an action $A_t,$ which affects the world’s next state, and

• the world provides a sensation $S_t$ to the agent, which affect’s the agent’s next state.

This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!

We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they

• get a reward depending on the action taken and the resulting state of the world

but

• pay a price for the information stored from sensations.

Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!

So, everything I sketched fits together somehow!

I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.

For details, I highly recommend this paper:

• Naftali Tishby and Daniel Polani, Information theory of decisions and actions, in Perception-Reason-Action Cycle: Models, Algorithms and System. Vassilis, Hussain and Taylor, Springer, Berlin, 2010.

I’m going to print this out, put it by my bed, and read it every night until I’ve absorbed it.

## Biodiversity, Entropy and Thermodynamics

27 October, 2014

I’m giving a short 30-minute talk at a workshop on Biological and Bio-Inspired Information Theory at the Banff International Research Institute.

I’ll say more about the workshop later, but here’s my talk, in PDF and video form:

Most of the people at this workshop study neurobiology and cell signalling, not evolutionary game theory or biodiversity. So, the talk is just a quick intro to some things we’ve seen before here. Starting from scratch, I derive the Lotka–Volterra equation describing how the distribution of organisms of different species changes with time. Then I use it to prove a version of the Second Law of Thermodynamics.

This law says that if there is a ‘dominant distribution’—a distribution of species whose mean fitness is at least as great as that of any population it finds itself amidst—then as time passes, the information any population has ‘left to learn’ always decreases!

Of course reality is more complicated, but this result is a good start.

This was proved by Siavash Shahshahani in 1979. For more, see:

• Lou Jost, Entropy and diversity.

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.