This time I explain more about how ‘cospans’ represent gadgets with two ends, an input end and an output end:
I describe how to glue such gadgets together by composing cospans. We compose cospans using a category-theoretic construction called a ‘pushout’, so I also explain pushouts. At the end, I explain how this gives us a category where the morphisms are electrical circuits made of resistors, and sketch what we’ll do next: study the behavior of these circuits.
These lecture notes provide extra details:
• Y. Berezin, Avi Gozolchiani, O. Guez and Shlomo Havlin, Stability of climate networks with time, Scientific Reports 2 (2012).
Some of the authors are the same, and the way they define climate networks is very similar. But their goal here is different: they want to see see how stable climate networks are over time. This is important, since the other paper wants to predict El Niños by changes in climate networks.
They divide the world into 9 zones:
For each zone they construct several climate networks. Each one is an array of numbers , one for each year and each pair of grid points in that zone. They call a link strength: it’s a measure of how how correlated the weather is at those two grid points during that year.
I’ll say more later about how they compute these link strengths. In Part 3 we explained one method for doing it. This paper uses a similar but subtly different method.
The paper’s first big claim is that doesn’t change much from year to year, “in complete contrast” to the pattern of local daily air temperature and pressure fluctuations. In simple terms: the strength of the correlation between weather at two different points tends to be quite stable.
Moreover, the definition of link strength involves an adjustable time delay, . We can measure the correlation between the weather at point at any given time and point at a time days later. The link strength is computed by taking a maximum over time delays . Naively speaking, the value of that gives the maximum correlation is “how long it typically takes for weather at point to affect weather at point ”. Or the other way around, if is negative.
This is a naive way of explaining the idea, because I’m mixing up correlation with causation. But you get the idea, I hope.
Their second big claim is that when the link strength between two points and is big, the value of that gives the maximum correlation doesn’t change much from year to year. In simple terms: if the weather at two locations is strongly correlated, the amount of time it takes for weather at one point to reach the other point doesn’t change very much.
How do Berezin et al define their climate network?
They use data obtained from here:
This is not exactly the same data set that Ludescher et al use, namely:
“Reanalysis 2″ is a newer attempt to reanalyze and fix up the same pile of data. That’s a very interesting issue, but never mind that now!
Berezin et al use data for:
• the geopotential height for six different pressures
and
• the air temperature at those different heights
The geopotential height for some pressure says roughly how high you have to go for air to have that pressure. Click the link if you want a more precise definition! Here’s the geopotential height field for the pressure of 500 millibars on some particular day of some particular year:
The height is in meters.
Berezin et al use daily values for this data for:
• locations world-wide on a grid with a resolution of 5° × 5°,
during:
• the years from 1948 to 2006.
They divide the globe into 9 zones, and separately study each zone:
So, they’ve got twelve different functions of space and time, where space is a rectangle discretized using a 5° × 5° grid, and time is discretized in days. From each such function they build a ‘climate network’.
How do they do it?
Berezin’s method of defining a climate network is similar to Ludescher et al‘s, but different. Compare Part 3 if you want to think about this.
Let be any one of their functions, evaluated at the grid point on day of year .
Let be minus its climatological average. For example, if is June 1st and is 1970, we average the temperature at location over all June 1sts from 1948 to 2006, and subtract that from to get . In other words:
where is the number of years considered.
For any function of time , let be the average of the function over all days in year . This is different than the ‘running average’ used by Ludescher et al, and I can’t even be 100% sure that Berezin mean what I just said: they use the notation .
Let and be two grid points, and any number of days in the interval . Define the cross-covariance function at time by:
I believe Berezin mean to consider this quantity, because they mention two grid points and . Their notation omits the subscripts and so it is impossible to be completely sure what they mean! But what I wrote is the reasonable quantity to consider here, so I’ll assume this is what they meant.
They normalize this quantity and take its absolute value, forming:
They then take the maximum value of over delays , subtract its mean over delays in this range, and divide by the standard deviation. They write something like this:
and say that the maximum, mean and standard deviation are taken over the (not written) variable .
Each number is called a link strength. For each year, the matrix of numbers where and range over all grid points in our zone is called a climate network.
We can think of a climate network as a weighted complete graph with the grid points as nodes. Remember, an undirected graph is one without arrows on the edges. A complete graph is an undirected graph with one edge between any pair of nodes:
A weighted graph is an undirected graph where each edge is labelled by a number called its weight. But right now we’re also calling the weight the ‘link strength’.
A lot of what’s usually called ‘network theory’ is the study of weighted graphs. You can learn about it here:
• Ernesto Estrada, The Structure of Complex Networks: Theory and Applications, Oxford U. Press, Oxford, 2011.
Suffice it to say that given a weighted graph, there are lot of quantities you can compute from it, which are believed to tell us interesting things!
I will not delve into the real meat of the paper, namely what they actually do with their climate networks! The paper is free online, so you can read this yourself.
I will just quote their conclusions and show you a couple of graphs.
The conclusions touch on an issue that’s important for the network-based approach to El Niño prediction. If climate networks are ‘stable’, not changing much in time, why would we use them to predict a time-dependent phenomenon like the El Niño Southern Oscillation?
We have established the stability of the network of connections between the dynamics of climate variables (e.g. temperatures and geopotential heights) in different geographical regions. This stability stands in fierce contrast to the observed instability of the original climatological field pattern. Thus the coupling between different regions is, to a large extent, constant and predictable. The links in the climate network seem to encapsulate information that is missed in analysis of the original field.
The strength of the physical connection, , that each link in this network represents, changes only between 5% to 30% over time. A clear boundary between links that represent real physical dependence and links that emerge due to noise is shown to exist. The distinction is based on both the high link average strength and on the low variability of time delays .
Recent studies indicate that the strength of the links in the climate network changes during the El Niño Southern Oscillation and the North Atlantic Oscillation cycles. These changes are within the standard deviation of the strength of the links found here. Indeed in Fig. 3 it is clearly seen that the coefficient of variation of links in the El Niño basin (zone 9) is larger than other regions such as zone 1. Note that even in the El Niño basin the coefficient of variation is relatively small (less than 30%).
Beside the stability of single links, also the hierarchy of the link strengths in the climate network is preserved to a large extent. We have shown that this hierarchy is partially due to the two dimensional space in which the network is embedded, and partially due to pure physical coupling processes. Moreover the contribution of each of these effects, and the level of noise was explicitly estimated. The spatial effect is typically around 50% of the observed stability, and the noise reduces the stability value by typically 5%–10%.
The network structure was further shown to be consistent across different altitudes, and a monotonic relation between the altitude distance and the correspondence between the network structures is shown to exist. This yields another indication that the observed network structure represents effects of physical coupling.
The stability of the network and the contributions of different effects were summarized in specific relation to different geographical areas, and a clear distinction between equatorial and off–equatorial areas was observed. Generally, the network structure of equatorial regions is less stable and more fluctuative.
The stability and consistence of the network structure during time and across different altitudes stands in contrast to the known unstable variability of the daily anomalies of climate variables. This contrast indicates an analogy between the behavior of nodes in the climate network and the behavior of coupled chaotic oscillators. While the fluctuations of each coupled oscillators are highly erratic and unpredictable, the interactions between the oscillators is stable and can be predicted. The possible outreach of such an analogy lies in the search for known behavior patterns of coupled chaotic oscillators in the climate system. For example, existence of phase slips in coupled chaotic oscillators is one of the fingerprints for their cooperated behavior, which is evident in each of the individual oscillators. Some abrupt changes in climate variables, for example, might be related to phase slips, and can be understood better in this context.
On the basis of our measured coefficient of variation of single links (around 15%), and the significant overall network stability of 20–40%, one may speculatively assess the extent of climate change. However, for this assessment our current available data is too short and does not include enough time from periods before the temperature trends. An assessment of the relation between the network stability and climate change might be possible mainly through launching of global climate model “experiments” realizing other climate conditions, which we indeed intend to perform.
A further future outreach of our work can be a mapping between network features (such as network motifs) and known physical processes. Such a mapping was previously shown to exist between an autonomous cluster in the climate network and El Niño. Further structures without such a climate interpretation might point towards physical coupling processes which were not observed earlier.
(I have expanded some acronyms and deleted some reference numbers.)
Finally, here two nice graphs showing the average link strength as a function of distance. The first is based on four climate networks for Zone 1, the southern half of South America:
The second is based on four climate networks for Zone 9, a big patch of the Pacific north of the Equator which roughly corresponds to the ‘El Niño basin':
As we expect, temperatures and geopotential heights get less correlated at points further away. But the rate at which the correlation drops off conveys interesting information! Graham Jones has made some interesting charts of this for the rectangle of the Pacific that Ludescher et al use for El Niño prediction, and I’ll show you those next time.
• El Niño project (part 1): basic introduction to El Niño and our project here.
• El Niño project (part 2): introduction to the physics of El Niño.
• El Niño project (part 3): summary of the work of Ludescher et al.
• El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.
• El Niño project (part 5): how to download R and use it to get files of climate data.
• El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.
• El Niño project (part 7): the definition of El Niño.
• El Niño project (part 8): Berezin et al on the stability of climate networks.
Cospans may sound scary, but they’re not. A cospan is just a diagram consisting of an object with two morphisms going into it:
We can talk about cospans in any category. A cospan is an abstract way of thinking about a ‘chunk of stuff’ with two ‘ends’ and It could be any sort of stuff: a set, a graph, an electrical circuit, a network of any kind, or even a piece of matter (in some mathematical theory of matter).
We call the object the apex of the cospan and call the morphisms the legs of the cospan. We sometimes call the objects and the feet of the cospan. We call the input and the output. We say the cospan goes from to though the direction is just a convention: we can flip a cospan and get a cospan going the other way!
If you’re wondering about the name ‘cospan’, it’s because a span is a diagram like this:
Since a ‘span’ is another name for a bridge, and this looks like a bridge from to category theorists called it a span! And category theorists use the prefix ‘co-‘ when they turn all the arrows around. Spans came first historically, and we will use those too at times. But now let’s think about how to compose cospans.
Composing cospans is supposed to be like gluing together chunks of stuff by attaching the output of the first to the input of the second. So, we say two cospans are composable if the output of the first equals the input of the second, like this:
We then compose them by forming a new cospan going all the way from to :
The new object and the new morphisms are built using a process called a ‘pushout’ which I’ll explain in a minute. The result is cospan from to called the composite of the cospans we started with. Here it is:
So how does a pushout work? It’s a general construction that you can define in any category, though it only exists if the category is somewhat nice. (Ours always will be.) You start with a diagram like this:
and you want to get a commuting diamond like this:
which is in some sense ‘the best’ given the diagram we started with. For example, suppose we’re in the category of sets and is a set included in both and Then we’d like to be the union of and There are other choices of that would give a commuting diamond, but the union is the best. Something similar is happening when we compose circuits, but instead of the category of sets we’re using the category of labelled graphs we discussed last time.
How do we make precise the idea that is ‘the best’? We consider any other potential solution to this problem, that is, some other commuting diamond:
Then is ‘the best’ if there exists a unique morphism from to the ‘competitor’ making the whole combined diagram commute:
This property is called a universal property: instead of saying that is the ‘best’, grownups say it is universal.
When has this universal property we call it the pushout of the original diagram, and we may write it as Actually we should call the whole diagram
the pushout, or a pushout square, because the morphisms matter too. The universal property is not really a property just of but of the whole pushout square. But often we’ll be sloppy and call just the object the pushout.
Puzzle 1. Suppose we have a diagram in the category of sets
where and the maps are the inclusions of this intersection in the sets and Prove that is the pushout, or more precisely the diagram
is a pushout square, where are the inclusions of and in the union
More generally, a pushout in the category of sets is a way of gluing together sets and with some ‘overlap’ given by the maps
And this works for labelled graphs, too!
Puzzle 2. Suppose we have two circuits of resistors that are composable, like this:
and this:
These give cospans in the category where
(Remember from last time that is the category of graphs with edges labelled by elements of some set ) Show that if we compose these cospans we get a cospan corresponding to this circuit:
If you’re a mathematician you might find it easier to solve this kind of problem in general, which requires pondering how pushouts work in Alternatively, you might find it easier to think about this particular example: then you can just check that the answer we want has the desired property of a pushout!
If this stuff seems complicated, well, just know that category theory is a very general, powerful tool and I’m teaching you just the microscopic fragment of it that we need right now. Category theory ultimately seems very simple: I can’t really think of any math that’s simpler! It only seem complicated when it’s unfamiliar and you have a fragmentary view of it.
So where are we? We know that circuits made of resistors are a special case of cospans. We know how to compose cospans. So, we know how to compose circuits… and in the last puzzle, we saw this does just what we want.
The advantage of this rather highbrow approach is that a huge amount is known about composing cospans! In particular, suppose we have any category where pushouts exist: that is, where we can always complete any diagram like this:
to a pushout square. Then we can form a category where:
• an object is an object of
• an morphisms from an object to an object is an equivalence classes of spans from to
• we compose cospans in the manner just described.
Why did I say ‘equivalence class’? It’s because the pushout is not usually unique. It’s unique only up to isomorphism. So, composing cospans would be ill-defined unless we work with some kind of equivalence class of cospans.
To be precise, suppose we have two cospans from to :
Then a map of cospans from one to the other is a commuting diagram like this:
We say that this is an isomorphism of cospans if is an isomorphism.
This gives our equivalence relation on cospans! It’s an old famous theorem in category theory—so famous that it’s hard to find a reference for the proof—that whenever is a category with pushouts, there’s a category where:
• an object is an object of
• an morphisms from an object to an object is an isomorphism class of spans from to
• we compose isomorphism classes of cospans by picking representatives, composing them and then taking the isomorphism class.
This takes some work to prove, but it’s true, so this is how we get our category of circuits!
Next time we’ll do something with this category. Namely, we’ll cook up a category of ‘behaviors’. The behavior of a circuit made of resistors just says which currents and potentials its terminals can have. If we put a circuit in a metaphorical ‘black box’ and refuse to peek inside, all we can see is its behavior.
Then we’ll cook up a functor from the category of circuits to the category of behaviors. We’ll call this the ‘black box functor’. Saying that it’s a functor mainly means that
Here and are circuits that we can compose, and is their composite. The black square is the black box functor, so is the behavior of the circuit There’s a way to compose behaviors, too, and the equation above says that the behavior of the composite circuit is the composite of their behaviors!
This is very important, because it says we can figure out what a big circuit does if we know what its pieces do. And this is one of the grand themes of network theory: understanding big complicated networks by understanding their pieces. We may not always be able to do this, in practice! But it’s something we’re always concerned with.
Check out this video! I start with a quick overview of network theory, and then begin building a category where the morphisms are electrical circuits. These lecture notes provide extra details:
With luck, this video will be the first of a series. I’m giving a seminar on network theory at U.C. Riverside this fall. I’ll start by sketching the results here:
• John Baez and Brendan Fong, A compositional framework for passive linear networks.
But this is a big paper, and I also want to talk about other papers, so I certainly won’t explain everything in here—just enough to help you get started! If you have questions, don’t be shy about asking them.
I thank Blake Pollard for filming this seminar, and Muhammad “Siddiq” Siddiqui-Ali for providing the videocamera and technical support.
We used to talk about reaction networks quite a lot here. When Arjun Jain was visiting the CQT, we made a lot of progress understanding how the master equation reduces to the rate equation in the limit where there are very large numbers of things of each kind. But we never told you the end of the story, and by now it’s been such a long time that you may need a reminder of some basic facts!
So…
The rate equation treats the number of things of each kind as continuous—a nonnegative real number—and says how it changes in a deterministic way.
The master equation treats the number of things of each kind as discrete—a nonnegative integer—and says how it changes in a probabilistic way.
You can think of the master equation as the ‘true’ description, and the rate equation as an approximation that’s good in some limit where there are large numbers of molecules — or more precisely, where the probability distribution of having some number of molecules of each kind is sharply peaked near some large value.
You may remember that in the master equation, the state of a chemical system is described by a vector in a kind of ‘Fock space’, while time evolution is described with the help of an operator on this space, called the ‘Hamiltonian’ :
The Hamiltonian is built from annihilation and creation operators, so all of this looks very much like quantum field theory. The details are here, and we won’t try to review them all:
• John Baez and Jacob Biamonte, Quantum Techniques for Stochastic Mechanics.
The point is this: the ‘large-number limit’ where the master equation reduces to the rate equation smells a lot like the ‘classical limit’ of quantum field theory, where the description of light in terms of photons reduces to the good old Maxwell equations. So, maybe we can understand the large-number limit by borrowing techniques from quantum field theory!
How do we take the classical limit of quantum electromagnetism and get the classical Maxwell equations? For simplicity let’s ignore charged particles and consider the ‘free electromagnetic field': just photons, described by the quantum version of Maxwell’s equations. When we take the classical limit we let Planck’s constant go to zero: that much is obvious. However, that’s not all! The energy of each photon is proportional to so to take the classical limit and get a solution of the classical Maxwell’s equations with nonzero energy we also need to increase the number of photons. We cleverly do this in such a way that the total energy remains constant as
So, in quantum electromagnetism the classical limit is also a large-number limit!
That’s a good sign. It suggests the same math will also apply to our reaction network problem.
But then we hit an apparent roadblock. What’s the analogue of Planck’s constant in chemical reaction networks? What should go to zero?
We told you the answer to that puzzle a while ago: it’s the reciprocal of Avogadro’s number!
You see, chemists measure numbers of molecules in ‘moles’. There’s a large number of molecules in each mole: Avogadro’s number. If we let the reciprocal of Avogadro’s number go to zero, we are taking a limit where chemistry becomes ‘continuous’ and the discreteness of molecules goes away. Of course this is just a mathematical trick, but it’s a very useful one.
So, we got around that roadblock. And then something nice happened.
When taking the classical limit of quantum electromagnetism, we focus attention on certain quantum states that are the ‘best approximations’ to classical states. These are called ‘coherent states’, and it’s very easy to study how the behave as we simultaneously let and let the expected number of photons go to infinity.
And the nice thing is that these coherent states are also important in chemistry! But because chemistry involves probabilities rather than amplitudes, they have a different name: ‘Poisson distributions’. On this blog, Brendan Fong used them to give a new proof of a great result in mathematical chemistry, the Anderson–Craciun–Kurtz theorem.
So, we have most of the concepts and tools in place, and we can tackle the large-number limit using quantum techniques.
You can review the details here:
• The large-number limit for reaction networks (part 1).
• The large-number limit for reaction networks (part 2) .
So, after a quick refresher on the notation, we’ll plunge right in.
As you’ll see, we solve the problem except for one important technical detail: passing a derivative through a limit! This means our main result is not a theorem. Rather, it’s an idea for how to prove a theorem. Or if we act like physicists, we can call it a theorem.
The rate equation says
where:
• is a vector describing concentrations of different species at time In chemistry these species could be different kinds of molecules.
• Each is a transition, or in chemistry, a reaction.
• is a vector of natural numbers saying how many items of each species appear as inputs to the reaction This is called the source of the reaction.
• is a vector of natural numbers saying how many items of each species appear as outputs of the reaction This is called the target of the reaction. So, says the net change of the number of items of each species in the reaction
• The rate at which the reaction occurs is proportional to the rate constant times the number
Here we are raising a vector to a vector power and getting a number, using this rule:
where is any vector of natural numbers and is any vector of nonnegative real numbers. From now on we’ll call a vector of natural numbers a multi-index.
In this paper:
• John Baez, Quantum techniques for reaction networks.
it was shown that the master equation implies
Here:
• is the stochastic state saying the probability of having any particular number of items of each species at each time We won’t review the precise details of how this work; for that reread the relevant bit of Part 8.
• is the th number operator, defined using annihilation and creation operators as in quantum mechanics:
For the annihilation and creation operators, see Part 8.
• is the expected number of items of the th species at time
• Similarly, is the expected value of a certain product of operators. For any multi-index we define the falling power
and then we define
Okay. Even if you don’t understand any of what we just said, you’ll see the master and rate equation look similar. The master equation implies this:
while the rate equation says this:
So, we will try to get from the first to the second second with the help of a ‘large-number limit’.
We start with a few definitions. We introduce an adjustable dimensionless quantity which we call This is just a positive number, which has nothing to do with quantum theory except that we’re using a mathematical analogy to quantum mechanics to motivate everything we’re doing.
Definition. The rescaled number operators are defined as This can be thought of as a rescaling of the number of objects, so that instead of counting objects individually, we count them in bunches of size
Definition. For any multi-index we define the rescaled falling power of the number operator by:
and also define
for any multi-index
Using these, we get the following equation:
where for any multi-index we set
This suggests a way to rescale the rate constants in the master equation:
Definition. The rescaled rate constants are
From here onwards, we change our viewpoint. We consider the rescaled rate constants to be fixed, instead of the original rate constants So, as we decrease we are studying situations where the original rate constants change to ensure that the rescaled rate constants stays fixed!
So, we switch to working with a rescaled master equation:
Definition. The rescaled master equation is:
This is really a one-parameter family of equations, depending on We write a solution of the rescaled master equation as but it is really one solution for each value of
Following the same procedure as above, we can rescale the rate equation, using the same definition of the rescaled rate constants:
Definition. The rescaled number of objects of the species is defined as where is the original number of objects of the species. Here again, we are counting in bunches of
Using this to rescale the rate equation, we get
Definition. The rescaled rate equation is
where
Therefore, to go from the rescaled master equation to the rescaled rate equation, we require that
as If this holds, we can identify with and get the rate equation from the master equation!
To this end, we introduce the following crucial idea:
Definition. A semiclassical family of states, is defined as a one-parameter family of states depending on such that for some we have
for every as
In particular, this implies
for every index
Intuitively, a semiclassical family is a family of probability distributions that becomes more and more sharply peaked with a larger and larger mean as decreases. We would like to show that in this limit, the rescaled master equation gives the rescaled rate equation.
We make this precise in the following propositions.
Proposition 1. If is a semiclassical family as defined above, then in the limit, we have as well.
Proof. For each index
By the definition of a semiclassical family,
since every term but the first approaches zero. Thus, we have
A similar but more elaborate calculation shows that
or in other words
as █
Proposition 2. If is a semiclassical family of states, then
for any multi-index
Proof. Consider the centered moment of the number operator:
Taking the limit as goes to zero, this becomes
For a general multi-index we can prove the same sort of thing with a more elaborate calculation. First note that
The right-hand side can be expanded as
We can write this more tersely as
where for any multi-index we define
and for any multi-indices we define
Now using the definition of a semiclassical state, we see
But this equals zero, as the last expression, expanded, is
where each individual sum is zero. █
Here is the theorem that would finish the job if we could give a fully rigorous proof:
“Theorem.” If is a solution of the rescaled master equation and also a semiclassical family for the time interval then is a solution of the rescaled rate equation for
Proof sketch. We sketch a proof that relies on the assumption that we can pass the limit through a time derivative. Of course, to make this rigorous, we would need to justify this. Perhaps it is true only in certain cases.
Assuming that we can pass the limit through the derivative:
and thus
and thus
As expected, we obtain the rescaled rate equation. █
Another question is this: if we start with a semiclassical family of states as our initial data, does it remain semiclassical as we evolve it in time? This will probably be true only in certain cases.
The best-behaved semiclassical states are the coherent states.
Consider the family of coherent states
using the notation developed in the earlier mentioned paper. In that paper it was shown that for any multi-index and any coherent state we have
Using this result for we get
Since equals plus terms of order as we have
showing that our chosen is indeed a semiclassical family.
In the last three episodes I explained how electrical circuits made of resistors, inductors and capacitors are a great metaphor for many kinds of complex systems made of interacting parts. And it’s not just a metaphor; there are mathematically rigorous analogies—in other words, isomorphisms—between such electrical circuits and various other kinds of ‘passive linear networks’.
I showed you a chart of these analogies last time:
displacement: | flow: | momentum: | effort: | |
Mechanics: translation | position | velocity | momentum | force |
Mechanics: rotation | angle | angular velocity | angular momentum | torque |
Electronics | charge | current | flux linkage | voltage |
Hydraulics | volume | flow | pressure momentum | pressure |
Thermal Physics | entropy | entropy flow | temperature momentum | temperature |
Chemistry | moles | molar flow | chemical momentum | chemical potential |
But what do I mean by a ‘passive linear network’? Let me explain this very roughly at first, since we’ll be painfully precise later on.
Right now by ‘network’ I mean a graph with gizmos called ‘components’ on the edges. For example:
In a network there is some kind of ‘flow’ running along each edge, and also some kind of ‘effort’ across that edge. For example, in electronics the flow is electrical current and the effort is voltage. The chart shows the meaning of flow and effort in other examples.
‘Passivity’ means roughly that none of the components put out energy that didn’t earlier come in. For example, resistors lose energy (which goes into heat, which we’re not counting). Capacitors can store energy and later release it. So, resistors and capacitors are passive—and so are inductors. But batteries and current sources actually put out energy, so we won’t allow them in our networks yet. For now, we’re just studying how passive components respond to a source of flow or effort.
For some subtleties that show up when you try to make the concept of passivity precise, try:
• Passivity (engineering), Wikipedia.
Finally, ‘linearity’ means that the flow along each edge of our network is linearly related to the effort across that edge. Here are the key examples:
• For electrical resistors, linearity is captured by Ohm’s law. If an edge in our network is labelled by a resistor of resistance usually drawn like this:
then Ohm’s law says:
where is the voltage across that edge and is the current along that edge.
• If our edge is labelled by an inductor of inductance :
we have
Here we need to think of the voltage and current as functions of time.
• If our edge is labelled by a capacitor of capacitance :
we write the equation
where again we think of the voltage and current as functions of time.
Both linearity and passivity are simplifying assumptions that we eventually want to drop. If we include batteries or current sources, we’re dropping passivity. And if include transistors, we’re dropping linearity. Obviously both these are important!
However, there is a lot to do even with these simplifying assumptions. And now it’s time to get started!
In what follows, I will not use the terms ‘flow’ and ‘effort’, which are chosen to be neutral and subject-independent. Instead, I’ll use the vocabulary from electronics, e.g. ‘current’ and ‘voltage’. The reason is that we’ve all heard of resistors, capacitors, Ohm’s law and Kirchhoff’s laws, and while these have analogues in every row of the chart, it seems pointless to make up weird new ‘neutral’ terms for all these concepts.
But don’t be fooled by the jargon! We’re not merely studying electrical circuits. We’ll be studying passive linear networks in full generality… with the help of category theory.
To get going, let’s think about circuits made of resistors. We can do this without harm, because we’ll later include capacitors and inductors using a simple effortless trick. Namely, we’ll generalize the ‘resistance’ of a resistor, which is a real number, to something called ‘impedance’, which is an element of some larger field. Everything will be so abstract that replacing resistances with impedances will be as easy as snapping our fingers.
Right now I want to define a category where the morphisms are circuits made of resistors. Any morphism will go from some ‘inputs’ to some ‘outputs’, like this:
So a morphism is roughly a graph with edges labelled by numbers called ‘resistances’, with some special nodes called ‘inputs’ and some special nodes called ‘outputs’.
What can do with morphisms? Compose them! So, suppose we have a second morphism whose inputs match the outputs of the first:
Then we can compose them, attaching the outputs of the first to the inputs of the second. We get this morphism as the result:
So, composing morphisms is a way to build big electrical circuits—or other ‘linear passive networks’—out of little ones.
This seems pretty simple, but let’s try to formalize it and see why we have a category. In fact it takes a bit of thought. To check that we get a category, we need to check that composition is associative:
and that each object has an identity morphism that does what an identity should:
All these equations seem obviously true in our example… until you try to prove them.
You might think an identity morphism should be a bunch of straight pieces of wire—a bunch of edges each with an input node and an output node—but that doesn’t quite work, since sticking an extra edge onto a graph gives a new graph with an extra edge!
Also, we are composing circuits by ‘sticking them together’. This process is formalized in category theory using a pushout, and pushouts are only defined ‘up to canonical isomorphism’. The very simplest example is the disjoint union of two sets. We all know what it means, but if you examine it carefully, you’ll see it’s only defined up to canonical isomorphism, because it involves a choice of how we make the two sets disjoint, and this choice is somewhat arbitrary.
All this means the category we’re after is a bit subtler than you might at first expect; in fact, it’s most naturally thought of as a bicategory, meaning roughly that all the equations above hold only ‘up to canonical isomorphism’.
So, we proceed like this.
First we define a concept of ‘labelled graph’, where (for now) only the edges are labelled. We do this because we want our circuits to have edges labelled by ‘resistances’, which are real numbers. But we do it in greater generality because later we’ll want the edges to be labelled by ‘impedances’, which are elements of some other field. And since we’re studying electrical circuits just as examples of networks, later still we will probably want graphs whose edges are labelled in still other ways.
So:
Definition. A graph consists a finite set of edges, a finite set of nodes, and two functions
Thus each edge will have some node as its source and some node as its target:
Definition. Given a set we define an -labelled graph to be a graph together with a function This assigns to each edge its label We call the label set.
We use the letter because for circuits of resistors we will take the label set to be
the positive real numbers, and will be the resistance of the edge For circuits that also contain inductors and capacitors we will take the label set to be the positive elements of some larger field… but more about that later!
Now we want to give our -labelled graph a set of nodes called ‘inputs’ and a set of nodes called ‘outputs’. You might think the set of inputs should be disjoint from the set of outputs, but that’s a tactical error! It turns out an identity morphism in our category should have the inputs being exactly the same as the outputs… and no edges at all:
To handle this nicely, we need to make a category of -labelled graphs. This works in the obvious way, if you’re used to this stuff. A morphism from one -labelled graph to another sends edges to edges, nodes to nodes, and preserves everything in sight:
Definition. Given -graphs and a morphism of -labelled graphs from to is a pair of functions
such that the following diagrams commute:
There is a category where the objects are -labelled graphs and the morphisms are as we’ve just defined them.
Warning: the morphisms in are not the morphisms of the kind we really want, the ones that look like this:
They are just a step along the way. A morphism of the kind we really want is a diagram like this in :
where is an -labelled graph and are -labelled graphs with no edges!
You see, if and have no edges, all they have is nodes. We call the nodes of the inputs and those of the outputs. The morphisms and say how these nodes are included in is our circuit made of resistors.
In general, any diagram shaped like this is called a cospan:
If we turned the arrows around it would be called a span. Cospans are good whenever you a thing with an ‘input end’ and an ‘output end’, and you want to describe how the ends are included in the thing. So, they’re precisely what we need for describing a circuit made of resistors, like this:
This makes us want to cook up a category where:
• an object is an -labelled graph with no edges. We can alternatively think of it as a finite set: a set of nodes.
• a morphism from to is a cospan of -labelled graphs:
We still need to say how to compose these morphisms. We know it will amount to attaching the outputs of one circuit to the inputs of the next—that’s all there is to it! But we need to make this precise and prove we get a category. And as I’ve hinted, we will actually get something bigger and better: a bicategory! This will come as no surprise to if you’re familiar with span and cospan bicategories–but it may induce a heart attack otherwise.
This bicategory can then be ‘watered down’ to give our category And when we take
we’ll get the category where morphisms are circuits made of resistors! We’ll call this
Then I’ll explain what we can do with this category! There’s no end of things we could do with it. But the main thing Brendan does is study the ‘black-boxing’ operation, where we take a circuit, forget its inner details, and only keep track of what it does. This turns out to be quite interesting.
I thank Brendan Fong for drawing some of the pictures of circuits here. For the details of what I’m starting to explain here, read our paper:
• John Baez and Brendan Fong, A compositional framework for passive linear networks.
You can learn more about the underlying ideas here:
• Dean C. Karnopp, Donald L. Margolis and Ronald C. Rosenberg, System Dynamics: a Unified Approach, Wiley, New York, 1990.
• Forbes T. Brown, Engineering System Dynamics: a Unified Graph-Centered Approach, CRC Press, Boca Raton, 2007.
• Francois E. Cellier, Continuous System Modelling, Springer, Berlin, 1991.
You may be wondering, somewhere deep in the back of your mind, what happened to the Network Theory series on this blog. It’s nowhere near done! I plan to revive it, since soon I’ll be teaching a seminar on network theory at U.C. Riverside. It will start in October and go on at least until the end of March.
Here’s a version of the advertisement I sent to the math grad students:
I’ll be running a seminar on Network Theory on Mondays from 3:10 to 4:30 pm in Surge 268 starting on October 6th.
Network theory uses the tools of modern math—categories, operads and more—to study complex systems made of interacting parts. The idea of this seminar is to start from scratch, explain what my grad students have been doing, and outline new projects that they and other students might want to work on. A lot has happened since I left town in January.
I hope to see you there!
If you want more detail, here is a sketch of what’s been happening.
1) Franciscus Rebro has been working on “cospans”, a very general way to treat a physical system with inputs and outputs and treat it as a morphism in category. This underlies all the other projects.
2) Brendan Fong, a student of mine at Oxford, is working on a category where the morphisms are electrical circuits, and composing morphisms is sticking together circuits:
• Brendan Fong, A compositional approach to control theory.
• John Baez and Brendan Fong, A compositional framework for passive linear networks.
The first paper here is an outline of the project; the second one actually carries out the project, or at least a large chunk of it.
3) Blake Pollard, a student in the physics department, has been studying Markov processes. In a Markov process, things randomly hop from one vertex of a graph to another along edges. Blake has created a category where morphisms are ‘open’ Markov process, in which things flow in and out of certain special vertices called ‘terminals’.
4) Jason Erbele has been working on categories in control theory, the branch of math used to study physical systems that interact with the outside world via inputs and outputs. After finishing this paper:
• John Baez and Brendan Fong, Categories in control.
he’s been taking more concept from control theory and formalizing them using categories.
5) Oh yeah, and what about me? I gave a series of lectures on network theory at Oxford, and you can see videos of them here:
• John Baez, Network Theory.
Jacob Biamonte and I have also more or less finished a book on chemical reaction networks and Petri nets:
• John Baez and Jacob Biamonte, Quantum Techniques for Stochastic Mechanics.
But I won’t be talking about that; I want to talk about the new work my students are doing!
I have been learning to make animations using R. This is an animation of the profile of the surface air temperature at the equator. So, the x axis here is the longitude, approximately from 120° E to 280° E. I pulled the data from the region that Graham Jones specified in his code on github: it’s equatorial line in the region that Ludescher et al. used:
For this animation I tried to show the 1997-1998 El Niño. Typically the Pacific is much cooler near South America, due to the upwelling of deep cold water:
(Click for more information.) That part of the Pacific gets even cooler during La Niña:
But it warms up during El Niños:
You can see that in the surface air temperature during the 1997-1998 El Niño, although by summer of 1998 things seem to be getting back to normal:
I want to practice making animations like this. I could make a much prettier and better-labelled animation that ran all the way from 1948 to today, but I wanted to think a little about what exactly is best to plot if we want to use it as an aid to understanding some of this El Niño business.
There’s something about logic that’s both fascinated and terrified me ever since I was a kid: it’s how we can’t fully pin down infinite structures, like the real or complex number systems, using a language with finitely many symbols and a theory with finitely many axioms.
It’s terrifying that we don’t fully know what we’re talking about when we’re talking about numbers! But it’s fascinating that we can understand a lot about the limitations.
There are many different things to say about this, depending on what features of these number systems we want to describe, and what kind of logic we want to use.
Maybe I should start with the natural numbers, since that story is more famous. This can also serve as a lightning review of some basic concepts which I’ll pretend you already vaguely know: first-order versus second-order logic, proofs versus models, and so on. If you don’t know these, you can either fake it or read some of the many links in this article!
When Peano originally described the natural numbers he did so using axioms phrased in second-order logic. In first-order logic we can quantify over variables: for example, we can say
which means that if the predicate holds for all it holds for any variable In second-order logic we can also quantify over predicates: for example, we can say
which says that if and only if for every predicate is true precisely when is true. Leibniz used this principle, called the identity of indiscernibles, to define equality… and this is a nice example of the greater power of second-order logic. In first-order logic we typically include equality as part of the language and add axioms describing its properties, like
In second-order logic we can define equality and prove these properties starting from the properties we already have for
Anyway, in his axioms for the natural numbers, Peano used second-order logic to formulate the principle of mathematical induction in this sort of way:
This says that if you’ve got any predicate that’s true for and is true for whenever it’s true for then it’s true for all natural numbers.
In 1888, Dedekind showed that Peano’s original axioms for the natural numbers are categorical, meaning all its models are isomorphic.
The concept of ‘model’ involves set theory. In a model you pick a set for your variables to range over, pick a subset of for each predicate—namely the subset where that predicate is true —and so on, in such a way that all the axioms in that theory are satisfied. If two models are isomorphic, they’re the same for all practical purposes.
So, in simple rough terms, a categorical theory is one that gives a full description of the mathematical structure it’s talking about.
This makes Dedekind’s result sound like great news. It sounds like Peano’s original second-order axioms for arithmetic completely describe the natural numbers.
However, there’s an important wrinkle. There are many inherently undetermined things about set theory! So in fact, a categorical theory only gives a full description of the mathematical structure it’s talking about relative to a choice of what sets are like.
So, Dedekind’s result just shoves everything mysterious and undetermined about the natural numbers under the carpet: they become mysterious and undetermined things about set theory. This became clear much later, thanks to Gödel and others. And in the process, it became clear that second-order logic is a bit problematic compared to first-order logic.
You see, first-order logic has a set of deduction rules that are:
• sound: Every provable sentence holds in every model.
• semantically complete: Every sentence that holds in every model is provable.
• effective: There is an algorithm that can correctly decide whether any given sequence of symbols is a proof.
Second-order logic does not! It’s ‘too powerful’ to also have all three of these nice properties.
So, these days people often work with a first-order version of Peano’s axioms for arithmetic. Instead of writing down a single axiom for mathematical induction:
we write down an axiom schema—an infinite list of axioms—with one axiom like this:
for each formula that we can actually write down using the language of arithmetic.
This first-order version of Peano arithmetic is not categorical: it has lots of nonisomorphic models. People often pretend there’s one ‘best’ model: they call it the ‘standard’ natural numbers, and call all the others ‘nonstandard’. But there’s something a bit fishy about this.
Indeed, Gödel’s first incompleteness theorem says there are many statements about natural numbers that can neither be proved nor disproved starting from Peano’s axioms. It follows that for any such statement we can find a model of the Peano axioms in which that statement holds, and also a model in which it does not.
Furthermore, this remains true even if we add any list of extra axioms to Peano arithmetic, as long as there’s some algorithm that can list all these axioms.
So, I’d prefer to say there are many different ‘versions’ of the natural numbers, just as there are many different groups.
We can study these different versions, and it’s a fascinating subject:
• Wikipedia, Nonstandard models of arithmetic.
However, I want to talk about the situation for other number systems!
The situation is better for the real numbers—at least if we are willing to think about them in a ‘purely algebraic’ way, leaving most analysis behind.
To do this, we can use the theory of a ‘real closed field’. This is a list of axioms, formulated in first-order logic, which describe how and work for the real numbers. You can think of these axioms as consisting of three parts:
• the field axioms: the usual algebraic identities involving and together with laws saying that everything has an additive inverse and everything except has a multiplicative inverse.
• the formally real field axiom, saying that is not the square of anything. This implies that we can equip the field with a concept of that makes it into an ordered field—but not necessarily in a unique way.
• the real closed field axioms, which says that also for any number either or has a square root, and every polynomial of odd degree has a root. Among other things this implies our field can be made into an ordered field in a unique way. To do this, we say if and only if has a square root.
Tarski showed this theory is complete: any first-order sentence involving only the operations and the relation can either be proved or disproved starting from the above axioms.
Nonetheless, the theory of real closed fields is not categorical: besides the real numbers, there are many other models! These models are all elementarily equivalent: any sentence involving just and first-order logic that holds in one model holds in all the rest. But these models are not all isomorphic: we can’t get a bijection between them that preserves and
Indeed, only finite-sized mathematical structures can be ‘nailed down’ up to isomorphism by theories in first-order logic. You see, the Löwenheim–Skolem theorem says that if a first-order theory in a countable language has an infinite model, it has at least one model of each infinite cardinality. So, if we’re trying to use this kind of theory to describe an infinitely big mathematical structure, the most we can hope for is that after we specify its cardinality, the axioms completely determine it.
However, the real closed field axioms aren’t even this good. For starters, they have infinitely many nonisomorphic countable models. Here are a few:
• the algebraic real numbers: these are the real numbers that obey polynomial equations with integer coefficients.
• the computable real numbers: these are the real numbers that can be computed to arbitrary precision by a computer program.
• the arithmetical real numbers: these are the numbers definable in the language of arithmetic. More precisely, a real number is arithmetical if there is a formula in the language of first-order Peano arithmetic, with two free variables, such that
Every computable real number is arithmetical, but not vice versa: just because you can define a real number in the above way does not mean you can actually compute it to arbitrary precision!
And indeed, there are other even bigger countable real closed fields, consisting of real numbers that are definable using more powerful methods, like second-order Peano arithmetic.
We can also get countable real closed fields using tricks like this: take the algebraic real numbers and throw in the number along with just enough other numbers to get a real closed field again. Or, we could throw in both and This probably gives a bigger real closed field—but nobody knows, because for all we know, could equal plus some rational number! Everyone believes this is false, but nobody has proved it.
There are also lots of nonisomorphic uncountable real closed fields, including ones that include the usual real numbers.
For example, we can take the real numbers and throw in an element that is bigger than and so on—and then do what it takes to get another real closed field. This involves throwing in elements like
and so on. So, we get lots of infinities and infinitesimals.
It gets a bit confusing here, trying to figure out what equals what. But there’s another real closed field containing an infinite element that seems easier to manage. It’s called the field of real Puiseux series. These are series of the form
where is any integer, perhaps negative, is any
positive integer, and the coefficients are real.
What’s It’s just a formal variable. But the real Puiseux series are real closed field, and acts like it’s positive, but smaller than any positive real number.
With considerably more work, we can make up a real closed field that:
• contains the real numbers,
• contains an element bigger than and
• obeys the transfer principle, which says that a first-order statement phrased in the usual language of set theory holds for the real numbers if and only if it holds for this other number system.
Any real closed field with these properties is called a system of hyperreal numbers. In the 1960s, the logician Abraham Robinson used them to make Leibniz’s old idea of infinitesimals in calculus fully rigorous. The resulting theory is called nonstandard analysis.
So, I hope you see there’s an exciting—or perhaps appalling—diversity of real closed fields. But don’t forget: they’re all elementarily equivalent. If a sentence involving just and first-order logic holds in any one of these real closed fields, it holds in all of them!
You might wonder what second-order logic has to say about this.
Here the situation looks very different. In second-order logic we can do analysis, because we can quantify over predicates, which allows us to talk about subsets of real numbers. And in second-order logic we can write down a theory of real numbers that’s categorical! It’s called the theory of a Dedekind-complete ordered field. Again, we can group the axioms in three bunches:
• the field axioms: the usual algebraic identities involving and together with laws saying that everything has an additive inverse and everything except has a multiplicative inverse.
• the ordered field axiom, saying there is a total ordering such that and implies and implies
• the Dedekind completeness axiom, which says that every nonempty subset with an upper bound has a least upper bound. But instead of talking about subsets, we talk about the predicates that hold on those subsets, so we say “for all predicates such that…”
Because they’re categorical, people often use these axioms to define the real numbers. But because they’re second-order, the problem of many nonisomorphic models has really just been swept under the rug. If we use second-order logic, we won’t have a concept of ‘proof’ that’s sound, semantically complete and effective. And if we use first-order axioms for set theory to explicitly talk about subsets instead of predicates, then our set theory will have many models! Each model will have a version of the real numbers in it that’s unique up to isomorphism… but the versions in different models will be really different.
In fact, there’s a precise sense in which the ‘standard real numbers’ in one model of set theory can be the ‘hyperreals’ in another. This was first shown by Abraham Robinson.
I mentioned that when we’re studying an infinite mathematical structure using first-order logic, the best we can hope for is to have one model of each size (up to isomorphism). The real numbers are far from being this nice… but the complex numbers come much closer!
More precisely, say is some cardinal. A first-order theory describing structure on a single set is called κ-categorical if it has a unique model of cardinality And 1965, a logician named Michael Morley showed that if a list of axioms is -categorical for some uncountable it’s -categorical for every uncountable I haven’t worked my way through the proof, which seems to be full of interesting ideas. But such theories are called uncountably categorical.
A great example is the ‘purely algebraic’ theory of the complex numbers. By this I mean we only write down axioms involving and We don’t include anything about this time, nor anything about complex conjugation. You see, if we start talking about complex conjugation we can pick out the real numbers inside the complex numbers, and then we’re more or less back to the story we had for real numbers.
This theory is called the theory of an algebraically closed field of characteristic zero. Yet again, the axioms come in three bunches:
• the field axioms.
• the characteristic zero axioms: these are an infinite list of axioms saying that
• the algebraically closed axioms: these say that every non-constant polynomial has a root.
Pretty much any mathematician worth their salt knows that the complex numbers are a model of these axioms, whose cardinality is that of the continuum. There are lots of different countable models: the algebraic complex numbers, the computable complex numbers, and so on. But because the above theory is uncountably categorical, there is exactly one algebraically closed field of characteristic zero of each uncountable cardinality… up to isomorphism.
This implies some interesting things.
For example, we can take the complex numbers, throw in an extra element, and let it freely generate a bigger algebraically closed field. It’s ‘bigger’ in the sense that it contains the complex numbers as a proper subset, indeed a subfield. But since it has the same cardinality as the complex numbers, it’s isomorphic to the complex numbers!
And then, because this ‘bigger’ field is isomorphic to the complex numbers, we can turn this argument around. We can take the complex numbers, remove a lot of carefully chosen elements, and get a subfield that’s isomorphic to the complex numbers.
Or, if we like, we can take the complex numbers, adjoin a really huge set of extra elements, and let them freely generate an algebraically closed field of characteristic zero. The cardinality of this field can be as big as we want. It will be determined up to isomorphism by its cardinality.
One piece of good news is that thanks to a result of Tarski, the theory of an algebraically closed field of characteristic zero is complete, and thus, all its models are elementarily equivalent. In other words, all the same first-order sentences written in the language of and hold in every model.
But here’s a piece of strange news.
As I already mentioned, the theory of a real closed field is not uncountably categorical. This implies something really weird. Besides the ‘usual’ real numbers we can choose another real closed field not isomorphic to with the same cardinality. We can build the complex numbers using pairs of real numbers. We can use the same trick to build a field using pairs of guys in But it’s easy to check that this funny field is algebraically closed and of characteristic zero. Since it has the same cardinality as it must be isomorphic to
In short, different ‘versions’ of the real numbers can give rise to the same version of the complex numbers!
So, I hope you see that the logical foundations of the real and complex number systems are quite slippery… yet with work, we can understand a lot about this slipperiness.
Besides the references I’ve given, I just want to mention two more. First, here’s a free introductory calculus textbook based on nonstandard analysis:
• H. Jerome Keisler, Elementary Calculus: an Infinitesimal Approach, available as a website or in PDF.
And here’s an expository paper that digs deeper into uncountably categorical theories:
• Nick Ramsey, Morley’s categoricity theorem.
The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, along with writing expository articles. So far we’ve mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there’s going to more data exploration using methods more typical of machine learning, it’s a good time to briefly describe the mindset and highlight some of differences between different kinds of predictive models. Here we’ll concentrate on the concepts rather than the fine details and particular techniques.
We also stress there’s not a fundamental distinction between machine learning (ML) and statistical modelling and inference. There are certainly differences in culture, background and terminology, but in terms of the actual algorithms and mathematics used there’s a great commonality. Throughout the rest of the article we’ll talk about ‘machine learning models’, but could equally have used ‘statistical models’.
For our purposes here, a model is any object which provides a systematic procedure for taking some input data and producing a prediction of some output. There’s a spectrum of models, ranging from physically based models at one end to purely data driven models at the other. As a very simple example, suppose you commute by car from your place of work to your home and you want to leave work in order to arrive homeat 6:30 pm. You can tackle this by building a model which takes as input the day of the week and gives you back a time to leave.
• There’s the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6:30 pm. You might find that the traffic is lighter on weekend days so you can leave at 6:10 pm while on weekdays you have to leave at 5:45 pm, except on Wednesdays when you have to leave at 5:30pm. Since you’ve just crunched on the data you have no idea why this works, but it’s a very reliable rule when you use it to predict when you need to leave.
• There’s the physical model approach, where you attempt to infer how many people are doing what on any given day and then figure out what that implies for the traffic levels and thus what time you need to leave. In this case you find out that there’s a mid-week sports game on Wednesday evenings which leads to even higher traffic. This not only predicts that you’ve got to leave at 5:30 pm on Wednesdays but also lets you understand why. (Of course this is just an illustrative example; in climate modelling a physical model would be based upon actual physical laws such as conservation of energy, conservation of momentum, Boyle’s law, etc.)
There are trade-offs between the two types of approach. Data driven modelling is a relatively simple process. In contrast, by proceeding from first principles you’ve got a more detailed framework which is equally predictive but at the cost of having to investigate a lot of complicated underlying effects. Physical models have one interesting advantage: nothing in the data driven model prevents it violating physical laws (e.g., not conserving energy, etc) whereas a physically based model obeys the physical laws by design. This is seldom a problem in practice, but worth keeping in mind.
The situation with data driven techniques is analogous to one
of those American medication adverts: there’s the big message about how “using a data driven technique can change your life for the better” while the voiceover gabbles out all sorts of small print. The remainder of this post will describe some of the basic principles in that small print.
There’s a popular misconception that machine learning works well when you simply collect some data and throw it into a machine learning algorithm. In practice that kind of approach often yields a model that is quite poor. Almost all successful machine learning applications are preceded by some form of data preprocessing. Sometimes this is simply rescaling so that different variables have similar magnitudes, are zero centred, etc.
However, there are often steps that are more involved. For example, many machine learning techniques have what are called ‘kernel variants’ which involve (in a way whose details don’t matter here) using a nonlinear mapping from the original data to a new space which is more amenable to the core algorithm. There are various kernels with the right mathematical properties, and frequently the choice of a good kernel is made either by experimentation or knowledge of the physical principles. Here’s an example (from Wikipedia’s entry on the support vector machine) of how a good choice of kernel can convert a not linearly separable dataset into a linearly separable one:
An extreme example of preprocessing is explicitly extracting features from the data. In ML jargon, a feature “boils down” some observed data into a value directly useful. For example, in the work by Ludescher et al that we’ve been looking at, they don’t feed all the daily time series values into their classifier but take the correlation between different points over a year as the basic features to consider. Since the individual days’ temperatures are incredibly noisy and there are so many of them, extracting features from them gives more useful input data. While these extraction functions could theoretically be learned by the ML algorithm, this is quite a complicated function to learn. By explicitly choosing to represent the data using this feature the amount the algorithm has to discover is reduced and hence the likelihood of it finding an excellent model is dramatically increased.
Some of the problems that we describe below would vanish if we had unlimited amounts of data to use for model development. However, in real cases we often have a strictly limited amount of data we can obtain. Consequently we need methodologies to address the issues that arise when data is limited.
The most common way to work with collected data is to split it into a training set and a test set. The training set is used in the process of determining the best model parameters, while the test set—which is not used in any way in determining the those model parameters—is then used to see how effective the model is likely to be on new, unseen data. (The test and training sets need not be equally sized. There are some fitting techniques which need to further subdivide the training set, so that having more training than test data works out best.) This division of data acts to further reduce the effective amount of data used in determining the model parameters.
After we’ve made this split we have to be careful how much of the test data we scrutinise in any detail, since once it has been investigated it can’t meaningfully be used for testing again, although it can still be used for future training. (Examining the test data is often informally known as burning data.) That only applies to detailed inspection however; one common way to develop a model is to look at some training data and then train the model (also known as fitting the model) on that training data. It can then be evaluated on the test data to see how well it does. It’s also then okay to purely mechanically train the model on the test data and evaluate it on the training data to see how “stable” the performance is. (If you get dramatically different scores then your model is probably flaky!) However, once we start to look at precisely why the model failed on the test data—in order to change the form of the model—the test data has now become training data and can’t be used as test data for future variants of that model. (Remember, the real goal is to accurately predict the outputs for new, unseen inputs!)
Suppose we’re modelling a system which has a true probability distribution . We can’t directly observe this, but we have some samples obtained from observation of the system and hence come from . Clearly there are problems if we generate this sample in a way that will bias the area of the distribution we sample from: it wouldn’t be a good idea to get training data featuring heights in the American population by only handing out surveys in the locker rooms of basketball facilities. But if we take care to avoid sampling bias as much as possible, then we can make various kinds of estimates of the distribution that we think comes from.
Let’s consider the estimate implied for by some particular technique. It would be nice if , wouldn’t it? And indeed many good estimators have the property that as the size of tends to infinity will tend to . However, for finite sizes of , and especially for small sizes, may have some spurious detail that’s not present in .
As a simple illustration of this, my computer has a pseudorandom number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got
2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.
Note that we’ve got one subset of 4 values (2928, 6552, 1672, 3937) within the interval of length 5012 between 1540 and 6552 and another subset of 3 values (23440, 23979 and 28451) in the interval of length 5012 between 23440 and 28451. For this uniform distribution the expectation of the number of values falling within a given range of that size is about 1.2. Readers will be familiar with how the expectation of a random quantity for a small sample will have a large amount of variation around its value that only reduces as the sample size increases, so this isn’t a surprise. However, it does highlight that even completely unbiased sampling from the true distribution will typically give rise to extra ‘structure’ within the distribution implied by the samples.
For example, here are the results from one way of estimating the probability from the samples:
The green line is the true density while the red curve shows the probability density obtained from the samples, with clearly spurious extra structure.
Almost all modelling techniques, while not necessarily estimating an explicit probability distribution from the training samples, can be seen as building functions that are related to that probability distribution.
For example, a ‘thresholding classifier’ for dividing input into two output classes will place the threshold at the optimal point for the distribution implied by the samples. As a consequence, one important aim in building machine learning models is to estimate the features that are present in the true probability distribution while not learning such fine details that they are likely to be spurious features due to the small sample size. If you think about this, it’s a bit counter-intuitive: you deliberately don’t want to perfectly reflect every single pattern in the training data. Indeed, specialising a model too closely to the training data is given the name over-fitting.
This brings us to generalization. Strictly speaking generalization is the ability of a model to work well upon unseen instances of the problem (which may be difficult for a variety of reasons). In practice however one tries hard to get representative training data so that the main issue in generalization is in preventing overfitting, and the main way to do that is—as discussed above—to split the data into a set for training and a set only used for testing.
One factor that’s often related to generalization is regularization, which is the general term for adding constraints to the model to prevent it being too flexible. One particularly useful kind of regularization is sparsity. Sparsity refers to the degree to which a model has empty elements, typically represented as 0 coefficients. It’s often possible to incorporate a prior into the modelling procedure which will encourage the model to be sparse. (Recall that in Bayesian inference the prior represents our initial ideas of how likely various different parameter values are.) There are some cases where we have various detailed priors about sparsity for problem specific reasons. However the more common case is having a ‘general modelling’ belief, based upon experience in doing modelling, that sparser models have a better generalization performance.
As an example of using sparsity promoting priors, we can look at linear regression. For standard regression with examples of against dimensional vectors we’re considering the total error
while with the prior we’ve got
where are the coefficients to be fitted and is the prior weight. We can see how the prior weight affects the sparsity of the s:
On the -axis is while the -axis is the coefficient value. Each line represents the value of one particular coefficient as increases. You can see that for very small – corresponding to a very weak prior – all the weights are non-zero, but as it increases – corresponding to the prior becoming stronger – more and more of them have a value of 0.
There are a couple of other reasons for wanting sparse models. The obvious one is speed of model evaluation, although this is much less significant with modern computing power. A less obvious reason is that one can often only effectively utilise a sparse model, e.g., if you’re attempting to see how the input factors should be physically modified in order to affect the real system in a particular way. In this case we might want a good sparse model rather than an excellent dense model.
While there are some situations where a model is sought purely to develop knowledge of the universe, in many cases we are interested in models in order to direct actions. For example, having forewarning of El Niño events would enable all sorts of mitigation actions. However, these actions are costly so they shouldn’t be undertaken when there isn’t an upcoming El Niño. When presented with an unseen input the model can either match the actual output (i.e., be right) or differ from the actual output (i.e., be wrong). While it’s impossible to know in advance if a single output will be right or wrong – if we could tell that we’d be better off using that in our model – from the training data it’s generally possible to estimate the fractions of predictions that will be right and will be wrong in a large number of uses. So we want to link these probabilities with the effects of actions taken in response to model predictions.
We can do this using a utility function and a loss
function. The utility maps each possible output to a numerical value proportional to the benefit from taking actions when that output was correctly anticipated. The loss maps outputs to a number proportional to the loss from the actions when the output was incorrectly predicted by the model. (There is evidence that human beings often have inconsistent utility and loss functions, but that’s a story for another day…)
There are three common ways the utility and loss functions are used:
• Maximising the expected value of the utility (for the fraction where the prediction is correct) minus the
expected value of the loss (for the fraction where the prediction is incorrect).
• Minimising the expected loss while ensuring that the expected utility is at least some
value
• Maximising the expected utility while ensuring that the expected loss is at most some
value.
Once we’ve chosen which one we want, it’s often possible to actually tune the fitting of the model to optimize with respect to that criterion.
Of course sometimes when building a model we don’t know enough details of how it will be used to get accurate utility and loss functions (or indeed know how it will be used at all).
It is certainly possible to take a predictive model obtained by machine learning and use it to figure out a physically based model; this is one way of performing data mining. However in practice there are a couple of reasons why it’s necessary to take some care when doing this:
• The variables in the training set may be related by some
non-observed latent variables which may be difficult to reconstruct without knowledge of the physical laws that are in play. (There are machine learning techniques which attempt to reconstruct unknown latent variables but this is a much more difficult problem than estimating known but unobserved latent variables.)
• Machine learning models have a maddening ability to find variables that are predictive due to the way the data was gathered. For example, in a vision system aimed at finding tanks all the images of tanks were taken during one day on a military base when there was accidentally a speck of grime on the camera lens, while all the images of things that weren’t tanks were taken on other days. A neural net cunningly learned that to decide if it was being shown a tank it should look for the shadow from the grime.
• It’s common to have some groups of very highly correlated input variables. In that case a model will generally learn a function which utilises an arbitrary linear combination of the correlated variables and an equally good model would result from using any other linear combination. (This is an example of the statistical problem of ‘identifiability’.) Certain sparsity encouraging priors have the useful property of encouraging the model to select only one representative from a group of correlated variables. However, even in that case it’s still important not to assign too much significance to the particular division of model parameters in groups of correlated variables.
• One can often come up with good machine learning models even when physically important variables haven’t been collected in the training data. A related issue is that if all the training data is collected from a particular subspace factors that aren’t important there won’t be found. For example, if in a collision system to be modelled all data is collected about low speeds the machine learning model won’t learn about relativistic effects that only have a big effect at a substantial fraction of the speed of light.
All of the ideas discussed above are really just ways of making sure that work developing statistical/machine learning models for a real problem is producing meaningful results. As Bob Dylan (almost) sang, “to live outside the physical law, you must be honest; I know you always say that you agree”.