Higher-Dimensional Rewriting in Warsaw (Part 2)

26 June, 2015

Today I’m going to this workshop:

Higher-Dimensional Rewriting and Applications, 28-29 June 2015, Warsaw, Poland.

Many of the talks will be interesting to people who are trying to use category theory as a tool for modelling networks!

For example, though they can’t actually attend, Lucius Meredith and my student Mike Stay hope to use Google Hangouts to present their work on Higher category models of the π-calculus. The π-calculus is a way of modelling networks where messages get sent here and there, e.g. the internet. Check out Mike’s blog post about this:

• Mike Stay, A 2-categorical approach to the pi calculus, The n-Category Café, 26 May 2015.

Krzysztof Bar, Aleks Kissinger and Jamie Vicary will be speaking about Globular, a proof assistant for computations in n-categories:

This talk is a progress report on Globular, an online proof assistant for semistrict higher-dimensional rewriting. We aim to produce a tool which can visualize higher-dimensional categorical diagrams, assist in their construction with a point-and-click interface, perform type checking to prevent incorrect composites, and automatically handle the interchanger data at each dimension. Hosted on the web, it will have a low barrier to use, and allow hyperlinking of formalized proofs directly from research papers. We outline the theoretical basis for the tool, and describe the challenges we have overcome in its design.

Eric Finster will be talking about another computer system for dealing with n-categories, based on the ‘opetopic’ formalism that James Dolan and I invented. And Jason Morton is working on a computer system for computation in compact closed categories! I’ve seen it, and it’s cool, but he can’t attend the workshop, so David Spivak will be speaking on his work with Jason on the theoretical foundations of this software:

We consider the linked problems of (1) finding a normal form for morphism expressions in a closed compact category and (2) the word problem, that is deciding if two morphism expressions are equal up to the axioms of a closed compact category. These are important ingredients for a practical monoidal category computer algebra system. Previous approaches to these problems include rewriting and graph-based methods. Our approach is to re-interpret a morphism expression in terms of an operad, and thereby obtain a single composition which is strictly associative and applied according to the abstract syntax tree. This yields the same final operad morphism regardless of the tree representation of the expression or order of execution, and solves the normal form problem up to automorphism.

Recently Eugenia Cheng has been popularizing category theory, touring to promote her book Cakes, Custard and Category Theory. But she’ll be giving two talks in Warsaw, I believe on distributive laws for Lawvere theories.

As for me, I’ll be promoting my dream of using category theory to understand networks in electrical engineering. I’ll be giving a talk on control theory and a talk on electrical circuits: two sides of the same coin, actually.

• John Baez, Jason Erbele and Nick Woods, Categories in control.

If you’ve seen a previous talk of mine with the same title, don’t despair—this one has new stuff! In particular, it talks about a new paper by Nick Woods and Simon Wadsley.

Abstract. Control theory is the branch of engineering that studies dynamical systems with inputs and outputs, and seeks to stabilize these using feedback. Control theory uses “signal-flow diagrams” to describe processes where real-valued functions of time are added, multiplied by scalars, differentiated and integrated, duplicated and deleted. In fact, these are string diagrams for the symmetric monoidal category of finite-dimensional vector spaces, but where the monoidal structure is direct sum rather than the usual tensor product. Jason Erbele has given a presentation for this symmetric monoidal category, which amounts to saying that it is the PROP for bicommutative bimonoids with some extra structure.

A broader class of signal-flow diagrams also includes “caps” and “cups” to model feedback. This amounts to working with a larger symmetric monoidal category where objects are still finite-dimensional vector spaces but the morphisms are linear relations. Erbele also found a presentation for this larger symmetric monoidal category. It is the PROP for a remarkable thing: roughly speaking, an object with two special commutative dagger-Frobenius structures, such that the multiplication and unit of either one and the comultiplication and counit of the other fit together to form a bimonoid.

• John Baez and Brendan Fong, Circuits, categories and rewrite rules.

Abstract. We describe a category where a morphism is an electrical circuit made of resistors, inductors and capacitors, with marked input and output terminals. In this category we compose morphisms by attaching the outputs of one circuit to the inputs of another. There is a functor called the ‘black box functor’ that takes a circuit, forgets its internal structure, and remembers only its external behavior. Two circuits have the same external behavior if and only if they impose same relation between currents and potentials at their terminals. This is a linear relation, so the black box functor goes from the category of circuits to the category of finite-dimensional vector spaces and linear relations. Constructing this functor makes use of Brendan Fong’s theory of ‘decorated cospans’—and the question of whether two ‘planar’ circuits map to the same relation has an interesting answer in terms of rewrite rules.

The answer to the last question, in the form of a single picture, is this:


(Click to enlarge.) How can you change an electrical circuit made out of resistors without changing what it does? 5 ways are shown here:

  1. You can remove a loop of wire with a resistor on it. It doesn’t do anything.
  2. You can remove a wire with a resistor on it if one end is unattached. Again, it doesn’t do anything.

  3. You can take two resistors in series—one after the other—and replace them with a single resistor. But this new resistor must have a resistance that’s the sum of the old two.

  4. You can take two resistors in parallel and replace them with a single resistor. But this resistor must have a conductivity that’s the sum of the old two. (Conductivity is the reciprocal of resistance.)

  5. Finally, the really cool part: the Y-Δ transform. You can replace a Y made of 3 resistors by a triangle of resistors But their resistances must be related by the equations shown here.

For circuits drawn on the plane, these are all the rules you need! This was proved here:

• Yves Colin de Verdière, Isidoro Gitler and Dirk Vertigan, Réseaux électriques planaires II.

It’s just the beginning of a cool story, which I haven’t completely blended with the categorical approach to circuits. Doing so clearly calls for 2-categories: those double arrows are 2-morphisms! For more, see:

• Joshua Alman, Carl Lian and Brandon Tran, Circular planar electrical networks I: The electrical poset EPn.


World Energy Outlook 2015

15 June, 2015

It’s an exciting and nerve-racking time as global carbon emissions from energy production have begun to drop, at least for a little while:



Global energy-related CO2 emissions 1985-2014

yet keeping warming below 2°C seems ever more difficult:



Global energy-related CO2 emissions in the INDC scenario and remaining carbon budget for a >50% chance of keeping to 2°C (IPCC and IEA data; IEA analysis)

The big international climate negotiations to be concluded in Paris in December 2015 bring these issues to the forefront in a dramatic way. Countries are already saying what they plan to do: you can read their Intended Nationally Determined Contributions online!

But it’s hard to get an overall picture of the situation. Here’s a new report that helps:

• International Energy Agency, World Energy Outlook Special Report 2015: Energy and Climate Change.

Since the International Energy Agency seems intelligent to me, I’ll just quote their executive summary. If you’re too busy for even the executive summary, let me summarize the summary:

Given the actions that countries are now planning, we could have an increase of around 2.6 °C over preindustrial temperature by 2100, and more after that.

Executive summary

A major milestone in efforts to combat climate change is fast approaching. The importance of the 21st Conference of the Parties (COP21) – to be held in Paris in December 2015 – rests not only in its specific achievements by way of new contributions, but also in the direction it sets. There are already some encouraging signs with a historic joint announcement by the United States and China on climate change, and climate pledges for COP21 being submitted by a diverse range of countries and in development in many others. The overall test of success for COP21 will be the conviction it conveys that governments are determined to act to the full extent necessary to achieve the goal they have already set to keep the rise in global average temperatures below 2 degrees Celsius (°C), relative to pre-industrial levels.

Energy will be at the core of the discussion. Energy production and use account for two-thirds of the world’s greenhouse-gas (GHG) emissions, meaning that the pledges made at COP21 must bring deep cuts in these emissions, while yet sustaining the growth of the world economy, boosting energy security around the world and bringing modern energy to the billions who lack it today. The agreement reached at COP21 must be comprehensive geographically, which means it must be equitable, reflecting both national responsibilities and prevailing circumstances. The importance of the energy component is why this World Energy Outlook Special Report presents detailed energy and climate analysis for the sector and recommends four key pillars on which COP21 can build success.

Energy and emissions: moving apart?

The use of low-carbon energy sources is expanding rapidly, and there are signs that growth in the global economy and energy-related emissions may be starting to decouple. The global economy grew by around 3% in 2014 but energy-related carbon dioxide (CO2) emissions stayed flat, the first time in at least 40 years that such an outcome has occurred outside economic crisis.

Renewables accounted for nearly half of all new power generation capacity in 2014, led by growth in China, the United States, Japan and Germany, with investment remaining strong (at $270 billion) and costs continuing to fall. The energy intensity of the global economy dropped by 2.3% in 2014, more than double the average rate of fall over the last decade, a result stemming from improved energy efficiency and structural changes in some economies, such as China.

Around 11% of global energy-related CO2 emissions arise in areas that operate a carbon market (where the average price is $7 per tonne of CO2), while 13% of energy-related CO2 emissions arise in markets with fossil-fuel consumption subsidies (an incentive equivalent to $115 per tonne of CO2, on average). There are some encouraging signs on both fronts, with reform in sight for the European Union’s Emissions Trading Scheme and countries including India, Indonesia, Malaysia and Thailand taking the opportunity of lower oil prices to diminish fossil-fuel subsidies, cutting the incentive for wasteful consumption.

The energy contribution to COP21

Nationally determined pledges are the foundation of COP21. Intended Nationally
Determined Contributions (INDCs) submitted by countries in advance of COP21 may vary in scope but will contain, implicitly or explicitly, commitments relating to the energy sector. As of 14 May 2015, countries accounting for 34% of energy-related emissions had submitted their new pledges.

A first assessment of the impact of these INDCs and related policy statements (such as by China) on future energy trends is presented in this report in an “INDC Scenario”. This shows, for example, that the United States’ pledge to cut net greenhouse-gas emissions by 26% to 28% by 2025 (relative to 2005 levels) would deliver a major reduction in emissions while the economy grows by more than one-third over current levels. The European Union’s pledge to cut GHG emissions by at least 40% by 2030 (relative to 1990 levels) would see energy-related CO2 emissions decline at nearly twice the rate achieved since 2000, making it one of the world’s least carbon-intensive energy economies. Russia’s energy-related emissions decline slightly from 2013 to 2030 and it meets its 2030 target comfortably, while implementation of Mexico’s pledge would see its energy-related emissions increase slightly while its economy grows much more rapidly. China has yet to submit its INDC, but has stated an intention to achieve a peak in its CO2 emissions around 2030 (if not earlier), an important change in direction, given the pace at which they have grown on average since 2000.

Growth in global energy-related GHG emissions slows but there is no peak by 2030 in the INDC Scenario. The link between global economic output and energy-related GHG emissions weakens significantly, but is not broken: the economy grows by 88% from 2013 to 2030 and energy-related CO2 emissions by 8% (reaching 34.8 gigatonnes). Renewables become the leading source of electricity by 2030, as average annual investment in nonhydro renewables is 80% higher than levels seen since 2000, but inefficient coal-fired power generation capacity declines only slightly.

With INDCs submitted so far, and the planned energy policies in countries that have yet to submit, the world’s estimated remaining carbon budget consistent with a 50% chance of keeping the rise in temperature below 2 °C is consumed by around 2040—eight months later than is projected in the absence of INDCs. This underlines the need for all countries to submit ambitious INDCs for COP21 and for these INDCs to be recognised as a basis upon which to build stronger future action, including from opportunities for collaborative/co-ordinated action or those enabled by a transfer of resources (such as technology and finance). If stronger action is not forthcoming after 2030, the path in the INDC Scenario would be consistent with an an average temperature increase of around 2.6 °C by 2100 and 3.5 °C after 2200.

What does the energy sector need from COP21?

National pledges submitted for COP21 need to form the basis for a “virtuous circle” of rising ambition. From COP21, the energy sector needs to see a projection from political leaders at the highest level of clarity of purpose and certainty of action, creating a clear expectation of global and national low-carbon development. Four pillars can support that achievement:

1. Peak in emissions – set the conditions which will achieve an early peak in global
energy-related emissions.

2. Five-year revision – review contributions regularly, to test the scope to lift the level of ambition.

3. Lock in the vision – translate the established climate goal into a collective long-term emissions goal, with shorter-term commitments that are consistent with the long-term vision.

4. Track the transition – establish an effective process for tracking achievements in
the energy sector.

Peak in emissions

The IEA proposes a bridging strategy that could deliver a peak in global energy-related
emissions by 2020. A commitment to target such a near-term peak would send a clear message of political determination to stay below the 2 °C climate limit. The peak can be
achieved relying solely on proven technologies and policies, without changing the economic and development prospects of any region, and is presented in a “Bridge Scenario”. The technologies and policies reflected in the Bridge Scenario are essential to secure the long-term decarbonisation of the energy sector and their near-term adoption can help keep the door to the 2 °C goal open. For countries that have submitted their INDCs, the proposed strategy identifies possible areas for over-achievement. For those that have yet to make a submission, it sets out a pragmatic baseline for ambition.

The Bridge Scenario depends upon five measures:

• Increasing energy efficiency in the industry, buildings and transport sectors.

• Progressively reducing the use of the least-efficient coal-fired power plants and
banning their construction.

• Increasing investment in renewable energy technologies in the power sector from
$270 billion in 2014 to $400 billion in 2030.

• Gradual phasing out of fossil-fuel subsidies to end-users by 2030.

• Reducing methane emissions in oil and gas production.

These measures have profound implications for the global energy mix, putting a brake on growth in oil and coal use within the next five years and further boosting renewables. In the Bridge Scenario, coal use peaks before 2020 and then declines while oil demand rises to 2020 and then plateaus. Total energy-related GHG emissions peak around 2020. Both the energy intensity of the global economy and the carbon intensity of power generation improve by 40% by 2030. China decouples its economic expansion from emissions growth by around 2020, much earlier than otherwise expected, mainly through improving the energy efficiency of industrial motors and the buildings sector, including through standards for appliances and lighting. In countries where emissions are already in decline today, the decoupling of economic growth and emissions is significantly accelerated; compared with recent years, the pace of this decoupling is almost 30% faster in the European Union (due to improved energy efficiency) and in the United States (where renewables contribute one-third of the achieved emissions savings in 2030). In other regions, the link between economic growth and emissions growth is weakened significantly, but the relative importance of different measures varies. India utilises energy more efficiently, helping it
to reach its energy sector targets and moderate emissions growth, while the reduction of
methane releases from oil and gas production and reforming fossil-fuel subsidies (while
providing targeted support for the poorest) are key measures in the Middle East and Africa, and a portfolio of options helps reduce emissions in Southeast Asia. While universal access to modern energy is not achieved in the Bridge Scenario, the efforts to reduce energy related emissions do go hand-in-hand with delivering access to electricity to 1.7 billion people and access to clean cookstoves to 1.6 billion people by 2030.


Network Theory in Turin

23 May, 2015

Here are the slides of the talk I’m giving on Monday to kick off the Categorical Foundations of Network Theory workshop in Turin:

Network theory.

This is a long talk, starting with the reasons I care about this subject, and working into the details of one particular project: the categorical foundations of networks as applied to electrical engineering and control theory. There are lots of links in blue; click on them for more details!


Network Theory Seminar (Part 3)

21 October, 2014

 

This time we use the principle of minimum power to determine what a circuit made of resistors actually does. Its ‘behavior’ is described by a functor sending circuits to linear relations between the potentials and currents at the input and output terminals. We call this the ‘black box’ functor, since it takes a circuit:

and puts a metaphorical ‘black box’ around it:

hiding the circuit’s internal details and letting us see only how it acts as viewed ‘from outside’.

For more, see the lecture notes here:

Network theory (part 32).

http://johncarlosbaez.wor


Science, Models and Machine Learning

3 September, 2014

guest post by David Tweed

The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, along with writing expository articles. So far we’ve mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there’s going to more data exploration using methods more typical of machine learning, it’s a good time to briefly describe the mindset and highlight some of differences between different kinds of predictive models. Here we’ll concentrate on the concepts rather than the fine details and particular techniques.

We also stress there’s not a fundamental distinction between machine learning (ML) and statistical modelling and inference. There are certainly differences in culture, background and terminology, but in terms of the actual algorithms and mathematics used there’s a great commonality. Throughout the rest of the article we’ll talk about ‘machine learning models’, but could equally have used ‘statistical models’.

For our purposes here, a model is any object which provides a systematic procedure for taking some input data and producing a prediction of some output. There’s a spectrum of models, ranging from physically based models at one end to purely data driven models at the other. As a very simple example, suppose you commute by car from your place of work to your home and you want to leave work in order to arrive homeat 6:30 pm. You can tackle this by building a model which takes as input the day of the week and gives you back a time to leave.

• There’s the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6:30 pm. You might find that the traffic is lighter on weekend days so you can leave at 6:10 pm while on weekdays you have to leave at 5:45 pm, except on Wednesdays when you have to leave at 5:30pm. Since you’ve just crunched on the data you have no idea why this works, but it’s a very reliable rule when you use it to predict when you need to leave.

• There’s the physical model approach, where you attempt to infer how many people are doing what on any given day and then figure out what that implies for the traffic levels and thus what time you need to leave. In this case you find out that there’s a mid-week sports game on Wednesday evenings which leads to even higher traffic. This not only predicts that you’ve got to leave at 5:30 pm on Wednesdays but also lets you understand why. (Of course this is just an illustrative example; in climate modelling a physical model would be based upon actual physical laws such as conservation of energy, conservation of momentum, Boyle’s law, etc.)

There are trade-offs between the two types of approach. Data driven modelling is a relatively simple process. In contrast, by proceeding from first principles you’ve got a more detailed framework which is equally predictive but at the cost of having to investigate a lot of complicated underlying effects. Physical models have one interesting advantage: nothing in the data driven model prevents it violating physical laws (e.g., not conserving energy, etc) whereas a physically based model obeys the physical laws by design. This is seldom a problem in practice, but worth keeping in mind.

The situation with data driven techniques is analogous to one
of those American medication adverts
: there’s the big message about how “using a data driven technique can change your life for the better” while the voiceover gabbles out all sorts of small print. The remainder of this post will describe some of the basic principles in that small print.

Preprocessing and feature extraction

There’s a popular misconception that machine learning works well when you simply collect some data and throw it into a machine learning algorithm. In practice that kind of approach often yields a model that is quite poor. Almost all successful machine learning applications are preceded by some form of data preprocessing. Sometimes this is simply rescaling so that different variables have similar magnitudes, are zero centred, etc.

However, there are often steps that are more involved. For example, many machine learning techniques have what are called ‘kernel variants’ which involve (in a way whose details don’t matter here) using a nonlinear mapping from the original data to a new space which is more amenable to the core algorithm. There are various kernels with the right mathematical properties, and frequently the choice of a good kernel is made either by experimentation or knowledge of the physical principles. Here’s an example (from Wikipedia’s entry on the support vector machine) of how a good choice of kernel can convert a not linearly separable dataset into a linearly separable one:



An extreme example of preprocessing is explicitly extracting features from the data. In ML jargon, a feature “boils down” some observed data into a value directly useful. For example, in the work by Ludescher et al that we’ve been looking at, they don’t feed all the daily time series values into their classifier but take the correlation between different points over a year as the basic features to consider. Since the individual days’ temperatures are incredibly noisy and there are so many of them, extracting features from them gives more useful input data. While these extraction functions could theoretically be learned by the ML algorithm, this is quite a complicated function to learn. By explicitly choosing to represent the data using this feature the amount the algorithm has to discover is reduced and hence the likelihood of it finding an excellent model is dramatically increased.

Limited amounts of data for model development

Some of the problems that we describe below would vanish if we had unlimited amounts of data to use for model development. However, in real cases we often have a strictly limited amount of data we can obtain. Consequently we need methodologies to address the issues that arise when data is limited.

Training sets and test sets

The most common way to work with collected data is to split it into a training set and a test set. The training set is used in the process of determining the best model parameters, while the test set—which is not used in any way in determining the those model parameters—is then used to see how effective the model is likely to be on new, unseen data. (The test and training sets need not be equally sized. There are some fitting techniques which need to further subdivide the training set, so that having more training than test data works out best.) This division of data acts to further reduce the effective amount of data used in determining the model parameters.

After we’ve made this split we have to be careful how much of the test data we scrutinise in any detail, since once it has been investigated it can’t meaningfully be used for testing again, although it can still be used for future training. (Examining the test data is often informally known as burning data.) That only applies to detailed inspection however; one common way to develop a model is to look at some training data and then train the model (also known as fitting the model) on that training data. It can then be evaluated on the test data to see how well it does. It’s also then okay to purely mechanically train the model on the test data and evaluate it on the training data to see how “stable” the performance is. (If you get dramatically different scores then your model is probably flaky!) However, once we start to look at precisely why the model failed on the test data—in order to change the form of the model—the test data has now become training data and can’t be used as test data for future variants of that model. (Remember, the real goal is to accurately predict the outputs for new, unseen inputs!)

Random patterns in small sample sets

Suppose we’re modelling a system which has a true probability distribution P . We can’t directly observe this, but we have some samples S obtained from observation of the system and hence come from P . Clearly there are problems if we generate this sample in a way that will bias the area of the distribution we sample from: it wouldn’t be a good idea to get training data featuring heights in the American population by only handing out surveys in the locker rooms of basketball facilities. But if we take care to avoid sampling bias as much as possible, then we can make various kinds of estimates of the distribution that we think S comes from.

Let’s consider the estimate P' implied for S by some particular technique. It would be nice if P = P' , wouldn’t it? And indeed many good estimators have the property that as the size of S tends to infinity P' will tend to P . However, for finite sizes of S , and especially for small sizes, P' may have some spurious detail that’s not present in P .

As a simple illustration of this, my computer has a pseudorandom number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got

2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.

Note that we’ve got one subset of 4 values (2928, 6552, 1672, 3937) within the interval of length 5012 between 1540 and 6552 and another subset of 3 values (23440, 23979 and 28451) in the interval of length 5012 between 23440 and 28451. For this uniform distribution the expectation of the number of values falling within a given range of that size is about 1.2. Readers will be familiar with how the expectation of a random quantity for a small sample will have a large amount of variation around its value that only reduces as the sample size increases, so this isn’t a surprise. However, it does highlight that even completely unbiased sampling from the true distribution will typically give rise to extra ‘structure’ within the distribution implied by the samples.

For example, here are the results from one way of estimating the probability from the samples:



The green line is the true density while the red curve shows the probability density obtained from the samples, with clearly spurious extra structure.

Generalization

Almost all modelling techniques, while not necessarily estimating an explicit probability distribution from the training samples, can be seen as building functions that are related to that probability distribution.

For example, a ‘thresholding classifier’ for dividing input into two output classes will place the threshold at the optimal point for the distribution implied by the samples. As a consequence, one important aim in building machine learning models is to estimate the features that are present in the true probability distribution while not learning such fine details that they are likely to be spurious features due to the small sample size. If you think about this, it’s a bit counter-intuitive: you deliberately don’t want to perfectly reflect every single pattern in the training data. Indeed, specialising a model too closely to the training data is given the name over-fitting.

This brings us to generalization. Strictly speaking generalization is the ability of a model to work well upon unseen instances of the problem (which may be difficult for a variety of reasons). In practice however one tries hard to get representative training data so that the main issue in generalization is in preventing overfitting, and the main way to do that is—as discussed above—to split the data into a set for training and a set only used for testing.

One factor that’s often related to generalization is regularization, which is the general term for adding constraints to the model to prevent it being too flexible. One particularly useful kind of regularization is sparsity. Sparsity refers to the degree to which a model has empty elements, typically represented as 0 coefficients. It’s often possible to incorporate a prior into the modelling procedure which will encourage the model to be sparse. (Recall that in Bayesian inference the prior represents our initial ideas of how likely various different parameter values are.) There are some cases where we have various detailed priors about sparsity for problem specific reasons. However the more common case is having a ‘general modelling’ belief, based upon experience in doing modelling, that sparser models have a better generalization performance.

As an example of using sparsity promoting priors, we can look at linear regression. For standard regression with E examples of y^{(i)} against P dimensional vectors x^{(i)} we’re considering the total error

\min_{c_1,\dots, c_P} \frac{1}{E}\sum_{i=1}^E (y^{(i)} - \sum_{j=1}^P c_j x^{(i)}_j)^2

while with the l_1 prior we’ve got

\min_{c_1,\dots, c_P} \frac{1}{E} \sum_{i=1}^E (y^{(i)} - \sum_{j=1}^P c_j x^{(i)}_j)^2 + \lambda \sum_{j=1}^P |c_j|

where c_i are the coefficients to be fitted and \lambda is the prior weight. We can see how the prior weight affects the sparsity of the c_i s:



On the x -axis is \lambda while the y -axis is the coefficient value. Each line represents the value of one particular coefficient as \lambda increases. You can see that for very small \lambda – corresponding to a very weak prior – all the weights are non-zero, but as it increases – corresponding to the prior becoming stronger – more and more of them have a value of 0.

There are a couple of other reasons for wanting sparse models. The obvious one is speed of model evaluation, although this is much less significant with modern computing power. A less obvious reason is that one can often only effectively utilise a sparse model, e.g., if you’re attempting to see how the input factors should be physically modified in order to affect the real system in a particular way. In this case we might want a good sparse model rather than an excellent dense model.

Utility functions and decision theory

While there are some situations where a model is sought purely to develop knowledge of the universe, in many cases we are interested in models in order to direct actions. For example, having forewarning of El Niño events would enable all sorts of mitigation actions. However, these actions are costly so they shouldn’t be undertaken when there isn’t an upcoming El Niño. When presented with an unseen input the model can either match the actual output (i.e., be right) or differ from the actual output (i.e., be wrong). While it’s impossible to know in advance if a single output will be right or wrong – if we could tell that we’d be better off using that in our model – from the training data it’s generally possible to estimate the fractions of predictions that will be right and will be wrong in a large number of uses. So we want to link these probabilities with the effects of actions taken in response to model predictions.

We can do this using a utility function and a loss
function
. The utility maps each possible output to a numerical value proportional to the benefit from taking actions when that output was correctly anticipated. The loss maps outputs to a number proportional to the loss from the actions when the output was incorrectly predicted by the model. (There is evidence that human beings often have inconsistent utility and loss functions, but that’s a story for another day…)

There are three common ways the utility and loss functions are used:

• Maximising the expected value of the utility (for the fraction where the prediction is correct) minus the
expected value of the loss (for the fraction where the prediction is incorrect).

• Minimising the expected loss while ensuring that the expected utility is at least some
value

• Maximising the expected utility while ensuring that the expected loss is at most some
value.

Once we’ve chosen which one we want, it’s often possible to actually tune the fitting of the model to optimize with respect to that criterion.

Of course sometimes when building a model we don’t know enough details of how it will be used to get accurate utility and loss functions (or indeed know how it will be used at all).

Inferring a physical model from a machine learning model

It is certainly possible to take a predictive model obtained by machine learning and use it to figure out a physically based model; this is one way of performing data mining. However in practice there are a couple of reasons why it’s necessary to take some care when doing this:

• The variables in the training set may be related by some
non-observed latent variables which may be difficult to reconstruct without knowledge of the physical laws that are in play. (There are machine learning techniques which attempt to reconstruct unknown latent variables but this is a much more difficult problem than estimating known but unobserved latent variables.)

• Machine learning models have a maddening ability to find variables that are predictive due to the way the data was gathered. For example, in a vision system aimed at finding tanks all the images of tanks were taken during one day on a military base when there was accidentally a speck of grime on the camera lens, while all the images of things that weren’t tanks were taken on other days. A neural net cunningly learned that to decide if it was being shown a tank it should look for the shadow from the grime.

• It’s common to have some groups of very highly correlated input variables. In that case a model will generally learn a function which utilises an arbitrary linear combination of the correlated variables and an equally good model would result from using any other linear combination. (This is an example of the statistical problem of ‘identifiability’.) Certain sparsity encouraging priors have the useful property of encouraging the model to select only one representative from a group of correlated variables. However, even in that case it’s still important not to assign too much significance to the particular division of model parameters in groups of correlated variables.

• One can often come up with good machine learning models even when physically important variables haven’t been collected in the training data. A related issue is that if all the training data is collected from a particular subspace factors that aren’t important there won’t be found. For example, if in a collision system to be modelled all data is collected about low speeds the machine learning model won’t learn about relativistic effects that only have a big effect at a substantial fraction of the speed of light.

Conclusions

All of the ideas discussed above are really just ways of making sure that work developing statistical/machine learning models for a real problem is producing meaningful results. As Bob Dylan (almost) sang, “to live outside the physical law, you must be honest; I know you always say that you agree”.


The Stochastic Resonance Program (Part 2)

28 August, 2014

guest post by David Tanzer

Last time we introduced the concept of stochastic resonance. Briefly, it’s a way that noise can amplify a signal, by giving an extra nudge that helps a system receiving that signal make the jump from one state to another. Today we’ll describe a program that demonstrates this concept. But first, check it out:

Stochastic resonance.

No installation required! It runs as a web page which allows you to set the parameters of the model and observe the resulting output signal. It has a responsive behavior, because it runs right in your browser, as javascript.

There are sliders for controlling the amounts of sine wave and noise involved in the mix. As explained in the previous article, when we set the wave to a level not quite sufficient to cause the system to oscillate between states, and we add in the right amount of noise, stochastic resonance should kick in:


The program implements a mathematical model that runs in discrete time. It has two stable states, and is driven by a combination of a sine forcing function and a noise source.

The code builds on top of a library called JSXGraph, which supports function plotting, interactive graphics, and data visualization.

Running the program

If you haven’t already, go try the program. On one plot it shows a sine wave, called the forcing signal, and a chaotic time-series, called the output signal.

There are four sliders, which we’ll call Amplitude, Frequency, Noise and Sample-Path.

• The Amplitude and Frequency sliders control the sine wave. Try them out.

• The output signal depends, in a complex way, on the sine wave. Vary Amplitude and Frequency to see how they affect the output signal.

• The amount of randomization involved in the process is controlled by the Noise slider. Verify this.

• Change the Sample-Path slider to alter the sequence of random numbers that are fed to the process. This will cause a different instance of the process to be displayed.

Now try to get stochastic resonance to kick in…

Going to the source

Time to look at the blueprints. It’s easy.

• Open the model web page. The code is now running in your browser.

• While there, run your browser’s view-source function. For Firefox on the Mac, click Apple-U. For Firefox on the PC, click Ctrl-U.

• You should see the html file for the web page itself.

• See the “script” directives at the head of this file. Each one refers to javascript program on the internet. When the browser sees it, the program is fetched and loaded into the browser’s internal javascript interpreter. Here are the directives:

<script src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default">
</script>

<script src=
"http://cdnjs.cloudflare.com/ajax/libs/jsxgraph/0.93/jsxgraphcore.js">
</script>

<script src="./StochasticResonanceEuler.js"></script>

<script src="./normals.js"></script>

The first one loads MathJax, which is a formula-rendering engine. Next comes JSXGraph, a library that provides support for plotting and interactive graphics. Next, StochchasticResonanceEuler.js is the main code for the model, and finally, normals.js provides random numbers.

• In the source window, click on the link for StochasticResonanceEuler.js — and you’ve reached the source!

Anatomy of the program

The program implements a stochastic difference equation, which defines the changes in the output signal as a function of its current value and a random noise value.

It consists of the following components:

1. Interactive controls to set parameters

2. Plot of the forcing signal

3. Plot of the output signal

4. A function that defines a particular SDE

5. A simulation loop, which renders the output signal.

The program contains seven functions. The top-level function is initCharts. It dispatches to initControls, which builds the sliders, and initSrBoard, which builds the curve objects for the forcing function and the output signal (called “position curve” in the program). Each curve object is assigned a function that computes the (x,t) values for the time series, which gets called whenever the input parameters change. The function that is assigned to the forcing curve computes the sine wave, and reads the amplitude and frequency values from the sliders.

The calculation method for the output signal is set to the function mkSrPlot, which performs the simulation. It begins by defining a function for the deterministic part of the derivative:

deriv = Deriv(t,x) = SineCurve(t) + BiStable(x),

Then it constructs a “stepper” function, through the call Euler(deriv, tStep). A stepper function maps the current point (t,x) and a noise sample to the next point (t’,x’). The Euler stepper maps

((t,x), noiseSample)

to

(t + tStep, x + tStep * Deriv(t,x) + noiseSample).

The simulation loop is then performed by the function sdeLoop, which is given:

• The stepper function

• The noise amplitude (“dither”)

• The initial point (t0,x0)

• A randomization offset

• The number of points to generate

The current point is initialized to (t0,x0), and then the stepper is repeatedly applied to the current point and the current noise sample. The output returned is the sequence of (t,x) values.

The noise samples are normally distributed random numbers stored in an array. They get scaled by the noise amplitude when they are used. The array contains more values than are needed. By changing the starting point in the array, different instances of the process are obtained.

Making your own version of the program

Now let’s tweak the program to do new things.

First let’s make a local copy of the program on your local machine, and get it to run there. Make a directory, say /Users/macbookpro/stochres. Open the html file in the view source window. Paste it into the file /Users/macbookpro/stochres/stochres.html. Next, in the view source window, click on the link to StochasticResonanceEuler.js. Paste the text into /Users/macbookpro/stochres/StochasticResonanceEuler.js.

Now point your browser to the file, with the URL file:///Users/macbookpro/stochres/stochres.html. To prove that you’re really executing the local copy, make a minor edit to the html text, and check that it shows up when you reload the page. Then make a minor edit to StochasticResonanceEuler.js, say by changing the label text on the slider from “forcing function” to “forcing signal.”

Programming exercises

Now let’s get warmed up with some bite-sized programming exercises.

1. Change the color of the sine wave.

2. Change the exponent in the bistable polynomial to values other than 2, to see how this affects the output.

3. Add an integer-valued slider to control this exponent.

4. Modify the program to perform two runs of the process, and show the output signals in different colors.

5. Modify it to perform ten runs, and change the output signal to display the point-wise average of these ten runs.

6. Add an input slider to control the number of runs.

7. Add another plot, which shows the standard deviation of the output signals, at each point in time.

8. Replace the precomputed array of normally distributed random numbers with a run-time computation that uses a random number generator. Use the Sample-Path slider to seed the random number generator.

9. When the sliders are moved, explain the flow of events that causes the recalculation to take place.

A small research project

What is the impact of the frequency of the forcing signal on its transmission through stochastic resonance?

• Make a hypothesis about the relationship.

• Check your hypothesis by varying the Frequency slider.

• Write a function to measure the strength of the output signal at the forcing frequency. Let sinwave be a discretely sampled sine wave at the forcing frequency, and coswave be a discretely sampled cosine wave. Let sindot = the dot product of sinwave and the output signal, and similarly for cosdot. Then the power measure is sindot2 + cosdot2.

• Modify the program to perform N trials at each frequency over some specified range of frequency, and measure the average power over all the N trials. Plot the power as a function of frequency.

• The above plot required you to fix a wave amplitude and noise level. Choose five different noise levels, and plot the five curves in one figure. Choose your noise levels in order to explore the range of qualitative behaviors.

• Produce several versions of this five-curve plot, one for each sine amplitude. Again, choose your amplitudes in order to explore the range of qualitative behaviors.


Follow

Get every new post delivered to your Inbox.

Join 3,042 other followers