Category Theory for Better Spreadsheets

Since one goal of Azimuth is to connect mathematicians to projects that can more immediately help the world, I want to pass this on. It’s a press release put out by Jocelyn Paine, who has blogged about applied category theory on the n-Category Café. I think he’s a serious guy, so I hope we can help him out!

Spreadsheet researcher Jocelyn Ireson-Paine has launched an Indiegogo campaign to fund a project to make spreadsheets safer. It will show how to write spreadsheets that are easier to read and less error-prone than when written in Excel. This is important because spreadsheet errors have cost some companies millions of pounds, even causing resignations and share-price crashes. An error in one spreadsheet, an economic model written in 2010 by Harvard economists Carmen Reinhart and Kenneth Rogoff, has even been blamed for tax rises and public-sector cuts. If he gets funding, Jocelyn will re-engineer this spreadsheet. He hopes that, because of its notoriety, this will catch public attention.

Reinhart and Rogoff’s spreadsheet was part of a paper on the association between debt and economic growth. They concluded that in countries where debt exceeds 90% of gross domestic product, growth is notably lower. But in spring 2013, University of Massachusetts student Thomas Herndon found they had omitted data when calculating an average. Because their paper’s conclusion supported governments’ austerity programmes, much criticism followed. They even received hate email blaming them for tax rises and public-sector cuts.

Jocelyn said, “The error probably didn’t change the results much. But better software would have made the nature of the error clearer, as well as the economics calculations, thus averting ill-informed and hurtful media criticism. Indeed, it might have avoided the error altogether.”

Jocelyn’s project will use two ideas. One is “literate programming”. Normally, a programmer writes a program first, then adds comments explaining how it works. But in literate programming, the programmer becomes an essayist. He or she first writes the explanation, then inserts the calculations as if putting equations into a maths essay. In ordinary spreadsheets, you’re lucky to get any documentation at all; in literate spreadsheets, documentation comes first.

The other idea is “modularity”. This means building spreadsheets from self-contained parts which can be developed, tested, and documented independently of one another. This gives the spreadsheet’s author less to think about, making mistakes less likely. It also makes it easier to replace parts that do have mistakes.

Jocelyn has embodied these ideas in a piece of software named Excelsior. He said, “‘Excelsior’ means ‘higher’ in Latin, and ‘upwards!’ in Longfellow’s poem. I think of it as meaning ‘upwards from Excel’. In fact, though, it’s the name of a wonderful Oxford café where I used to work on my ideas.”

Jocelyn also wants to show how advanced maths benefits computing. Some of his inspiration came from a paper he found on a friend’s desk in the Oxford University Department of Computer Science. Written by professor Joseph Goguen, this used a branch of maths called category theory to elucidate what it means for something to be part of a system, and how the behaviour of a system arises from the behaviours of its parts. Jocelyn said, “The ideas in the paper were extremely general, applying to many different areas. And when you think of modules as parts, they even apply to spreadsheets. This shows the value of abstraction.”

For more

Postal: 23 Stratfield Road, Oxford, OX2 7BG, UK.
Tel: 07768 534 091.

Campaign Web site: http://igg.me/at/safe-spreadsheets

Jocelyn’s bio: http://www.j-paine.org/bio.html

Jocelyn’s personal website, for academic and general stuff: http://www.j-paine.org/

Background information: links to all topics mentioned can be found at the end of Paine’s campaign text at

These include literate programming, modularity, the Reinhart-Rogoff spreadsheet, category theory, and many horror stories about the damage caused by spreadsheet errors.

16 Responses to Category Theory for Better Spreadsheets

1. arch1 says:

I wonder whether category theory has something to teach Systems Engineering in general, since “How the behaviour of a system arises from the behaviours of its parts” is central to SE.

• John Baez says:

One of my big goals in life now is to connect category theory to systems engineering and then systems biology. Try the very next blog article for more on this:

Categories in control, Azimuth.

I don’t think category theory has a lot to teach systems engineering right away: first we need people who understand both category theory and systems engineering at a reasonably deep level! However, from what little I’ve learned of systems engineering so far, it seems that category theory is a natural mathematical language for this subject. So, my first project is to lay down bridges between category theory and systems engineering. Once people start walking back and forth along these bridges I feel sure they will become an interesting trade route.

More concretely, my students and I have been working on the mathematics of signal-flow diagrams, and we’ve seen how they’re morphisms in a mathematically very natural category, and how there’s a functor from a category where the morphisms are electrical circuits to this category. For details try that blog article.

• arch1 says:

Thanks John I will. BTW the latest posting on Terry Tao’s blog (concerning a variant of the Navier-Stokes equation) prominently features a “‘circuit diagram’ that is analogous (but not identical) to the circuit diagrams arising in electrical engineering”.

I don’t know if that has any connection to category theory (except that I understand each about equally well:-).

• I have a friend who has been inspired by Robert Rosen’s work on systems biology, and that of Andrée Ehresmann and Jean-Paul Vanbremeersch on memory evolutive systems. (There’s a paper giving their categorical formulation of these here, and I see that David Corfield posted about them in the n-Category Café, attracting a lot of comments.) My friend is using these as a language in which to re-found economics. I’ll point him at these postings and ask him whether he’d like to explain his work.

• John Baez says:

Great!

• Hi Jocelyn,

here we go. The “re-foundation” of economics is a bit too large of a word but I think some interested researchers and me are working at one of the core characteristics of social systems namely that the model interacts with the modelled system, the society, or that modelling takes place in the modelled system. This happens also in the small in personal situations like in “I know that you know that I know and so on” or in self-fulfilling prophecies.

We call the attempt to model these issues reflexive economics. The basic structures that need to be considered are those to get along with the circularities involved. One handy structure are coalgebras that may represent infinite data types, as a computer scientists would call it. Another is the reflexivity that is inherent in lambda calculus where terms can be the operator, what is operated on and the result of an operation. The same meta level structure is inherent in the notion of a universal Turing machine that may simulate and operate on any other Turing machine.

I guess the two basic underlying advantages to use categories is that properties can be defined by embedding, i.e. self-participating universals and that categories are “fractal” in that morphisms can be again objects in other categories. It opens the path to think about a related question of reflexivity that arises in social systems, namely how do these structures that mutually contain each other compose into hole systems, where again as a “vertical” reflexivity subparts think about or model the hole and the hole influences or constrains the subparts.

The reflexive concerns are known in social sciences since long to be at the core of many most important issues if not being the defining feature of social science but it is not clear how to approach that mathematically, and this is what we try to explore. However, we are really at the very beginning of finding a proper mathematical form.

Jocelyn, it’s a small world in that John and me are co-organizing a Dagstuhl workshop where this issue of reflexivity will be one topic. There is another one only on coalgebras in reflexive economics.

Dusko Pavlovic and I are working on a basic framework for reflexivity in social systems (using game theory, the universality of Turing machines, coalgebras and Kleene’s recursion theorem) in his asecolab on cyber attacks. There are a few (preliminary) papers on my website that is something of a beginning of mathematical reflexive economics.

Viktor Winschel

• John Baez says:

Oh, so you’re the friend Jocelyn was talking about! Great—it will be good to hear more about your work in Dagstuhl, and thanks for all the links.

2. Excel is a weapon of mass financial destruction. In Australia at the time of the GFC someone in a quant group in a major bank had developed a pricing model in Excel (despite a policy of such programs being coded in C++ via a specialised platform) and got correlation coefficients outside the interval [-1,1]. This was never picked up by the writer of the spreadsheet nor any user until the actual conditions gave rise to absurd results.

Jocelyn is on to something.

3. westy31 says:

The “problem” I think is that Excel is a universal Turing machine. Therefore, it can do anything, including things that its human masters did not intend it to do.

How can a computer distinguish between a user forgetting something, and a user interested in investigating the effect of leaving that thing out?

Gerard

• John Baez says:

Type-checking is one way for a universal computer to notice when you’re making dumb mistakes and say “warning: dumb mistake!” Type-checking is very much category-theoretic, in that it often amounts to saying you shouldn’t compose arrows $f: x \to y$ and $g: y' \to z$ unless $y = y'.$

But there is probably a whole branch of computer science explicitly devoted to creating languages and software that make it harder to make dumb mistakes while still keeping it easy to do what you want. If there’s not, there should be.

• Robert Smart says:

Well if every sufficiently interesting computer activity can be regarded as programming, and if the type of a value is everything that the compiler can know about that value, then maybe type checking is the way to find all dumb mistakes. It is perhaps worth noting that the experts only recently worked out how to handle IO in a mathematically nice programming language. This used monads and has opened the flood gates to applying category theory to programming. However I think existing programming languages have trouble with the common situation where one type is a special case of another, and most particularly with the common case where we want to say “and these types fit in between those 2 types, but in different ways”.

• “But there is probably a whole branch of computer science explicitly devoted to creating languages and software that make it harder to make dumb mistakes while still keeping it easy to do what you want. If there’s not, there should be.”

There is certainly a huge amount of research into topics such as types and type-checking. The work that Robert Smart mentions below on monads and IO is one example. But sadly, I don’t think this separate branch of computer science exists yet.

As an example, take my Excelsior programming language. I wanted it to be type-checked. We know that compile-time type-checking avoids mistakes. So far so good.

But then I had to choose a notation for my types and type statements. This is where computer science intersects with psychology. Some notations, one presumes, are easier for humans to read or write than others. Depending probably on the reader or author’s previous experience. But is there a “Handbook of Optimum Notation” for me to look up the optimum design of type statements in? No! (*).

And similarly for the other constructs in my language. When it comes to designing notation, I’m not sure that we’ve advanced much beyond Alfred North Whitehead’s famous remark:

It is a profoundly erroneous truism, repeated by all copy-books and by eminent people when they are making speeches, that we should cultivate the habit of thinking of what we are doing. The precise opposite is the case. Civilization advances by extending the number of important operations which we can perform without thinking about them. Operations of thought are like cavalry charges in a battle they are strictly limited in number, they require fresh horses, and must only be made at decisive moments.

(*) Granted, this is a very subtle matter. Perhaps hard-to-read type statements are easier to remember! That’s only a guess, which I make by analogy with theresearch reported here: Hard-to-Read Fonts Promote Better Recall.

4. John, thanks for the intro! Yes, my work is concerned with making spreadsheets easier to read and safer to program. I suspect there aren’t any climate models coded in Excel (*), so I probably don’t have much to do with that particular aspect of saving the planet.

But there certainly are financial and economic models, and the Reinhart-Rogoff business suggests how influential these can be. Imagine for example an economic model, misprogrammed in Excel, that tells politicians they need to save money by making massive cuts in environmental protection (**). Or a misprogrammed geological spreadsheet that tells frackers they can safely solve our oil needs by tearing up the East Midlands.

My work’s connection with category theory came about via Joseph Goguen‘s work on algebraic semantics and the OBJ family of programming languages, and on a categorical description of interacting objects. In a paper Module Expressions for Modularising Spreadsheets and Sharing Code between Them, I’ve explained the relation between spreadsheets and the way the OBJ languages treat modules. Warning: the paper is terse, because of the page limits for the conference concerned, so may not be all that pleasant to read. There’s a description of a graphical interface for modules in a blog posting I wrote for Doctor Dobbs, Spreadsheet Components, Google Spreadsheets, and Code Reuse.

Categories aren’t the only influence on my work. I’ve designed a programming language that maps on to spreadsheets in a very simple fashion, but that is easier to read because it uses named variables rather than A1-style cell names. (It also has compile-time type-checking). The module system mentioned above fits very nicely into this. For a simple example of the language, see my blog posting Kaprekar’s Constant in Excel and Excelsior. This shows how I coded a spreadsheet that calculates one of the most useless integers ever, .6174.

I’ve also experimented with tools that make it easier to understand existing spreadsheets by reverse-engineering them into this language. There’s a short explanation at How to Reveal Implicit Structure in Spreadsheets.

Of course, even with better tools, people will still make errors. Peter Haggstrom comments above on a quant whose Excel pricing model got correlation coefficients outside the interval [-1,1]. That could have happened in any language, and the remedies include proper testing, proper documentation, and making sure that you have the right staff, i.e. decent statisticians!

(*) Google showed me that I’m wrong. I ought to look into this.

(**) That’s on my mind at the moment because of the terrible flood damage happening in Somerset, though I’m not suggesting the lack of preparedness was caused by a faulty spreadsheet.

• westy31 says:

My first comment was a bit skeptical, I wanted to object against the idea of preventing all errors, which I think we agree is not possible. But do I agree there must be good ways of making less mistakes.

I use Excel a lot, the main reason being that most people I work with also know how to use it, making it easy to share things. Reusing the same software is perhaps the most effective way of eliminating mistakes.
Above a certain complexity in the calculations, I switch to VBA.

I will read your articles. With many millions of users, we should not underestimate the impact of this software on society.

Gerard

5. There are earthquake models coded in Excel, in Italy.