## Biology and the Pi-Calculus

I’m going to try posting more short articles here. I often read papers and want to jot down a few notes about them. I’ve avoided doing that here, on the theory that articles here should be beautiful works of art. Now I’ll try being a bit more relaxed. I hope you ask questions, or fill in details I leave out.

My former grad student Mike Stay has become fascinated by the pi-calculus as model of computation. The more famous lambda-calculus is a simple model of functional programming: each expression in this calculus can be seen as a program, or a description of a function. The pi-calculus can do all the same things, but more: it models processes that can compute but also create channels and send messages to each other along these channels.

I’m trying to use network theory to understand the essence of biology in a very simplified way. I’m not convinced that the pi-calculus is the right formalism for this. But some people are trying to apply it to cell biology, so I want to learn what they’re doing.

I’m especially fascinated by this paper, and I’d like to understand it in detail:

• Davide Chiarugi, Pierpaolo Degano and Roberto Marangoni, A computational approach to the functional screening of genomes, PLoS Computational Biology, 28 September 2007.

They used a framework for describing computational networks, the “enhanced pi-calculus”, to help find a minimal set of genes required for the survival of a bacterial cell. If we knew this, it would help us make a guess for the genome of the Last Universal Ancestor. This is the most recent ancestor of all organisms now living on Earth—some single-celled guy, between 3.5 and 3.8 billion years old.

I’m not mainly interested in understanding the Last Universal Ancestor: I’m mainly interested in understanding a very simple form of life. So, I want to find what sort of equations they’re using to describe what happens in a cell… and how they’re converting their pi-calculus model to those equations. Unfortunately this paper doesn’t say, and it doesn’t seem like their code is open-source. But I’ll poke around.

I know how to use Petri nets to model chemical reactions. In their paper they mildly disparage these, and refer to other papers on this general issue. So, I should read that stuff. But I want to see whether Petri nets are merely inconvenient for them, or insufficiently powerful to describe their model.

I especially want to see if their model includes processes involving membranes, like the cell membrane, since Petri nets are not really suited to these. A eukaryotic cell has lots of organelles with membranes, so we will need some kind of ‘membrane calculus’ to understand it, and I’ve been gradually gearing up to understand that. But it’s possible their prokaryotic cell is just treated as a single bag of chemicals all of which can freely react with each other. Petri nets would suffice for this!

They write:

The π-calculus was designed to express, run, and reason about concurrent systems. These are abstract systems composed of processes, i.e., autonomous, independent processing units that run in parallel and eventually communicate, by exchanging messages through channels. A biochemical reaction between two metabolites, catalyzed by an enzyme, can be modeled in π-calculus as a communication. The two metabolites are represented by two processes, and, in our approach, the enzyme is modeled as the channel which permits the communication.

In addition to communications, the π-calculus also allows us to specify silent internal actions, used to model those activities of the cell, the details of which we are not interested in (e.g., the pure presence of a catalyst in a reaction, where it is not actively involved). The calculus has the means to express alternative behavior, when a metabolite can act in different possible manners: the way to follow is chosen according to a given probability distribution.

The main difference between the standard π-calculus and the enhanced version we used in this work is the notion of address. An address is a unique identifier of a process, totally transparent to the user, automatically assigned to all of its child subprocesses. This labeling technique helps in tracking the history of virtual metabolites and reasoning about computations, in a purely mechanical way. In particular, stochastic implementation or causality are kept implicit and are recovered as needed.

I would also like to understand all the chemical reactions that occur in their minimal model of a cell. They started by simulating an already known model, the ‘minimal gene set’ or ‘MGS’:

The MGS-prokaryote has been exhaustively described in the enhanced π-calculus. We represented the 237 genes, their relative products, and the metabolic pathways expressed and regulated by the genes, as the corresponding processes and channels. In particular: the glycolytic pathway, the pentose phosphate pathway, the pathways involved in nucleotide, aminoacids, coenzyme, lipids, and glycerol metabolism. Moreover, MGS genes encode for a set of membrane carriers for metabolite uptake, including the PTS carrier. We placed this virtual cell in an optimal virtual environment, in which all nutrients and water were available, and where no problems were present in eliminating waste substances.

However, they found that a cell described by this minimal gene set would soon die! So, they added more features until they got a cell that could survive.

This seems like a fun way to learn a manageable amount of biochemistry. I’m less interested in knowing all the molecules on a first-name basis than in getting a rough overview of what goes on the simplest organisms.

Many of the processes mentioned above seem adequately described by chemical reaction networks—or in other words, Petri nets. For example, here is the glycolytic pathway:

I didn’t know what the ‘PTS carrier’ is, but Graham Jones helped out, writing:

Wikipedia says the ‘phosphotransferase system or PTS, is a distinct method used by bacteria for sugar uptake where the source of energy is from phosphoenolpyruvate (PEP).’)

and

Pyruvate is then converted to acetate which, being a catabolite, can diffuse out of the cell. A transmembrane reduced-NAD dehydrogenase complex catalyzes the oxidation of reduced-NAD; this reaction is coupled with the synthesis of ATP through the ATP synthase/ATPase transmembrane system. This set of reactions enables the cell to manage its energetic metabolism.

and

The cell “imports” fatty acids, glycerol, and some other metabolites, e.g., choline, and uses them for the synthesis of triglycerides and phospholipids; these are essential components of the plasma membrane. (plasma membrane = cell membrane).

I guess that a Petri net could suffice, with some tricks to represent the environment outside the cell. But I think it would need to be stochastic, not just differential equations, because some processes would involve a single copy, or a tiny number of copies, of a molecule.

### For more

For more references and links to the pi-calculus, other process calculi and their applications to biology, see:

• Azimuth Library, Process calculus.

### 15 Responses to Biology and the Pi-Calculus

1. Walter Blackstock says:

Larry Moran, a Professor of Biochemistry in Toronto, recently discussed in his blog Sandwalk the question, “How many human protein-coding genes are essential for cell survival?” Granted, the examples are for eukaryotic cell lines, not prokaryotic minimal gene set models, but the numbers are interesting. Citing three 2015 papers (one using CRISPR screens) he notes:

“Each group identified between 1500 and 2000 protein-coding genes that are essential in their chosen cell lines”.

He thinks this is too low, supporting his view with the glycolytic pathway and citric acid cycle.

2. John Baez says:

On G+ Marshall Hampton wrote:

I wonder what their “optimal virtual environment” was. The actual environment of the LUCA was probably a hydrothermal vent with some very steep chemical and thermal gradients, and rather unusual chemical components compared to modern oceans.﻿

I think their “optimal virtual environment” was much less challenging than this. They said it was one “in which all nutrients and water were available, and where no problems were present in eliminating waste substances”. I believe the cells were in sugar water, since they say:

A large number of simulations (about 5,000) have been run, differing in the values of the initial parameters. We independently varied the amount of glucose in the extracellular environment (the number of copies was in the range 100–5,000) …

By “copies” I guess they mean molecules.

• Graham says:

The earliest life is a very different thing from the LAST universal common ancestor. A billion years may have separated the first self-replicator from the LUCA, including the RNA world (https://en.wikipedia.org/wiki/RNA_world).

3. John Baez says:

On G+ Vance Morris wrote:

“A biochemical reaction between two metabolites, catalyzed by an enzyme, can be modeled in π-calculus as a communication. The two metabolites are represented by two processes, and, in our approach, the enzyme is modeled as the channel which permits the communication.”

I have no inclination regarding Petri nets, but it seems to me that one could use pi-calculus to describe each side of a plasma membrane and describe a channel for communication and translation (the actual membrane).﻿

That sounds right. I wish they gave details, but they don’t! Not in this paper, at least. Walter Blackstock was complaining about some math papers being too long, but this paper goes to the other extreme and does not provide anywhere near enough detail to verify or even understand what’s been done.

• I like to say that I studied math because I’m too stupid for anything else — above all, biochemistry. Still I can’t help being fascinated this stuff, mainly for the philosophical question: Is Life computable?

1) The question of autopoiesis: Are the enzymes/channels produced by other enzymes/channels?
2) They write: “ViCe has no pathways for amino acid synthesis.”. Do amino acids occur in abiotic nature? I guess not. Then ViCe would be far away from the primordial ancestral cell.

• John Baez says:

Martin wrote:

1) The question of autopoiesis: Are the enzymes/channels produced by other enzymes/channels?

Unfortunately they don’t seem to say! First they simulated their competitor’s cell, the one with the claimed “minimal gene set”. It crashed and died quite soon:

Their own model cell lasted longer. But I don’t think they got it to reproduce! So it could be like simulating a guy but leaving out his testes. There’s another experiment where people claim to have simulated the whole life cycle of a bacterium, including reproduction. But that was a much bigger product:

The simulation of the complete life cycle of the pathogen, Mycoplasma genitalium, was presented on Friday in the journal Cell. The scientists called it a “first draft” but added that the effort was the first time an entire organism had been modeled in such detail — in this case, all of its 525 genes.

The simulation, which runs on a cluster of 128 computers, models the complete life span of the cell at the molecular level, charting the interactions of 28 categories of molecules — including DNA, RNA, proteins and small molecules known as metabolites, which are generated by cell processes.

[…]

For the new computer simulation, the researchers had the advantage of extensive scientific literature on the bacterium. They were able to use data taken from more than 900 scientific papers to validate the accuracy of their software model.

Still, they said, the model of the simplest biological system was pushing the limits of their computers.

“Right now, running a simulation for a single cell to divide only one time takes around 10 hours and generates half a gigabyte of data,” Dr. Covert wrote. “I find this fact completely fascinating, because I don’t know that anyone has ever asked how much data a living thing truly holds. We often think of the DNA as the storage medium, but clearly there is more to it than that.”

Martin wrote:

2) They write: “ViCe has no pathways for amino acid synthesis.” Do amino acids occur in abiotic nature? I guess not. Then ViCe would be far away from the primordial ancestral cell.

There actually are some amino acids in abiotic nature: there’s even glycine in interstellar clouds and comets. But the cell being simulated here is not an attempt to reconstruct one of the first organisms: it’s an attempt to reconstruct the last organism that’s an ancestor to all living today. So, it would be the result of at least hundreds of millions of years of evolution, and it would probably live in an environment where ‘primordial’ supplies of chemicals had already been largely eaten.

4. Mike Stay says:

Luca Cardelli has done a lot of stuff with simulating cells with pi calculus:
http://lucacardelli.name/biocomputing.htm

• John Baez says:

Thanks! I know about some of those papers. I want to get into this stuff more deeply now.

Corrado Priami introduced the stochastic pi-calculus and applied it to biology here:

• C. Priami, Stochastic pi-calculus, The Computer Journal 38 (1995), 578–589.

• C. Priami, A. Regev, E. Shapiro, W. Silverman, Application of a stochastic name-passing calculus to representation and simulation of molecular processes, Information Processing Letters 80 (2001), 25.

Also, COSBI (www.cosbi.eu) developed a set of languages to model biological systems which are related to the pi-calculus (e.g., beta-binders, BlenX, L languages). Most of them are implemented and freely available online (http://www.cosbi.eu/research/prototypes and also http://www.cosbi.eu/research/cosbi-lab). Some relevant papers are here:

• C. Priami and P. Quaglia, Beta binders for biological interactions. Computational Methods in Systems Biology (2005), 20–33.

• L. Dematte, C. Priami, and A. Romanel, The Beta Workbench: a computational tool to study the dynamics of biological systems, Briefings in Bioinformatics 9 (2008), 437–449.

• R. Gostner, B. Baldacci, M.J. Morine and C. Priami, Graphical modelling tools for systems biology, ACM Computing Surveys 47 (2014).

Lot of details on language-based models of biology can be found in this book:

• C. Priami and M.J. Morine, Analysis of Biological Systems, Imperial College Press, 2015.

I need to get this and read it!

5. Marc Harper says:

Regarding minimal sets of genes, it’s crucial to identify the environment. For example, Spiegelman’s monster is an RNA strand selected for minimal length in a solution that contained the proteins necessary for replication, and so those proteins did not need to be coded for in the final/minimal genome, and indeed they were lost after many generations of evolution. [1]

For bacteria, if the environment is e.g. a Petri dish, then the nutrient medium is crucial. If glucose is the only carbon source in the medium then the cell does not need the genes for other utilizing other carbon sources (unless, of course, those genes are critical for other metabolic pathways). We don’t really know what the environment of the last universal ancestor was, but knowing some of its genes would shed some light on its environment — it’s an intringuing question biologically. Of course some information may have been irrevocably lost since there have been major changes to the global enviroment (reductive to oxidative being a big one), but perhaps that information is preserved in extremophiles.

I would guess that a minimal set of genes for say E. coli could be arrived at empirically by knocking out genes in successive generations perhaps pathway by pathway, using known functional association databases to avoid genes that are involved in multiple pathways, and allowing other genes to atrophy by mutation. It would be tedious work but within current technological reach. This could be repeated for other organisms (preferrably distintly related) and the intersection of the minimal genes sets would likely have to been in the LUA (more precisely, genes that are ancestors of the common genes). I’m imagining something much like the inference of extinct ancestor languages in linguistics.

6. domenico says:

I don’t see much difference between a chain of chemical reaction and a human thought, through a chemical action potential.
It can be that the cell signaling and g-proteins can forming a though in a single cell, like a primordial brain; so that it could be possible to write a primordial theory for the chain of chemical reactions to obtain a first conscious program.
The last universal ancestor could be a first self-replicating rna that it is connected with an helical capsid, so that in a right – with complex molecules – environment the evolution can be possible, with a great and stable Helmholtz free energy reservoir (like a hydrothermal vent) that could generate these complex molecules.
It is interesting that some researchers try to simulate some complex biological behaviour using all branches of knowledge.

7. leithaus says:

Regarding Petri nets and expressiveness, there are competing understandings of expressiveness. The older notion was related to functionally complete, aka Turing complete. Can i say anything in this formalism that i can say in lambda calculus or Turing machines? Iirc, Petri nets with inhibitor arcs are functionally complete.

More modern notions of expressiveness stemming from investigations of programming language semantics and the relationship of programming language semantics to category theory are much more nuanced. They get at how well a particular language expresses a particular notion (typically captured in yet another language!). In this domain the moral equivalent of functoriality plays a big role.

Let me illustrate with an example. There are several fragments and variations of the π-calculus. One of them is the asynchronous (no continuations after output), choice-free (no summation construct) π-calculus. This is provably Turing complete (because you can give a rather elementary encoding of the lambda calculus into it). However, if you compare it to another Turing complete fragment, the synchronous (output guards on continuations) π-calculus with mixed choice (summations that involve both input and output guards), then you can see an expressiveness gap.

Catuscia Palamidessi shows that there is no translation from the latter fragment (synchronous with mixed sums) to the former (asynchronous with no sums) that preserves parallel composition. That is, if [| – |] is our translation function, then insisting that

[| P|Q |] = [| P |] | [| Q |]

exposes an expressiveness gap between the two calculi because no such translation exists. We can argue whether this is a reasonable requirement. But it is still a nice way to separate the two calculi.

There’s a gold standard with respect to expressiveness and this is called full abstraction. Suppose we have two calculi, C1 and C2 and a translation between them [| – |] : C1 -> C2. We want the native notion of equivalence in C1 to exactly match the native notion of equivalence in C2. For example, if C1 is the lambda calculus, then we might use contextual or Morris-style equivalence as the native notion of equivalence, or we might use applicative bisimulation. Meanwhile, if C2 is the π-calculus we might use barbed bisimulation as the native notion of equivalence. Writing ∼ for the equivalence in C1 and ≈ for the equivalence in C2 we can write down what we mean by full abstraction:

[| – |] is fully abstract just when P ∼ Q iff [| P |] ≈ [| Q |].

The content of this definition is that whenever P is substitutable for Q in some computation in C1, then [| P |] is substitutable for [| Q |] in C2. This is a demanding correspondence between the formalisms.

Having laid out these more nuanced notions of expressiveness it becomes possible to get a more nuanced view of when to choose certain formalisms over others when modeling certain kinds of phenomena.

For example, if you compare two mobile process calculi that have been widely used in modeling biological phenomena, namely Milner’s π-calculus with Cardelli and Gordon’s ambient calculus, then you can immediately see practical differences that make a difference. First, as Luca is fond of pointing out, no one has found a fully abstract encoding of the ambient calculus into the π-calculus. So, on the face of it, there appears to be an expressiveness gap.

In the biological domain we are often dealing with phenomena related to membranes. The ambient calculus seems tailor-made for modeling such phenomena, while there’s a kind of conceptual gap in the π-calculus for modeling this kind of thing.

In connection with this Mike Stay and i have been working on results that might be summarized via the slogan “logic is a distributive law” that fits very nicely with our higher-category semantics for the π-calculus. Briefly, if you think of the realizability programme as interpreting formulae in terms of the collections of computations that realize them, then you can interpret a logic as weaving together 3 data: a notion of collecting computations (list, set, bag, etc) which is always modeled as a kind of monad; a notion of computation (lambda, π-calculus, ambient calculus, etc) which turns out to also be modeled in terms of a monad (for the free term algebra); a way of distributing term formation over collections into collections of terms (this turns out to be a distributive law).

Thinking about this idea from the perspective of comparing ambients to π-calculus i have come to suspect that they are talking on completely different levels. i believe that a proper encoding of ambient terms is as modal formulae interpreted over π-calculus terms. This stems from the intuition that an ambient, aka a membrane, identifies a particular kind of collection semantics (everything that’s inside the membrane is collected by the membrane), and the fact that containment in an ambient membrane is dynamic is what motivates the modal aspect of the formulae that interpret ambient agents.

So, this suggests both practically and theoretically why these calculi are of different expressiveness.

Notice that this kind of thing goes the other way. Ambients are so expressive that you can write down leader election (a class of algorithms that’s getting renewed attention what with the growing interest in distributed and decentralized computation on Internet-scale applications) in a single line program. This is simply too expressive to allow as an infrastructure on the Internet at this stage because the trust model for the nodes that run ambient calculus programs would be too high. An attacker could inject code of enormous expressivity into a node and it’s halting hard to say that it doesn’t do unexpectedly bad things. The π-calculus, on the other hand, enjoys a nice result that says you can always encode code-shipping computation into channel-exchanging computation. This makes it much more suitable for today’s Internet with today’s trust model considerations.

To complete the picture and return to the topic of the relationship between Petri nets and π-calculus, Petri nets have a static data flow topology. It can’t change as a result of the computation. The π-calculus has a dynamic communication topology more generally. Which computations are connected to which changes as a result of computation. This corresponds very closely with the way computational activities happen on the Internet. First i connect with amazon.com, but very soon the AMZN servers generate a fresh URL that reflects our specific interaction and during the course of that interaction will generate several more. It also corresponds to what happens in biological interactions where cells, and higher level organisms, are in more or less constant motion. A predator runs after some prey who runs toward a group of more appealing prey of a different species. The communication topology changes. A single cell organism with a flagellum will follow a trail of sugar and move away from a region of poison. The communication topology changes.

It’s very hard to model these kinds of phenomena in the Petri nets. Of course it can be done because Petri nets (with inhibitor arcs) are Turing complete. However, by the time you’ve built the model you’ve probably build a π-calculus interpreter into your translation into Petri nets.

Finally, the biggest problem with Petri nets can be illustrated with the following example. Using the Priami/Silverman/Shapiro approach to modeling chemical reactions in the π-calculus you can do the following kind of thing. Suppose you have a beaker A with some reagents in it and you write down the π-calculus model of that as [| A |]. Similarly, for beaker B with it’s reagents, you write down the π-calculus model as [| B |]. To write down the π-calculus model of beaker C into which you simply pour the contents of beaker A and B, you need only write [| A |] | [| B |], the semantics of parallel composition does the rest. Now, to the best of my knowledge, you can’t do that with Petri nets. If you have an approach that does that, then you’ve got a really significant result that would be of interest to several communities.

• John Baez says:

Thanks for the wonderfully detailed comments. Since you note that you’re working with my student Mike Stay, I suppose it’s not giving away a big secret to reveal that “leithaus” is a pseudonym for Greg Meredith.

I wish I knew all the concepts you’re using. I don’t. But luckily I get the general drift.

You hinted at this observation, but perhaps not everyone noticed, so I’ll say it out loud: more expressiveness is not always better. You mentioned one reason: it makes computer security harder:

Ambients are so expressive that you can write down leader election (a class of algorithms that’s getting renewed attention what with the growing interest in distributed and decentralized computation on Internet-scale applications) in a single line program. This is simply too expressive to allow as an infrastructure on the Internet at this stage because the trust model for the nodes that run ambient calculus programs would be too high. An attacker could inject code of enormous expressivity into a node and it’s halting hard to say that it doesn’t do unexpectedly bad things

But the underlying problem here, that you can’t easily predict the behavior of programs written in highly expressive languages, is also a problem for people wanting to understand models of systems, e.g. in biology. In modelling systems, I generally want to use the least expressive language necessary for a given task—so I can understand what’s going on in the most detail.

Petri nets are great for understanding a single well-mixed container of chemicals, and because they’re rather inexpressive we can prove lots of great theorems about them—as long as we don’t include fancy features like ‘inhibitor arcs’.

Supplementing them with extra features is necessary for understanding full-fledged biochemistry, where membranes and ‘containers’ play a big role. I want to slowly start exploring these extra features, a bit at a time. But extra features that I don’t actually need will just be confusing. So you’re making me feel confirmed in what I already thought, that that pi-calculus is not quite right for what I want. The ambient calculus feels closer, but it feels complicated. For its mathematical elegance, I’m really attracted to this:

• Nicolas Ouray and Gordon Plotkin, Multi-level modelling via stochastic multi-level multiset rewriting.

But I’m just getting started…

8. […]I’m trying to understand some biology. Being a mathematician I’m less interested in all the complicated details of life on this particular planet than in something a bit more abstract. I want to know ‘the language of life’: the right way to talk about living systems.

Of course, there’s no way to reach this goal without learning a lot of the complicated details. But I might as well be honest and state my goal, since it’s bound to put a strange spin on how I learn and talk about biology.

For example, when I heard people were using the pi-calculus to model a very simple bacterium, I wasn’t eager to know how close their model is to the Last Universal Ancestor, the primordial bug from which we all descend. Even though it’s a fascinating question, it’s not one I can help solve. Instead, I wanted to know if the pi-calculus is really the best language for this kind of model. […]

9. a minimal set of genes required for the survival of a bacterial cell. If we knew this, it would help us make a guess for the genome of the Last Universal Ancestor.

It probably wouldn’t help since last ancestors are routinely found by phylogenetic methods to be surprisingly complex. (E.g. LUCA, LECA, ancestor metazoan, et cetera.)

I am reviewing the latest Astrobiology meeting, and a popular description of life was as an opportunistic geological process. Gisser Martin discuss environmental production of biochemicals that are produced in modern cells, and besides amino acids it seems gluconeogenesis/glycolysis was environmental in Hadean oceans. (See e.g. Keller et al on that.) And so were inorganic cell compartments (pores in vents).

Taking that to the extreme, Woese’s method seems to be much more informative. (I.e. to study the phylogenetics of the ribosome, which today dig deep before the LUCA to the RNA/protein ancestor lineage; see Petrov et al on that, together with the recent uncovered set of ancestral polypeptides.)

A fun observation is that while LUCA likely was a vent occupant, the evolution of chemiosmosis seems to be possible only on the surface of alkaline hydrothermal vents and the heat stress protein root matches in temperature, the above indicates that the whole lineage evolved there. Simplicity may yet again be a good guess.

Since comments on LUCA age abound, I note that recent finds may also put other theories in tension, LUCA may have evolved at neck breaking pace by repetitive accretion as both the ribosome and ancestral polypeptide finds suggest. A putative > 4.1 billion year old fossil* was recently described after painstaking work on old zircons. The find needs to be repeated and better constrained by context as it is a microfossil. But it fits nicely with the > 4.3 billion year habitable ocean (see Valley’s latest review on zircon data) and the > 4.2 billion year old LUCA phylogenetic date of TimeTree (from 3 independent papers!). The carbon isotope ratios was the typical of photosynthesis, set by RuBisCO of the Calvin cycle. That general type of metabolism evolved in the bacteria clade and not in archaea, i.e. it could be dating after the first recognizable split from the LUCA population.

*Actually two fossils nicely encapsulated in one zircon, if we count as paleontologists do.