## Information and Entropy in Biological Systems (Part 7)

In 1961, Rolf Landauer argued that that the least possible amount of energy required to erase one bit of information stored in memory at temperature $T$ is $kT \ln 2,$ where $k$ is Boltzmann’s constant.

This is called the Landauer limit, and it came after many decades of arguments concerning Maxwell’s demon and the relation between information and entropy.

In fact, these arguments are still not finished. For example, here’s an argument that the Landauer limit is not as solid as widely believed:

• John D. Norton, Waiting for Landauer, Studies in History and Philosophy of Modern Physics 42 (2011), 184–198.

But something like the Landauer limit almost surely holds under some conditions! And if it holds, it puts some limits on what organisms can do. That’s what David Wolpert spoke about at our workshop! You can see his slides here:

You can also watch a video:

### 46 Responses to Information and Entropy in Biological Systems (Part 7)

1. jessemckeown says:

The first thing that puzzles me about this suspicion of Landauer’s, is that the cost of switching increases with temperature, whereas from my (rather naïve) expectation, it should be easier to get hot systems to forget themselves than cold systems; but perhaps I should take that as a clue, because to reliably store one bit in a hotter system requires more redundancy, and therefore more underlying hardware, which it will then cost more to switch just for its being heavier.

I’m not going anywhere with this, just musing (encouraged, perhaps, by someone’s having mentioned that there wasn’t much musing right here on these recent posts).

• John Baez says:

Musing is good!

I guess the first thing to say is that the equation “energy is Boltzmann’s constant times temperature times entropy” is dimensionally correct. Whatever alternative law one might suggest should at least be dimensionally correct.

• Graham Jones says:

I am not a physicist. I have gathered from Wikipedia that

Change in entropy = (change in free energy) / temperature

so it makes sense that if you want to do a lot of information processing for a given amount of energy, you should do it as cold as possible. I guess the catch as far as life is concerned is time. (Energy always goes with time in physics doesn’t it?) You can do more energy-efficient processing at low temperature, but a competitor at higher temperature and with more energy would think faster.

• jessemckeown says:

ooh, Graham reminds me how to turn a frequency into an energy: if we do have a clock running at some frequency, $(\nu \hbar)^2/(k T)$ is another decent energy scale, though admittedly more complicated and contrived.

2. Theophanes Raptis says:

There was never any need for real “switching”. This was only because of the historical persistence of the old Turing “teletype” paradigm which led to digital design. In his 2000 thesis, Yuzuru Sato showed how one can immitate digital computation with continuous dynamical systems like modified Baker’s maps.
http://www.jucs.org/jucs_6_9/nonlinear_computation_with_switching
Recently, La Roy managed to set up a classical analog computer imitating quantum computations via electrical signals.
http://phys.org/news/2015-05-quantum-emulated-classical.html

• jessemckeown says:

I’m not quite sure what you’re driving at, but perhaps there’s an echo in: after browsing through Norton’s paper, I’m puzzled as to why the “memory” cells are so frequently imagined as pistons at pressure 1-or-2 ; another just-as-plausible memory cell could be realized by pistons at pressure 0.8-or-1.2. Maybe (following Graham’s hint) we should expect such a machine to run slower, BUT…

• Theophanes Raptis says:

There are simpler means to perform “digital” computations without the stringent requirement to erase information and thus increase entropy. One can try like Sato to “embed” digital into analog. I tried something similar in the Fourier domain but it’s a hard project to figure out all the engineering details
http://cag.dat.demokritos.gr/publications/AnalogProcessor.ppt

Theophanes Raptis this is an interesting website you are on:

EU Funded Innovation in H2020
26-27 Ιουνίου 2015 – Building partnerships for research & business innovation.

The Hellenic Forum is under the auspices of the Hellenic Ministry of Foreign Affairs and aims to act as a meeting point for Science, Technology and Innovation building bridges with scientists abroad, in order to enhance scientific and technological transfer collaboration for innovation. Your technology, Your idea, Your invention is accelerated from Lab to Market at the Innovation Exhibition of NCSR Demokritos.

• Robert says:

Well of course you can, since you can modify the usual digital logic gates to be entirely reversible with extra output bits. Is there a reason why one might want to use an analog system model to accomplish this?

In particular I would expect that an analog model that does something interestingly different from a discrete reversible one is going to sneakily rely on the infinitely many bits available in any precisely defined real number, and so the ‘no loss of information’ will fail when it’s truncated to a finite number of decimal places or when (thermodynamically inevitable!) real world errors and inaccuracies are otherwise taken into account.

• Theophanes Raptis says:

It is not the same as using an A/D – D/A quantizer where noise eats bits. The key is in the encodings of words but not by adding bits a la Bennet-Toffoli-Margolus etc. Instead one can utilize frequencies to encode words that do not need to be read and written the ordinary way. If we can do that with practically constant spectral density then we get what we want. It all ends up with something like a permutation machine on a synthi.

3. Blake Stacey says:

I get a “page not found” error when trying to view the slides.

• John Baez says:

I forgot to change a relative link to an absolute one when copying it… the bane of a despatialized existence. David Wolpert’s talk is actually here.

4. John Baez says:

I forgot to mention that a mathematical proof of a very general form of Landauer’s bound is given here:

• Matteo Smerlak, The mathematical origin of irreversibility, 8 October 2012.

He starts with the Detailed Fluctuation Theorem, uses that to prove Clausius’ inequality, and notes that this is a generalized version of the Landauer limit.

Of course, the hard part in these physics-related theorems is clearly stating the assumptions and the result! There could be other, very different, ways of making Landauer’s limit into a theorem (or non-theorem).

• Theophanes Raptis says:

This is fine for it does not exclude almost isentropic computations. From a practical engineering perspective if a computing machinery outputs heat at a glacially slow rate that would be more than enough for any practical purpose, fans aside, not to mention data centers immersed in liquid nitrogen or so. The true limit of course will come after the final collapse of Moore’s law in one or two decades.

5. Mike Stay says:

Any conserved quantity will do, not just energy.

“Landauer argued that the process of erasing the information stored in a memory device incurs an energy cost in the form of a minimum amount of mechanical work. We find, however, that this energy cost can be reduced to zero by paying a cost in angular momentum or any other conserved quantity. Erasing the memory of Maxwell’s demon in this way implies that work can be extracted from a single thermal reservoir at a cost of angular momentum and an increase in total entropy. The implications of this for the second law of thermodynamics are assessed.”

https://arxiv.org/abs/1004.5330

6. John Baez says:

For some reason Nadja Kutz was unable to post this comment, so I will:

John wrote:

I forgot to mention that a mathematical proof of a very general form of Landauer’s bound is given here:

I have troubles to understand this detailed fluctuation theorem, which is partly due to the fact, that I don’t have the time to delve into it. So I thought is might be good to understand the Landauer limit in terms of the master equation in terms of this theorem. For a two state system with

$\Psi(t)=(\pi_0,\pi_1)^T$

the matrix H seems to look like

$H=g\left( \begin{array}{rr} -1 &1\\1&-1\end{array}\right)$

where $g >0$ so Matteo’s $\Sigma=0$ and so is $\langle \Sigma \rangle = 0$ for this case and thus it should hold that $\langle\Delta I \rangle \geq 0$. (“Clausius inequality”). I tried to check this. That is due to the form of H one has, as it seems that

$\pi_0(T)-\pi_1(T)=(\pi_0(0)-\pi_1(0))e^{-2gT}$

and since

$\pi_1(t)=1-\pi_0(t)$

one can thus compute the time evolution explicitly. And thus one can compute $\Delta I$ explicitly which is according to Matteo and to my interpretation of it in terms of the master equation (??):

$\displaystyle{ \Delta I= \ln(\frac{\pi_0(0)}{\pi_1(T)}) }$

As explained before $\Delta I$ can be explicitly expressed as a function of $a=pi_0(0),T, g$ and so one can integrate

$\displaystyle{ \int_0^1 a \Delta I(a,T,g) da }$

which should be

$\langle \Delta I \rangle$

(if I understood Matteo correctly). I started to integrate this (using Bronstein…) but didn’t finish. It’s a bit messy and I don’t have a computer algebra program and in particular I got the impression that it is not so easy to see that the Clausius inequality holds. In any case I think it would be interesting to calculate this explicitly, because one could see whether at all and if for which cases the Clausius equality holds (i.e. reversibility). Can someone do this? – I am already late for work.

Thanks John for posting. This is I think the second time that the reply button didn’t work, I could have restarted my browser, but then as said I was already late. You wrote me in an email that you might persuade your student Blake Pollard to finish the computation. It would be great if he could check the above reasoning. But I think finishing the computation is more an undergrad task. Finally all I ask for is to solve the following integral (hope there is no typing and calc error):

$\int_0^1 \ln(\frac{a}{0.5-(a-0.5)e^{-2 g T}})da$

You can plug that into a computer algebra programm (and plotting the result for gT would give you infos about the inequality) but it could also be done by hand if I look at the integral on a first glance it seems use substitution and then integral (471 in Bronstein-Semendjajew 24th edition Teubner Leipzig, Nauka Moscow 1979).

If nobody does it and I find the time then I’ll do it myself. I don’t really want to bother Blake with that tedious work.

Some of my undergrad students in Amherst once came up to me and asked me to solve an integral on the spot (Hey let’s see whether you can solve this…). I thought that by solving it on the spot I could eventually significantly reduce the amount of farting noises, paper planes and comments about my German accent around lessons, so I took the risk. I was lucky and was able to solve the integral. That helped though only a bit.

• Blake Pollard says:

I think you have to be a little careful about what you mean by $\Delta I$. In Matteo’s post he defines this quantity for a trajectory starting at state $a_0$ and ending at state $a_N$ at time $T$ after $N+1$ transitions. So if the system starts in state $0$, then after some time $t$ it can end up in either state $0$ or state $1$ and similarly for trajectories initially in state $1$. So we really have something like 4 classes of trajectories to consider. You wrote

$\Delta I = \ln \left( \frac{ \pi_0(0) }{ \pi_1(T)} \right)$

which is the change in self-information along a trajectory starting in state $0$ and ending up in state $1$ at time $t=T$.

Matteo says that $\langle \Delta I \rangle$ should be the change in self-information averaged over all realizations of the process. I’m not sure how to implement that averaging. It seems like it should be the probability of each particular realization of a path times the change in self information for that path, summed over all possible paths. Later Matteo writes that

$\langle \Delta I \rangle = \Delta S \geq \langle \Sigma \rangle$

where

$S= -\sum_i \pi_i(t) \ln ( \pi_i(t) )$

is the Shannon entropy and $\Sigma$ is the skewness, which should give us Clasius’ inequality for the two state example you mentioned.

If we use that formula then we get that

$\langle I \rangle = \Delta S = \sum_i -\pi_i (t) \ln ( \pi_i(t) ) + \pi_i(0) \ln (\pi_i(0))$

For the example you proposed this gives

$\langle I \rangle = -\pi_0(t) \ln( \pi_0(t) ) -\pi_1(t) \ln( \pi_1(t) ) + \pi_0(0) \ln( \pi_0(0) ) + \pi_1(0) \ln ( \pi_1(0) )$

No integrating over time required. I’m still understanding this stuff myself, but I think that in general $\Delta I$ is a quantity that, for a fixed duration of your path $T$ depends only on the end points of your path, so finding its average value corresponds to averaging over paths, not integrating over time. Here I plotted $\Delta S$ in Wolfram Alpha for $\pi_0(0) = .25$ and $\pi_1(0) = .75$ and it looks positive.

• Theophanes Raptis says:

Mighty Maple gives “some” answer which produces a “Division by 0 error” at any attempt to smplify it with the exponential back or not. Steps were as

restart;q:=int(xlog(2x/(1-k(2x-1))),x);
p:=subs(x=1,q) – subs(x=0,q);
simplify(p);
Error, (in ln) numeric exception: division by zero.

Blake, it is great that you are looking into this. As I said I might be wrong, finally I try to make sense of what I read here and I have not so much time, but I think it is imprtant to look at this example, if one would like to get a glimpse on what is going on with the fluctuation theorem.

I haven’t understood yet why $\langle \Delta I \rangle = \Delta S$.

So I first tried to understand the inequality you wrote above which seems to be a direct consequence of the fluctuation theorem, namely:

$\langle \Delta I \rangle \geq \langle \Sigma I \rangle$

For this special case of two states one always has $\Sigma =0$ -if I got the matrix right. So for this case it should hold that: $\langle \Delta I \rangle \geq 0$.

so finding its average value corresponds to averaging over paths, not integrating over time

What I meant with computing the time evolution explicitly is that you can express each time evolution of the probabilities $\pi_i(t)$ as a rather simple function of their initial condition $\pi_i(0)$. And since $\pi_0(0)=1- \pi_1(0)$ (if I understood this master equation right) one has only one initial variable. The time evolution is what (if I understood you correctly) what you call “a path”, so one path is uniquely determined in this example by $\pi_i(0)$, which I abbreviated as $\pi_i(0)=a$ and so the integral I am talking about here should be the averageing over all paths, i.e. the time $T$ is in this calculation treated as a parameter (see below $k=e^{-2gT}$).

Since at a fist glance it looks as if one gets divergencies for $\langle \Delta I \rangle$ one needs eventually first check wether this is true (like by plotting the integral not at 0 but for a small value, see comment below) or wether there is a mistake/misunderstanding somewhere.

Blake, I guess you see what are typos and which not, but just to make sure, it should read:
$\langle \Delta I \rangle \geq \langle \Sigma \rangle$
and
$\pi_0(0)=a$

• Blake Pollard says:

I don’t think that $\pi_0(t)$ is a path. It describes the evolution of the probability of finding the system in the state $0$ at time $t$. A path would be more like a sequence of states along with the times at which transitions between states occured. So I think the quantity you are calculating is not $\langle I \rangle$. I think there is a way to understand the equality $\langle I \rangle = \Delta S$ and that is the crux of the issue because $\Delta S$ appears to be non-negative for all choices of initial distributions.

• John Baez says:

Blake wrote:

Matteo says that $\langle \Delta I \rangle$ should be the change in self-information averaged over all realizations of the process. I’m not sure how to implement that averaging. It seems like it should be the probability of each particular realization of a path times the change in self information for that path, summed over all possible paths.

$\langle \Delta I \rangle$ should be the change in self-information averaged over all paths. Our Markov process defines a probability measure on the set of paths that start at a given state $a.$ We are also given a probability distribution $\pi(0)$ on the set of states. Putting these together we get a probability measure on the set of all paths. This is what we use to define the average over all paths, which Matteo is denoting $\langle \;\rangle$ here.

In other words: to define $\langle \Delta I \rangle,$ I think you should first take the average of $\Delta I$ over all paths that start at the state $a.$ Then you should sum this over all states $a,$ weighted by the probabilities $\pi_a(0).$ The result should be $\langle \Delta I \rangle.$

You should be able to show that this equals

$\begin{array}{ccl} \Delta S &=& S(\pi(T)) - S(\pi(0)) \\ \\ &=& - \sum_a \pi_a(0) \ln \pi_a(0) + \sum_b \pi_b(T) \ln \pi_b(T) \end{array}$

The key step should be this: $\pi_b(T)$ can be obtained by taking the probability that a path goes from $a$ to $b,$ multiplying it by $\pi_a(0),$ and summing over all $a.$

This says: the probability that you wind up in jail at the end of your life is the probability that you get there if you’re born in any particular state (like Florida), times the probability you were born in that state, summed over states.

This is a well-known fact, but we may need to look at the stuff Matteo wrote at the end of his post, and do some calculations, to prove that it’s true.

Blake wrote:

I think there is a way to understand the equality $\langle I \rangle = \Delta S$

I guess there is quite some guessing involved with what is meant with averaging.

In particular I was indeed also thinking whether Mattheo may have meant something like that kind of average:

$\frac{1}{2} \left( \pi_0(0)\ln (\frac{\pi_0(0)}{\pi_1(T)}) + \pi_1(T) (\frac{\pi_0(0)}{\pi_1(T)}) + \pi_1(0)ln(\frac{\pi_1(0)}{\pi_0(T)}) + \pi_0(T)ln(\frac{\pi_1(0)}{\pi_0(T)}) \right)$

i.e. some “weighted mean”, which on a first glance would contain $\Delta S$ apart from a factor 1/2 and maybe a sign but first I don’t see why the other terms should cancel and why this should be an “average over all realizations of the process.”

I wrote:

I guess there is quite some guessing involved with what is meant with averaging.

One should probably look into original literature in order to find out how the average over the cumulative skewness is defined exactly.

I probably will have problems to get the literature anyways and alone for this reason I sofar just tried to get an understanding of this skewness thing. Like it could eventually be interesting to see how this definition is
related to other definitions of skewness, like e.g. Wikipedia lists
quite a few.

Like one could compute the skewnesses of initial and final distribution and look at their difference and take the average over that. If I didn’t miscalculate the skewness for this two state case (x_0=0, x_1=1) seems to be just $\frac{1}{\pi_1^2}-\frac{1}{\pi_0^2}$. The cumulative skewness looks however more like a mixture of skewnesses of initial and final state. According to Wikipedia there exists a so-called Distance skewness and instead of X’ being the identical distribution one could eventually take the time-evolved one

$\mathrm{Skew}(X) := 1 - \frac{\mathrm{E}\|X-X'\|}{\mathrm{E}\|X+X'\|} \text{ if } \Pr(X=0)\ne 1$

compute this “mixed skewness” and the average over all initial distributions of this. But maybe this is a bit too wild.

• John Baez says:

If you look at

• Matteo Smerlak, The mathematical origin of irreversibility, 8 October 2012.

you can see Matteo Smerlak’s definition of the skewness $\Sigma$ and the mean skewness $\langle \Sigma \rangle.$

There must be lots of dfferent things that people call ‘skewness’, but right now I’m just interested in what Matteo is talking about. Blake and I went though his arguments in the last few days and they all make sense to me. It took a while.

Other people may have other definitions, but Blake and I went though his arguments in the last few days and they all make sense to me.

Mattheo wrote:

if $\langle\,\cdot\,\rangle$ denotes the average over all realizations of the process

as an explanation of the average. What is “a realization of a process” ?

• John Baez says:

What is “a realization of a process”?

It’s a thing like this:

$\displaystyle{(a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}) }$

where $a_0, \dots a_N$ are points in a fixed finite set and $\tau_0, \dots, \tau_N$ are times with

$0 \le \tau_0 \le \tau_1 \le \cdots \le \tau_N \le T$

The term “realization of a process” is jargon from stochastic process theory. We have a stochastic process, in particular a Markov process, which gives a probability measure on the set of “realizations”.

Near the top of his article he calls the realization a “history”:

each possible history $\omega=(\omega_t)_{0\leq t\leq T}$ of this process can be characterized by the sequence of occupied states $a_{j}$ and by the times $\tau_{j}$ at which the transitions $a_{j-1}\longrightarrow a_{j}$ occur $(0\leq j\leq N)$:

$\omega=(\omega_{0}=a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}=\omega_{T}).$

If you look near the bottom of his article, where he proves the main results, you’ll see he’s treating his Markov process as a triple $(\Omega,\mathcal{T},p)$. Here $\Omega$ is the space of “histories” or “realizations”, $p$ is a probability distribution on this space, and $\mathcal{T}$ is a $\sigma$-algebra on this space (which you can safely ignore for now).

The cumulative skewness $\Sigma$ is a function on $\Omega.$ The average $\langle \Sigma \rangle$ is defined by

$\langle \Sigma \rangle = \int_\Omega \Sigma(\omega) dp(\omega)$

He gives more detail about $p$ down here, too.

Thanks John for trying to explain this to me.

If you look near the bottom of his article, where he proves the main results,

I didn’t look at the bottom of his article, because before I might try to understand the proof I would like to understand what the proof is about. If I haven’t totally wrongly understood the master equation, then for the master equation one uses a discrete probability distribution, so at least for the moment I prefer to stay away from sigma algebras.

The term “realization of a process” is jargon from stochastic process theory. We have a stochastic process, in particular a Markov process, which gives a probability measure on the set of “realizations”.

Near the top of his article he calls the realization a “history”

OK. So for stochastic process people a history is the same as a process (quite different from real life though…).

If I have understood the master equation and Mattheo right then the probability of the state $a_i$ is given by the i’th entry of $\psi$, i.e. $\psi_i$, or as called above $\pi_i$, or as Mattheo calls it $\pi_{a_i}$. Now what are the

“the times $\tau_{j}$ at which the transitions $a_{j-1} \longrightarrow a_{j}$ occur

???
That is I can only have a distinct transition if I have a Dirac distribution (i.e. one of the $\pi_i$ is 0 the other 1). But the master equation is …as it looks to me…evolving towards the equilibrium solution $(0.5,0.5)^T$. In particular as I said before if I didn’t miscalculate the explicit solution for this two state master equation is: $2 \pi_0(T)-1= (2\pi_0(0)-1) e^{-2gT}$ and $1-2\pi_1(T)=(1-2\pi_1(0))e^{-2gT}$, so if
one has $\pi_0(0)=1,\pi_1(0)=0$ then
$\pi_0(t)=0.5(1-e^{-2gT}),\pi_1(T)=0.5(1+e^{-2gT})$ In
particular I don’t see how you could possibly end up in the other “Dirac state” .
So what is meant by a transition from state $a_{j-1} \longrightarrow a_{j}$ at time $\tau_{j}$ ???

• John Baez says:

I think I see why you’re confused. Matteo is heavily using this fact: a Markov process on a finite set $S$ together with an initial probability distribution $\pi_a(0)$ on that finite set ($a \in S$) gives rise to a probability measure $p$ on the space of paths

$\omega: [0,T] \to S$

The relevant paths are constant except at finitely many times $\tau_j$; these are “the times $\tau_{j}$ at which the transitions $a_{j-1} \longrightarrow a_{j}$ occur”.

People call such a path a “history” or a “realization” of the Markov process. It’s simply a sequence of states $a_1, \dots, a_n$ together with times $0 \le \tau_1 \le \cdots \le \tau_n \le T.$ Matteo writes it as

$\displaystyle{(a_{0}\overset{\tau_{0}}{\longrightarrow} a_{1} \overset{\tau_{1}}{\longrightarrow}\cdots \overset{\tau_{N}}{\longrightarrow} a_{N}) }$

Just as quantum mechanics has a Hamiltonian description and a path-integral description, so do Markov processes—and that’s what we’re using here. Unlike the path-integral description of quantum mechanics, the path-integral description of Markov processes is often done in a mathematically rigorous way. Since it involves a measure $p$ on a space of paths, we really should know what $\sigma$-algebra on the space of paths it’s defined on. That’s why Matteo is talking about a $\sigma$-algebra… but I think we should ignore such details and get the basic idea straightened out first.

If I could find a nice intro to this approach I’d point you to it. To some extent it’s explained in Matteo’s post, but he sort of assumes one is familiar with it.

typo: the signs inside $\pi_i(T)$ has to be switched in particular for time $T=0$ one should get the same probability.

The relevant paths are constant except at finitely many times $\tau_j$; these are “the times $\tau_{j}$ at which the transitions $a_{j-1} \longrightarrow a_{j}$occur”.

I see something that could remotely be interpreted as transitions at different times for the case of a Markov chain. There you have if I understood correctly the conditional probabilities as transition matrix elements. So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states. Is that what is taken as a probability distributions for the paths?

But Matteo talks about an infinitesimal stochastic matrix, so this is a continuous process so there things would get a little more messy, but then I haven’t even seen yet where you need the paths for Matteos averaging procedure.

By the way I just noticed that I mistakingly added the self-adjointness property to the infinitesimal stochastic matrix property, because I didn’t read the line above the definition. So H could be unsymmetric and not necessarily symmetric as in my above comment. That may be important for the skewness, but it doesn’t change the questions and the arguments I had in my last comment, in particular I unfortunately still don’t understand those “jumpy” transitions.

• John Baez says:

Is that what is taken as a probability distributions for the paths?

Not quite; the correct formula is at the bottom of Matteo’s post.

John wrote:

Not quite; the correct formula is at the bottom of Matteo’s post.

OK. right. I wrote:

So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states.

I guess I would have liked to write:

So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states times the initial probability

Thanks for pointing this line out to me. As I said I skipped the proof and so I didn’t see it. Now I understand what Matteo means with jumping. I was also a bit mislead because Jake called something master equation where the time dependence was trivial. So I was already wondering a bit why Matteo had a time-dependent $\gamma_{ab}(t)$ but I hadn’t found the time to investigate that.

Yes and you are right with the “not quite”. That is apart from my forgetting of the initial probability this what Matteo writes with respect to $\phi_{a_{j-1}}(\tau_{j-1},\tau_j)$ at the bottom is conceptionally quite similar to what I said above, the $\gamma_{a_{j-1},a_j}$ are though giving me headaches, since there the columns don’t sum to one but to zero. Anyways I am again late for work.

• John Baez says:

So I was already wondering a bit why Matteo had a time-dependent $\gamma_{ab}(t)$ but I hadn’t found the time to investigate that.

Allowing Markov processes with a time-dependent Hamiltonian—I’ll call $\gamma_{ab}$ the Hamiltonian—allows us to study a wider variety of interesting problems.

For example, we can start with by holding $\gamma_{ab}$ constant for some interval of time, start with an equilibrium probability distribution for this $\gamma_{ab}$, then ‘drive the system out of equilibrium’ by changing $\gamma_{ab}$ for a while, then bring $\gamma_{ab}$ back to its original value, and study how entropy changes as the system goes back toward equilibrium.

However, there are also plenty of interesting questions to ask about time-independent Hamiltonians!

Without preview it is really hard to see typos:

$\int_0^1 a \ln(\frac{a}{0.5-(a-0.5)e^{-2 g T}})da$

aha thanks. Wolfram has now an online integrator. Let’s trust this (I had also found mistakes in Mathematica). May be I got a 0.5 wrong, but if I believe in my own formula then under the integral
xlog(2x/(1 – k(2x-1))). the result I got from Wolfram was (modulo typing errors, because somehow the pasting doesn’t work):

$\frac{1}{8k^2}(4 k^2 x^2 \ln(\frac{2x}{-2kx+k+1}) +(k+1)(2kx+(k+1)\ln(2kx-k-1)))$

Unfortunately the $4 k^2 x^2 \ln(\frac{2x}{-2kx+k+1})$ gives me headaches. That is one can’t just set $x=0$ (for the one boundary term) and on a first glance it look as if the $\ln$ gives quite some divergency.

I actually think that I asked Matteo about “what if a probabilty is zero (which e.g. doesn’t look good for $\Delta I$) and if I remember correctly (unfortunately I can’t find the Azimuth forum entry to this) he said something like that this case is averaged out….so I guess one would need to ivestigate this problem a little closer like with l’hopital.

• John Baez says:

In entropy calculations we define

$0 \ln 0 = 0$

which makes sense since

$\displaystyle{ \lim_{p \to 0} \; p \ln p = 0 }$

by L’Hospital (as you suggest).

If we have a list of probabilities $p_a$ summing to 1, and one of them vanishes, then Matteo’s self-information

$i_a = -\ln p_a$

is infinite for that probability, but still the mean self-information, or entropy

$\displaystyle{ S = \sum_a p_a i_a = - \sum_a p_a \ln p_a }$

will be finite, since we set $0 \ln 0 = 0.$

Roughly speaking: infinitely improbable events are infinitely surprising, but still the expected value of our surprise is finite because $0$ is smaller than $-\ln 0$ is big.

John wrote:

In other words: to define

$\langle \Delta I \rangle$

I think you should first take the average of $\Delta I$ over all paths that start at the state a. Then you should sum this over all states a, weighted by the probabilities $\pi_a(0)$. The result should be $\langle \Delta I \rangle.$

I think this is what I did (?) for starting state 0. You say now that one should also do an additional averaging step:

Then you should sum this over all states a, weighted by the probabilities $\pi_a(0)$. The result should be $\langle \Delta I \rangle.$

So if I understand you correctly (?) you say that $\Delta I = \frac{\pi_0(0)}{\pi_1(T)}$ is only the cumulative skewness for the state 0 and that one needs also to consider the additional skewness:

$\Delta I = \frac{1-\pi_0(0)}{1-\pi_1(T)}$

and then need to take the average of both. Sounds right to me. So this sounds as if one would need to add the integral (modulo typos):

$\int_0^1 (1- \pi_0(0) \frac{1-\pi_0(0)}{1-\pi_1(T)} d(1-\pi_0(0))$

(where $\pi_1(T)$ is now seen as a function of $(1-\pi_0(0))$) and divide both terms by 2. Is that right?

I am still not sure though whether one would get rid of the divergences.

You wrote:

$\lim_{p \to 0} \; p \ln p = 0$

Yes and by l’Hopital one even has

$\lim_{p \to 0} \; p^2 \ln p = 0$

which comes closer to the problematic term:
$4 k^2 x^2 \ln(\frac{2x}{-2kx+k+1})$

but I would need to check l’Hopital for that term. May be you have more confidence that this works out just as above, I don’t. But if you tell me this is so I would believe you.

8. Berényi Péter says:

With a 25 W power consumption for the human brain, Landauer’s limit gives us 8.4×10²¹ irreversible operations per second at 310 K.

If there was some evolutionary pressure to utilize reversible computing to save power, it translates to orders of magnitude more logical operations in each second.

That’s a million to a billion times more computational capacity than what’s derived from standard neuro-synaptic brain models.

Therefore either Landauer’s limit is irrelevant in biological information processing or there is a layer of extremely powerful high frequency molecular computational architecture below the neuro-synaptic level, completely hidden from observations so far.

• John Baez says:

I believe Landauer’s limit is largely irrelevant to biological information processing, because organisms don’t come close to reaching this bound. It’s nice to know that this bound exists, and as a question of physics it’s important to clarify exactly what this bound says. But for biology, we need other ideas.

• Theophanes Raptis says:

The true question though is, if this is so for biology why do people in physics try to generalize Landauer as if it was an axiom of nature and why do people neglect other’s works using this as an argument as Dr Motls did in the below post…
http://motls.blogspot.gr/2012/12/exorcising-maxwells-daemons.html

• John Baez says:

Landauer’s limit is not an “axiom of nature”, but it is a theorem about quite general physical systems, given suitable assumptions. Not all theorems are important in every situation.

I know about membrane computing. It’s a good idea, but I believe we still need more ideas.

• Berényi Péter says:

Okay. In that case what slide 14 is supposed to mean in David Wolpert’s presentation?

         IMPLICATIONS FOR
DESIGN OF BRAINS
• P(xt) a dynamic process outside of a brain;
• Natural selection favors brains that:
• (generate vt’s that) predict future of x accurately;
but ...
• not generate heat that needs to be dissipated;
• not require free energy from environment (need to create all that heat)

Natural selection favors brains that:
1) Accurately predict future (quantified with a fitness function);
2) Using a prediction program with minimal thermo. cost
• John Baez says:

I think it’s fairly evident what this slide means. Clearly natural selection punishes organisms with extremely high heat dissipation (or free energy consumption), but it doesn’t seem to be pushing it down to the Landauer limit. Maybe in some far future regime where free energy is extremely scarce, this will happen. But it doesn’t seem to be happening now.