Information and Entropy in Biological Systems (Part 7)

In 1961, Rolf Landauer argued that that the least possible amount of energy required to erase one bit of information stored in memory at temperature is where is Boltzmann’s constant.

This is called the Landauer limit, and it came after many decades of arguments concerning Maxwell’s demon and the relation between information and entropy.

In fact, these arguments are still not finished. For example, here’s an argument that the Landauer limit is not as solid as widely believed:

• John D. Norton, Waiting for Landauer, Studies in History and Philosophy of Modern Physics42 (2011), 184–198.

But something like the Landauer limit almost surely holds under some conditions! And if it holds, it puts some limits on what organisms can do. That’s what David Wolpert spoke about at our workshop! You can see his slides here:

The first thing that puzzles me about this suspicion of Landauer’s, is that the cost of switching increases with temperature, whereas from my (rather naïve) expectation, it should be easier to get hot systems to forget themselves than cold systems; but perhaps I should take that as a clue, because to reliably store one bit in a hotter system requires more redundancy, and therefore more underlying hardware, which it will then cost more to switch just for its being heavier.

I’m not going anywhere with this, just musing (encouraged, perhaps, by someone’s having mentioned that there wasn’t much musing right here on these recent posts).

I guess the first thing to say is that the equation “energy is Boltzmann’s constant times temperature times entropy” is dimensionally correct. Whatever alternative law one might suggest should at least be dimensionally correct.

I am not a physicist. I have gathered from Wikipedia that

Change in entropy = (change in free energy) / temperature

so it makes sense that if you want to do a lot of information processing for a given amount of energy, you should do it as cold as possible. I guess the catch as far as life is concerned is time. (Energy always goes with time in physics doesn’t it?) You can do more energy-efficient processing at low temperature, but a competitor at higher temperature and with more energy would think faster.

ooh, Graham reminds me how to turn a frequency into an energy: if we do have a clock running at some frequency, is another decent energy scale, though admittedly more complicated and contrived.

There was never any need for real “switching”. This was only because of the historical persistence of the old Turing “teletype” paradigm which led to digital design. In his 2000 thesis, Yuzuru Sato showed how one can immitate digital computation with continuous dynamical systems like modified Baker’s maps. http://www.jucs.org/jucs_6_9/nonlinear_computation_with_switching
Recently, La Roy managed to set up a classical analog computer imitating quantum computations via electrical signals. http://phys.org/news/2015-05-quantum-emulated-classical.html

I’m not quite sure what you’re driving at, but perhaps there’s an echo in: after browsing through Norton’s paper, I’m puzzled as to why the “memory” cells are so frequently imagined as pistons at pressure 1-or-2 ; another just-as-plausible memory cell could be realized by pistons at pressure 0.8-or-1.2. Maybe (following Graham’s hint) we should expect such a machine to run slower, BUT…

There are simpler means to perform “digital” computations without the stringent requirement to erase information and thus increase entropy. One can try like Sato to “embed” digital into analog. I tried something similar in the Fourier domain but it’s a hard project to figure out all the engineering details http://cag.dat.demokritos.gr/publications/AnalogProcessor.ppt

Theophanes Raptis this is an interesting website you are on:

EU Funded Innovation in H2020
26-27 Ιουνίου 2015 – Building partnerships for research & business innovation.

The Hellenic Forum is under the auspices of the Hellenic Ministry of Foreign Affairs and aims to act as a meeting point for Science, Technology and Innovation building bridges with scientists abroad, in order to enhance scientific and technological transfer collaboration for innovation. Your technology, Your idea, Your invention is accelerated from Lab to Market at the Innovation Exhibition of NCSR Demokritos.

Well of course you can, since you can modify the usual digital logic gates to be entirely reversible with extra output bits. Is there a reason why one might want to use an analog system model to accomplish this?

In particular I would expect that an analog model that does something interestingly different from a discrete reversible one is going to sneakily rely on the infinitely many bits available in any precisely defined real number, and so the ‘no loss of information’ will fail when it’s truncated to a finite number of decimal places or when (thermodynamically inevitable!) real world errors and inaccuracies are otherwise taken into account.

It is not the same as using an A/D – D/A quantizer where noise eats bits. The key is in the encodings of words but not by adding bits a la Bennet-Toffoli-Margolus etc. Instead one can utilize frequencies to encode words that do not need to be read and written the ordinary way. If we can do that with practically constant spectral density then we get what we want. It all ends up with something like a permutation machine on a synthi.

He starts with the Detailed Fluctuation Theorem, uses that to prove Clausius’ inequality, and notes that this is a generalized version of the Landauer limit.

Of course, the hard part in these physics-related theorems is clearly stating the assumptions and the result! There could be other, very different, ways of making Landauer’s limit into a theorem (or non-theorem).

This is fine for it does not exclude almost isentropic computations. From a practical engineering perspective if a computing machinery outputs heat at a glacially slow rate that would be more than enough for any practical purpose, fans aside, not to mention data centers immersed in liquid nitrogen or so. The true limit of course will come after the final collapse of Moore’s law in one or two decades.

“Landauer argued that the process of erasing the information stored in a memory device incurs an energy cost in the form of a minimum amount of mechanical work. We find, however, that this energy cost can be reduced to zero by paying a cost in angular momentum or any other conserved quantity. Erasing the memory of Maxwell’s demon in this way implies that work can be extracted from a single thermal reservoir at a cost of angular momentum and an increase in total entropy. The implications of this for the second law of thermodynamics are assessed.”

For some reason Nadja Kutz was unable to post this comment, so I will:

John wrote:

I forgot to mention that a mathematical proof of a very general form of Landauer’s bound is given here:

I have troubles to understand this detailed fluctuation theorem, which is partly due to the fact, that I don’t have the time to delve into it. So I thought is might be good to understand the Landauer limit in terms of the master equation in terms of this theorem. For a two state system with

where so Matteo’s and so is for this case and thus it should hold that . (“Clausius inequality”). I tried to check this. That is due to the form of H one has, as it seems that

and since

one can thus compute the time evolution explicitly. And thus one can compute explicitly which is according to Matteo and to my interpretation of it in terms of the master equation (??):

As explained before can be explicitly expressed as a function of and so one can integrate

which should be

(if I understood Matteo correctly). I started to integrate this (using Bronstein…) but didn’t finish. It’s a bit messy and I don’t have a computer algebra program and in particular I got the impression that it is not so easy to see that the Clausius inequality holds. In any case I think it would be interesting to calculate this explicitly, because one could see whether at all and if for which cases the Clausius equality holds (i.e. reversibility). Can someone do this? – I am already late for work.

Thanks John for posting. This is I think the second time that the reply button didn’t work, I could have restarted my browser, but then as said I was already late. You wrote me in an email that you might persuade your student Blake Pollard to finish the computation. It would be great if he could check the above reasoning. But I think finishing the computation is more an undergrad task. Finally all I ask for is to solve the following integral (hope there is no typing and calc error):

You can plug that into a computer algebra programm (and plotting the result for gT would give you infos about the inequality) but it could also be done by hand if I look at the integral on a first glance it seems use substitution and then integral (471 in Bronstein-Semendjajew 24th edition Teubner Leipzig, Nauka Moscow 1979).

If nobody does it and I find the time then I’ll do it myself. I don’t really want to bother Blake with that tedious work.

Some of my undergrad students in Amherst once came up to me and asked me to solve an integral on the spot (Hey let’s see whether you can solve this…). I thought that by solving it on the spot I could eventually significantly reduce the amount of farting noises, paper planes and comments about my German accent around lessons, so I took the risk. I was lucky and was able to solve the integral. That helped though only a bit.

I think you have to be a little careful about what you mean by . In Matteo’s post he defines this quantity for a trajectory starting at state and ending at state at time after transitions. So if the system starts in state , then after some time it can end up in either state or state and similarly for trajectories initially in state . So we really have something like 4 classes of trajectories to consider. You wrote

which is the change in self-information along a trajectory starting in state and ending up in state at time .

Matteo says that should be the change in self-information averaged over all realizations of the process. I’m not sure how to implement that averaging. It seems like it should be the probability of each particular realization of a path times the change in self information for that path, summed over all possible paths. Later Matteo writes that

where

is the Shannon entropy and is the skewness, which should give us Clasius’ inequality for the two state example you mentioned.

If we use that formula then we get that

For the example you proposed this gives

No integrating over time required. I’m still understanding this stuff myself, but I think that in general is a quantity that, for a fixed duration of your path depends only on the end points of your path, so finding its average value corresponds to averaging over paths, not integrating over time. Here I plotted in Wolfram Alpha for and and it looks positive.

Blake, it is great that you are looking into this. As I said I might be wrong, finally I try to make sense of what I read here and I have not so much time, but I think it is imprtant to look at this example, if one would like to get a glimpse on what is going on with the fluctuation theorem.

I haven’t understood yet why .

So I first tried to understand the inequality you wrote above which seems to be a direct consequence of the fluctuation theorem, namely:

For this special case of two states one always has -if I got the matrix right. So for this case it should hold that: .

so finding its average value corresponds to averaging over paths, not integrating over time

What I meant with computing the time evolution explicitly is that you can express each time evolution of the probabilities as a rather simple function of their initial condition . And since (if I understood this master equation right) one has only one initial variable. The time evolution is what (if I understood you correctly) what you call “a path”, so one path is uniquely determined in this example by , which I abbreviated as and so the integral I am talking about here should be the averageing over all paths, i.e. the time is in this calculation treated as a parameter (see below ).

Since at a fist glance it looks as if one gets divergencies for one needs eventually first check wether this is true (like by plotting the integral not at 0 but for a small value, see comment below) or wether there is a mistake/misunderstanding somewhere.

I don’t think that is a path. It describes the evolution of the probability of finding the system in the state at time . A path would be more like a sequence of states along with the times at which transitions between states occured. So I think the quantity you are calculating is not . I think there is a way to understand the equality and that is the crux of the issue because appears to be non-negative for all choices of initial distributions.

Matteo says that should be the change in self-information averaged over all realizations of the process. I’m not sure how to implement that averaging. It seems like it should be the probability of each particular realization of a path times the change in self information for that path, summed over all possible paths.

should be the change in self-information averaged over all paths. Our Markov process defines a probability measure on the set of paths that start at a given state We are also given a probability distribution on the set of states. Putting these together we get a probability measure on the set of all paths. This is what we use to define the average over all paths, which Matteo is denoting here.

In other words: to define I think you should first take the average of over all paths that start at the state Then you should sum this over all states weighted by the probabilities The result should be

You should be able to show that this equals

The key step should be this: can be obtained by taking the probability that a path goes from to multiplying it by and summing over all

This says: the probability that you wind up in jail at the end of your life is the probability that you get there if you’re born in any particular state (like Florida), times the probability you were born in that state, summed over states.

This is a well-known fact, but we may need to look at the stuff Matteo wrote at the end of his post, and do some calculations, to prove that it’s true.

I guess there is quite some guessing involved with what is meant with averaging.

In particular I was indeed also thinking whether Mattheo may have meant something like that kind of average:

i.e. some “weighted mean”, which on a first glance would contain apart from a factor 1/2 and maybe a sign but first I don’t see why the other terms should cancel and why this should be an “average over all realizations of the process.”

I guess there is quite some guessing involved with what is meant with averaging.

One should probably look into original literature in order to find out how the average over the cumulative skewness is defined exactly.

I probably will have problems to get the literature anyways and alone for this reason I sofar just tried to get an understanding of this skewness thing. Like it could eventually be interesting to see how this definition is
related to other definitions of skewness, like e.g. Wikipedia lists
quite a few.

Like one could compute the skewnesses of initial and final distribution and look at their difference and take the average over that. If I didn’t miscalculate the skewness for this two state case (x_0=0, x_1=1) seems to be just . The cumulative skewness looks however more like a mixture of skewnesses of initial and final state. According to Wikipedia there exists a so-called Distance skewness and instead of X’ being the identical distribution one could eventually take the time-evolved one

compute this “mixed skewness” and the average over all initial distributions of this. But maybe this is a bit too wild.

you can see Matteo Smerlak’s definition of the skewness and the mean skewness

There must be lots of dfferent things that people call ‘skewness’, but right now I’m just interested in what Matteo is talking about. Blake and I went though his arguments in the last few days and they all make sense to me. It took a while.

where are points in a fixed finite set and are times with

The term “realization of a process” is jargon from stochastic process theory. We have a stochastic process, in particular a Markov process, which gives a probability measure on the set of “realizations”.

Near the top of his article he calls the realization a “history”:

each possible history of this process can be characterized by the sequence of occupied states and by the times at which the transitions occur :

If you look near the bottom of his article, where he proves the main results, you’ll see he’s treating his Markov process as a triple . Here is the space of “histories” or “realizations”, is a probability distribution on this space, and is a -algebra on this space (which you can safely ignore for now).

The cumulative skewness is a function on The average is defined by

If you look near the bottom of his article, where he proves the main results,

I didn’t look at the bottom of his article, because before I might try to understand the proof I would like to understand what the proof is about. If I haven’t totally wrongly understood the master equation, then for the master equation one uses a discrete probability distribution, so at least for the moment I prefer to stay away from sigma algebras.

The term “realization of a process” is jargon from stochastic process theory. We have a stochastic process, in particular a Markov process, which gives a probability measure on the set of “realizations”.

Near the top of his article he calls the realization a “history”

OK. So for stochastic process people a history is the same as a process (quite different from real life though…).

If I have understood the master equation and Mattheo right then the probability of the state is given by the i’th entry of , i.e. , or as called above $\pi_i$, or as Mattheo calls it . Now what are the

“the times at which the transitions occur

???
That is I can only have a distinct transition if I have a Dirac distribution (i.e. one of the is 0 the other 1). But the master equation is …as it looks to me…evolving towards the equilibrium solution . In particular as I said before if I didn’t miscalculate the explicit solution for this two state master equation is: and , so if
one has then
In
particular I don’t see how you could possibly end up in the other “Dirac state” .
So what is meant by a transition from state at time ???

I think I see why you’re confused. Matteo is heavily using this fact: a Markov process on a finite set together with an initial probability distribution on that finite set () gives rise to a probability measure on the space of paths

The relevant paths are constant except at finitely many times ; these are “the times at which the transitions occur”.

People call such a path a “history” or a “realization” of the Markov process. It’s simply a sequence of states together with times Matteo writes it as

Just as quantum mechanics has a Hamiltonian description and a path-integral description, so do Markov processes—and that’s what we’re using here. Unlike the path-integral description of quantum mechanics, the path-integral description of Markov processes is often done in a mathematically rigorous way. Since it involves a measure on a space of paths, we really should know what -algebra on the space of paths it’s defined on. That’s why Matteo is talking about a -algebra… but I think we should ignore such details and get the basic idea straightened out first.

If I could find a nice intro to this approach I’d point you to it. To some extent it’s explained in Matteo’s post, but he sort of assumes one is familiar with it.

The relevant paths are constant except at finitely many times ; these are “the times at which the transitions occur”.

I see something that could remotely be interpreted as transitions at different times for the case of a Markov chain. There you have if I understood correctly the conditional probabilities as transition matrix elements. So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states. Is that what is taken as a probability distributions for the paths?

But Matteo talks about an infinitesimal stochastic matrix, so this is a continuous process so there things would get a little more messy, but then I haven’t even seen yet where you need the paths for Matteos averaging procedure.

By the way I just noticed that I mistakingly added the self-adjointness property to the infinitesimal stochastic matrix property, because I didn’t read the line above the definition. So H could be unsymmetric and not necessarily symmetric as in my above comment. That may be important for the skewness, but it doesn’t change the questions and the arguments I had in my last comment, in particular I unfortunately still don’t understand those “jumpy” transitions.

Not quite; the correct formula is at the bottom of Matteo’s post.

OK. right. I wrote:

So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states.

I guess I would have liked to write:

So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states times the initial probability

Thanks for pointing this line out to me. As I said I skipped the proof and so I didn’t see it. Now I understand what Matteo means with jumping. I was also a bit mislead because Jake called something master equation where the time dependence was trivial. So I was already wondering a bit why Matteo had a time-dependent but I hadn’t found the time to investigate that.

Yes and you are right with the “not quite”. That is apart from my forgetting of the initial probability this what Matteo writes with respect to at the bottom is conceptionally quite similar to what I said above, the are though giving me headaches, since there the columns don’t sum to one but to zero. Anyways I am again late for work.

So I was already wondering a bit why Matteo had a time-dependent but I hadn’t found the time to investigate that.

Allowing Markov processes with a time-dependent Hamiltonian—I’ll call the Hamiltonian—allows us to study a wider variety of interesting problems.

For example, we can start with by holding constant for some interval of time, start with an equilibrium probability distribution for this , then ‘drive the system out of equilibrium’ by changing for a while, then bring back to its original value, and study how entropy changes as the system goes back toward equilibrium.

However, there are also plenty of interesting questions to ask about time-independent Hamiltonians!

aha thanks. Wolfram has now an online integrator. Let’s trust this (I had also found mistakes in Mathematica). May be I got a 0.5 wrong, but if I believe in my own formula then under the integral
it should read:
xlog(2x/(1 – k(2x-1))). the result I got from Wolfram was (modulo typing errors, because somehow the pasting doesn’t work):

Unfortunately the gives me headaches. That is one can’t just set (for the one boundary term) and on a first glance it look as if the gives quite some divergency.

I actually think that I asked Matteo about “what if a probabilty is zero (which e.g. doesn’t look good for ) and if I remember correctly (unfortunately I can’t find the Azimuth forum entry to this) he said something like that this case is averaged out….so I guess one would need to ivestigate this problem a little closer like with l’hopital.

If we have a list of probabilities summing to 1, and one of them vanishes, then Matteo’s self-information

is infinite for that probability, but still the mean self-information, or entropy

will be finite, since we set

Roughly speaking: infinitely improbable events are infinitely surprising, but still the expected value of our surprise is finite because is smaller than is big.

I think you should first take the average of over all paths that start at the state a. Then you should sum this over all states a, weighted by the probabilities . The result should be

I think this is what I did (?) for starting state 0. You say now that one should also do an additional averaging step:

Then you should sum this over all states a, weighted by the probabilities . The result should be

So if I understand you correctly (?) you say that is only the cumulative skewness for the state 0 and that one needs also to consider the additional skewness:

and then need to take the average of both. Sounds right to me. So this sounds as if one would need to add the integral (modulo typos):

(where is now seen as a function of ) and divide both terms by 2. Is that right?

I am still not sure though whether one would get rid of the divergences.

You wrote:

Yes and by l’Hopital one even has

which comes closer to the problematic term:

but I would need to check l’Hopital for that term. May be you have more confidence that this works out just as above, I don’t. But if you tell me this is so I would believe you.

With a 25 W power consumption for the human brain, Landauer’s limit gives us 8.4×10²¹ irreversible operations per second at 310 K.

If there was some evolutionary pressure to utilize reversible computing to save power, it translates to orders of magnitude more logical operations in each second.

That’s a million to a billion times more computational capacity than what’s derived from standard neuro-synaptic brain models.

Therefore either Landauer’s limit is irrelevant in biological information processing or there is a layer of extremely powerful high frequency molecular computational architecture below the neuro-synaptic level, completely hidden from observations so far.

I believe Landauer’s limit is largely irrelevant to biological information processing, because organisms don’t come close to reaching this bound. It’s nice to know that this bound exists, and as a question of physics it’s important to clarify exactly what this bound says. But for biology, we need other ideas.

The true question though is, if this is so for biology why do people in physics try to generalize Landauer as if it was an axiom of nature and why do people neglect other’s works using this as an argument as Dr Motls did in the below post… http://motls.blogspot.gr/2012/12/exorcising-maxwells-daemons.html

Landauer’s limit is not an “axiom of nature”, but it is a theorem about quite general physical systems, given suitable assumptions. Not all theorems are important in every situation.

I know about membrane computing. It’s a good idea, but I believe we still need more ideas.

Okay. In that case what slide 14 is supposed to mean in David Wolpert’s presentation?

IMPLICATIONS FOR
DESIGN OF BRAINS
• P(xt) a dynamic process outside of a brain;
• Natural selection favors brains that:
• (generate vt’s that) predict future of x accurately;
but ...
• not generate heat that needs to be dissipated;
• not require free energy from environment (need to create all that heat)
Natural selection favors brains that:
1) Accurately predict future (quantified with a fitness function);
2) Using a prediction program with minimal thermo. cost

I think it’s fairly evident what this slide means. Clearly natural selection punishes organisms with extremely high heat dissipation (or free energy consumption), but it doesn’t seem to be pushing it down to the Landauer limit. Maybe in some far future regime where free energy is extremely scarce, this will happen. But it doesn’t seem to be happening now.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it. Cancel reply

You need the word 'latex' right after the first dollar sign, and it needs a space after it. Double dollar signs don't work, and other limitations apply, some described here. You can't preview comments here, but I'm happy to fix errors.

The first thing that puzzles me about this suspicion of Landauer’s, is that the cost of switching

increaseswith temperature, whereas from my (rather naïve) expectation, it should beeasierto get hot systems to forget themselves than cold systems; but perhaps I should take that as a clue, because to reliablystoreone bit in a hotter system requires more redundancy, and therefore more underlying hardware, which it will then cost more to switch just for its being heavier.I’m not going anywhere with this, just musing (encouraged, perhaps, by someone’s having mentioned that there wasn’t much musing right here on these recent posts).

Musing is good!

I guess the first thing to say is that the equation “energy is Boltzmann’s constant times temperature times entropy” is dimensionally correct. Whatever alternative law one might suggest should at least be dimensionally correct.

I am not a physicist. I have gathered from Wikipedia that

Change in entropy = (change in free energy) / temperature

so it makes sense that if you want to do a lot of information processing for a given amount of energy, you should do it as cold as possible. I guess the catch as far as life is concerned is time. (Energy always goes with time in physics doesn’t it?) You can do more energy-efficient processing at low temperature, but a competitor at higher temperature and with more energy would think faster.

ooh, Graham reminds me how to turn a frequency into an energy: if we do have a clock running at some frequency, is another decent energy scale, though admittedly more complicated and contrived.

There was never any need for real “switching”. This was only because of the historical persistence of the old Turing “teletype” paradigm which led to digital design. In his 2000 thesis, Yuzuru Sato showed how one can immitate digital computation with continuous dynamical systems like modified Baker’s maps.

http://www.jucs.org/jucs_6_9/nonlinear_computation_with_switching

Recently, La Roy managed to set up a classical analog computer imitating quantum computations via electrical signals.

http://phys.org/news/2015-05-quantum-emulated-classical.html

I’m not quite sure what you’re driving at, but perhaps there’s an echo in: after browsing through Norton’s paper, I’m puzzled as to why the “memory” cells are so frequently imagined as pistons at pressure 1-or-2 ; another just-as-plausible memory cell could be realized by pistons at pressure 0.8-or-1.2. Maybe (following Graham’s hint) we should expect such a machine to run slower, BUT…

There are simpler means to perform “digital” computations without the stringent requirement to erase information and thus increase entropy. One can try like Sato to “embed” digital into analog. I tried something similar in the Fourier domain but it’s a hard project to figure out all the engineering details

http://cag.dat.demokritos.gr/publications/AnalogProcessor.ppt

Theophanes Raptis this is an interesting website you are on:

EU Funded Innovation in H2020

26-27 Ιουνίου 2015 – Building partnerships for research & business innovation.

The Hellenic Forum is under the auspices of the Hellenic Ministry of Foreign Affairs and aims to act as a meeting point for Science, Technology and Innovation building bridges with scientists abroad, in order to enhance scientific and technological transfer collaboration for innovation. Your technology, Your idea, Your invention is accelerated from Lab to Market at the Innovation Exhibition of NCSR Demokritos.

Well of course you can, since you can modify the usual digital logic gates to be entirely reversible with extra output bits. Is there a reason why one might want to use an analog system model to accomplish this?

In particular I would expect that an analog model that does something interestingly different from a discrete reversible one is going to sneakily rely on the infinitely many bits available in any precisely defined real number, and so the ‘no loss of information’ will fail when it’s truncated to a finite number of decimal places or when (thermodynamically inevitable!) real world errors and inaccuracies are otherwise taken into account.

It is not the same as using an A/D – D/A quantizer where noise eats bits. The key is in the encodings of words but not by adding bits a la Bennet-Toffoli-Margolus etc. Instead one can utilize frequencies to encode words that do not need to be read and written the ordinary way. If we can do that with practically constant spectral density then we get what we want. It all ends up with something like a permutation machine on a synthi.

I get a “page not found” error when trying to view the slides.

I forgot to change a relative link to an absolute one when copying it… the bane of a despatialized existence. David Wolpert’s talk is actually here.

I forgot to mention that a mathematical proof of a very general form of Landauer’s bound is given here:

• Matteo Smerlak, The mathematical origin of irreversibility, 8 October 2012.

He starts with the Detailed Fluctuation Theorem, uses that to prove Clausius’ inequality, and notes that this is a generalized version of the Landauer limit.

Of course, the hard part in these physics-related theorems is clearly stating the assumptions and the result! There could be other, very different, ways of making Landauer’s limit into a theorem (or non-theorem).

This is fine for it does not exclude almost isentropic computations. From a practical engineering perspective if a computing machinery outputs heat at a glacially slow rate that would be more than enough for any practical purpose, fans aside, not to mention data centers immersed in liquid nitrogen or so. The true limit of course will come after the final collapse of Moore’s law in one or two decades.

Any conserved quantity will do, not just energy.

“Landauer argued that the process of erasing the information stored in a memory device incurs an energy cost in the form of a minimum amount of mechanical work. We find, however, that this energy cost can be reduced to zero by paying a cost in angular momentum or any other conserved quantity. Erasing the memory of Maxwell’s demon in this way implies that work can be extracted from a single thermal reservoir at a cost of angular momentum and an increase in total entropy. The implications of this for the second law of thermodynamics are assessed.”

https://arxiv.org/abs/1004.5330

For some reason Nadja Kutz was unable to post this comment, so I will:

Thanks John for posting. This is I think the second time that the reply button didn’t work, I could have restarted my browser, but then as said I was already late. You wrote me in an email that you might persuade your student Blake Pollard to finish the computation. It would be great if he could check the above reasoning. But I think finishing the computation is more an undergrad task. Finally all I ask for is to solve the following integral (hope there is no typing and calc error):

You can plug that into a computer algebra programm (and plotting the result for gT would give you infos about the inequality) but it could also be done by hand if I look at the integral on a first glance it seems use substitution and then integral (471 in Bronstein-Semendjajew 24th edition Teubner Leipzig, Nauka Moscow 1979).

If nobody does it and I find the time then I’ll do it myself. I don’t really want to bother Blake with that tedious work.

Some of my undergrad students in Amherst once came up to me and asked me to solve an integral on the spot (Hey let’s see whether you can solve this…). I thought that by solving it on the spot I could eventually significantly reduce the amount of farting noises, paper planes and comments about my German accent around lessons, so I took the risk. I was lucky and was able to solve the integral. That helped though only a bit.

Hi Nadja,

I think you have to be a little careful about what you mean by . In Matteo’s post he defines this quantity for a trajectory starting at state and ending at state at time after transitions. So if the system starts in state , then after some time it can end up in either state or state and similarly for trajectories initially in state . So we really have something like 4 classes of trajectories to consider. You wrote

which is the change in self-information along a trajectory starting in state and ending up in state at time .

Matteo says that should be the change in self-information averaged over all realizations of the process. I’m not sure how to implement that averaging. It seems like it should be the probability of each particular realization of a path times the change in self information for that path, summed over all possible paths. Later Matteo writes that

where

is the Shannon entropy and is the skewness, which should give us Clasius’ inequality for the two state example you mentioned.

If we use that formula then we get that

For the example you proposed this gives

No integrating over time required. I’m still understanding this stuff myself, but I think that in general is a quantity that, for a fixed duration of your path depends only on the end points of your path, so finding its average value corresponds to averaging over paths, not integrating over time. Here I plotted in Wolfram Alpha for and and it looks positive.

Mighty Maple gives “some” answer which produces a “Division by 0 error” at any attempt to smplify it with the exponential back or not. Steps were as

restart;q:=int(x

log(2x/(1-k(2x-1))),x);p:=subs(x=1,q) – subs(x=0,q);

simplify(p);

Error, (in ln) numeric exception: division by zero.

Blake, it is great that you are looking into this. As I said I might be wrong, finally I try to make sense of what I read here and I have not so much time, but I think it is imprtant to look at this example, if one would like to get a glimpse on what is going on with the fluctuation theorem.

I haven’t understood yet why .

So I first tried to understand the inequality you wrote above which seems to be a direct consequence of the fluctuation theorem, namely:

For this special case of two states one always has -if I got the matrix right. So for this case it should hold that: .

What I meant with computing the time evolution explicitly is that you can express each time evolution of the probabilities as a rather simple function of their initial condition . And since (if I understood this master equation right) one has only one initial variable. The time evolution is what (if I understood you correctly) what you call “a path”, so one path is uniquely determined in this example by , which I abbreviated as and so the integral I am talking about here should be the averageing over all paths, i.e. the time is in this calculation treated as a parameter (see below ).

Since at a fist glance it looks as if one gets divergencies for one needs eventually first check wether this is true (like by plotting the integral not at 0 but for a small value, see comment below) or wether there is a mistake/misunderstanding somewhere.

Blake, I guess you see what are typos and which not, but just to make sure, it should read:

and

I don’t think that is a path. It describes the evolution of the probability of finding the system in the state at time . A path would be more like a sequence of states along with the times at which transitions between states occured. So I think the quantity you are calculating is not . I think there is a way to understand the equality and that is the crux of the issue because appears to be non-negative for all choices of initial distributions.

Blake wrote:

should be the change in self-information averaged over

allpaths. Our Markov process defines a probability measure on the set of paths that start at a given state We are also given a probability distribution on the set of states. Putting these together we get a probability measure on the set ofallpaths. This is what we use to define the average over all paths, which Matteo is denoting here.In other words: to define I think you should first take the average of over all paths that start at the state Then you should sum this over all states weighted by the probabilities The result should be

You should be able to show that this equals

The key step should be this: can be obtained by taking the probability that a path goes from to multiplying it by and summing over all

This says: the probability that you wind up in jail at the end of your life is the probability that you get there if you’re born in any particular state (like Florida), times the probability you were born in that state, summed over states.

This is a well-known fact, but we may need to look at the stuff Matteo wrote at the end of his post, and do some calculations, to prove that it’s true.

Blake wrote:

I guess there is quite some guessing involved with what is meant with averaging.

In particular I was indeed also thinking whether Mattheo may have meant something like that kind of average:

i.e. some “weighted mean”, which on a first glance would contain apart from a factor 1/2 and maybe a sign but first I don’t see why the other terms should cancel and why this should be an “average over all realizations of the process.”

I wrote:

One should probably look into original literature in order to find out how the average over the cumulative skewness is defined exactly.

I probably will have problems to get the literature anyways and alone for this reason I sofar just tried to get an understanding of this skewness thing. Like it could eventually be interesting to see how this definition is

related to other definitions of skewness, like e.g. Wikipedia lists

quite a few.

Like one could compute the skewnesses of initial and final distribution and look at their difference and take the average over that. If I didn’t miscalculate the skewness for this two state case (x_0=0, x_1=1) seems to be just . The cumulative skewness looks however more like a mixture of skewnesses of initial and final state. According to Wikipedia there exists a so-called Distance skewness and instead of X’ being the identical distribution one could eventually take the time-evolved one

compute this “mixed skewness” and the average over all initial distributions of this. But maybe this is a bit too wild.

If you look at

• Matteo Smerlak, The mathematical origin of irreversibility, 8 October 2012.

you can see Matteo Smerlak’s definition of the skewness and the mean skewness

There must be lots of dfferent things that people call ‘skewness’, but right now I’m just interested in what Matteo is talking about. Blake and I went though his arguments in the last few days and they all make sense to me. It took a while.

Mattheo wrote:

as an explanation of the average. What is “a realization of a process” ?

Nad wrote:

It’s a thing like this:

where are points in a fixed finite set and are times with

The term “realization of a process” is jargon from stochastic process theory. We have a stochastic process, in particular a Markov process, which gives a probability measure on the set of “realizations”.

Near the top of his article he calls the realization a “history”:

If you look near the bottom of his article, where he proves the main results, you’ll see he’s treating his Markov process as a triple . Here is the space of “histories” or “realizations”, is a probability distribution on this space, and is a -algebra on this space (which you can safely ignore for now).

The cumulative skewness is a function on The average is defined by

He gives more detail about down here, too.

Thanks John for trying to explain this to me.

I didn’t look at the bottom of his article, because before I might try to understand the proof I would like to understand what the proof is about. If I haven’t totally wrongly understood the master equation, then for the master equation one uses a discrete probability distribution, so at least for the moment I prefer to stay away from sigma algebras.

OK. So for stochastic process people a history is the same as a process (quite different from real life though…).

If I have understood the master equation and Mattheo right then the probability of the state is given by the i’th entry of , i.e. , or as called above $\pi_i$, or as Mattheo calls it . Now what are the

???

That is I can only have a distinct transition if I have a Dirac distribution (i.e. one of the is 0 the other 1). But the master equation is …as it looks to me…evolving towards the equilibrium solution . In particular as I said before if I didn’t miscalculate the explicit solution for this two state master equation is: and , so if

one has then

In

particular I don’t see how you could possibly end up in the other “Dirac state” .

So what is meant by a transition from state at time ???

I think I see why you’re confused. Matteo is heavily using this fact: a Markov process on a finite set together with an initial probability distribution on that finite set () gives rise to a probability measure on the space of paths

The relevant paths are constant except at finitely many times ; these are “the times at which the transitions occur”.

People call such a path a “history” or a “realization” of the Markov process. It’s simply a sequence of states together with times Matteo writes it as

Just as quantum mechanics has a Hamiltonian description and a path-integral description, so do Markov processes—and that’s what we’re using here. Unlike the path-integral description of quantum mechanics, the path-integral description of Markov processes is often done in a mathematically rigorous way. Since it involves a measure on a space of paths, we really should know what -algebra on the space of paths it’s defined on. That’s why Matteo is talking about a -algebra… but I think we should ignore such details and get the basic idea straightened out first.

If I could find a nice intro to this approach I’d point you to it. To some extent it’s explained in Matteo’s post, but he sort of assumes one is familiar with it.

typo: the signs inside has to be switched in particular for time one should get the same probability.

I see something that could remotely be interpreted as transitions at different times for the case of a Markov chain. There you have if I understood correctly the conditional probabilities as transition matrix elements. So there a meaningful probability distribution for the path would be probably the product over the matrix elements corresponding to that chain of states. Is that what is taken as a probability distributions for the paths?

But Matteo talks about an infinitesimal stochastic matrix, so this is a continuous process so there things would get a little more messy, but then I haven’t even seen yet where you need the paths for Matteos averaging procedure.

By the way I just noticed that I mistakingly added the self-adjointness property to the infinitesimal stochastic matrix property, because I didn’t read the line above the definition. So H could be unsymmetric and not necessarily symmetric as in my above comment. That may be important for the skewness, but it doesn’t change the questions and the arguments I had in my last comment, in particular I unfortunately still don’t understand those “jumpy” transitions.

Nad wrote:

Not quite; the correct formula is at the bottom of Matteo’s post.

John wrote:

OK. right. I wrote:

I guess I would have liked to write:

Thanks for pointing this line out to me. As I said I skipped the proof and so I didn’t see it. Now I understand what Matteo means with jumping. I was also a bit mislead because Jake called something master equation where the time dependence was trivial. So I was already wondering a bit why Matteo had a time-dependent but I hadn’t found the time to investigate that.

Yes and you are right with the “not quite”. That is apart from my forgetting of the initial probability this what Matteo writes with respect to at the bottom is conceptionally quite similar to what I said above, the are though giving me headaches, since there the columns don’t sum to one but to zero. Anyways I am again late for work.

Nad wrote:

Allowing Markov processes with a time-dependent Hamiltonian—I’ll call the

Hamiltonian—allows us to study a wider variety of interesting problems.For example, we can start with by holding constant for some interval of time, start with an equilibrium probability distribution for this , then ‘drive the system out of equilibrium’ by changing for a while, then bring back to its original value, and study how entropy changes as the system goes back toward equilibrium.

However, there are also plenty of interesting questions to ask about time-independent Hamiltonians!

Without preview it is really hard to see typos:

Mighty Wolfram said:

http://integrals.wolfram.com/index.jsp?expr=xlog%282x%2F%281+-+k%28x-1%29%29%29&random=false

aha thanks. Wolfram has now an online integrator. Let’s trust this (I had also found mistakes in Mathematica). May be I got a 0.5 wrong, but if I believe in my own formula then under the integral

it should read:

xlog(2x/(1 – k(2x-1))). the result I got from Wolfram was (modulo typing errors, because somehow the pasting doesn’t work):

Unfortunately the gives me headaches. That is one can’t just set (for the one boundary term) and on a first glance it look as if the gives quite some divergency.

I actually think that I asked Matteo about “what if a probabilty is zero (which e.g. doesn’t look good for ) and if I remember correctly (unfortunately I can’t find the Azimuth forum entry to this) he said something like that this case is averaged out….so I guess one would need to ivestigate this problem a little closer like with l’hopital.

In entropy calculations we define

which makes sense since

by L’Hospital (as you suggest).

If we have a list of probabilities summing to 1, and one of them vanishes, then Matteo’s

self-informationis infinite for that probability, but still the mean self-information, or

entropywill be finite, since we set

Roughly speaking: infinitely improbable events are infinitely surprising, but still the expected value of our surprise is finite because is smaller than is big.

John wrote:

I think this is what I did (?) for starting state 0. You say now that one should also do an additional averaging step:

So if I understand you correctly (?) you say that is only the cumulative skewness for the state 0 and that one needs also to consider the additional skewness:

and then need to take the average of both. Sounds right to me. So this sounds as if one would need to add the integral (modulo typos):

(where is now seen as a function of ) and divide both terms by 2. Is that right?

I am still not sure though whether one would get rid of the divergences.

You wrote:

Yes and by l’Hopital one even has

which comes closer to the problematic term:

but I would need to check l’Hopital for

thatterm. May be you have more confidence that this works out just as above, I don’t. But if you tell me this is so I would believe you.With a 25 W power consumption for the human brain, Landauer’s limit gives us 8.4×10²¹ irreversible operations per second at 310 K.

If there was some evolutionary pressure to utilize reversible computing to save power, it translates to orders of magnitude more logical operations in each second.

That’s a million to a billion times more computational capacity than what’s derived from standard neuro-synaptic brain models.

Therefore either Landauer’s limit is irrelevant in biological information processing or there is a layer of extremely powerful high frequency molecular computational architecture below the neuro-synaptic level, completely hidden from observations so far.

I believe Landauer’s limit is largely irrelevant to biological information processing, because organisms don’t come close to reaching this bound. It’s nice to know that this bound exists, and as a question of physics it’s important to clarify exactly what this bound says. But for biology, we need other ideas.

Such models already exist…

https://en.wikipedia.org/wiki/Membrane_computing

https://en.wikipedia.org/wiki/X-machine#Analog_X_Machine_.28AXM.29

The true question though is, if this is so for biology why do people in physics try to generalize Landauer as if it was an axiom of nature and why do people neglect other’s works using this as an argument as Dr Motls did in the below post…

http://motls.blogspot.gr/2012/12/exorcising-maxwells-daemons.html

Landauer’s limit is not an “axiom of nature”, but it is a theorem about quite general physical systems, given suitable assumptions. Not all theorems are

importantin every situation.I know about membrane computing. It’s a good idea, but I believe we still need more ideas.

Okay. In that case what slide 14 is supposed to mean in David Wolpert’s presentation?

I think it’s fairly evident what this slide means. Clearly natural selection punishes organisms with extremely high heat dissipation (or free energy consumption), but it doesn’t seem to be pushing it down to the Landauer limit. Maybe in some far future regime where free energy is extremely scarce, this will happen. But it doesn’t seem to be happening now.