Sensing and Acting Under Information Constraints

30 October, 2014

I’m having a great time at a workshop on Biological and Bio-Inspired Information Theory in Banff, Canada. You can see videos of the talks online. There have been lots of good talks so far, but this one really blew my mind:

• Naftali Tishby, Sensing and acting under information constraints—a principled approach to biology and intelligence, 28 October 2014.

Tishby’s talk wasn’t easy for me to follow—he assumed you already knew rate-distortion theory and the Bellman equation, and I didn’t—but it was great!

It was about the ‘action-perception loop':


This is the feedback loop in which living organisms—like us—take actions depending on our goals and what we perceive, and perceive things depending on the actions we take and the state of the world.

How do we do this so well? Among other things, we need to balance the cost of storing information about the past against the payoff of achieving our desired goals in the future.

Tishby presented a detailed yet highly general mathematical model of this! And he ended by testing the model on experiments with cats listening to music and rats swimming to land.

It’s beautiful stuff. I want to learn it. I hope to blog about it as I understand more. But for now, let me just dive in and say some basic stuff. I’ll start with the two buzzwords I dropped on you. I hate it when people use terminology without ever explaining it.

Rate-distortion theory

Rate-distortion theory is a branch of information theory which seeks to find the minimum rate at which bits must be communicated over a noisy channel so that the signal can be approximately reconstructed at the other end without exceeding a given distortion. Shannon’s first big result in this theory, the ‘rate-distortion theorem’, gives a formula for this minimum rate. Needless to say, it still requires a lot of extra work to determine and achieve this minimum rate in practice.

For the basic definitions and a statement of the theorem, try this:

• Natasha Devroye, Rate-distortion theory, course notes, University of Chicago, Illinois, Fall 2009.

One of the people organizing this conference is a big expert on rate-distortion theory, and he wrote a book about it.

• Toby Berger, Rate Distortion Theory: A Mathematical Basis for Data Compression, Prentice–Hall, 1971.

Unfortunately it’s out of print and selling for $259 used on Amazon! An easier option might be this:

• Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, Chapter 10: Rate Distortion Theory, Wiley, New York, 2006.

The Bellman equation

The Bellman equation reduces the task of finding an optimal course of action to choosing what to do at each step. So, you’re trying to maximize the ‘total reward’—the sum of rewards at each time step—and Bellman’s equation says what to do at each time step.

If you’ve studied physics, this should remind you of how starting from the principle of least action, we can get a differential equation describing the motion of a particle: the Euler–Lagrange equation.

And in fact they’re deeply related. The relation is obscured by two little things. First, Bellman’s equation is usually formulated in a context where time passes in discrete steps, while the Euler–Lagrange equation is usually formulated in continuous time. Second, Bellman’s equation is really the discrete-time version not of the Euler–Lagrange equation but a more or less equivalent thing: the ‘Hamilton–Jacobi equation’.

Ah, another buzzword to demystify! I was scared of the Hamilton–Jacobi equation for years, until I taught a course on classical mechanics that covered it. Now I think it’s the greatest thing in the world!

Briefly: the Hamilton–Jacobi equation concerns the least possible action to get from a fixed starting point to a point q in space at time t. If we call this least possible action W(t,q), the Hamilton–Jacobi equation says

\displaystyle{ \frac{\partial W(t,q)}{\partial q_i} = p_i  }

\displaystyle{ \frac{\partial W(t,q)}{\partial t} = -E  }

where p is the particle’s momentum at the endpoint of its path, and E is its energy there.

If we replace derivatives by differences, and talk about maximizing total reward instead of minimizing action, we get Bellman’s equation:

Bellman equation, Wikipedia.

Markov decision processes

Bellman’s equation can be useful whenever you’re trying to figure out an optimal course of action. An important example is a ‘Markov decision process’. To prepare you for Tishby’s talk, I should say what this is.

In a Markov process, something randomly hops around from state to state with fixed probabilities. In the simplest case there’s a finite set S of states, and time proceeds in discrete steps. At each time step, the probability to hop from state s to state s' is some fixed number P(s,s').

This sort of thing is called a Markov chain, or if you feel the need to be more insistent, a discrete-time Markov chain.

A Markov decision process is a generalization where an outside agent gets to change these probabilities! The agent gets to choose actions from some set A. If at a given time he chooses the action \alpha \in A, the probability of the system hopping from state s to state s' is P_\alpha(s,s'). Needless to say, these probabilities have to sum to one for any fixed s.

That would already be interesting, but the real fun is that there’s also a reward R_\alpha(s,s'). This is a real number saying how much joy or misery the agent experiences if he does action \alpha and the system hops from s to s'.

The problem is to choose a policy—a function from states to actions—that maximizes the total expected reward over some period of time. This is precisely the kind of thing Bellman’s equation is good for!

If you’re an economist you might also want to ‘discount’ future rewards, saying that a reward n time steps in the future gets multiplied by \gamma^n, where 0 < \gamma \le 1 is some discount factor. This extra tweak is easily handled, and you can see it all here:

Markov decision process, Wikipedia.

Partially observable Markov decision processes

There’s a further generalization where the agent can’t see all the details of the system! Instead, when he takes an action \alpha \in A and the system hops from state s to state s', he sees something: a point in some set O of observations. He makes the observation o \in O with probability \Omega_\alpha(o,s').

(I don’t know why this probability depends on s' but not s. Maybe it ultimately doesn’t matter much.)

Again, the goal is to choose a policy that maximizes the expected total reward. But a policy is a bit different now. The action at any time can only depend on all the observations made thus far.

Partially observable Markov decision processes are also called POMPDs. If you want to learn about them, try these:

Partially observable Markov decision process, Wikipedia.

• Tony Cassandra, Partially observable Markov decision processes.

The latter includes an introduction without any formulas to POMDPs and how to choose optimal policies. I’m not sure who would study this subject and not want to see formulas, but it’s certainly a good exercise to explain things using just words—and it makes certain things easier to understand (though not others, in a way that depends on who is trying to learn the stuff).

The action-perception loop

I already explained the action-perception loop, with the help of this picture from the University of Bielefeld’s Department of Cognitive Neuroscience:


Nafthali Tishby has a nice picture of it that’s more abstract:

We’re assuming time comes in discrete steps, just to keep things simple.

At each time t

• the world has some state W_t, and
• the agent has some state M_t.

Why the letter M? This stands for memory: it can be the state of the agent’s memory, but I prefer to think of it as the state of the agent.

At each time

• the agent takes an action A_t, which affects the world’s next state, and

• the world provides a sensation S_t to the agent, which affect’s the agent’s next state.

This is simplified but very nice. Note that there’s a symmetry interchanging the world and the agent!

We could make it fancier by having lots of agents who all interact, but there are a lot of questions already. The big question Tishby focuses on is optimizing how much the agent should remember about the past if they

• get a reward depending on the action taken and the resulting state of the world

but

• pay a price for the information stored from sensations.

Tishby formulates this optimization question as something like a partially observed Markov decision process, uses rate-distortion theory to analyze how much information needs to be stored to achieve a given reward, and uses Bellman’s equation to solve the optimization problem!

So, everything I sketched fits together somehow!

I hope what I’m saying now is roughly right: it will take me more time to get the details straight. If you’re having trouble absorbing all the information I just threw at you, don’t feel bad: so am I. But the math feels really natural and good to me. It involves a lot of my favorite ideas (like generalizations of the principle of least action, and relative entropy), and it seems ripe to be combined with network theory ideas.

For details, I highly recommend this paper:

• Naftali Tishby and Daniel Polani, Information theory of decisions and actions, in Perception-Reason-Action Cycle: Models, Algorithms and System. Vassilis, Hussain and Taylor, Springer, Berlin, 2010.

I’m going to print this out, put it by my bed, and read it every night until I’ve absorbed it.


Biodiversity, Entropy and Thermodynamics

27 October, 2014

 

I’m giving a short 30-minute talk at a workshop on Biological and Bio-Inspired Information Theory at the Banff International Research Institute.

I’ll say more about the workshop later, but here’s my talk, in PDF and video form:

Biodiversity, entropy and thermodynamics.

Most of the people at this workshop study neurobiology and cell signalling, not evolutionary game theory or biodiversity. So, the talk is just a quick intro to some things we’ve seen before here. Starting from scratch, I derive the Lotka–Volterra equation describing how the distribution of organisms of different species changes with time. Then I use it to prove a version of the Second Law of Thermodynamics.

This law says that if there is a ‘dominant distribution’—a distribution of species whose mean fitness is at least as great as that of any population it finds itself amidst—then as time passes, the information any population has ‘left to learn’ always decreases!

Of course reality is more complicated, but this result is a good start.

This was proved by Siavash Shahshahani in 1979. For more, see:

• Lou Jost, Entropy and diversity.

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.

and more recent papers by Harper.


Entropy and Information in Biological Systems (Part 2)

4 July, 2014

John Harte, Marc Harper and I are running a workshop! Now you can apply here to attend:

Information and entropy in biological systems, National Institute for Mathematical and Biological Synthesis, Knoxville Tennesee, Wednesday-Friday, 8-10 April 2015.

Click the link, read the stuff and scroll down to “CLICK HERE” to apply. The deadline is 12 November 2014.

Financial support for travel, meals, and lodging is available for workshop attendees who need it. We will choose among the applicants and invite 10-15 of them.

The idea

Information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The aim of this investigative workshop is to synthesize different ways of applying these concepts to help systematize and unify work in biological systems. Early attempts at “grand syntheses” often misfired, but applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful. In ecology, entropy maximization methods have proven successful in predicting the distribution and abundance of species. Entropy is also widely used as a measure of biodiversity. Work on the role of information in game theory has shed new light on evolution. As a population evolves, it can be seen as gaining information about its environment. The principle of maximum entropy production has emerged as a fascinating yet controversial approach to predicting the behavior of biological systems, from individual organisms to whole ecosystems. This investigative workshop will bring together top researchers from these diverse fields to share insights and methods and address some long-standing conceptual problems.

So, here are the goals of our workshop:

• To study the validity of the principle of Maximum Entropy Production (MEP), which states that biological systems – and indeed all open, non-equilibrium systems – act to produce entropy at the maximum rate.

• To familiarize all the participants with applications to ecology of the MaxEnt method: choosing the probabilistic hypothesis with the highest entropy subject to the constraints of our data. We will compare MaxEnt with competing approaches and examine whether MaxEnt provides a sufficient justification for the principle of MEP.

• To clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

• To develop the concept of evolutionary games as “learning” processes in which information is gained over time.

• To study the interplay between information theory and the thermodynamics of individual cells and organelles.

For more details, go here.

If you’ve got colleagues who might be interested in this, please let them know. You can download a PDF suitable for printing and putting on a bulletin board by clicking on this:



Network Theory III

16 March, 2014

 

In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy.

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

• Brendan Fong, Causal Theories: a Categorical Perspective on Bayesian Networks.

I took a slightly different approach, by saying that a causal theory \mathcal{C}_G is the free category with products on certain objects and morphisms coming from a directed acyclic graph G. In his thesis he said \mathcal{C}_G was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!


Relative Entropy (Part 4)

16 February, 2014

In recent posts by Manoj Gopalkrishnan and Marc Harper we’ve seen how not just entropy but relative entropy—the entropy of a probability distribution relative to the equilibrium distribution—is a driving force in chemistry and evolution. Now Tobias Fritz and I have finally finished our paper on this subject:

A Bayesian characterization of relative entropy.

Very roughly, we proved that any reasonable measure of the information you gain when you to update your assumptions about the world based on discovering what a system is really doing must be some constant c times the relative entropy.

I’ve blogged about this here before:

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, \mathrm{FinStat}, and how relative entropy defines a functor on this category.

Relative Entropy (Part 3): statement of our main theorem, which characterizes relative entropy up to a constant multiple as the only functor F : \mathrm{FinStat} \to [0,\infty) with a few nice properties.

Now that the paper is done, we’re having a nice conversation about it on the n-Category Café. Since I don’t want to splinter the conversation, I won’t enable comments here—please go there and join the fun!

But having blogged about this thrice before, what’s new?

One thing is that our conversation is getting more deeply into the category-theoretic aspects. Read the long parenthetical remarks in my post on the n-Café to get up to speed on that aspect.

Another is that by looking at our paper, you can actually see the proof of our result. As I mention on the n-Café.

The proof is surprisingly hard. Or maybe we’re just surprisingly bad at proving things. But the interesting thing is this: the proof is swift and effective in the ‘generic’ case—the case where the support of the probability measures involved is the whole set they’re living on, and the constant c is finite.

It takes some more work to handle the case where the probability measures have smaller support.

But the really hard work starts when we handle the case that, in the end, has c = \infty. Then the proof becomes more like analysis than what you normally expect in category theory. We slowly corner the result, blocking off all avenues of escape. Then we close in, grab its neck, and strangle it, crushing its larynx ever tighter, as it loses the will to fight back and finally expires… still twitching.

We haven’t gotten into discussing this much yet, perhaps because the mathematicians on the n-Café are too dainty and civilized. But someone into analysis might be able to find a more efficient proof.

That would make me a bit sad—since why didn’t we find it?—but mainly happy—since this subject deserves to be clean and elegant. We really need a category-theoretic formulation of the second law of thermodynamics that’s suitable for studying complex networks: that’s the long-term goal here.


Bio-Inspired Information Theory

31 January, 2014

 

There will be a 5-day workshop on Biological and Bio-Inspired Information Theory at BIRS from Sunday the 26th to Friday the 31st of October, 2014. It’s being organized by

Toby Berger (University of Virginia)
Andrew Eckford (York University)
Peter Thomas (Case Western Reserve University)

BIRS is the Banff International Research Station, a conference venue in a rather wild part of Alberta, in the mountains west of Calgary.

Description

Here’s the workshop proposal on the BIRS website:

Currently, research in the community is organized into three streams:

• Information theory and biochemistry (including information theory and intercellular communication);

• Information theory and neuroscience; and

• Information-theoretic analysis of biologically-inspired communication systems (including nano-networking and design of biologically implemented information processing networks).

We propose a BIRS workshop to explore these streams, focusing on mathematical open problems that cut across the streams. The main objectives of this workshop would be: to bring together the most prominent researchers in this field; to discuss and review the current state of mathematical research in this field; to promote cross-pollination among the various streams of research to find common problems; and to collectively identify key future directions or open problems that would bring the greatest impact to this field over the next few years.

Expected impact

A BIRS workshop involving the field’s leading researchers would allow a review of the current state of the art, and would promote cross-pollination among these three streams of research. We expect to have these leading researchers in attendance. For example, Prof. Toby Berger (U. Virginia), a widely recognized pioneer in this field and a recipient of the Shannon Award (the top prize awarded by the IEEE Information Theory Society), is one of the co-organizers of the workshop. Moreover, we have approached many of the field’s most prominent mathematicians and scientists: a complete list is found elsewhere in this proposal, but among the most prominent confirmed participants are: Prof. Tadashi Nakano (Osaka U.), one of the earliest researchers in engineered molecular communication; Dr. Thomas D. Schneider (NIH – National Cancer Institute), inventor of the sequence logo and prominent researcher in genetic information theory; and Profs. William Bialek (Princeton U.) and Naftali Tishby (Hebrew U.), prominent experts on information theory in neural coding.

Although the focus of our workshop is on mathematical fundamentals, our list of expected participants includes a few experimental scientists, e.g. Raymond Cheong and Andre Levchenko (both from Johns Hopkins U.), in addition to mathematical scientists. This is because quantitative application of information theoretic analysis to biological systems typically requires empirical estimation of joint probability distributions for multiple input and output variables, often posing daunting data collection challenges which pioneered the use of high-throughput experimental methods to collect large data sets quantifying the input/output relationships for a specific biochemical signaling pathway). We believe a blended approach, emphasizing mathematics but including experimental perspectives, will enhance the impact of our workshop and increase the usefulness to our participants.

Given that publications in these research areas have achieved prominence in the past few years, the time is right for a general meeting among the key researchers to review the state of the field and develop future directions. Thus, our proposed workshop is timely and would be expected to have a tremendous impact on the field over the next several years.


The Rarest Things in the Universe

27 January, 2014

guest post by Leonard Adleman

About 50 years ago Kolomogorov assigned to each finite binary string \sigma a non-negative integer that measured that string’s ‘descriptive complexity’. Informally, K(\sigma) is the length (in binary) of the shortest (Turing machine) program that with the empty string as input, outputs \sigma and halts. A related measure of descriptive complexity is M(\sigma)=\frac{K(\sigma)}{|\sigma|}, where |\sigma| denotes the length of \sigma. A simple string like:

\sigma_0=\overbrace{111\ldots 111}^{1,000,000}

can be produced by a very short program; hence M(\sigma_0) is near 0. But if \sigma is a ‘random string’ (e.g. obtained by flipping coins), then, with high probability, it cannot be produced by a program significantly shorter than \sigma itself; hence M(\sigma) will be near 1.

If I asked you to produce (using a computer) a thousand strings of length one million with M near 1, it would be easy to do; just flip a lot of coins. If I asked you to produce a thousand strings with M near 0, that would also be easy. For example, you could start with a short random string and repeat it a lot. Actually, if I chose my favorite \alpha\in [0,1] and wanted a thousand strings of length one million with M near \alpha, then a mix of the preceding approaches can be used to produce them. So strings with a desired M are not rare.

Now let’s consider ‘deep strings’*. I will be informal here, but the underlying theory can be found in my Time space and randomness.

For all binary strings \sigma, we assign a value that measures the ‘depth’ of \sigma. D(\sigma) is obtained by considering both the size and the number of steps used by each program that with the empty string as input, outputs \sigma and halts**. D(\sigma) has the following properties:

  • If there is no short program to produce \sigma, then D(\sigma) is small.
  • If there is a short program to produce \sigma and it uses few steps, then D(\sigma) is small.
  • If there is a short program to produce \sigma, but all short programs to produce \sigma use lots of steps, then D(\sigma) is large. Roughly speaking, the more steps small programs use to produce \sigma, the larger D(\sigma) will be.

Informally, we call a string with large D ‘deep’ and one with small D ‘shallow’. A few examples may help.

Consider a string \sigma obtained by flipping coins. With high probability there is no short program to produce \sigma, hence \sigma is shallow.

Now consider \sigma_0 above. Since there is a short program to produce \sigma_0 and that program uses few steps, \sigma_0 is shallow.

Now treat \sigma_0 as a number in binary (i.e. 2^{1,000,000}-1) and consider the prime factorization. The fundamental theorem tells us it exists and will be about one million bits long. But, unless 2^{1,000,000}-1 is somehow special (e.g. a prime times a very smooth number), its prime factorization may be very very deep. A short program can generate the prime factorization (just generate one million 1s with a short program and then give it to a short factoring program). But if it turns out that factoring can’t be done in polynomial time, then perhaps all short programs that generate the prime factorization use a huge number of steps. So the prime factorization would have a very large D. Conceivably, since steps on a computer use time and energy, the prime factorization can never be realized. It is not a long string (only one million bits) but it may exist only in theory and not in reality.

If I asked you to produce even one string of length one million with D near that of the prime factorization of 2^{1,000,000}-1, could you do it? I would not know how and I suspect that as a practical matter it cannot be done. So strings with such a D are rare.

Here is a string that does exist in our universe and that (I suspect) is quite deep:

46817226351072265620777670675006972301618979214252832875068976303839400413682313
921168154465151768472420980044715745858522803980473207943564433*5277396428112339
17558838216073534609312522896254707972010583175760467054896492872702786549764052
643493511382273226052631979775533936351462037464331880467187717179256707148303247

In fact, this string is the prime factorization of 2^{1061}-1. We should not expect it to be as deep as the prime factorization of 2^{1,000,000}-1, but we should still expect it to have considerable depth. There is even some experimental evidence that supports this view. How did I get this prime factorization? I got it from: Factorization of a 1061-bit number by the Special Number Field Sieve, by Greg Childers. So it was easy to get. Well, easy for me anyway; not so easy for Childers. He reports that the factorization took about `3 CPU-centuries’ using the Special Number Field Sieve. Other than by factoring 2^{1061}-1, how could you ever come to write this pair of primes? I venture to say that since the Special Number Field Sieve is the fastest known algorithm for factoring numbers of this kind, no method available today could have written these primes (for the first time) using fewer steps (and hence less time/energy).

The situation might be compared to that of atomic physics. I am not a physicist, but I suppose it is possible to theorize about an atomic nucleus with a million protons. But what if I want to create one? It appears that producing transuranic elements takes huge amounts of time/energy and the greater the number of protons, the more time/energy it takes. It is even conceivable (to me at least) that there is not enough time/energy available (at least on earth) to actually produce one. Like the prime factorization of 2^{1,000,000}-1, it may exist in theory but not in reality. On the other hand, physicists from Russia and America, using lots of time/energy, have created an atomic nucleus with 118 protons called Ununoctium. Ununoctium is analogous to Childers’ prime factorization; both exist in reality; both were very costly to create.

But my focus here is neither ontology nor physics; it is money. Recently Bitcoins and Bitcoin-like things have captured the world’s attention. I suspect that this kind of currency will revolutionize the nature of economies, and consequently the nature of governance. Personally, I am agnostic on whether this is a good thing or a bad thing. But, in any case, I am wondering if ‘depth’ might form part of a theoretical basis for understanding new currencies. I must confess to knowing few details about Bitcoins; consequently, I will start from first principles and consider some of the properties of currency:

  • Mass: A currency may have mass or may be massless. U.S. dollars and gold have mass. Bitcoins are massless. The efficiencies of the Internet market require a massless currency. Imagine running eBay or Amazon using gold. U.S. dollars have mass, so we do not actually use them for transactions on the Internet, rather we use credit card numbers to create massless IOUs that promise to deliver (but, in fact, seldom actually deliver) U.S. dollars in the future. These IOUs are essentially a massless currency.
  • Production: Who can produce or destroy the currency? This has been an important political, legal and economic issue for millennia. In the west, coins began to appear at about the time Solon ruled Athens. Solon realized that he could manipulate the value of coins. And so he did. Solon’s lesson has not been lost on governments ever since. In its simplest form: if you owe K dollars, you print K dollars, and voila! no more debt. Creditors don’t like debtors to do that, so they want to be paid in a currency that no one can produce more of. Gold comes close. You can produce more gold but you have to spend a lot of time/energy in a mine to do so. Production also includes counterfeiting. Counterfeiting comes in at least two important forms: de novo-counterfeiting (build a press) and duplication-counterfeiting (get a xerox machine). For now, all massless currencies are in digital form and are vulnerable to duplication-counterfeiting since computers make it cheap and easy to create perfect copies of digital notes (a unit of a currency will be called a ‘note’). As a result, massless currencies will typically be associated with systems that mitigate this threat. Perhaps the elaborate ledgers implemented by the creators of Bitcoin are an attempt to deal with the threat of duplication-counterfeiting.
  • Abundance: As is often said, money can be used to ‘lubricate the economy’. To accomplish this a currency has to be sufficiently abundant. For example, Ununoctium, might be attractive to creditors because, compared to gold, it is far more costly in time/energy to produce more. However, it is not desirable for lubricating the economy because it is so costly to produce that less than 100 atoms have ever been created.

The transition from mass to masslessness will lead to currencies with uses and properties we do not normally associate with money. For example, using the techniques of secret-sharing, it becomes possible to create digital currencies where a single note can be jointly owned by 1000 individuals; any 501 of whom could cooperate to spend it, while any 500 of whom would be unable to do so.

What is the perfect currency? This is probably the wrong question, rather we should ask what properties may currency have, in theory and in reality. Let’s consider massless currencies.

Is there a massless currency similar to the U.S. dollar? I think yes. For example, the Government could simply publish a set of numbers and declare the numbers to be currency. Put another way, make the U.S. dollar smaller and smaller to decrease mass until asymptotically all that is left is the serial number. With regard to abundance, like the U.S. dollar, the Government is free to determining the number of notes available. With regard to production, as with the U.S. dollar, the Government can print more by simply publishing new numbers to be added to the set (or destroy some by declaring that some numbers have been removed from the set). With regard to counterfeiting, the U.S. dollar has some advantages. The mass of the U.S. dollar turns counterfeiting from a digital problem to a physical one. This provides the Government with the ability to change the U.S. dollar physically to defeat technology that might arise to produce faithful (either de novo or duplication) counterfeits.

Is there a massless currency similar to gold? I think yes. I think this is what Bitcoin-like currencies are all about. They are ‘deep-currencies’, sets of deep strings. With regard to abundance, they are superior to gold. The total number of deep strings in the set can be chosen at initiation by the creator of the currency. The abundance of gold on the other hand has already been set by nature (and, at least as currently used, gold is not sufficiently abundant to lubricate the world economy). With regard to production, as with gold, making notes requires time/energy. With regard to counterfeiting, gold has an advantage. Counterfeit gold is an absurdity, since all gold (by the ounce, not numismatically), no matter how it arises, is the same atomically and is perceived to have the same value. On the other hand, as stated above, massless currencies are vulnerable to duplication-counterfeiting. Interestingly, deep-currencies may be resistant to de novo-counterfeiting, since the creator of the deep-currency is free to choose the depth of the notes, and consequently the cost of producing new notes.

The value of a massless currency in our information based world is clear. Deep-currencies such as Bitcoin offer an attractive approach. But there is an interesting issue that may soon arise. The issue stems from the fact that the production of each note in currencies like Bitcoin requires a large investment of time/energy, and as with gold, this deprives governments of the prerogative to print money. Creditors may like this, but governments will not. What will governments do? Perhaps they will want to create a ‘Dual-currency’. A Dual-currency should be massless. It should come with a secret key. If you do not possess the secret key, then, like gold, it should be very costly to produce a new note, but if you possess the secret key, then, like the U.S. dollar, it should be inexpensive to produce a new note. Is a Dual-currency possible? Here is an example of an approach I call aurum:

  • Generate a pair of RSA keys: a public key \langle E,N\rangle, and a secret key \langle D,N\rangle. Publish the public key \langle E,N\rangle.
  • Declare that notes are exactly those integers G such that 2\leq G\leq N-1 and (the least positive residue of) G^E\mbox{Mod}(N) is less than or equal to \frac{N}{10^9}.

So, choosing a random integer V such that 2\leq V\leq N-1, and then computing V^E\mbox{Mod}(N) has about one chance in a billion of producing a note. Hence the expected number of modular exponentiations to produce a note is about one billion. On the other hand, those who possess the secret key can chose an integer W such that 2\leq W\leq \frac{N}{10^9} and calculate W^D\mbox{Mod}(N) to produce a note after just one modular exponentiation. There are many bells and whistles that can be accommodated with aurum, but here is what is ‘really’ going on. The depth of a string \sigma is obtained by considering programs running with the empty string as input, but we can consider a more general concept: ‘relative depth’. Given a pair of strings \sigma and \tau, the depth of \sigma relative to \tau is obtained by considering programs running with \tau as the input. Hence depth as we have been discussing it is the same as depth relative to the empty string. In the example above, we have made `dual-strings'; strings that are deep relative to the public key, but shallow relative to the secret key.

One of the interesting phenomena of theoretical computer science is that you can sometimes turn bad news into good news. If factoring is hard, then we are deprived of the ability to do something we (at least number theorists) would like to do. Bad news. But surprisingly, we acquire public-key cryptography and the ability to preserve our privacy. Good news. Similarly, strings that are very hard to produce seem useless, but Bitcoins have revealed that such strings can provide a new and useful form of currency. Now that we are aware that deep strings can have value, I expect that clever people will find many new uses for them.

After thinking about deep strings for many years, I see them everywhere – I invite you to do the same. I will finish with one of my favorite observations. The watchmaker analogy is well known and is frequently used in arguing the existence of a creator. If you stumble upon a watch, you recognize from its complexity that it did not form by chance, and you conclude that there must have been a watchmaker. A human is more complex than a watch, so there must have been a ‘human-maker’ – a creator. An alternative view is that the complexity you recognize is actually ‘depth’ and the conclusion you should reach is that there must have been a computational process sustained through a great many steps. In the case of humans, the process is 3.6 billion years of evolution and the depth can be read in the genome. The watch is deep as well, but much of its depth is acquired from the human genome relative to which it is not nearly so deep.

*It has been brought to my attention that others have considered similar concepts including Charles Bennett, Murray Gell-Mann, and Luis Antunes, Lance Fortnow, Dieter Van Melkebeek, and N. Variyam Vinodchandran. Indeed, the latter group has even used the term ‘deep string’, hence it is to them we owe the name.

**For all \sigma\in\{0,1\}^{*}:

  • T(\sigma) denotes the set of Turing machine programs that with the empty string as input, output \sigma and halt.
  • E(\sigma) denotes \min\{\max\{|P|,\log_{2}(\vec{P})\}|P\in T(\sigma)\}, where |P| denotes the length of P, and \vec{P} denotes the number of steps used by P with the empty string as input.

It may be convenient for the reader to assume that D(\sigma) is approximately \frac{E(\sigma)}{K(\sigma)}; however, the proper theory of deep strings extends the notion of depth to sets of strings and accommodates the use of randomness in computation. Depth in the context of quantum computation may also be of interest.


Follow

Get every new post delivered to your Inbox.

Join 2,849 other followers