The Internal Model Principle

“Every good key must be a model of the lock it opens.”

That sentence states an obvious fact, but perhaps also a profound insight if we interpret it generally enough.

That sentence is also the title of a paper:

• Daniel L. Scholten, Every good key must be a model of the lock it opens (the Conant & Ashby Theorem revisited), 2010.

Scholten gives a lot of examples, including these:

• A key is a model of a lock’s keyhole.

• A city street map is a model of the actual city streets

• A restaurant menu is a model of the food the restaurant prepares and sells.

• Honey bees use a kind of dance to model the location of a source of nectar.

• An understanding of some phenomenon (for example a physicist’s understanding of lightning) is a mental model of the actual phenomenon.

This line of thought has an interesting application to control theory. It suggests that to do the best job of regulating some system, a control apparatus should include a model of that system.

Indeed, much earlier, Conant and Ashby tried to turn this idea into a theorem, the ‘good regulator theorem’:

• Roger C. Conant and W. Ross Ashby, Every good regulator of a system must be a model of that system), International Journal of Systems Science 1 (1970), 89–97.

Scholten’s paper is heavily based on this earlier paper. He summarizes it as follows:

What all of this means, more or less, is that the pursuit of a goal by some dynamic agent (Regulator) in the face of a source of obstacles (System) places at least one particular and unavoidable demand on that agent, which is that the agent’s behaviors must be executed in such a reliable and predictable way that they can serve as a representation (Model) of that source of obstacles.

It’s not clear that this is true, but it’s an appealing thought.

A particularly self-referential example arises when the regulator is some organism and the System is the world it lives in, including itself. In this case, it seems the regulator should include a model of itself! This would lead, ultimately, to self-awareness.

It all sounds great. But Scholten raises an obvious question: if Conant and Ashby’s theorem is so great, why isn’t more well-known? Scholten puts it quite vividly:

Given the preponderance of control-models that are used by humans (the evidence for this preponderance will be surveyed in the latter part of the paper), and especially given the obvious need to regulate that system, one might guess that the C&A theorem would be at least as famous as, say, the Pythagorean Theorem (a^2 + b^2 = c^2), the Einstein mass-energy equivalence (E = mc^2, which can be seen on T-shirts and bumper stickers), or the DNA double helix (which actually shows up in TV crime dramas and movies about super heroes). And yet, it would appear that relatively few lay-persons have ever even heard of C&A’s important prerequisite to successful regulation.

There could be various explanations. But here’s mine: when I tried to read Conant and Ashby’s paper, I got stuck. They use some very basic mathematical notation in nonstandard ways, and they don’t clearly state the hypotheses and conclusion of their theorem.

Luckily, the paper is short, and the argument, while mysterious, seems simple. So, I immediately felt I should be able to dream up the hypotheses, conclusion, and argument based on the hints given.

Scholten’s paper didn’t help much, since he says:

Throughout the following discussion I will assume that the reader has studied Conant & Ashby’s original paper, possesses the level of technical competence required to understand their proof, and is familiar with the components of the basic model that they used to prove their theorem [….]

However, I have a guess about the essential core of Conant and Ashby’s theorem. So, I’ll state that, and then say more about their setup.

Needless to say, I looked around to see if someone else had already done the work of figuring out what Conant and Ashby were saying. The best thing I found was this:

• B. A. Francis and W. M. Wonham, The internal model principle of control theory, Automatica 12 (1976) 457–465.

This paper works in a more specialized context: linear control theory. They’ve got a linear system or ‘plant’ responding to some input, a regulator or ‘compensator’ that is trying to make the plant behave in a desired way, and a ‘disturbance’ that affects the plant in some unwanted way. They prove that to perfectly correct for the disturbance, the compensator must contain an ‘internal model’ of the disturbance.

I’m probably stating this a bit incorrectly. This paper is much more technical, but it seems to be more careful in stating assumptions and conclusions. In particular, they seem to give a precise definition of an ‘internal model’. And I read elsewhere that the ‘internal model principle’ proved here has become a classic result in control theory!

This paper says that Conant and Ashby’s paper provided “plausibility arguments in favor of the internal model idea”. So, perhaps Conant and Ashby inspired Francis and Wonham, and were then largely forgotten.

My guess

My guess is that Conant and Ashby’s theorem boils down to this:

Theorem. Let R and S be finite sets, and fix a probability distribution p on S. Suppose q is any probability distribution on R \times S such that

\displaystyle{ p(s) = \sum_{r \in R} q(r,s)  \; \textrm{for all} \; s \in S}

Let H(p) be the Shannon entropy of p and let H(q) be the Shannon entropy of q. Then

H(q) \ge H(p)

and equality is achieved if there is a function

h: S \to R

such that

q(r,s) = \left\{\begin{array}{cc} p(s)  &  \textrm{if} \; r = h(s) \\                                             0  & \textrm{otherwise}  \end{array} \right.       █

Note that this is not an ‘if and only if’.

The proof of this is pretty easy to anyone who knows a bit about probability theory and entropy. I can restate it using a bit of standard jargon, which may make it more obvious to experts. We’ve got an S-valued random variable, say \textbf{s}. We want to extend it to an R \times S-valued random variable (\textbf{r}, \textbf{s}) whose entropy is small as possible. Then we can achieve this by choosing a function h: S \to R, and letting \textbf{s} = h(\textbf{r}).

Here’s the point: if we make \textbf{s} be a function of \textbf{r}, we aren’t adding any extra randomness, so the entropy doesn’t go up.

What in the world does this have to do with a good regulator containing a model of the system it’s regulating?

Well, I can’t explain that as well as I’d like—sorry. But the rough idea seems to be this. Suppose that S is a system with a given random behavior, and R is another system, the regulator. If we want the combination of the system and regulator to behave as ‘nonrandomly’ as possible, we can let the state of the regulator be a function of the state of the system.

This theorem is actually a ‘lemma’ in Conant and Ashby’s paper. Let’s look at their setup, and the ‘good regulator theorem’ as they actually state it.

Their setup

Conant and Ashby consider five sets and three functions. In a picture:

The sets are these:

• A set Z of possible outcomes.

• A goal: some subset G \subseteq Z of good outcomes

• A set D of disturbances, which I might prefer to call ‘inputs’.

• A set S of states of some system that is affected by the disturbances.

• A set R of states of some regulator that is also affected by the disturbances.

The functions are these:

• A function \phi : D \to S saying how a disturbance determines a state of the system.

• A function \rho: D \to R saying how a disturbance determines a state of the regulator.

• A function \psi: S \times R \to Z saying how a state of the system and a state of the regulator determines an outcome.

Of course we want some conditions on these maps. What we want, I guess, is for the outcome to be good regardless of the disturbance. I might say that as follows: for every d \in D we have

\psi(\phi(d), \rho(d)) \in G

Unfortunately Conant and Ashby say they want this:

\rho \subset  [\psi^{-1}(G)]\phi

I can’t parse this: they’re using math notation in ways I don’t recognize. Can you figure out what they mean, and whether it matches my guess above?

Then, after a lot of examples and stuff, they state their theorem:

Theorem. The simplest optimal regulator R of a reguland S produces events R which are related to events S by a mapping h: S \to R.

Clearly I’ve skipped over too much! This barely makes any sense at all.

Unfortunately, looking at the text before the theorem, I don’t see these terms being explained. Furthermore, their ‘proof’ introduces extra assumptions that were not mentioned in the statement of the theorem. It begins:

The sets R, S, and Z and the mapping \psi: R \times S \to Z are presumed given. We will assume that over the set S there exists a probability distribution p(S) which gives the relative frequencies of the events in S. We will further assume that the behaviour of any particular regulator R is specified by a conditional distribution p(R|S) giving, for each event in S, a distribution on the regulatory events in R.

Get it? Now they’re saying the state of the regulator R depends on the state of the system S via a conditional probability distribution p(r|s) where r \in R and s \in S. It’s odd that they didn’t mention this earlier! Their picture made it look like the state of the regulator is determined by the ‘disturbance’ via the function \rho: D \to R. But okay.

They’re also assuming there’s a probability distribution on S. They use this and the above conditional probability distribution to get a probability distribution on R.

In fact, the set D and the functions out of this set seem to play no role in their proof!

It’s unclear to me exactly what we’re given, what we get to choose, and what we’re trying to optimize. They do try to explain this. Here’s what they say:

Now p(S) and p(R|S) jointly determine p(R,S) and hence p(Z) and H(Z), the entropy in the set of outcomes:

\displaystyle{ H(Z) = - \sum_{z \in Z} p(z) \log (p(z)) }

With p(S) fixed, the class of optimal regulators therefore corresponds to the class of optimal distributions p(R|S) for which H(Z) is minimal. We will call this class of optimal distributions \pi.

I could write a little essay on why this makes me unhappy, but never mind. I’m used to the habit of using the same letter p to stand for probability distributions on lots of different sets: folks let the argument of p say which set they have in mind at any moment. So, they’re starting with a probability distribution on S and a conditional probability distribution on r \in R given s \in S. They’re using these to determine probability distribution on R \times S. Then, presumably using the map \psi: S \times R \to Z, they get a probability distribution on Z. H(Z) is the entropy of the probability distribution on Z, and for some reason they are trying to minimize this.

(Where did the subset G \subseteq Z of ‘good’ outcomes go? Shouldn’t that play a role? Oh well.)

I believe the claim is that when this entropy is minimized, there’s a function h : S \to R such that

p(r|s) = 1 \; \textrm{if} \; r = h(s)

This says that the state of the regulator should be completely determined by the the state of the system. And this, I believe, is what they mean by

Every good regulator of a system must be a model of that system.

I hope you understand: I’m not worrying about whether the setup is a good one, e.g. sufficiently general for real-world applications. I’m just trying to figure out what the setup actually is, what Conant and Ashby’s theorem actually says, and whether it’s true.

I think I’ve just made a lot of progress. Surely this was no fun to read. But it I found it useful to write it.

56 Responses to The Internal Model Principle

  1. Jan Moren says:

    Is it perhaps meant to be \sum_{r \in R}{q(r,s)} in the first equation of the theorem?

  2. Are they saying that R contains a model of S means there is a function h that states of S to the states of R? That seems kind of trivial. Are the additional probability/ entropy constraints required for being good/optima or just for being a model?

  3. Do they use R contains a model of S to mean there is a map h from states of S to the states of S? That seems kind of trivial. Or are the probability/entropy constraints part of this definition? I read them as being part of the goodness/optimality definition.

  4. Do they use R contains a model of S to simply mean there is a function h from the states of S to the states of R? That would seem kind of trivial. Or are the probability/entropy conditions a part of this definition as well? I was taking those to be the requirements for goodness/optimality.

    • John Baez says:

      Who is “they”? Conant and Ashby? I tried to figure out what they mean by R contains a model of S, and my answer so far is this blog post.

      (Francis and Wonham seem to have a different idea.)

      • I was trying to separate out being a model and being good/optimal. That separation is not clear to me even in the blog post, but maybe being a model without being good does not make much sense since anything can be a (very) bad model of anything else

        • John Baez says:

          This blog post records my struggles to understand a paper that I found very unclear. I believe in their setup they are trying to formulate a concept of “optimal regulator” of a system, and prove that any such thing must “contain a model” of the system. But I found their definitions of both these concepts very tough to understand, mainly because they don’t present their material in the systematic, regimented way that we mathematicians are used to. E.g., some terms in the theorem are only defined en passant in the proof. The actual math involved is all very simple.

    • linasv says:

      Hi Daniel,

      Allow me to do some forensic anthropology: in 1970, the general theory of of control systems would still be quite fresh. So, the control theory developed in the 1940’s to prevent the unwanted vibration of fighter aircraft wings would have been mostly about linear algebra, meromorphic functions, resonances, filters, transfer functions. — whereas the notion that you could use a computer to perform highly non-linear controls – discontinuous, computable recursive functions – would be both forehead-slappingly obvious and yet also mind-boggling in terms of the theoretical possibilities of this new vista. So, for starters, you need to have a map of this new territory.

      The most abstract way to express a control system would indeed seem to be “for every possible state of the system S, please given me a state of the regulator R, such that R prevents S from halting and catching fire.” Perhaps trivial, but a good place to start, to address the general concept of control.

      • linasv says:

        To be extra-clear: a quick skim of shows that, to this very day, the concept of control is still very much preoccupied with systems that are modeled by PDE’s: the model is a PDE. Which is very very different from a model that is “a key, a map, a menu, a bee’s dance”.

        • John Baez says:

          I think systems of ODEs, actually. I urge anyone who has the requisite academic superpowers to get ahold of this paper:

          • B. A. Francis and W. M. Wonham, The internal model principle of control theory, Automatica 12 (1976) 457–465.

          and see how they formalize the internal model principle for systems of linear ODEs.

        • John Baez says:

          Excellent! I will add this link to the blog post. Thanks. This paper is much more technical than the Conant–Ashby paper, but it also seems to contain the version of the internal model principle that more people actually understand and use. It uses some special tricks with linear ODE—Laplace transforms and complex analysis—to define when one system contains an ‘internal model’ of another. It would be nice to have a definition that worked more generally.

        • linasv says:

          Yes, but … Yes, the “Internal Model Principle” paper seems to have 1500 citations, and is apparently a part of standard curriculum; yet, in many ways it is much less interesting because of its focus on ODE’s.

          By contrast, the good regulator theorem, and in particular, the lemma at the center of it, seems to suggest some sort of practical algorithm that might be useful in machine learning (I seem to glimpse something, but can’t quite grasp it). Consider, for example, the rather technical and abstract ideas of Solomonoff induction and AIXI, which purportedly are so general that they can ‘figure out anything’, yet are explicitly non-computable. The good regulator theorem seems to be saying ‘here’s how you can compute it’; it seems to offer a way forward. I’m unable to put the pieces together, but it seems like they should somehow fit.

        • John Baez says:

          I agree that the basic intuition behind the “internal model principle” extends beyond systems governed by linear (or even nonlinear ODE), so it would be nice to have a much more general theorem that captures this intuition. I’m not sure Conant and Ashby’s theorem really captures it. It’s hard to figure out what their theorem actually says, but the proof is very short, with the most technical part being the Lemma I wrote down:

          Theorem. Let R and S be finite sets, and fix a probability distribution p on S. Suppose q is any probability distribution on R \times S such that

          \displaystyle{ p(s) = \sum_{r \in R} q(r,s)  \; \textrm{for all} \; s \in S}

          Let H(p) be the Shannon entropy of p and let H(q) be the Shannon entropy of q. Then

          H(q) \ge H(p)

          and equality is achieved if there is a function

          h: S \to R

          such that

          q(r,s) = \left\{\begin{array}{cc} p(s)  &  \textrm{if} \; r = h(s) \\                                            0  & \textrm{otherwise}  \end{array} \right.       █

          Furthermore, the proof of this lemma is so easy that I can’t imagine the theorem, whatever it is, is very helpful for actually doing practical things. The lemma may look technical, but it really just says “If you have two random variables, together they must carry at least as much information as the first one. Furthermore, together they will carry exactly as much information as the first random variable if the second variable is a function of the first.”

          I believe the point of this lemma is to say “if you’re optimally regulating a system, your regulator won’t be doing a bunch of extra random stuff that’s not determined by what the system is doing.” Or something like that.

          But this doesn’t help us a whole lot in designing an optimal regulator.

  5. Gabriel Sztorc says:

    I think the weird notation is just relational composition. The part in square brackets is a relation on S and R that connects each state s of the system with set of states of the regulator that are appropriate responses to s. If you compose that with the function from inputs to states of the system, you get a relation that maps each input d to a set of states of the regulator that are appropriate responses to the system state resulting from d.

    So, it looks equivalent to your guess.

    • John Baez says:

      Oh! I considered that they might be using relational composition, but somehow I couldn’t get that interpretation to parse correctly. Now I see it does. Thanks!

  6. domenico says:

    The link of Scholten don’t work.
    I obtain on internet

    Click to access Every_Good_Key_Must_Be_A_Model_Of_The_Lock_It_Opens.pdf

    It seem a simple assumption: the knowledge of the system permits its regulation; but I understand that could be possible to write the regulation function using probability distributions and entropy minimization.

  7. Jon Awbrey says:

    Ashby’s book was my own first introduction to cybernetics and I recently returned to his discussion of regulation games in connection with some issues in Peirce’s logic of science or “theory of inquiry”.

    In that context it seems like the formula \rho \subset  [\psi^{-1}(G)]\phi would have to be saying that the Regulator’s choices are a subset given by applying that portion of the game matrix with goal values in the body to the Disturber’s input.

  8. Boris Borcic says:

    I am not sure I wholly buy the idea, but the one thing it reminds me of is a shorthand argument to justify predictive power of Lamarckism, which is that the animal’s behavior is a model of the animal’s ecological niche.

  9. Simon Burton says:

    Looking at the Conant Ashby paper, they have a second figure where Z appears as an error term in some kind of negative feedback control circuit. So this could help explain why they want to minimize H(Z).

    In the last few pages of this other paper by Scholten:
    from page 36, “The Simplest Optimal Regulator” he shows a good example. Looking at the table for \psi the idea is to always produce the same z outcome no matter what s value the system chooses. Ie. we are looking for a function h:S \to R such that \psi(h(s), s) is constant.

  10. linasv says:

    I believe that the square-bracket notation is sometimes used to indicate a function. This notation seems to sometimes be used in books on lattices and orders (for example, to discuss the Scott topology and Scott continuity). In this case, I parse the formula as follows: \psi^{-1}(G) is a set of pairs (s,r) \in S \times R Such a set is a “relation”, and can then be re-interpreted as a function by Currying: given an element s \in S , it can be mapped to the set \{  r \in R : (s,r) \in \psi^{-1}(G) \} That is, one has a map, a function s \mapsto  \{  r \in R : (s,r) \in \psi^{-1}(G) \} The notation [ \psi^{-1}(G) ] simply denotes this map. That is, [ \psi^{-1}(G) ] : S \to \mathrm{powerset}(R)

    The rest is downhill: since \phi: D \to S , writing two functions next to each other is function composition: thus there is a

    function [ \psi^{-1}(G) ] \phi : D \to \mathrm{powerset(R)}

    This function can be viewed as a set of pairs (d, subset of R). It is being compared to \rho : D \to R which can be understood to be a set of pairs (d, singleton subsets of R).

    One reason I think my interpretation is the intended one is the date of the paper, and the Dept of Computer Science affiliation of one of the authors: in 1970, the discussion of functions involving Scott continuity, partial orders, and etc. would still have been fresh and common and widespread to anyone working in computer science, and so the square-bracket notation would have been “obvious” to the intended audience.

    • John Baez says:

      Thanks! It’s interesting how notation that seems “incorrect” or at least “highly obscure” (absent further explanation) to a mathematician can make sense in some other community, even one that overlaps with quite a bit.

    • Jon Awbrey says:

      The square brackets may also suggest a matrix, the kernel of a transform, with \phi interpreted as a column vector.

  11. Jon Awbrey says:

    I notice also that there appears to be a period after the right square bracket, whatever significance that may have, and I don’t think they mean proper inclusion necessarily, simply using \subset where we’d now used \subseteq.

    Sub [Ashby/Ashcroft] twice at the end?

  12. Jaime Santos says:

    Forgive me what seems an obvious observation, but it looks to me that if R and S are discrete sets, R has to be at least the size of S, because otherwise the relation r=h(s) does not determine all the states in S, i.e. the function h(s) is not invertible. Physically, this seems to mean that there are different states in S to which R replies in the same manner. Therefore, the regulator only captures a part of the behaviour of the regulated (the one that requires regulation, I guess). Did I get this correctly?

  13. Quine Atal says:

    I think the regulator has only to map its proximate input into it’s output, so it “models” only its proximate input including whatever expected or allowed deviations might occur. In a way that is what any function does too. Perhaps part of the difficultly is in failing to note that C&A are here using the word ‘model’ in a metaphorical sense.

  14. None of the math is within my reach, but I think a lot about mental models and other representations, and I would point out that there is a dimensional difference in these Map!=Terrain relations.
    A key emulates the internal surface of the lock. The map emulates the flattened appearance of the terrain (and the concept “the terrain” is a mental model of the surface of a section of the planet).

    Is including that too ambitious?

    • John Baez says:

      There should be a super-general formulation that includes these examples. I’m not sure Conant and Ashby have it. Regarding the map example, a flat map is “good” if for example your main task is getting from one specific latitude and longitude to another, with your altitude being largely irrelevant. Such a map might not be so good it you were trying to walk from Kashmir to Lhasa without having to climb up and down any more than necessary. (At the very least, contour lines would be helpful.)

      Generalizing this, in many cases we can imagine a kind of mapping from the system to the model that “discards irrelevant information”, where what’s irrelevant depends on the task at hand. Mathematicians call this type of mapping a “projection”: it has a technical meaning, but it doubtless originates in humble examples like maps… as does the word “mapping”.

  15. In reinforcement learning, the method of Q learning is able to achieve optimal control of a system without having a model of it. However, Q learning does need to have experimental access to the system (as a black box). The Q learning formulation is in terms of Markov decision processes rather than ODEs. One could argue that this is both more general and less general. Tali Tishby has been developing an information-theoretic account of optimal control. My understanding is that only some information about the system is useful for control and therefore the controller does not need a “complete” model of the system being controlled but only a model of how the relevant aspects of the dynamics relate to the objective function.

    • nad says:

      Prof. Dietterich, I guess you are aware of the meaning of your name in german?

    • seanny1986 says:

      To what extent is the reward function in Q learning a model of the system, though? Also, what do you mean by optimal? Q learning can learn the optimal Q function, but this is different to learning optimal control. You only get useful behavior when the reward function is congruent with the environment and the task, which suggests that it serves the purpose of a model that maps states X and a goal G to actions U.

      As an example, imagine we want the agent to go from A to B as quickly as possible, with some hazard in the environment. We give the agent a negative reward at each timestep to incentivize shorter paths to B, and a positive reward for reaching A. Depending on the relative magnitudes of these two rewards, we’ll get an agent that either goes from A to B, or suicides to prevent further negative reward. The first set of rewards all represent a valid model of behavior given the goal (getting from A to B), and the second doesn’t.

      Keep in mind, I’m assuming that “model” here is being used in a more general sense than it typically is in RL, where it’s mainly used to denote a transition function.

  16. domenico says:

    If the dna is a control mechanism for the cells, then contain it an information on the current environment and the past environment (until 100.000 years ago)? Can the phenotype give us the environment information for ancient animals (feathering, dimensions, internal organs)?

  17. Richard J Botting says:

    This result looks like an extension of W. Ross Ashby’s Law of Requisite Variety from his “Introduction to Cybernetics” (1956) Uk publisher Chapman & Hall. The 1976 reprint was obtainable from Barnes and Noble. A good undergraduate text. Solidly based on set theory but short of rigorous proofs. Sigh.

    Statements about A models B and C is contained in B as applied to dynamic systems makes me think of Hartmannis and Sterns(sp?) theory which associates a monoid with each automata. The homomorphisms between monoids correspond to modelling (yes we have a category). They prove that if an automata is constructed to contain another then there is a homomorphism from the monoid of the containing automata to the component. This glosses a book I do not have with me. It is at graduate level and largely forgotten, sigh.

  18. Jon Awbrey says:

    There’s a far-ranging discussion that could take off from this point — touching on the links among analogical reasoning, arrows and morphisms, cybernetic images, iconic representations, mental models, systems simulations, etc., and just how categorically or not those functions are necessary to intelligent agency, all of which questions have enjoyed large and overlapping literatures for a long time now —, but I’m not sure how much of that you meant to open up.

    • John Baez says:

      In this post I’m really just interested in understanding the theorem of Conant and Ashcroft. In my opinion, there’s a lot more work to be done to understand what it’s really saying, and to understand precisely how close it comes toward formalizing the intuition they have.

  19. David Duffy says:

    • Eduardo D. Sontag, Adaptation and regulation with signal detection implies internal model.

    …a simple result showing, under suitable technical assumptions, that if a system Σ adapts to a class of external signals U, in the sense of regulation against disturbances or tracking signals in U, then Σ must necessarily contain a subsystem which is capable of generating all the signals in U. It is not assumed that regulation is robust, nor is there a prior requirement for the system to be partitioned into separate plant and controller components.

    The internal model principle originates in the biological cybernetics literature. But, as with any “principle” in control theory (like dynamic programming, the maximum principle, etc.) and more generally in mathematics, the IMP is not a theorem but rather a “mold” for many possible theorems, each of which will hold under appropriate technical assumptions, and whose conclusions will depend upon the precise meaning of “class of external signals”, “reproducing all functions”, and so on.

  20. Giampy says:

    Also if you are looking for a more general version of the Internal Model Principle, this paper is a classic by now:

    • C. I. Byrnes and A. Isidori, Limit sets, zero dynamics, and internal models in the problem of nonlinear output regulation, Automatic Control, IEEE Trans. 48 (2003), 1712–1723.

    This is a closely related version, by the same authors, on arxiv:

    • C. I. Byrnes and A. Isidori, Nonlinear internal models for output regulation.

    One thing that i always wondered is if and how the IMP is related to the “no free lunch” theorem in optimization. My intuition leads me to believe that they really the same thing (because you have to have some a-priori knowledge – that is an implicit model – of the function you are trying to optimize if you want to achieve better than average results, or something like that …).

    • John Baez says:

      Thanks! More for me to read and think about! These papers seem harder than Eduardo D. Sontag’s paper. More precisely, it seems harder to get any idea of what the authors are trying to do.

      What does the “no free lunch” theorem say, roughly? Of course I can guess that very roughly it says that there’s no free lunch. But I meant a bit less roughly.

      • An output regulator is a controller that forces the output of a plant to track a trajectory generated by another known external system (exosystem).

        For example i would want the output of my plant (e.g. aircraft altitude) to grow steadily until a certain point and then stay constant for a while.

        This is the most general control problem, and the authors are trying to characterize the structure of the solution for the most general case in which the plant is nonlinear (actually finding and implementing the solution for a given nonlinear plant is another story, but at least we know the structure, roughly speaking).

        Ok, about the No Free Lunch Theorem, quoting this paper:

        Click to access nfl-optimization-explanation.pdf

        “the NFLT tells us that if we cannot make any prior assumption about the optimization problem we are trying to solve, no strategy can be expected to perform better than any other. Put it another way, … the only way one strategy can outperform another is if it is specialized to the specific problem under consideration”.

        The paper of course formulates the problem in a formal way, providing solution and a simple (so the authors say) explanation.

        Now, let me try to better substantiate my earlier claim that NFLT==IMP:

        1) A controller can really be seen as an on-line optimization algorithm, (the connection is explicit in optimal control). Therefore if one applies NFLT, the only way a controller can perform better than a random one is if it specialized to the “problem” i.e if it uses a-priori knowledge (an internal model) of the plant, (IMP).

        2) It is also well known (and used to proof convergence in some cases), that an optimization algorithm can also be seen as a controller, with the plant being the function to minimize. In this case the controller/algorithm is tasked with driving the function output towards its minimum, by finding the best input. In this case, applying the IMP says that the algorithm has to have an internal model (some a priori knowledge) of the plant (function) to be controlled (minimized), which really is the NFLT.

        Actually since normally optimization is applied to an input-output function without dynamics (without states) it might be actually the case that IMP is more general than NFLT, but anyway, i just wanted to suggest a connection as an idea. Hope you see it.

        • John Baez says:

          Okay, I think I get the connection now. It would be fun to dig deeper into this. I’m starting to see biology as a bunch of systems modelling each other in order to control what’s going on…

  21. David Duffy says:

    If my limited understanding is correct, Byrnes and Isidori and that related literature examine the practical problem of whether particular implementations of an internal model (eg relying on “measured output and the first r−1 derivatives with respect to time”) can control an “arbitrary” dynamical system eg

    …the existence of an output feedback law that asymptotically steers to zero a given controlled variable, while keeping all state variables bounded, for any initial conditions in a fixed compact set. The proposed framework encompasses and extends a number of existing results in the fields of output feedback stabilization and output regulation of nonlinear systems. The main assumption under which the theory is developed is the existence of a state feedback control law able to achieve boundedness of the trajectories of the zero dynamics of the controlled plant.

  22. Jon Awbrey says:

    Re: Giampiero Campa

    It is one question whether a regulator has “knowledge” of the object system and another question whether that knowledge is embodied in the more specific form of a “model”. At this point we encounter a variety of meanings for the word “model”. In my experience they divide into two broad classes, “logical models” and “analogical models”.

    Logical modeling involves a relation between a theory and anything that satisfies the theory, in practice either the original domain of phenomena from which the theory was derived in the first place or a formal object we construct to satisfy the theory.

    Analogical modeling involves a relation between any two things that have similar properties or structures or that satisfy the same theory.

    It is possible that a regulator has knowledge, competence, or a capacity for performance that exists in the form of a theory or other data structures without necessarily having either type of model on hand.

    I would be the last to deny that models of either sort are extremely useful when we can get them, but there are reasons for thinking that the mirror of nature does not go all the way down to the most primitive structures of adaptive functioning.

    • Giampy says:

      While i’m not 100% sure about what you meant, I think i can see what you are saying, and i think i agree.

      I have used the word “model” in a very liberal way, but indeed i didn’t mean to restrict the thoughts to formal, analytic, explicit, mathematical, models.

      I really meant that when some knowledge about the environment is incorporated in your regulator, or algorithm, or system, then it’s somehow easier for your system to steer the environment to its advantage (i think you call this “analogical” modeling).

      This seems to me to be a very general principle or guideline.

      In fact perhaps in biology evolution is a way for living organism to acquire and refine such knowledge, in an implicit and iterative and randomized way.

      In engineering, in the last few decades, we have seen an increase in the use of formal (mathematical and computational) modeling in system design. I think it would be nice if the same happened in social sciences or law, for example.

    • linasv says:

      Hi Jon,

      Some parts of a model need to be “faithful” (or, to borrow a phrase from genetics, “highly conserved”) while other parts can be variable and/or relatively unimportant.

      If the model is formal in the sense of Model Theory, then how do I assign probabilities to the various parts to indicate how faithful or variable they are? Is the result of doing so a “Markov Logic Net” (MLN)? If so, then how does it work?

      Now, practical control problems have either linear of ODE models, so if I walk down the Model-Theory path, this suggests employing something like a Satisfiability Modulo-Theories (SMT) framework. But these are not probabilistic, nor do they seem to be good at tuning my ODE (for example, a Kalman filter).

      I am not aware of any texts that discuss model-theory+probability in any enlightened way, although there seem to be a infinite number of special-case, domain-specific treatments that all seem similar but yet all different…

  23. Tom Holroyd says:

    Robert Rosen, in his book Anticipatory Systems, says organisms must contain models of the environment, and in that work and later works expands this to say that formalisms in math are models of [part of] the world, and this is exactly why formalisms break down, because they are necessarily simple, and the world is complex. F=ma for example, cuts the formalism off at second order. It works for a while …

  24. seanny1986 says:

    So, I know this is several years old now, but I found it interesting and wanted to share some thoughts. If I’ve understood correctly, doesn’t Lemma 1 tell us that a regulator has to be a function of the state? Because if it isn’t, it’s essentially acting randomly (or rather, undirected). Following on from this, if the regulator is a function of the state, and we have some governing dynamics s’ = f(s,a(s)), then the optimal action will be some inverse function of the dynamics system a(s) = g(s’, s), where I’ll assume that s’ is the optimal next step to reaching some goal state G. I’m probably being a bit fast and loose with terminology here.

    That said, I’d like to tie this back to reinforcement learning, which is an area I’m a bit more familiar with. In RL, we have “model free” methods that maximize a reward signal using gradient ascent. They’re model free in the sense that they don’t use a transition function to map out a trajectory, but I’d argue that the model is actually implicit in the reward function, because it determines the behavior of the agent. You can mis-specify a reward function, in which case you get an agent that does something wildly unexpected (e.g. suiciding), or you can pick the right set of rewards to get the behavior you want. It’s only when the reward function is congruent with the task and environment that you get something useful.

    The interesting thing (to me at least), is that the process of learning in RL is fundamentally tied in with minimizing cross-entropy in some form or another. The policy gradient theorem uses the score function approximator to step function parameters in the direction that maximizes the reward, but this gradient approximator implies that the policy is fitting to a Boltzmann distribution given by the reward function. This turns out to be the case for all policy gradients through the equivalence between the score function gradient estimate and the change-of-variables gradient estimate. Furthermore, you can link all RL algorithms in this way, including Q value-based methods like Q learning, which turn out to be minimizing KL-divergence rather than cross-entropy (not sure how widely known this is — it’s something I only figured out recently, and led me to this page).

    You can actually do a simple test with RL and fit a distribution directly by minimizing reverse mode KL-divergence instead of cross-entropy (this gives you a slightly modified policy gradient estimator, that maximizes policy entropy as it minimizes cross-entropy). Say you have data generated by some underlying distribution k(x) ~ N(mu, sigma), and you fit a policy to a reward function that rewards prediction of the data — your policy will only fit k(x) when the weighting of the reward function is the same as k(x) (i.e. if k(x) = f(x) + sigma * eps, where eps ~ N(0, I), then for a reward function -m * (pi(a|x) – y)^2 , m = 0.5/sigma^2 ). That is to say, in this simple case, the reward function has to be a model of the original distribution for the policy to learn the optimal action and entropy (this isn’t the case with the standard PG, which minimizes cross-entropy, and thus fits the mean and becomes increasingly deterministic over time).

    Just some thoughts. Happy to share some code and figures if you’re interested.

  25. Elias Hasle says:

    Thank you for investigating this! I think you are being too nice with the authors. After all, they are clearly abusing the high-ranking label of “theorem” for a sloppily explained hunch. It doesn’t help if their hunch is (partially) right and has inspired other researchers.

    It is a fascinating topic, though.

    • John Baez says:

      I’m a nice guy, so I preferred to focus on the interesting idea rather than the problems with making it precise. But yes:

      When I tried to read Conant and Ashby’s paper, I got stuck. They use some very basic mathematical notation in nonstandard ways, and they don’t clearly state the hypotheses and conclusion of their theorem.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.