Exponential Discounting

25 October, 2020

Most of us seem to agree that the promise of a dollar in the future is worth less to us than a dollar today, even if the promise is certain to be fulfilled. Economists often assume ‘exponential discounting’, which says that a dollar promised at some time s is worth

\exp(-\alpha(s - t))

dollars in hand at time t. The constant \alpha is connected to the ‘interest rate’.

Why are economists so wedded to exponential discounting? The main reason is probably that it’s mathematically simple. But one argument for it goes roughly like this: if your decisions today are to look rational at any future time, you need to use exponential discounting.

In practice, humans, pigeons and rats do not use exponential discounting. So, economists say they are ‘dynamically inconsistent’:

• Wikipedia, Dynamic inconsistency.

In economics, dynamic inconsistency or time inconsistency is a situation in which a decision-maker’s preferences change over time in such a way that a preference can become inconsistent at another point in time. This can be thought of as there being many different “selves” within decision makers, with each “self” representing the decision-maker at a different point in time; the inconsistency occurs when not all preferences are aligned.

I this ‘inconsistent’ could be a misleading term for what’s going on here. It suggests that something bad is happening. That may not be true.

Anyway, some of the early research on this was done by George Ainslie, and here is what he found:

Ainslie’s research showed that a substantial number of subjects reported that they would prefer $50 immediately rather than $100 in six months, but would NOT prefer $50 in 3 months rather than $100 in nine months, even though this was the same choice seen at 3 months’ greater distance. More significantly, those subjects who said they preferred $50 in 3 months to $100 in 9 months said they would NOT prefer $50 in 12 months to $100 in 18 months—again, the same pair of options at a different distance—showing that the preference-reversal effect did not depend on the excitement of getting an immediate reward. Nor does it depend on human culture; the first preference reversal findings were in rats and pigeons.

Let me give a mathematical argument for exponential discounting. Of course it will rely on some assumptions. I’m not claiming these assumptions are true! Far from it. I’m just claiming that if we don’t use exponential discounting, we are violating one or more of these assumptions… or breaking out of the whole framework of my argument. The widespread prevalence of ‘dynamic inconsistency’ suggests that the argument doesn’t apply to real life.

Here’s the argument:

Suppose the value to us at any time t of a dollar given to us at some other time s is V(t,s).

Let us assume:

1) The ratio

\displaystyle{ \frac{V(t,s_2)}{V(t,s_1)} }

is independent of t. E.g., the ratio of value of a “dollar on Friday” to “a dollar on Thursday” is the same if you’re computing it on Monday, or on Tuesday, or on Wednesday.

2) The quantity V(t,s) depends only on the difference s - t.

3) The quantity V(t,s) is a continuous function of s and t.

Then we can show

V(t,s) = k \exp(-\alpha(s-t))

for some constants \alpha and k. Typically we assume k = 1 since the value of a dollar given to us right now is 1. But let’s just see how we get this formula for V(t,s) out of assumptions 1), 2) and 3).

The proof goes like this. By 2) we know

V(t,s) = F(s-t)

for some function F. By 1) it follows that

\displaystyle{ \frac{F(s_2 - t)}{F(s_1 - t)} }

is independent of t, so

\displaystyle{ \frac{F(s_2 - t)}{F(s_1 - t)} =  \frac{F(s_2)}{F(s_1)} }

or in other words

F(s_2 - t) F(s_1) = F(s_2) F(s_1 - t)

Ugh! What next? Well, if we take s_1 = t, we get a simpler equation that’s probably still good enough to get the job done:

F(s_2 - t) F(t) = F(s_2) F(0)

Now let’s make up a variable t' = s_2 - t, so that s_2 = t + t'. Then we can rewrite our equation as

F(t') F(t) = F(t+t') F(0)

or

F(t) F(t') = F(t+t') F(0)

This is beautiful except for the constant F(0). Let’s call that k and factor it out by writing

F(t) = k G(t)

Then we get

G(t) G(t') = G(t+t')

A theorem of Cauchy implies that any continuous solution of this equation is of the form

G(t) = \exp(-\alpha t)

So, we get

F(t) = k \exp(-\alpha t)

or

V(t,s) = k \exp(-\alpha(s-t))

as desired!

By the way, we don’t need to assume G is continuous: it’s enough to assume G is measurable. You can get bizarre nonmeasurable solutions of G(t) G(t') = G(t+t') using the axiom of choice, but they are not of practical interest.

So, assumption 3) is not the assumption I’d want to attack in trying to argue against exponential discounting. In fact both assumptions 1) and 2) are open to quite a few objections. Can you name some? Here’s one: in real life the interest rate changes with time. There must be some reason.

By the way, nothing in the argument I gave shows that \alpha \ge 0. So there could be people who obey assumptions 1)–3) yet believe the promise of a dollar in the future is worth more than a dollar in hand today.

Also, nothing in my argument for the form of V(t,s) assumes that s \ge t. That is, my assumptions as stated also concern the value of a dollar that was promised in the past. So, you might have fun seeing what changes, or does not change, if you restrict the assumptions to say they only apply when s \ge t. The arrow of time seems to be built into economics, after all.

Also, you may enjoy finding the place in my derivation where I might have divided by zero, and figure out to do about that.

If you don’t like exponential discounting—for example, because people use it to argue against spending money now to fight climate change—you might prefer hyperbolic discounting:

• Wikipedia, Hyperbolic discounting.


Compositional Game Theory and Climate Microeconomics

5 October, 2020

guest post by Jules Hedges

Hi all

This is a post I’ve been putting off for a long time until I was sure I was ready. I am the “lead developer” of a thing called compositional game theory (CGT). It’s an approach to game theory based on category theory, but we are now at the point where you don’t need to know that anymore: it’s an approach to game theory that has certain specific benefits over the traditional approach.

I would like to start a conversation about “using my powers for good”. I am hoping particularly that it is possible to model microeconomic aspects of climate science. This seems to be a very small field and I’m not really hopeful that anyone on Azimuth will have the right background, but it’s worth a shot. The kind of thing I’m imagining (possibly completely wrongly) is to create models that will suggest when a technically-feasible solution is not socially feasible. Social dilemmas and tragedies of the commons are at the heart of the climate crisis, and modelling instances of them is in scope.

I have a software tool (https://github.com/jules-hedges/open-games-hs) that is designed to be an assistant for game-theoretic modelling. This I can’t emphasise enough: A human with expertise in game-theoretic modelling is the most important thing, CGT is merely an assistant. (Right now the tool also probably can’t be used without me being in the loop, but that’s not an inherent thing.)

To give an idea what sort of things CGT can do, my 2 current ongoing research collaborations are: (1) a social science project modelling examples of institution governance, and (2) a cryptoeconomics project modelling an attack against a protocol using bribes. On a technical level the best fit is for Bayesian games, which are finite-horizon, have common knowledge priors, and private knowledge with agents who do Bayesian updating.

A lot of the (believed) practical benefits of CGT come from the fact that the model is code (in a high level language designed specifically for expressing games) and thus the model can be structured according to existing wisdom for structuring code. Really stress-testing this claim is an ongoing research project. My tool does equilibrium-checking for all games (the technical term is “model checker”), and we’ve had some success doing other things by looping an equilibrium check over a parameter space. It makes no attempt to be an equilibrium solver, that is left for the human.

This is not me trying to push my pet project (I do that elsewhere) but me trying to find a niche where I can do some genuine good, even if small. If you are a microeconomist (or a social scientist who uses applied game theory) and share the goals of Azimuth, I would like to hear from you, even if it’s just for some discussion.


Open Games: the Long Road to Practical Applications

13 April, 2020

In the third talk of the ACT@UCR seminar, Jules Hedges spoke about open games!

He gave his talk on Wednesday April 15th. Afterwards we discussed it at the Category Theory Community Server, here:

https://categorytheory.zulipchat.com/

You can view or join the conversation there if you sign in.

You can see his slides here, or download a video here, or watch the video here:

• April 15, Jules Hedges: Open games: the long road to practical applications.

Abstract. I will talk about open games, and the closely related concepts of lenses/optics and open learners. My goal is to report on the successes and failures of an ongoing effort to try to realise the often-claimed benefits of categories and compositionality in actual practice. I will introduce what little theory is needed along the way. Here are some things I plan to talk about:

— Lenses as an abstraction of the chain rule

— Comb diagrams

— Surprising applications of open games: Bayesian inference, value function iteration

— The state of tool support

— Open games in their natural habitat: microeconomics

— Sociological aspects of working with economics


Correlated Equilibria in Game Theory

24 July, 2017

Erica Klarreich is one of the few science journalists who explains interesting things I don’t already know clearly enough so I can understand them. I recommend her latest article:

• Erica Klarreich, In game theory, no clear path to equilibrium, Quanta, 18 July 2017.

Economists like the concept of ‘Nash equilibrium’, but it’s problematic in some ways. This matters for society at large.

In a Nash equilibrium for a multi-player game, no player can improve their payoff by unilaterally changing their strategy. This doesn’t mean everyone is happy: it’s possible to be trapped in a Nash equilibrium where everyone is miserable, because anyone changing their strategy unilaterally would be even more miserable. (Think ‘global warming’.)

The great thing about Nash equilibria is that their meaning is easy to fathom, and they exist. John Nash won a Nobel prize for a paper proving that they exist. His paper was less than one page long. But he proved the existence of Nash equilibria for arbitrary multi-player games using a nonconstructive method: a fixed point theorem that doesn’t actually tell you how to find the equilibrium!

Given this, it’s not surprising that Nash equilibria can be hard to find. Last September a paper came out making this precise, in a strong way:

• Yakov Babichenko and Aviad Rubinstein, Communication complexity of approximate Nash equilibria.

The authors show there’s no guaranteed method for players to find even an approximate Nash equilibrium unless they tell each other almost everything about their preferences. This makes finding the Nash equilibrium prohibitively difficult to find when there are lots of players… in general. There are particular games where it’s not difficult, and that makes these games important: for example, if you’re trying to run a government well. (A laughable notion these days, but still one can hope.)

Klarreich’s article in Quanta gives a nice readable account of this work and also a more practical alternative to the concept of Nash equilibrium. It’s called a ‘correlated equilibrium’, and it was invented by the mathematician Robert Aumann in 1974. You can see an attempt to define it here:

• Wikipedia, Correlated equilibrium.

The precise mathematical definition near the start of this article is a pretty good example of how you shouldn’t explain something: it contains a big fat equation containing symbols not mentioned previously, and so on. By thinking about it for a while, I was able to fight my way through it. Someday I should improve it—and someday I should explain the idea here! But for now, I’ll just quote this passage, which roughly explains the idea in words:

The idea is that each player chooses their action according to their observation of the value of the same public signal. A strategy assigns an action to every possible observation a player can make. If no player would want to deviate from the recommended strategy (assuming the others don’t deviate), the distribution is called a correlated equilibrium.

According to Erica Klarreich it’s a useful notion. She even makes it sound revolutionary:

This might at first sound like an arcane construct, but in fact we use correlated equilibria all the time—whenever, for example, we let a coin toss decide whether we’ll go out for Chinese or Italian, or allow a traffic light to dictate which of us will go through an intersection first.

In [some] examples, each player knows exactly what advice the “mediator” is giving to the other player, and the mediator’s advice essentially helps the players coordinate which Nash equilibrium they will play. But when the players don’t know exactly what advice the others are getting—only how the different kinds of advice are correlated with each other—Aumann showed that the set of correlated equilibria can contain more than just combinations of Nash equilibria: it can include forms of play that aren’t Nash equilibria at all, but that sometimes result in a more positive societal outcome than any of the Nash equilibria. For example, in some games in which cooperating would yield a higher total payoff for the players than acting selfishly, the mediator can sometimes beguile players into cooperating by withholding just what advice she’s giving the other players. This finding, Myerson said, was “a bolt from the blue.”

(Roger Myerson is an economics professor at the University of Chicago who won a Nobel prize for his work on game theory.)

And even though a mediator can give many different kinds of advice, the set of correlated equilibria of a game, which is represented by a collection of linear equations and inequalities, is more mathematically tractable than the set of Nash equilibria. “This other way of thinking about it, the mathematics is so much more beautiful,” Myerson said.

While Myerson has called Nash’s vision of game theory “one of the outstanding intellectual advances of the 20th century,” he sees correlated equilibrium as perhaps an even more natural concept than Nash equilibrium. He has opined on numerous occasions that “if there is intelligent life on other planets, in a majority of them they would have discovered correlated equilibrium before Nash equilibrium.”

When it comes to repeated rounds of play, many of the most natural ways that players could choose to adapt their strategies converge, in a particular sense, to correlated equilibria. Take, for example, “regret minimization” approaches, in which before each round, players increase the probability of using a given strategy if they regret not having played it more in the past. Regret minimization is a method “which does bear some resemblance to real life — paying attention to what’s worked well in the past, combined with occasionally experimenting a bit,” Roughgarden said.

(Tim Roughgarden is a theoretical computer scientist at Stanford University.)

For many regret-minimizing approaches, researchers have shown that play will rapidly converge to a correlated equilibrium in the following surprising sense: after maybe 100 rounds have been played, the game history will look essentially the same as if a mediator had been advising the players all along. It’s as if “the [correlating] device was somehow implicitly found, through the interaction,” said Constantinos Daskalakis, a theoretical computer scientist at the Massachusetts Institute of Technology.

As play continues, the players won’t necessarily stay at the same correlated equilibrium — after 1,000 rounds, for instance, they may have drifted to a new equilibrium, so that now their 1,000-game history looks as if it had been guided by a different mediator than before. The process is reminiscent of what happens in real life, Roughgarden said, as societal norms about which equilibrium should be played gradually evolve.

In the kinds of complex games for which Nash equilibrium is hard to reach, correlated equilibrium is “the natural leading contender” for a replacement solution concept, Nisan said.

As Klarreich hints, you can find correlated equilibria using a technique called linear programming. That was proved here, I think:

• Christos H. Papadimitriou and Tim Roughgarden, Computing correlated equilibria in multi-player games, J. ACM 55 (2008), 14:1-14:29.

Do you know something about correlated equilibria that I should know? If so, please tell me!


Information Geometry (Part 16)

1 February, 2017

This week I’m giving a talk on biology and information:

• John Baez, Biology as information dynamics, talk for Biological Complexity: Can it be Quantified?, a workshop at the Beyond Center, 2 February 2017.

While preparing this talk, I discovered a cool fact. I doubt it’s new, but I haven’t exactly seen it elsewhere. I came up with it while trying to give a precise and general statement of ‘Fisher’s fundamental theorem of natural selection’. I won’t start by explaining that theorem, since my version looks rather different than Fisher’s, and I came up with mine precisely because I had trouble understanding his. I’ll say a bit more about this at the end.

Here’s my version:

The square of the rate at which a population learns information is the variance of its fitness.

This is a nice advertisement for the virtues of diversity: more variance means faster learning. But it requires some explanation!

The setup

Let’s start by assuming we have n different kinds of self-replicating entities with populations P_1, \dots, P_n. As usual, these could be all sorts of things:

• molecules of different chemicals
• organisms belonging to different species
• genes of different alleles
• restaurants belonging to different chains
• people with different beliefs
• game-players with different strategies
• etc.

I’ll call them replicators of different species.

Let’s suppose each population P_i is a function of time that grows at a rate equal to this population times its ‘fitness’. I explained the resulting equation back in Part 9, but it’s pretty simple:

\displaystyle{ \frac{d}{d t} P_i(t) = f_i(P_1(t), \dots, P_n(t)) \, P_i(t)   }

Here f_i is a completely arbitrary smooth function of all the populations! We call it the fitness of the ith species.

This equation is important, so we want a short way to write it. I’ll often write f_i(P_1(t), \dots, P_n(t)) simply as f_i, and P_i(t) simply as P_i. With these abbreviations, which any red-blooded physicist would take for granted, our equation becomes simply this:

\displaystyle{ \frac{dP_i}{d t}  = f_i \, P_i   }

Next, let p_i(t) be the probability that a randomly chosen organism is of the ith species:

\displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)} }

Starting from our equation describing how the populations evolve, we can figure out how these probabilities evolve. The answer is called the replicator equation:

\displaystyle{ \frac{d}{d t} p_i(t)  = ( f_i - \langle f \rangle ) \, p_i(t) }

Here \langle f \rangle is the average fitness of all the replicators, or mean fitness:

\displaystyle{ \langle f \rangle = \sum_j f_j(P_1(t), \dots, P_n(t)) \, p_j(t)  }

In what follows I’ll abbreviate the replicator equation as follows:

\displaystyle{ \frac{dp_i}{d t}  = ( f_i - \langle f \rangle ) \, p_i }

The result

Okay, now let’s figure out how fast the probability distribution

p(t) = (p_1(t), \dots, p_n(t))

changes with time. For this we need to choose a way to measure the length of the vector

\displaystyle{  \frac{dp}{dt} = (\frac{d}{dt} p_1(t), \dots, \frac{d}{dt} p_n(t)) }

And here information geometry comes to the rescue! We can use the Fisher information metric, which is a Riemannian metric on the space of probability distributions.

I’ve talked about the Fisher information metric in many ways in this series. The most important fact is that as a probability distribution p(t) changes with time, its speed

\displaystyle{  \left\| \frac{dp}{dt} \right\|}

as measured using the Fisher information metric can be seen as the rate at which information is learned. I’ll explain that later. Right now I just want a simple formula for the Fisher information metric. Suppose v and w are two tangent vectors to the point p in the space of probability distributions. Then the Fisher information metric is given as follows:

\displaystyle{ \langle v, w \rangle = \sum_i \frac{1}{p_i} \, v_i w_i }

Using this we can calculate the speed at which p(t) moves when it obeys the replicator equation. Actually the square of the speed is simpler:

\begin{array}{ccl}  \displaystyle{ \left\| \frac{dp}{dt}  \right\|^2 } &=& \displaystyle{ \sum_i \frac{1}{p_i} \left( \frac{dp_i}{dt} \right)^2 } \\ \\  &=& \displaystyle{ \sum_i \frac{1}{p_i} \left( ( f_i - \langle f \rangle ) \, p_i \right)^2 } \\ \\  &=& \displaystyle{ \sum_i  ( f_i - \langle f \rangle )^2 p_i }   \end{array}

The answer has a nice meaning, too! It’s just the variance of the fitness: that is, the square of its standard deviation.

So, if you’re willing to buy my claim that the speed \|dp/dt\| is the rate at which our population learns new information, then we’ve seen that the square of the rate at which a population learns information is the variance of its fitness!

Fisher’s fundamental theorem

Now, how is this related to Fisher’s fundamental theorem of natural selection? First of all, what is Fisher’s fundamental theorem? Here’s what Wikipedia says about it:

It uses some mathematical notation but is not a theorem in the mathematical sense.

It states:

“The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”

Or in more modern terminology:

“The rate of increase in the mean fitness of any organism at any time ascribable to natural selection acting through changes in gene frequencies is exactly equal to its genetic variance in fitness at that time”.

Largely as a result of Fisher’s feud with the American geneticist Sewall Wright about adaptive landscapes, the theorem was widely misunderstood to mean that the average fitness of a population would always increase, even though models showed this not to be the case. In 1972, George R. Price showed that Fisher’s theorem was indeed correct (and that Fisher’s proof was also correct, given a typo or two), but did not find it to be of great significance. The sophistication that Price pointed out, and that had made understanding difficult, is that the theorem gives a formula for part of the change in gene frequency, and not for all of it. This is a part that can be said to be due to natural selection

Price’s paper is here:

• George R. Price, Fisher’s ‘fundamental theorem’ made clear, Annals of Human Genetics 36 (1972), 129–140.

I don’t find it very clear, perhaps because I didn’t spend enough time on it. But I think I get the idea.

My result is a theorem in the mathematical sense, though quite an easy one. I assume a population distribution evolves according to the replicator equation and derive an equation whose right-hand side matches that of Fisher’s original equation: the variance of the fitness.

But my left-hand side is different: it’s the square of the speed of the corresponding probability distribution, where speed is measured using the ‘Fisher information metric’. This metric was discovered by the same guy, Ronald Fisher, but I don’t think he used it in his work on the fundamental theorem!

Something a bit similar to my statement appears as Theorem 2 of this paper:

• Marc Harper, Information geometry and evolutionary game theory.

and for that theorem he cites:

• Josef Hofbauer and Karl Sigmund, Evolutionary Games and Population Dynamics, Cambridge University Press, Cambridge, 1998.

However, his Theorem 2 really concerns the rate of increase of fitness, like Fisher’s fundamental theorem. Moreover, he assumes that the probability distribution p(t) flows along the gradient of a function, and I’m not assuming that. Indeed, my version applies to situations where the probability distribution moves round and round in periodic orbits!

Relative information and the Fisher information metric

The key to generalizing Fisher’s fundamental theorem is thus to focus on the speed at which p(t) moves, rather than the increase in fitness. Why do I call this speed the ‘rate at which the population learns information’? It’s because we’re measuring this speed using the Fisher information metric, which is closely connected to relative information, also known as relative entropy or the Kullback–Leibler divergence.

I explained this back in Part 7, but that explanation seems hopelessly technical to me now, so here’s a faster one, which I created while preparing my talk.

The information of a probability distribution q relative to a probability distribution p is

\displaystyle{     I(q,p) = \sum_{i =1}^n q_i \log\left(\frac{q_i}{p_i}\right) }

It says how much information you learn if you start with a hypothesis p saying that the probability of the ith situation was p_i, and then update this to a new hypothesis q.

Now suppose you have a hypothesis that’s changing with time in a smooth way, given by a time-dependent probability p(t). Then a calculation shows that

\displaystyle{ \left.\frac{d}{dt} I(p(t),p(t_0)) \right|_{t = t_0} = 0 }

for all times t_0. This seems paradoxical at first. I like to jokingly put it this way:

To first order, you’re never learning anything.

However, as long as the velocity \frac{d}{dt}p(t_0) is nonzero, we have

\displaystyle{ \left.\frac{d^2}{dt^2} I(p(t),p(t_0)) \right|_{t = t_0} > 0 }

so we can say

To second order, you’re always learning something… unless your opinions are fixed.

This lets us define a ‘rate of learning’—that is, a ‘speed’ at which the probability distribution p(t) moves. And this is precisely the speed given by the Fisher information metric!

In other words:

\displaystyle{ \left\|\frac{dp}{dt}(t_0)\right\|^2 =  \left.\frac{d^2}{dt^2} I(p(t),p(t_0)) \right|_{t = t_0} }

where the length is given by Fisher information metric. Indeed, this formula can be used to define the Fisher information metric. From this definition we can easily work out the concrete formula I gave earlier.

In summary: as a probability distribution moves around, the relative information between the new probability distribution and the original one grows approximately as the square of time, not linearly. So, to talk about a ‘rate at which information is learned’, we need to use the above formula, involving a second time derivative. This rate is just the speed at which the probability distribution moves, measured using the Fisher information metric. And when we have a probability distribution describing how many replicators are of different species, and it’s evolving according to the replicator equation, this speed is also just the variance of the fitness!


Biology as Information Dynamics (Part 1)

31 January, 2017

This is my talk for the workshop Biological Complexity: Can It Be Quantified?

• John Baez, Biology as information dynamics, 2 February 2017.

Abstract. If biology is the study of self-replicating entities, and we want to understand the role of information, it makes sense to see how information theory is connected to the ‘replicator equation’—a simple model of population dynamics for self-replicating entities. The relevant concept of information turns out to be the information of one probability distribution relative to another, also known as the Kullback–Leibler divergence. Using this we can get a new outlook on free energy, see evolution as a learning process, and give a clean general formulation of Fisher’s fundamental theorem of natural selection.

For more, read:

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.

• Barry Sinervo and Curt M. Lively, The rock-paper-scissors game and the evolution of alternative male strategies, Nature 380 (1996), 240–243.

• John Baez, Diversity, entropy and thermodynamics.

• John Baez, Information geometry.

The last reference contains proofs of the equations shown in red in my slides.
In particular, Part 16 contains a proof of my updated version of Fisher’s fundamental theorem.


The Game of Googol

20 July, 2015

Here’s a puzzle from a recent issue of Quanta, an online science magazine:

Puzzle 1: I write down two different numbers that are completely unknown to you, and hold one in my left hand and one in my right. You have absolutely no idea how I generated these two numbers. Which is larger?

You can point to one of my hands, and I will show you the number in it. Then you can decide to either select the number you have seen or switch to the number you have not seen, held in the other hand. Is there a strategy that will give you a greater than 50% chance of choosing the larger number, no matter which two numbers I write down?

At first it seems the answer is no. Whatever number you see, the other number could be larger or smaller. There’s no way to tell. So obviously you can’t get a better than 50% chance of picking the hand with the largest number—even if you’ve seen one of those numbers!

But “obviously” is not a proof. Sometimes “obvious” things are wrong!

It turns out that, amazingly, the answer to the puzzle is yes! You can find a strategy to do better than 50%. But the strategy uses randomness. So, this puzzle is a great illustration of the power of randomness.

If you want to solve it yourself, stop now or read Quanta magazine for some clues—they offered a small prize for the best answer:

• Pradeep Mutalik, Can information rise from randomness?, Quanta, 7 July 2015.

Greg Egan gave a nice solution in the comments to this magazine article, and I’ll reprint it below along with two followup puzzles. So don’t look down there unless you want a spoiler.

I should add: the most common mistake among educated readers seems to be assuming that the first player, the one who chooses the two numbers, chooses them according to some probability distribution. Don’t assume that. They are simply arbitrary numbers.

The history of this puzzle

I’d seen this puzzle before—do you know who invented it? On G+, Hans Havermann wrote:

I believe the origin of this puzzle goes back to (at least) John Fox and Gerald Marnie’s 1958 betting game ‘Googol’. Martin Gardner mentioned it in his February 1960 column in Scientific American. Wikipedia mentions it under the heading ‘Secretary problem’. Gardner suggested that a variant of the game was proposed by Arthur Cayley in 1875.

Actually the game of Googol is a generalization of the puzzle that we’ve been discussing. Martin Gardner explained it thus:

Ask someone to take as many slips of paper as he pleases, and on each slip write a different positive number. The numbers may range from small fractions of 1 to a number the size of a googol (1 followed by a hundred 0s) or even larger. These slips are turned face down and shuffled over the top of a table. One at a time you turn the slips face up. The aim is to stop turning when you come to the number that you guess to be the largest of the series. You cannot go back and pick a previously turned slip. If you turn over all the slips, then of course you must pick the last one turned.

So, the puzzle I just showed you is the special case when there are just 2 slips of paper. I seem to recall that Gardner incorrectly dismissed this case as trivial!

There’s been a lot of work on Googol. Julien Berestycki writes:

I heard about this puzzle a few years ago from Sasha Gnedin. He has a very nice paper about this

• Alexander V. Gnedin, A solution to the game of Googol, Annals of Probability (1994), 1588–1595.

One of the many beautiful ideas in this paper is that it asks what is the best strategy for the guy who writes the numbers! It also cites a paper by Gnedin and Berezowskyi (of oligarchic fame). 

Egan’s solution

Okay, here is Greg Egan’s solution, paraphrased a bit:

Pick some function f : \mathbb{R} \to \mathbb{R} such that:

\displaystyle{ \lim_{x \to -\infty} f(x) = 0 }

\displaystyle{ \lim_{x \to +\infty} f(x) = 1 }

f is strictly increasing: if x > y then f(x) > f(y)

There are lots of functions like this, for example

\displaystyle{f(x) = \frac{e^x}{e^x + 1} }

Next, pick one of the first player’s hands at random. If the number you are shown is a, compute f(a). Then generate a uniformly distributed random number z between 0 and 1. If z is less than or equal to f(a) guess that x is the larger number, but if z is greater than f(a) guess that the larger number is in the other hand.

The probability of guessing correctly can be calculated as the probability of seeing the larger number initially and then, correctly, sticking with it, plus the probability of seeing the smaller number initially and then, correctly, choosing the other hand.

Say the larger number is x and the smaller one is y. Then the probability of guessing correctly is

\frac{1}{2} f(x) + \frac{1}{2} (1 - f(y)) =  \frac{1}{2} + \frac{1}{2} (f(x) - f(y))

This is strictly greater than \frac{1}{2} since x > y so f(x) - f(y) > 0.

So, you have a more than 50% chance of winning! But as you play the game, there’s no way to tell how much more than 50%. If the numbers on the other players hands are very large, or very small, your chance will be just slightly more than 50%.

Followup puzzles

Here are two more puzzles:

Puzzle 2: Prove that no deterministic strategy can guarantee you have a more than 50% chance of choosing the larger number.

Puzzle 3: There are perfectly specific but ‘algorithmically random’ sequences of bits, which can’t predicted well by any program. If we use these to generate a uniform algorithmically random number between 0 and 1, and use the strategy Egan describes, will our chance of choosing the larger number be more than 50%, or not?

But watch out—here come Egan’s solutions to those!

Solutions

Egan writes:

Here are my answers to your two puzzles on G+.

Puzzle 2: Prove that no deterministic strategy can guarantee you have a more than 50% chance of choosing the larger number.

Answer: If we adopt a deterministic strategy, that means there is a function S: \mathbb{R} \to \{0,1\} that tells us whether on not we stick with the number x when we see it. If S(x)=1 we stick with it, if S(x)=0 we swap it for the other number.

If the two numbers are x and y, with x > y, then the probability of success will be:

P = 0.5 + 0.5(S(x)-S(y))

This is exactly the same as the formula we obtained when we stuck with x with probability f(x), but we have specialised to functions S valued in \{0,1\}.

We can only guarantee a more than 50% chance of choosing the larger number if S is monotonically increasing everywhere, i.e. S(x) > S(y) whenever x > y. But this is impossible for a function valued in \{0,1\}. To prove this, define x_0 to be any number in [1,2] such that S(x_0)=0; such an x_0 must exist, otherwise S would be constant on [1,2] and hence not monotonically increasing. Similarly define x_1 to be any number in [-2,-1] such that S(x_1) = 1. We then have x_0 > x_1 but S(x_0) < S(x_1).

Puzzle 3: There are perfectly specific but ‘algorithmically random’ sequences of bits, which can’t predicted well by any program. If we use these to generate a uniform algorithmically random number between 0 and 1, and use the strategy Egan describes, will our chance of choosing the larger number be more than 50%, or not?

Answer: As Philip Gibbs noted, a deterministic pseudo-random number generator is still deterministic. Using a specific sequence of algorithmically random bits

(b_1, b_2, \dots )

to construct a number z between 0 and 1 means z takes on the specific value:

z_0 = \sum_i b_i 2^{-i}

So rather than sticking with x with probability f(x) for our monotonically increasing function f, we end up always sticking with x if z_0 \le f(x), and always swapping if z_0 > f(x). This is just using a function S:\mathbb{R} \to \{0,1\} as in Puzzle 2, with:

S(x) = 0 if x < f^{-1}(z_0)

S(x) = 1 if x \ge f^{-1}(z_0)

So all the same consequences as in Puzzle 2 apply, and we cannot guarantee a more than 50% chance of choosing the larger number.

Puzzle 3 emphasizes the huge gulf between ‘true randomness’, where we only have a probability distribution of numbers z, and the situation where we have a specific number z_0, generated by any means whatsoever.

We could generate z_0 using a pseudorandom number generator, radioactive decay of atoms, an oracle whose randomness is certified by all the Greek gods, or whatever. No matter how randomly z_0 is generated, once we have it, we know there exist choices for the first player that will guarantee our defeat!

This may seem weird at first, but if you think about simple games of luck you’ll see it’s completely ordinary. We can have a more than 50% chance of winning such a game even if for any particular play we make the other player has a move that ensures our defeat. That’s just how randomness works.


Relative Entropy in Evolutionary Dynamics

22 January, 2014

guest post by Marc Harper

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

The replicator equation

First a little refresher. For a population of n replicating types, such as individuals with different eye colors or a gene with n distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let P_i be the population of individuals of the ith type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

P = (P_1, \dots, P_n)

The Lotka–Volterra equation is a very general rule for how these numbers can change with time:

\displaystyle{ \frac{d P_i}{d t} = f_i(P) P_i }

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, f_i(P), is not necessarily constant: it can be any real-valued function of P. This function is called the fitness of the ith type. Taken all together, these functions f_i are called the fitness landscape.

Let p_i be the fraction of individuals who are of the ith type:

\displaystyle{ p_i = \frac{P_i}{\sum_{i =1}^n P_i } }

These numbers p_i are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: p_i is the probability that a randomly chosen individual is of the ith type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

p = (p_1, \dots, p_n)

which we call the population distribution. It turns out that the Lotka–Volterra equation implies the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(P) - \langle f(P) \rangle \right) \, p_i }

where

\displaystyle{ \langle f(P) \rangle = \sum_{i =1}^n  f_i(P)  p_i  }

is the mean fitness of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness f_i(P) only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }

From now on, when talking about this equation, that’s what I’ll do.

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

Relative entropy as a Lyapunov function

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions p and q, the information of q relative to p is

I(q,p) = \displaystyle{ \sum_i q_i \ln \left(\frac{q_i}{p_i }\right)}

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution p, and then someone tells you “no, the best hypothesis is q!”

It may seem weird to treat a population distribution as a hypothesis, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions q and p. Suppose q is fixed, while p evolves in time according to the replicator equation. Then

\displaystyle{  \frac{d}{d t} I(q,p)  =  \sum_i f_i(P) (p_i - q_i) }

For the proof, see Part 11 of the information geometry series.

So, the information of q relative to p will decrease as p evolves according to the replicator equation if

\displaystyle{  \sum_i f_i(P) (p_i - q_i) } \le 0

If q makes this true for all p, we say q is an evolutionarily stable state. For some reasons why, see Part 13.

What matters now is that when q is an evolutionarily stable state, I(q,p) says how much information the population has ‘left to learn’—and we’re seeing that this always decreases. Moreover, it turns out that we always have

I(q,p) \ge 0

and I(q,p) = 0 precisely when p = q.

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

\displaystyle{  \frac{d}{d t} x(t) = v(x(t)) }

where x(t) \in \mathbb{R}^n and v is some smooth vector field on \mathbb{R}^n. Then a smooth function

V : \mathbb{R}^n \to \mathbb{R}

is a Lyapunov function if

V(x) \ge 0 for all x

V(x) = 0 iff x is some particular point x_0

and

\displaystyle{ \frac{d}{d t} V(x(t)) \le 0 } for every solution of our differential equation.

In this situation, the point x_0 is a stable equilibrium for our differential equation: this is Lyapunov’s theorem.

The replicator equation as a gradient flow equation

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball.

The idea of gradient flow is similar, but different: sometimes things like to roll downhill as efficiently as possible: they move in the exactly the best direction to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

F : \mathbb{R}^n \to \mathbb{R}

Think of V as ‘height’. Then the gradient flow equation says how a point x(t) \in \mathbb{R}^n will move if it’s always trying its very best to go downhill:

\displaystyle{ \frac{d}{d t} x(t) = - \nabla V(x(t)) }

Here \nabla is the usual gradient in Euclidean space:

\displaystyle{ \nabla V = \left(\partial_1 V, \dots, \partial_n V \right)  }

where \partial_i is short for the partial derivative with respect to the ith coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically not one where \nabla is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

The space of all population distributions is a simplex:

\{ p \in \mathbb{R}^n : \; p_i \ge 0, \; \sum_{i = 1}^n p_i = 1 \} .

For example, it’s an equilateral triangle when n = 3. The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere, and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

\delta_{i j} = \left\{\begin{array}{ccl} 1 & \mathrm{ if } & i = j \\                                       0 &\mathrm{ if } & i \ne j \end{array} \right.

This simply says that two standard basis vectors like (0,1,0,0) and (0,0,1,0) have dot product zero if the 1’s are in different places, and one if they’re in the same place. The Fisher information metric is a bit more complicated:

\displaystyle{ g_{i j} = \frac{\delta_{i j}}{p_i} }

As before, g_{i j} is a formula for the dot product of the ith and jth standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution p and consider the information of another one, say q, relative to this. We get I(q,p). If q = p this is zero:

\displaystyle{ \left. I(q,p)\right|_{q = p} = 0 }

and this point is a local minimum for the relative information. So, the first derivative of I(q,p) as we change q must be zero:

\displaystyle{ \left. \frac{\partial}{\partial q_i} I(q,p) \right|_{q = p} = 0 }

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

\displaystyle{  g_{i j} = \left. \frac{\partial^2}{\partial q_i \partial q_j} I(q,p) \right|_{q = p} = 0 }

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! The ith component of this new gradient vector field turns out to b

(\nabla_g V)^i = g^{i j} \partial_j V

where g^{i j} is the inverse of the matrix g_{i j}, and we sum over the repeated index j. As a sanity check, make sure you see why this is the usual Euclidean gradient when g_{i j} = \delta_{i j}.

Now suppose the fitness landscape is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness landscape is the Euclidean gradient of some function V:

\displaystyle{ f_i(P) = \frac{\partial V}{\partial P_i} }

Then the Lotka–Volterra equation becomes this:

\displaystyle{ \frac{d P_i}{d t} = \frac{\partial V}{\partial P_i} \, P_i }

This doesn’t look like the gradient flow equation, thanks to that annoying P_i on the right-hand side! It certainly ain’t the gradient flow coming from the function V and the usual Euclidean gradient. However, it is gradient flow coming from V and some other metric on the space

\{ P \in \mathbb{R}^n : \; P_i \ge 0 \}

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

• Marc Harper, Information geometry and evolutionary game theory.

Now let’s turn to the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p)  - \langle f(p) \rangle \right) \, p_i }

Again, if the fitness landscape is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, A New Mathematical Framework for the Study of Linkage and Selection, Memoirs of the AMS 17, 1979.

New directions

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

• Akio Fujiwara and Shun-ichi Amari, Gradient systems in view of information geometry, Physica D: Nonlinear Phenomena 80 (1995), 317–327.

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter q. There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set q=1 we get back the usual story. See

• Marc Harper, Escort evolutionary game theory.

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

• A. Hernando et al, The workings of the Maximum Entropy Principle in collective human behavior.

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a distance-like function, and this is often a valid interpretation contextually. On the other hand, the q=0 relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

\displaystyle{ \frac{d p_i}{d t} = f_i(p) - \frac{1}{n} \sum_{j = 1}^n f_j(p) }

This equation is called the projection dynamic.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar distance minimization property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms x_i f_i(x) in the replicator equation with a variety of other functions, called incentives, we can generate many commonly studied models of evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness landscape as follows:

\displaystyle{ f_i \mapsto \frac{x_i e^{\beta f_i}}{\sum_j{x_j e^{\beta f_j}}} }

where \beta is known as an inverse temperature in statistical thermodynamics and as an intensity of selection in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (There are also other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced earlier in his exploration of the foundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol \mathbb{T} for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

\mathbb{T} = \mathbb{R}

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

\mathbb{T} = \mathbb{Z}

More generally, we can make the steps have duration h, where h is any positive real number:

\mathbb{T} = h\mathbb{Z}

There is a nice way to simultaneously describe the cases \mathbb{T} = \mathbb{R} and \mathbb{T} = h\mathbb{Z} using the time-scale calculus and time-scale derivatives. For the time-scale \mathbb{T} = \mathbb{R} the time-scale derivative is just the ordinary derivative. For the time-scale \mathbb{T} = h\mathbb{Z}, the time-scale derivative is given by the difference quotient from first year calculus:

\displaystyle{ f^{\Delta}(z) = \frac{f(z+h) - f(z)}{h} }

and using this as a substitute for the derivative gives difference equations like a discrete-time version of the replicator equation. There are many other choices of time-scale, such as the quantum time-scale given by \mathbb{T} = q^{\mathbb{Z}}, in which case the time-scale derivative is called the q-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply state by saying they have negative \mathbb{T} = h\mathbb{Z} time-scale derivative. The continuous case we started with corresponds to \mathbb{T} = \mathbb{R}.

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On top we have the replicator equation, which goes along with the Fisher information metric; on the bottom we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on top, the q = 0 entropy on bottom) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric g (obeying some mild conditions) and an incentive for each population, as well as a time scale (\mathbb{R} or h \mathbb{Z}) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy. Then, if there is an evolutionarily stable state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population will decrease as the trajectories converge to the stable state!

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:


See this paper for details:

• Marc Harper and Dashiell E. A. Fryer, Stability of evolutionary dynamics on time scales.

Next time we’ll see how to make the main idea work in finite populations, without derivatives or deterministic trajectories!


Game Theory (Part 20)

11 March, 2013

Last time we tackled von Neumann’s minimax theorem:

Theorem. For every zero-sum 2-player normal form game,

\displaystyle{\min_{q'} \max_{p'} \; p' \cdot A q' = \max_{p'} \min_{q'} \; p' \cdot A q'}

where p' ranges over player A’s mixed strategies and q' ranges over player B’s mixed strategies.

We reduced the proof to two geometrical lemmas. Now let’s prove those… and finish up the course!

But first, let me chat a bit about this theorem. Von Neumann first proved it in 1928. He later wrote:

As far as I can see, there could be no theory of games … without that theorem … I thought there was nothing worth publishing until the Minimax Theorem was proved.

Von Neumann’s gave several proofs of this result:

• Tinne Hoff Kjeldesen, John von Neumann’s conception of the minimax theorem: a journey through different mathematical contexts, Arch. Hist. Exact Sci. 56 (2001) 39–68.

In 1937 he gave a proof which became quite famous, based on an important result in topology: Brouwer’s fixed point theorem. This says that if you have a ball

B = \{ x \in \mathbb{R}^n : \|x\| \le 1 \}

and a continuous function

f: B \to B

then this function has a fixed point, meaning a point x \in B with

f(x) = x

You’ll often seen Brouwer’s fixed point theorem in a first course on algebraic topology, though John Milnor came up with a proof using just multivariable calculus and a bit more.

After von Neumann proved his minimax theorem using Brouwer’s fixed point theorem, the mathematician Shizuo Kakutani proved another fixed-point theorem in 1941, which let him get the minimax theorem in a different way. This is now called Kakutani fixed-point theorem.

In 1949, John Nash generalized von Neumann’s result to nonzero-sum games with any number of players: they all have Nash equilibria if we let ourselves use mixed strategies! His proof is just one page long, and it won him the Nobel prize!

Nash’s proof used the Kakutani fixed-point theorem. There is also a proof of Nash’s theorem using Brouwer’s fixed-point theorem; see here for the 2-player case and here for the n-player case.

Apparently when Nash explained his result to von Neumann, the latter said:

That’s trivial, you know. That’s just a fixed point theorem.

Maybe von Neumann was a bit jealous?

I don’t know a proof of Nash’s theorem that doesn’t use a fixed-point theorem. But von Neumann’s original minimax theorem seems to be easier. The proof I showed you last time comes from Andrew Colman’s book Game Theory and its Applications in the Social and Biological Sciences. In it, he writes:

In common with many people, I first encountered game theory in non-mathematical books, and I soon became intrigued by the minimax theorem but frustrated by the way the books tiptoed around it without proving it. It seems reasonable to suppose that I am not the only person who has encountered this problem, but I have not found any source to which mathematically unsophisticated readers can turn for a proper understanding of the theorem, so I have attempted in the pages that follow to provide a simple, self-contained proof with each step spelt out as clearly as possible both in symbols and words.

There are other proofs that avoid fixed-point theorems: for example, there’s one in Ken Binmore’s book Playing for Real. But this one uses transfinite induction, which seems a bit scary and distracting! So far, Colman’s proof seems simplest, but I’ll keep trying to do better.

The lemmas

Now let’s prove the two lemmas from last time. A lemma is an unglamorous result which we use to prove a theorem we’re interested in. The mathematician Paul Taylor has written:

Lemmas do the work in mathematics: theorems, like management, just take the credit.

Let’s remember what we were doing. We had a zero-sum 2-player normal-form game with an m \times n payoff matrix A. The entry A_{ij} of this matrix says A’s payoff when player A makes choice i and player B makes choice j. We defined this set:

C = \{  A q' : \; q' \textrm{ is a mixed strategy for B} \} \subseteq \mathbb{R}^m

For example, if

\displaystyle{ A = \left( \begin{array}{rrr} 2 & 10 &  4 \\-2 & 1 & 6 \end{array} \right) }

then C looks like this:

We assumed that

\displaystyle{ \min_{q'} \max_{p'} \; p' \cdot A q' > 0}

This means there exists p' with

\displaystyle{  p' \cdot A q' > 0}

and this implies that at least one of the numbers (Aq')_i must be positive. So, if we define a set N by

\displaystyle{ N = \{(x_1, \dots, x_m) : x_i \le 0 \textrm{ for all } i\} \subseteq \mathbb{R}^m }

then Aq' can’t be in this set:

\displaystyle{ Aq' \notin N }

In other words, the set C \cap N is empty.

Here’s what C and N look like in our example:

Next, we choose a point in N and a point in C:

• let r be a point in N that’s as close as possible to C,

and

• let s be a point in C that’s as close as possible to r,

These points r and s need to be different, since C \cap N is empty. Here’s what these points and the vector s - r look like in our example:

To finish the job, we need to prove two lemmas:

Lemma 1. r \cdot (s-r) = 0, s_i - r_i \ge 0 for all i, and s_i - r_i > 0 for at least one i.

Proof. Suppose r' is any point in N whose coordinates are all the same those of r, except perhaps one, namely the ith coordinate for one particular choice of i. By the way we’ve defined s and r, this point r' can’t be closer to s than r is:

\| r' - s \| \ge  \| r - s \|

This means that

\displaystyle{ \sum_{j = 1}^m  (r_j' - s_j)^2 \ge  \sum_{j = 1}^m  (r_j - s_j)^2  }

But since r_j' = r_j except when j = i, this implies

(r_i' - s_i)^2 \ge  (r_i - s_i)^2

Now, if s_i \le 0 we can take r'_i = s_i. In this case we get

0 \ge  (r_i - s_i)^2

so r_i = s_i. On the other hand, if s_i > 0 we can take r'_i = 0 and get

s_i^2 \ge  (r_i - s_i)^2

which simplifies to

2 r_i s_i \ge r_i^2

But r_i \le 0 and s_i > 0, so this can only be true if r_i = 0.

In short, we know that either

r_i = s_i

or

s_i > 0 and r_i = 0.

So, either way we get

(s_i - r_i) r_i = 0

Since i was arbitrary, this implies

\displaystyle{ (s - r) \cdot r = \sum_{i = 1}^m (s_i - r_i) r_i = 0 }

which is the first thing we wanted to show. Also, either way we get

s_i - r_i \ge 0

which is the second thing we wanted. Finally, s_i - r_i \ge 0 but we know s \ne r, so

s_i - r_i > 0

for at least one choice of i. And this is the third thing we wanted!   █

Lemma 2. If Aq' is any point in C, then

(s-r) \cdot Aq' \ge 0

Proof. Let’s write

Aq' = a

for short. For any number t between 0 and 1, the point

ta + (1-t)s

is on the line segment connecting the points a and s. Since both these points are in C,, so is this point ta + (1-t)s, because the set C is convex. So, by the way we’ve defined s and r, this point can’t be closer to r than s is:

\| r - (ta + (1-t)s) \| \ge  \| r - s \|

This means that

\displaystyle{  (r + (1-t)s - ta) \cdot  (r + (1-t)s - ta) \ge (r - s) \cdot (r - s) }

With some algebra, this gives

\displaystyle{ 2 (a - s)\cdot (s - r) \ge -t (a - s) \cdot (a - s)  }

Since we can make t as small as we want, this implies that

\displaystyle{  (a - s)\cdot  (s - r) \ge 0  }

or

\displaystyle{ a \cdot (s - r) \ge  s \cdot (s - r)}

or

\displaystyle{ a \cdot (s - r) \ge  (s - r) \cdot (s - r) + r \cdot (s - r)}

By Lemma 1 we have r \cdot (s - r) \ge 0, and the dot product of any vector with itself is nonnegative, so it follows that

\displaystyle{ a \cdot (s - r) \ge 0}

And this is what we wanted to show!   █

Conclusion

Proving lemmas is hard work, and unglamorous. But if you remember the big picture, you’ll see how great this stuff is.

We started with a very general concept of two-person game. Then we introduced probability theory and the concept of ‘mixed strategy’. Then we realized that the expected payoff of each player could be computed using a dot product! This brings geometry into the subject. Using geometry, we’ve seen that every zero-sum game has at least one ‘Nash equilibrium’, where neither player is motivated to change what they do—at least if they’re rational agents.

And this is how math works: by taking a simple concept and thinking about it very hard, over a long time, we can figure out things that are not at all obvious.

For game theory, the story goes much further than we went in this course. For starters, we should look at nonzero-sum games, and games with more than two players. John Nash showed these more general games still have Nash equilibria!

Then we should think about how to actually find these equilibria. Merely knowing that they exist is not good enough! For zero-sum games, finding the equilibria uses a subject called linear programming. This is a way to maximize a linear function given a bunch of linear constraints. It’s used all over the place—in planning, routing, scheduling, and so on.

Game theory is used a lot by economists, for example in studying competition between firms, and in setting up antitrust regulations. For that, try this book:

• Lynne Pepall, Dan Richards and George Norman, Industrial Organization: Contemporary Theory and Empirical Applications, Blackwell, 2008.

For these applications, we need to think about how people actually play games and make economic decisions. We aren’t always rational agents! So, psychologists, sociologists and economists do experiments to study what people actually do. The book above has a lot of case studies, and you can learn more here:

• Andrew Colman, Game Theory and its Applications in the Social and Biological Sciences, Routledge, London, 1982.

As this book title hints, we should also think about how game theory enters into biology. Evolution can be seen as a game where the winning genes reproduce and the losers don’t. But it’s not all about competition: there’s a lot of cooperation involved. Life is not a zero-sum game! Here’s a good introduction to some of the math:

• William H. Sandholm, Evolutionary game theory, 12 November 2007.

For more on the biology, get ahold of this classic text:

• John Maynard Smith, Evolution and the Theory of Games, Cambridge University Press, 1982.

And so on. We’ve just scratched the surface!


Game Theory (Part 19)

7 March, 2013

Okay, we’re almost done! We’ve been studying Nash equilibria for zero-sum 2-player normal form games. We proved a lot of things about them, but now we’ll wrap up the story by proving this:

Grand Theorem. For every zero-sum 2-player normal-form game, a Nash equilibrium exists. Moreover, a pair of mixed strategies (p,q) for the two players is a Nash equilibrium if and only if each strategy is a maximin strategy.

Review

Let’s remember what we’ve proved in Part 16 and Part 18:

Theorem 1. For any zero-sum 2-player normal form game,

\displaystyle{ \min_{q'} \max_{p'} p' \cdot A q' \ge \max_{p'} \min_{q'} \; p' \cdot A q'}

Theorem 2. Given a zero-sum 2-player normal form game for which a Nash equilibrium exists, we have

\displaystyle{\min_{q'} \max_{p'} \; p' \cdot A q' = \max_{p'} \min_{q'} \; p' \cdot A q'}     ★

Theorem 3. If (p,q) is a Nash equilibrium for a zero-sum 2-player normal-form game, then p is a maximin strategy for player A and q is a maximin strategy for player B.

Theorem 4. Suppose we have a zero-sum 2-player normal form game for which ★ holds. If p is a maximin strategy for player A and q is a maximin strategy for player B, then (p,q) is a Nash equilibrium.

The plan

Today we’ll prove two more results. The first one is easy if you know some topology. The second one is the real heart of the whole subject:

Theorem 5. For every zero-sum 2-player normal-form game, a maximin strategy exists for each player.

Theorem 6. For every zero-sum 2-player normal-form game, ★ holds.

Putting all these results together, it’s easy to get our final result:

Grand Theorem. For every zero-sum 2-player normal-form game, a Nash equilibrium exists. Moreover, a pair of mixed strategies (p,q) for the two players is a Nash equilibrium if and only if each strategy is a maximin strategy.

Proof. By Theorem 6 we know that ★ holds. By Theorem 5 we know that there exist maximin strategies for each player, say p and q.. Theorem 4 says that if p and q are maximin strategies and ★ holds, then (p,q) is a Nash equilibrium. So, a Nash equilibrium exists.

Moreover, if (p,q) is any Nash equilibrium, Theorem 3 says p and q are maximin strategies. Conversely, since ★ holds, Theorem 4 says that if p and q are maximin strategies, (p,q) is a Nash equilibrium.   █

Minimax strategies exist

Okay, let’s dive in and get to work:

Theorem 5. For every zero-sum 2-player normal-form game, a maximin strategy exists for each player.

Proof. We’ll prove this only for player A, since the proof for player B is similar. Remember that a maximin strategy for player A is a mixed strategy that maximizes A’s security level, which is a function

\displaystyle{ f(p') = \min_{q'} p' \cdot A q' }

So, we just need to show that this function f really has a maximum. To do this, we note that

f : \{ \textrm{A's mixed strategies} \} \to \mathbb{R}

is a continuous function defined on a compact set. As mentioned at the start of Part 17, this guarantees that f has a maximum.   █

I apologize if this proof is hard to understand. All this stuff is standard if you know some topology, and a huge digression if you don’t, so I won’t go through the details. This is a nice example of how topology can be useful in other subjects!

The key theorem

Now we finally reach the heart of the whole subject: von Neumann’s minimax theorem. Our proof will be a condensed version of the one in Andrew Colman’s 1982 book Game Theory and its Applications in the Social and Biological Sciences.

Theorem 6. For every zero-sum 2-player normal-form game,

\displaystyle{\min_{q'} \max_{p'} \; p' \cdot A q' = \max_{p'} \min_{q'} \; p' \cdot A q'}     ★

holds.

Proof. Let’s write

\displaystyle{  \max_{p'} \min_{q'} \; p' \cdot A q' = V}

and

\displaystyle{ \min_{q'} \max_{p'} \; p' \cdot A q' = W}

Our goal is to prove ★, which says V = W. By Theorem 1 we know

V \le W

So, we just need to prove

V \ge W

Here’s how we will do this. We will prove

\textrm{if } W > 0 \textrm{ then } V \ge 0

Since we’ll prove this for any game of the sort we’re studying, it’ll be true even if we add some real number c to each entry of the payoff matrix A_{ij}. Doing this adds c to the expected payoff p' \cdot A q', so it adds c to V and W. So, it will follow that

\textrm{if } V + c > 0 \textrm{ then } W + c\ge 0

for any real number c, and this implies

V \ge W

So, let’s get going.

Assume W > 0. To prove that V \ge 0, remember that

\displaystyle{ V = \max_{p'} \min_{q'} \; p' \cdot A q'}

To show this is greater than or equal to zero, we just need to find some mixed strategy p for player A such that

\displaystyle{ \min_{q'} \; p \cdot A q' \ge 0}

In other words, we need to find p such that

\displaystyle{ p \cdot A q' \ge 0}     ★★

for all mixed strategies q' for player B.

How can we find p for which this ★★ is true? The key is to consider the set

C = \{  A q' : \; q' \textrm{ is a mixed strategy for B} \} \subseteq \mathbb{R}^m

For example, if

\displaystyle{ A = \left( \begin{array}{rrr} 2 & 10 &  4 \\-2 & 1 & 6 \end{array} \right) }

then C looks like this:

Since W > 0, for any Aq' \in C we have

\displaystyle{ \max_{p'} \; p' \cdot A q' \ge \min_{q'} \max_{p'} \; p' \cdot A q' = W > 0}

so there must exist p' with

\displaystyle{  p' \cdot A q' \ge W > 0}

Since p' = (p'_1, \dots, p'_m) is a mixed strategy, we have p'_i \ge 0 for all 1 \le i \le m. But since we’ve just seen

\displaystyle{ \sum_{i=1}^m p'_i (Aq')_i = p' \cdot A q' \ge W > 0}

at least one of the numbers (Aq')_i must be positive. In other words, if we define a set N by

\displaystyle{ N = \{(x_1, \dots, x_m) : x_i \le 0 \textrm{ for all } i\} \subseteq \mathbb{R}^m }

then Aq' can’t be in this set:

\displaystyle{ Aq' \notin N }

So, we’ve seen that no point in C can be in N:

C \cap N = \emptyset

Here’s what it looks like in our example:

Now the trick is to:

• let r be a point in N that’s as close as possible to C,

and

• let s be a point in C that’s as close as possible to r,

We need to use a bit of topology to be sure these points exist, since it means finding the minima of certain functions (namely, distances). But let’s not worry about that now! We’ll complete the proof with two lemmas:

Lemma 1. r \cdot (s-r) = 0, s_i - r_i \ge 0 for all i, and s_i - r_i > 0 for at least one i.

Lemma 2. If Aq' is any point in C, then

(s-r) \cdot Aq' \ge 0

Here’s what the points s and r and the vector s - r look like in our example:

Check to see that Lemmas 1 and 2 are true in this example! We’ll prove the lemmas later; right now let’s see how they get the job done.

First, by Lemma 1, the numbers s_i - r_i are nonnegative and at least one is positive. So, we can define a mixed strategy p for player A by defining

\displaystyle{ p_i = \frac{1}{c} (s_i - r_i) }

where c > 0 is a number chosen to make sure \sum_i p_i = 1. (Remember, the probabilities p_i must be \ge 0 and must sum to 1.) In other words,

\displaystyle{ p = \frac{1}{c} (s - r) }

Now, for any mixed strategy q' for player B, we have Aq' \in C and thus by Lemma 1

(s-r) \cdot Aq' \ge 0

Dividing by c, we get

p \cdot Aq' \ge 0

for all q'. But this is ★★, which is what we wanted to prove! So we are done!   █

I will give the proofs of Lemmas 1 and 2 in the next part.