Gheorghe Craciun is a mathematician at the University of Wisconsin who recently proved the Global Attractor Conjecture, which since 1974 was the most famous conjecture in mathematical chemistry. This week he visited U. C. Riverside and gave a talk on this subject. But he also told me about something else—something quite remarkable.

### The mystery

A peptide is basically a small protein: a chain of made of fewer than 50 amino acids. If you plot the number of peptides of different masses found in various organisms, you see peculiar oscillations:

These oscillations have a frequency of about 14 daltons, where a ‘dalton’ is roughly the mass of a hydrogen atom—or more precisely, 1/12 the mass of a carbon atom.

Biologists had noticed these oscillations in databases of peptide masses. But they didn’t understand them.

Can you figure out what causes these oscillations?

It’s a math puzzle, actually.

Next I’ll give you the answer, so stop looking if you want to think about it first.

### The solution

Almost all peptides are made of 20 different amino acids, which have different masses, which are almost integers. So, to a reasonably good approximation, the puzzle amounts to this: if you have 20 natural numbers how many ways can you write any natural number as a finite ordered sum of these numbers? Call it and graph it. It oscillates! Why?

(We count *ordered* sums because the amino acids are stuck together in a linear way to form a protein.)

There’s a well-known way to write down a formula for . It obeys a linear recurrence:

and we can solve this using the ansatz

Then the recurrence relation will hold if

for all But this is fairly easy to achieve! If is the biggest mass, we just need this polynomial equation to hold:

There will be a bunch of solutions, about of them. (If there are repeated roots things get a bit more subtle, but let’s not worry about.) To get the actual formula for we need to find the right linear combination of functions where ranges over all the roots. That takes some work. Craciun and his collaborator Shane Hubler did that work.

But we can get a pretty good understanding with a lot less work. In particular, the root with the largest magnitude will make grow the fastest.

If you haven’t thought about this sort of recurrence relation it’s good to look at the simplest case, where we just have two masses Then the numbers are the Fibonacci numbers. I hope you know this: the th Fibonacci number is the number of ways to write as the sum of an ordered list of 1’s and 2’s!

1+1, 2

1+1+1, 1+2, 2+1

1+1+1+1, 1+1+2, 1+2+1, 2+1+1, 2+2

If I drew edges between these sums in the right way, forming a ‘family tree’, you’d see the connection to Fibonacci’s original rabbit puzzle.

In this example the recurrence gives the polynomial equation

and the root with largest magnitude is the golden ratio:

The other root is

With a little more work you get an explicit formula for the Fibonacci numbers in terms of the golden ratio:

But right now I’m more interested in the qualitative aspects! In this example both roots are real. The example from biology is different.

**Puzzle 1.** For which lists of natural numbers are all the roots of

real?

I don’t know the answer. But apparently this kind of polynomial equation always one root with the largest possible magnitude, which is real and has multiplicity one. I think it turns out that is asymptotically proportional to where is this root.

But in the case that’s relevant to biology, there’s also a pair of roots with the *second* largest magnitude, which are *not* real: they’re complex conjugates of each other. And these give rise to the oscillations!

For the masses of the 20 amino acids most common in life, the roots look like this:

The aqua root at right has the largest magnitude and gives the dominant contribution to the exponential growth of The red roots have the second largest magnitude. These give the main oscillations in which have period 14.28.

For the full story, read this:

• Shane Hubler and Gheorghe Craciun, Periodic patterns in distributions of peptide masses, *BioSystems* **109** (2012), 179–185.

Most of the pictures here are from this paper.

My main question is this:

**Puzzle 2.** Suppose we take many lists of natural numbers and draw all the roots of the equations

What pattern do we get in the complex plane?

I suspect that this picture is an approximation to the answer you’d get to Puzzle 2:

If you stare carefully at this picture, you’ll see some patterns, and I’m guessing those are hints of something very beautiful.

Earlier on this blog we looked at roots of polynomials whose coefficients are all 1 or -1:

The pattern is very nice, and it repays deep mathematical study. Here it is, drawn by Sam Derbyshire:

But now we’re looking at polynomials where the leading coefficient is 1 and all the rest are -1 or 0. How does that change things? A lot, it seems!

By the way, the 20 amino acids we commonly see in biology have masses ranging between 57 and 186. It’s not really true that all their masses are different. Here are their masses:

I pretended that none of the masses are equal in Puzzle 2, and I left out the fact that only about 1/9th of the coefficients of our polynomial are nonzero. This may affect the picture you get!

There’s a typo in the formula for for the Fibonacci numbers; apart from the lower-case on the RHS, the second root should be .

I meant to say

but perhaps

is clearer. Anyway, the two roots are and

I was puzzled as to why there are so many more than 20 roots plotted if there are 20 amino acids, so I looked at the paper. It talks about adding 41 more masses because of post-translational changes to the proteins, but I think there are more than 61 roots in the plot, so there must be something else going on that adds a few more masses.

Oh, sorry, that was a dumb comment! I got confused between the number of terms in the polynomial and the degree.

I came here to make the same “dumb comment”. Confused minds think alike?

Yes, I made the mistake too at one point.

I hadn’t noticed about the ‘post-translational changes’. I told Craciun about selenocysteine, an unusual amino acid that contains selenium. It’s only used in certain organisms, and it’s coded for by a codon that’s normally one of the stop codons. It would be really cool if the peptides found in one of these other organisms showed oscillations with a notably different periodicity.

Quoth Wikipedia:

Here’s another cool puzzle.

Greg Egan worked out the roots of 200 polynomials, where in each case he randomly chose 20 masses between 50 and 200, and threw out any repeated masses. They look like this:

Compared to the pictures in The Beauty of Roots, this looks pretty chaotic and boring except for the repulsion of the real axis, the strip of positive real roots with large magnitude, and the ‘bands’ or ‘tendrils’ of complex roots with large positive real part and small imaginary part.

But, these few interesting visible features are connected to the peptide puzzle! The roots with largest magnitude contribute most to the asymptotic growth rate of and the complex roots with largest magnitude—the tips of the ‘tendrils’—will give the visible oscillations. So, let’s zoom in on those:

Puzzle 3.Suppose we randomly chose 20 natural numbersand work out the roots of the polynomial

What, on average, is the imaginary of the complex roots with largest real part? And what is its standard deviation? I’m hoping that this quantity, say has a small standard deviation and mean with

If that’s true, the periodicity of the peptide masses is a robust phenomenon, largely independent of the precise masses of the peptides!

Puzzle 4.How do things change as we change the numbers 20, 50 and 200 in the above problem?I think you need to pick the complex root with the largest magnitude, rather than the largest real part; if not, the value is a long way off.

Using 10,000 random samples, chosen in the same way as I used for the plot (pick 20 numbers between 50 and 200 inclusive, and discard any repetition – so there might be less than 20 numbers after that step), I get:

Standard deviation of

If I change the upper limit on the mass from 200 to 500, while keeping the lower limit at 50 and the number of masses at 20, that doesn’t alter the average of the periods very much. For 1,000 random samples, I get:

with a standard deviation of 0.305363, and:

If I go back to the original mass limits of 50 and 200, but double the number of masses from 20 to 40, then for 1,000 random samples, I get:

with a standard deviation of 0.354997, and:

Thanks for the calculations!

The problem is, if the real part of a root is small, that root gives oscillations that don’t grow very fast, or even decay. That wouldn’t explain the dramatic growing oscillations that we see. What we

seeis likely to be dominated by the roots with the largest real parts.More precisely, we have

where are the roots and are some coefficients. The asymptotic behavior of as is dominated by the roots with for which is the largest. So, roots with would throw off my simplified claim above.

OK. So maybe I need to calculate something more elaborate.

Just to be clear, for the specific set of roots given in the diagram from the paper, the two red diamonds that Craciun associates with the period are the non-real roots with the largest magnitude. There are other non-real roots with larger real parts (the two small diamonds with very small phases, around plus or minus 3 degrees, and magnitude around 1.014).

Hmm. I’m finding this whole phase-magnitude way of plotting a bit confusing. What matters in

mymind (and I could be wrong!) are the real and imaginary parts of the roots since these control the growth rate and frequency of the corresponding functions .So, I can imagine asking you to draw a picture in terms of real and imaginary parts. But even more useful might be these: a picture of Craciun’s specific roots overlaid on your picture of roots of randomly chosen polynomials. That would let us see how special the biological example is—or not.

Surely has its growth rate controlled by the real part of , which depends on the magnitude of , rather than the real part of itself?

Now we’re even on dumb comments for the week. Thanks for straightening me out!

Greg drew a phase/magnitude plot of the roots for 1000 random examples (where in each case he randomly chose 20 masses between 50 and 200, and throw out any repeated masses), together with the roots for the polynomial he got from applying the same construction to the 20 amino acid masses (which include some repeated masses), shown in red:

Here’s a closeup of the most important part:

Unfortunately for my dreams, the complex roots that control the dominant oscillations in the peptide mass distribution do not lie on the “first tendril”, so these biological oscillations are not “controlled by math” in a simple way. They seem to lie near the edge of the “fourth tendril”—and the fourth tendril, unlike the first three, is barely visible.

So, the oscillations we see in nature are slower than they you might get with randomly chosen amino acid masses.

(There should be some fairly simple explanation of these tendrils, but I don’t see it.)

I’ve corrected my confusion about real part vs. magnitude and imaginary part vs. phase in the main post and some comments. But here I should really issue a separate, corrected, version of Puzzles 3 and 4. If we’re trying to understand the period of the oscillations in the function

we need to look, not at the imaginary part of the root but its phase, or argument. If

then

so the period of the oscillations in this function is So:

Puzzle 3′.Suppose we randomly chose 20 natural numbersand work out the roots of the polynomial

What, on average, is the phase of the complex roots with largest magnitude? And what is its standard deviation? I’m hoping that this quantity, say has a small standard deviation and mean with

If that’s true, the periodicity of the peptide masses is a robust phenomenon, largely independent of the precise masses of the peptides!

(I’m not at all optimistic that this is true, now that Greg has prepared an image comparing the average situation to the situation that arises with actual peptides. But this at least is the correct way of asking the question!)

When amino acids join to make peptides, a water molecule is lost at each bond, so for peptide made from n amino acids, subtract (n-1)*18.

Hmm, that’s important! I wonder if Hubler and Craciun discuss that in their paper. It’s a big deal, yet easy to deal with. You can just subtract 18 from the mass of each peptide, then do all the math I described to work out a formula for then add 18. The last step would not affect the period of so if you only care about the period, you can skip it.

14 is the rounded atomic mass of a CH2 unit. You’re overthinking this….

The paper mentions that. However, they say the fit of 14.28 is noticeably better, and that’s what they get from their more detailed analysis.

Of course, the two explanations are not necessarily mutually exclusive.

And of course, amino acids involve a lot more than CH2’s so it’s not obvious that CH2’s explain the truly dramatic oscillations that we see! One would need to give a mathematical argument to prove this.

I think the paper’s comments about the structural subunits of the amino acids explain both the fact that the period is

close tobut not exactly 14, and also the fact that the period is distinctly non-generic (i.e. quite different from what random masses in the same range produce):Hmm, I guess I should add that the mere fact that all the amino acid masses can be expressed as sums of elements of the set {12, 14, 15, 16, 32} is actually

completely non-informative, because this is true ofeveryinteger from 50 to 200!So my previous comment was a bit of a stretch. There must be something else about the composition of amino acids that makes the masses non-generic … and of course there are probably fairly simple reasons to do with valences why you can’t stick any old numbers of these units (C, CH2, NH, O, and S) together, let alone stick them together and get something that counts as an amino acid.

About this magic number 14: if it’s really important that CH2 has mass 14, then this might show up in lots of amino acids having masses whose differences are multiples of 14. If I were a biochemist I’d just

knowwhether lots of amino acids differ by one or more CH2’s, but since I’m a mathematician I imagine taking the amino acid masses57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186

computing their differences, and seeing if lots differ by multiples of 14. Let’s see:

71-57 = 14

113-99 = 14

115-101 = 14

129-115 = 14

That seems like a lot. I’m too lazy to look for differences that equal 28 or other multiples of 14.

One thing we could do is take 20 numbers between 50 and 100 with a lot that differ by 14, and generate a polynomial from those numbers in the usual way, and look at its roots. Claim: the complex root with largest magnitude has phase such that

Multiples of 14.28 are:

57.12 71.40 85.68 99.96 114.24 128.52 142.80 157.08 171.36 185.64

That’s a good match for most of the amino acid masses, until you subtract 18 from them.

It’s so weird in how subtle ways combinatorics interact with our lives.

The distribution of the masses of the amino acids might not be well modeled by a flat distribution between 50 and 200. It looks more like a normal distribution centred on about 115 with a small standard deviation (about 20?) Would that make a significant difference?

Good question! This might tend to push the largest-magnitude complex root onto the fourth tendril, which is what we see in nature. But I have no intuition for whether that’s true, because I don’t understand those tendrils.

It might be worth solving the case where the amino acid masses are all evenly separated e.g. 48,56,64,…,200. In that case the polynomial takes a simplified form if multiplied by with just four terms, and it is a polynomial in Perhaps that would give a distribution with clear tendrils.

Actually more like mean = 118, stdev = 31, but the mean should not affect the period right?

The mean can affect the strength of the periodicity, From 48,56,64,…,200 you can only make multiples of 8, but from 51,59,67,… 203, you can make much more.

Let’s look at how having many similarly spaced amino acids affects the mass distribution of peptides. Suppose we have numbers

How many ways are there to write as an ordered sum of these?

I’m too busy to dive in and work that out, so let’s start out with an easy example and hope that helps us with harder ones. Suppose we have numbers

Let be the number of ordered sums of these that add up to .

Then

so if we take as our ansatz

we get

or

But

so our equation becomes

or

What are the solutions like, especially those with the largest magnitude?

Hmm, I’m too busy to even tackle this in detail right now! But I’ll leave this here and maybe return to it later. It could be fun to study problems of this sort.

(For , the solution with greatest magnitude seems to be . For it seems to be So for large this root seems to be approaching 2. But this isn’t the main thing I’m interested in.)

Following these calculations where and are not 1, I get for the characteristic polynomial

The drops out so unless I am mistaken the loss of water molecules in peptide bond formation does not matter when the masses are equally spaced.

The polynomial is zero if . In the case the fourteenth roots of unity are roots of the characteristic polynomial. And one of the fourteenth roots of unity is

Numerically I find that for the root of

with the largest magnitude is nearly 2. The positive fourteenth root of 2 is about 1.05. So

approximates a root of

and it ties for having the maximum magnitude of all the roots.

I think the equation for generic is:

or equivalently:

Of course if is a

bth root of unity, it will still be a solution regardless of .I think does affect the other solutions, though.

You’re right, the equation is

Not all of the ‘s cancel as I thought.

Now that I check more carefully, it’s not safe to multiply through by when is a

bth root of unity, and in fact those roots don’t solve the equation.But my numerical calculations seem to agree with the idea that the largest-magnitude roots are multiples of the

bth roots of unity.When is divisible by (as in the example that I described in another comment), then you get solutions that are some constant real multiple of all the

bth roots of unity. But when is not divisible by , things aren’t so symmetrical.Using and setting I got

It looks like the root with the largest magnitude is

This would make the magnitude of about 1.0233 which looks close to the magnitude of the red diamonds in Hubler and Craciuns’ figure. Perhaps too close given that the amino acid masses are not equally spaced.

I agree that for there’s a set of 14 roots that all have magnitudes of about 1.0233.

For the real data, I calculate a magnitude for the largest complex root of 1.02552. I don’t know to what extent that’s just a coincidence, or whether there’s some correlation here that doesn’t depend so much on the details of the mass set as the fact that the periods are very close.

If you use masses that lie in a perfect arithmetic sequence, I don’t think you get periodic behaviour in .

The first plot in the image above shows what would be the period from the largest-magnitude complex root for the series of masses with , but the next plot shows itself for the series with 5 masses, and like the Fibonacci numbers it doesn’t oscillate.

The third and fourth plots show what you get with a perfect arithmetic sequence with , and what happens if you alternately add and subtract 1 from the masses.

So this seems to suggest that exact spacing gives non-generic (and perhaps always non-oscillatory) results, but adding some variation to an exactly spaced set of masses can give an oscillation that comes close to the original spacing.

For the exact series , the largest magnitude of any non-real root seems to approach 1 asymptotically

from belowas you add more masses, while as JB noted the real root seems to approach 2 from below. So for large N, the exponential behaviour will always dominate.For the exact series , it turns out that there are 14 roots with magnitudes

exactlyequal to the magnitude of the largest real root (which is greater than 1). So that plot where I showed the period associated with the largest complex root seeming to jump back and forth between different values was an artifact of trying to pick just one root, when the “dominant period” in this case is actually multi-valued.And in fact these roots all lie equally spaced around a circle in the complex plane. So one of them does correspond to a period of exactly 14, but there are equal-magnitude roots with periods 14/2=7, 14/3, 14/4=7/2, etc.

I don’t know what actually happens with in this case, because I haven’t been able to obtain enough values by combinatorial methods to solve for the coefficients of the terms associated with all the roots.

However, for the case where you perturb the exact series by adding or subtracting 1 from the masses, I can solve for everything, and you do get clear periodic behaviour.

Thanks! By the way, my original claim that the Fibonacci numbers show no oscillation was based on my silly confusion about versus If is negative then the former oscillates even though the latter does not. The former is what matters for our problem. For the Fibonacci numbers the root

is negative so we

doget an oscillatory effect:However, the oscillations are exponentially damped since The oscillations can only be seen if we subtract them off, leaving numbers

which go like this:

0.723

1.171

1.894

3.065

4.960

8.025

The Fibonacci numbers are alternately larger and smaller than these, so we can say they oscillate, in a sense!

However, these oscillations are vastly milder than the growing oscillations we see when there’s a negative or complex with

(I imagine you know all this stuff, but other people may not have thought about it very much, and I need to correct any lingering confusion I sowed.)

Greg Egan’s analysis above may suggest that smooth distributions don’t capture the right effect, but let’s continue with that for now.

Suppose the amino acid masses have a smooth distribution about some mean value that we take to be an integer. Rewrite the equation for the roots as

where is the number of amino acids.

If is near one the smooth distribution can be used to write this as an integral.

Take with small

So you get an equation to solve involving a Laplace transform.

I want to try this for a normal distribution rather than a flat distribution because the result is more tractable and because I think it models the actual numbers better.

The integral can be done giving

This reduces to a series of quadratic equations to solve for

for integers

In the real case , and

The real roots () are

or

Taking the second dominant root the growth factor is

I think that is quite close.

Taking the dominant root for

This has a larger real part than the real root and an imaginary part that is far too small. So some random element might be needed here too, and that is too hard to do analytically.

Suppose you start with 20 masses in an arithmetic series, and then randomly add or subtract 1, or do nothing, to each mass. The periods implied by the largest complex root, for 200 random trials of that process, are shown in the histograms above. In each case, the arithmetic series starts at , and the trials were done with .

The results for the first two cases are tightly clustered a bit below the original spacing of the masses. But for reasons I don’t understand, things go weird for the third case, where . I’m using very high precision arithmetic (1000 digits), so it’s unlikely that the numerical approximation is misidentifying the correct root, and if there are multiple complex roots that share the same magnitude, I pick the one with the smallest argument, which ought to yield the highest possible period in each case. But for ,

noneof the mass sets yield a period greater than 16.85.Of course, these periods are all based on the assumption that these particular roots will have non-zero coefficients in the final formula for . But generically, you’d expect that to be true.

Anyway, it does seem clear that there are cases where the dominant period in will be set by the spacing between the masses.

These examples don’t readily translate to the amino acid masses, though, as the spacings between consecutive masses for them is: 14,16,10,2,2,2,10,0,1,1,13,0,1,2,6,10,9,7,23.

If you start out with masses in a perfect arithmetic sequence, and if the starting value of the sequence, , is divisible by the increment, , then the characteristic polynomial will have roots that are all of equal magnitude, and are all multiples of

bth roots of unity. That’s what I see numerically, and I think Scott Hotton has shown that analytically.If you perturb the sequence by adding or subtracting 1 to the masses at random, you break that perfect symmetry. What seems to happen at first is that the positive real root becomes the largest, and the second-largest roots are a conjugate pair with phases of approximately .

But as you keep perturbing the masses, the roots that were originally arranged symmetrically on a perfect circle move around, and if you make enough changes, eventually you can “dethrone” the conjugate pair with the smallest phase, and one of the other pairs of roots becomes larger.

In the root plots below, the largest root(s) are shown in red, and the second-largest are shown in blue. Eventually, for both and , the lowest-phase roots get beaten by some other pair of roots, and the phase is abruptly multiplied by some (roughly integer) factor.

I’m still not entirely sure why the randomly perturbed cases for that I plotted in the histogram in an earlier comment

neverhad the lowest-phase roots being the largest, but it might be due to the relationship between and ; you’re only starting from a perfect circle when is divisible by , and how much things are distorted by non-divisibility must depend on the particular numbers.On the basis of what Greg has done, I’m willing to guess that if we have a list of integer masses and a large fraction of them have differences equal to some number and we plot a histogram of how many ways each natural number can be written as an ordered sum of these masses, we’ll usually get oscillations with period close to But sometimes funny things happen!

It might be fun, Greg, to take one of the test cases you examined and actually work out the number of ways can be written as an ordered sum of the masses on one of your lists.

Instead of writing

and trying to figure out the coefficient for each root of the relevant polynomial, a less tiring approach might be to actually generate all the ordered sums up to some number and then count how many of these sums give for each (This could be tiring for your computer, but I hope not for you.)

I’m wondering, for example, if the second-largest roots always manifest themselves as a noticeable oscillatory pattern in Conceivably they might not, if the coefficient of for those roots was very small. Unless that coefficient were

zero, as those terms would eventually beat the terms coming from other complex roots… but by then it might be completely overshadowed by the term coming from the real root.A case where another pair of complex roots dethrones the lowest-phase pair would be fun to see. I don’t get what’s really happening there.

By coincidence yesterday I ran into 3 undergrad math majors at UCR who did a project on numerical semigroups. A

numerical semigroupis just a set of natural numbers that’s closed under addition and has finite complement. (If it doesn’t have finite complement, every number in it must be a multiple of some fixed natural number so we can divide everything in it by and simplify things.)This is relevant because if we take any set of natural numbers

and keep adding them to our hearts content, the numbers we get form a numerical semigroup, at least if all the aren’t divisible by some number

The

Frobenius numberof a numerical semigroup is the largest integer that’s not in it.A numerical semigroup is called

irreducibleif it’s not the intersection of two numerical semigroups that properly contain it.We can ask: how many numerical semigroups are there with Frobenius number ? And this turns out to grow in an interesting way, which has bumps with period 6:

(Click to enlarge.) This picture is from the students’ paper, where they prove the lower and upper bound shown:

• Clarisse Bonnand, Reid Booth, Carina Kaainoa, and Ethan Rooke, Bounds on the number of irreducible semigroups of fixed Frobenius number.

Needless to say, this caught my eye both because the peptide problem is related and because it shows periodic bumps!

Here’s a simple example of period-doubling that I was able to solve completely:

To add a bit more detail about what happens at the level of the roots: the blue curve has a dominant complex root of magnitude 1.10551 and period 5.50192, while the orange curve has a dominant complex root of magnitude 1.08076 and period 2.2945; it also has a root of magnitude 1.05422 and period 5.4668. Both have non-zero coefficients in the solution, and in fact the larger-magnitude root (with the shorter period) also has a larger coefficient. That surprised me, because it seems to take a while to clearly dominate the oscillations, but I guess at low values there are always going to be complicated things going on with all the other roots.

Wow, great! I was confused about ‘period-doubling’ versus ‘frequency-doubling’. but I think I’ve got it straightened out now. Period-doubling is of course a famous thing in dynamical systems, but frequency-doubling, or in other words period-halving, also shows up, e.g. when you blow hard on a flute and the note jumps up an octave.

Of course you might argue it’s all a matter of convention, but the question is whether the ‘unexpected’ situation has period twice or half the ‘expected’ period. And here it’s

half… which I find quite remarkable from one point of view.You take the numbers 6, 11, 16, 21, 26, each of which is 5 more than the last, and you count how many ways you can get as an ordered sum of these. Unsurprisingly, perhaps, the count oscillates with period 5… well, okay, actually 5.50192.

Then you change the number 6 to 7, and repeat the count, and now the count oscillates with period 5/2! Well, okay, actually 2.2945. But the period has

halved, roughly.If you’d asked me before you started doing all these calculations, I might have guessed that the period could

double, because it might take more time for something to come into alignment. (Not a very clear reason.)But of course once you showed that the complex root with the smallest phase could be ‘dethroned’ by one with roughly twice that phase, it became clear that this funny phenomenon should actually cause ‘frequency-doubling’, hence ‘period-halving’.

But I had trouble

believingthat until I saw an example! I still have no intuition for why the counts start wiggling twice as fast when you change the 6 to a 7.Anyway, it’s great.

I also find it cute that while the orange curve goes negative, it has the good sense not to do so at integer values.

And I also find it interesting how the “period 5… well, okay, actually 5.50192” mimics the story in the biological case, where an easy argument suggests that the count should oscillate with period 14, but in fact the period is 14.28.

I knew as soon as I’d posted that I should have said “period-halving” rather than “period-doubling”, but by then it was too late.

It’s okay, I’d been pondering period-doubling versus period-halving in this problem so you gave me an excuse to vent!

I’m still trying to figure out how to pose the question “what’s really going on in this period-halving effect?” in a useful way. I think it’s a wonderful and curious thing.

Here’s one idea. One can imagine wondering not just how

manyways there are to write as an ordered sum of numbers but also more abouthowwe can write it as a sum. Could it be that then the period halves, we get pulses of two differentkinds, in which some average characteristic of the ordered sums flips and flops?For example, does the average fraction of s appearing in the sum stay roughly constant, or does it vary significantly with ? This is just the simplest characteristic I can think of.

(Even if it stays constant, what’s the constant? If I take a number and write it as an ordered sum of numbers in all possible ways, do each of these numbers tend on average to show up in equal proportion, or do smaller ones show up more? If we were talking about unordered sums, I should be able to answer this using standard techniques of statistical mechanics, sort of like how we figure out the energy distribution of atoms in a gas.)

Guesses aside, ultimately this question is about how the expression

‘knows’ how many ways we can write as an ordered sum of numbers or in other words, how the roots and coefficients encode what is going on. Somehow they encode

everythingabout what’s going on. But the meaning of the individual oscillating functions is obscure, since each one is affected byallthe numbers It’s a bit like a Fourier transform or holograph.More concretely, suppose we’re trying to write as an ordered sum of the numbers 7, 11, 16, 21, 26. Then we can write as an ordered sum in a bunch of ways. Many of these ways come from taking an 11 in a sum that gives and replacing it with a 16, or taking a 16 and replacing it with a 21, or taking a 21 and replacing it with a 26. So if tended to have a lot of 11’s in it, maybe will tend to have a lot of 16’s in it, and so on.

I guess it’s again wise to start with the very simple cases, where the are in an arithmetic progression.

I had a look at Hubler and Craciun’s paper, and at a few other things. I learned:

The masses 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186 are for residues, ie, without a H2O. So the 18s have already been subtracted.

The first graph in the blog post is not “the number of peptides of different masses found in various organisms”. Instead, they are the result of splitting up (digesting) a single protein (Lys-C or Trypsin) into peptides. Both Lys-C and Trypsin are proteases, i.e., proteins which digest proteins. They split at certain residues, which means that each peptide they make will begin (or end) with one of a few residues.

I think (from the shape of the graph) that these semi-digested proteins contain few very short peptides, and this is important for periodicity. If short ones (say <6 residues) are very rare, peptides with masses in the range 300-500 will be dominated by the residues with small mass like 57, 71, 87, 97, 99, 101, 103.

I am dubious about F(N) being used as a model. There is no process that I know of in the cell, or in the in vitro digestion, that cares about mass. On the other hand, I reckon that if you get a lot of Trypsin and wait for it to self-destruct, there will be a particular distribution of peptide lengths. And so I think a better approach (if anyone cares about the biochemistry) would look at “what masses can you make with r residues?” for various r. Convolutions, Fourier transforms, etc.

Thanks!

Yes, Craciun mentioned this to me in email, saying:

I was waiting for him to post a comment on the blog.

About the biology:

Interesting! But I don’t quite see what mathematical model you might be proposing here.

While the biology is exciting, unfortunately the math problems raised by are so fundamental and so gripping that I’m perfectly happy thinking about those, for a while at least. Of course there’s nothing mathematically profound about how many ways you can write as an ordered sum of the numbers 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, and 186. But the

generalizedquestion, where we consider arbitrary collections of numbers, feels important, and fun. So it’s probably been studied already. So I should poke into the literature.By the way, these problems are

alsobasically problems of convolutions and Fourier transforms, even though we’re not using those words a lot. is obtained by iterated convolution—that’s what the recurrence relation says—and that lets express it as a linear combination of exponentials, which is a kind of Fourier transform adapted to exponentially growing functions.If there is a distribution p(r) on the lengths of the peptides, and w is the distribution of residue masses, and is the r-fold convolution of w, then I mean to replace F with .

Okay, now I get your idea. Just to compare using the same notation, our is given by

We didn’t write it that way, but that’s what it is. So, you’re suggesting modifying it by weighting the sum by a probability distribution :

As long as is constant for will be proportional to for since 57 is the residue mass of the lightest amino acid.

Thus, I don’t feel very bad about using our original model—at least until someone proposes a specific choice of

Also, unless oscillates, I have trouble seeing how it could create dramatic oscillations in if there weren’t already such oscillations in the original Since the original puzzle was to explain the observed oscillations in empirical data, I think it’s good that we can see them, and understand their appearance, in the original

In short, to the extent to which your model is the ‘real thing’, I think the original simplified model is a useful first step toward the real thing.

I think your is , assuming we are using 20 masses.

Perhaps I wasn’t clear. I meant by “w is the distribution of residue masses” to imply that , and similarly for p. is supposed to be a probability distribution.

Okay, in my comment I used to mean the function on the integers given by the sum of Kronecker deltas supported at the residue masses of the amino acids. So, it was 20 times your

I’m just skimming so this might have been answered in earlier comments: Is it clear yet what underlies the apparent structure in John’s initial phase/magnitude scatterplot?

I think I see in that plot some right-facing circular or elliptical arcs, and maybe some left-facing parabolic or hyperbolic ones.

Of course one possible answer is ‘the viewer’s brain’ but I guess there are statistical tests that could help w/ that.

I think it is interesting to see what happens when all the masses are divided by 14.28

57/14.28 = 3.99159

71/14.28 = 4.97198

87/14.28 = 6.09243

97/14.28 = 6.79271

99/14.28 = 6.93277

101/14.28 = 7.07282

103/14.28 = 7.21288

113/14.28 = 7.91316

114/14.28 = 7.98319

115/14.28 = 8.05322

128/14.28 = 8.96358

129/14.28 = 9.03361

131/14.28 = 9.17366

137/14,28 = 9.59383

147/14.28 = 10.29411

156/14.28 = 10.92436

163/14.28 = 11.41456

186/14.28 = 13.02531

What you notice is that most of them are close to integers with a few exceptions at the heavier end. In particular the lightest two are very close to integers. This is what is responsible for the oscillations. You can see from the recurrence relation why that would be the case. Greg Egan’s experiments with equally spaced masses also shows it. In a growing sequence the lightest masses control the dominant terms in the recurrence relation.

Is there a biochemical reason why the masses should be close to multiples of 14.28? I can see that CH_2 and N both have masses of 14 and may dominate the chemistry, O is 16 while P and S are 31 and 32 which are a little over twice the period, pushing the average a little higher. An extra H on the end of a chain has a similar effect.

I wish these oscillations would turn out to have some significance for life. That would be very profound. However as someone said, masses don’t have a big effect on chemistry.

Do you mean this picture by Hubler and Craciun?

I don’t think we know explanations of the patterns you’re alluding to, or even whether those patterns are ‘real things’ (i.e.,

haveinteresting explanations).Yes, that is the diagram I asked about. I agree that the explanation may well be “it’s all in your head”. Also that (as Philip observes) they are less impressive without the reflection symmetry.

I think there is still some possibility that the arcs are a real effect. If the masses were all exact multiples of some integer then we would get some straight vertical lines. So if they are mostly close to multiples of some number there is a possibility that the lines could be distorted into arcs. I just can’t see that idea quite matching up with what we see. Perhaps it is more complicated.

To give a little better idea of what I mean, here is an analysis of how you could get some arcs under the right conditions

Suppose we have a case where all the masses are near multiples of some number I.e. where the multipliers are positive integers and the remainders are small, e.g. between -0.5 and 0.5. should also be quite large, e.g. greater than , but it does not have to be an integer.

Suppose a root of the polynomial equation (not necessarily a real root.) Are there other routes near for integers ? Try for a small complex correction Substituting into the polynomial gives

Some multiples of came out of the exponent making it small, so the exponential can now be expanded as a power series. Coefficients of powers can be matched up to find a solution. The first power gives

I wont write the rest out but you can expand the exponential to second order and set the coefficient of to zero to get an expression for and so on. Provided is big and are small this series will converge rapidly, so the result describes a series of roots following a smooth curve in the complex plane. I don’t know if this fully accounts for what is being seen. It might be useful to setup a cleaner case where the remainders are made smaller to see how it looks.

What does the plot look like if the masses are adjusted to

57, 71, 86, 100, 100, 100, 100, 114,114, 114, 114, 128, 128, 128, 128, 143, 143, 157, 157, 186 ?

These numbers are nearer multiples of 14.28. Does this make the arcs more visible?

I hope Greg (whose ‘red dot’ code could answer your Q without modifiation) sees your post. I am starting to see linear features as well and want to put this to rest before I start seeing horsies and doggies.

Philip wrote:

That sounds like an interesting question! I’d gotten busy with some funeral arrangements in the last week, but I haven’t lost interest in these puzzles. I hope someone with better computational skills than I can draw this plot. Maybe I’ll try to drum up interest on G+.

I’m working on an interactive page that will let people plot both F(N) and the roots for any set of integers they like, but it’ll be a couple of days before it’s up.

Wow! I should have guessed you were up to something, not slacking off like me. That sounds fun.

I see those arcs too. They are very convincing, but if I cover up the bottom half I am not sure they are so convincing. Is it the reflection symmetry that creates the illusion?

To continue with my other point which I somehow got mixed up in this thread, the chemistry seems to be based on chains of CH_2 (mass 14) NH (mass 15) CO (mass 28) CNH_3 (mass 29) S (mass 32) N_2C_3H_2 (mass 66) benzene (mass 76). What oscillations do you get if you use these 7 masses as your starting point?

In general, for the sake of some possible applications, I always thought it may be more interesting if the the period of the oscillations

doessomehow depend on the peptide masses — i.e., if the period captures some property of the peptide masses, which may allow us to say something about the probability that the masses present in some sample set are (or are not) compatible with a given set of “building block” masses.In case anybody may be interested, in a different paper Shane Hubler and I have also looked at some general mathematical properties of this type of recurrence relations: “Mass distributions of linear chain polymers”, Journal of Mathematical Chemistry, 2012 http://www.math.wisc.edu/~craciun/PAPERS_NEW/Hubler_Craciun_JMathChem_2012.pdf

Thanks, I’ll look at that!

Shane Hubler and I have pointed out in our paper that 70% of the amino acid masses are within 1.4Da of an integer multiple of 14.28 — see Table 2 and the second paragraph in section 5 “Why14.28Da?” in the paper at

http://www.math.wisc.edu/~craciun/PAPERS_NEW/Hubler_Craciun_BioSystems_FINAL_2012.pdf .

I think it would be interesting to check numerically the following: if we make many random selections of 20 integers (all between 50 and 200) such that most of them are very close to an integer multiple of 14.28, is it true that for a large percentage of those choices the oscillation period is close to 14.28? (For an idea of how large the “noise” could be, you can have a look at the leftmost column labelled “Remainder” in Table 2 in our paper.)

Yes, this is the correct explanation. Furthermore it is the low mass components that most control the oscillation and more of these are within your limits for the remainder. The contribution from each mass in the recurrence relation is suppressed according to the growth factor. E.g. with a growth factor of 1.02 per dalton the mass at 57 makes a larger contribution than the mass at 103 by a factor of The fact that the two lowest mass components are particularly close to multiples of 14.28 is most significant.

I don’t know much about biochemistry but from the Wikipedia page on amino acids it looks like the lightest components in the chains that they are formed from are CH_2 (mass 14) NH (mass 15) CO (mass 28) CNH_3 (mass 29) S (mass 32) . More complex components such as the benzene ring are much heavier so their influence is suppressed by the growth factor. Being close to twice the period (i.e. 28.56) is just as good as being close to the period itself. The frequency with which these occur as components in the amino acid chains will also be relevant.

This explanation is a little handwaving but it might be possible to do a more detailed analysis of how this works out.

Here is a more detailed analysis.

The overall growth factor is the largest real number root of

This can be computed as

Next try to find a second complex root assuming that the masses are all close to integer multiples of some fixed size i.e. define

for integers such that the values are all close to 14. Look for a second root of the form where will be the period of the oscillation and is real.

So solve

Subtract a multiple of in the exponents to make them small and simplify.

Since the exponents are now small use the first order approximation to the exponential

The real part of the equation gives (first order approximation) and the imaginary part gives

In other words the period is just the weighted mean of the values with weights

The approximation for the growth rate of the oscillations could be improved by using a second order approximation to the exponential to show that it grows just slightly slower than the growth factor by an amount determined by the size of the remainders.

Nice analysis! It gets the period fairly close for 6,11,16,21,26; it gives 5.73981 vs 5.50192.

For 7, 11, 16, 21, 26 it’s not as good; it gives 6.16646 vs the (lowest-phase) period of 5.4668.

But for 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186, it’s extremely close: 14.2533 vs 14.2767.

I calculated this weighted sum on a spreadsheet and got 14.2775

What values for did you use Greg? Mine were

4 5 6 7 7 7 7 8 8 8 8 9 9 9 10 10 11 11 13

For a few it was not obvious. I could have chosen a different number and the result would be different

I used the same numbers except for 12 rather than 11 for the second-last (mass 163). I rounded the masses divided by 14 to get the ; dividing by 14.28 instead would change that 12 to the 11 you used and leave everything else the same.

I guess there are all sorts of refinements possible in making these choices, but that’s not really the point — we can compute the roots

exactlyif we really want to know them, exactly. But your analysis gives a nice intuitive explanation of why the period is roughly what it is in relation to the .You can double all the values of giving a new root at half the period, but the growth rate will go down. For the case 7, 11, 16, 21, 26 you could then change from 2 to 3 to give a better match.

Great stuff!

I made a plot of the roots of many random polynomials with coefficients 1, 0 or 1, 0 or 1,…., 0 or 1, -1. In the complex plane they look quite like the Littlewood ones, but a C not an O. Here’s a link to an image in Mod/Arg space, and the R code. (Goodness knows what WordPress will make of this.)

Cool stuff, Graham! But I’m confused by this:

Maybe there’s a typo there, but even turning on my automatic typo-correction facilities I can’t tell what class of polynomials you’re plotting roots of.

No typo that I can see, but I suppose it is a bit terse and cryptic – and influenced by my R code. I mean polynomials of the type described in Puzzle 2. I rewrote them like , where and the one between are either 0 or 1.

First n is sampled randomly between 3 and 42 (with a bias towards big n), then the coefficients filled in at random.

I see! You meant

but for some reason I read it as

which makes no sense at all.

Nobody said anything about puzzle 1 yet. Here is one approach that might work.

If a polynomial has all real roots then all its derivatives have all real roots too. Differentiate the polynomial until you are left with a low degree polynomial, e.g. a cubic. Its coefficients are non-negative. In which cases can this have all real roots?

I have a sketchy proof that the polynomials in the puzzle only have all roots real when m=1 or 2.

Put , where each is 0 or 1, and . Put .

The polynomial is then f(x) = g(x) – h(x). I get bounds on h(x) via various geometric sums.

First, f has exactly one positive root in [1,2]. h(x) is bigger than g(x) for . So f has at least one root in [1,2], and no positive ones outside. If y is the smallest root in [1,2], show all derivatives of f at y are positive. So there aren’t any more.

Next look at negative roots. f(0)=-1, and the quickest possibility for f(x) to get up to 0 as -x gets bigger is for when i is odd, otherwise 0. It follows that all negative roots lie in [-2,-.618]. Suppose f has m real roots. Then it has m-1 roots in this range, so f’ has at least m-2 roots in this range, f” has at least m-3, and so on, until you get to the (m-2)’th derivative, which has one root in the range. But this is a quadratic, and both its roots lie too close to 0 for . m=3 can be checked manually.

Let’s try the third paragraph again.

First, f has exactly one positive root in [1,2]. h(x) is bigger than g(x) for , and h(x) is smaller than g(x) for . So f has at least one root in [1,2], and no positive ones outside. If y is the smallest root in [1,2], show all derivatives of f at y are positive. So there aren’t any more.

If we do

notassume that masses are distinct, i.e., if we look at polynomial equations of the form where each isnonnegative(instead of being restricted to 0 or 1), then there are many possible choices for the numbers for which all the roots are real. For example, if we consider the polynomial with m-1 roots being exactly -1, and the mth root being a positive number M, then if M is large enough we obtain a (monic) polynomial of the type considered above.(This is easy to check: we look at the product and note that this is product between and a polynomial with positive coefficients, so it will have the desired coefficient signs for large enough M.)

More about John’s “Puzzle 1”:

On the other hand, note that the polynomials we obtain this way (i.e., by using the product for large M) correspond to a case where

allpositive integers less than M are masses ofsome“amino acids”, which is a very restrictive case.More generally, consider the equation with nonnegative coefficients (and with ). If we assume that we have two or more “missing masses”, i.e., two or more coefficients are zero (but still ), then I think we can use Descartes’s rule of signs https://en.wikipedia.org/wiki/Descartes%27_rule_of_signs to argue that

notall roots can be real (because in this case Descartes’s rule of signs implies that there is exactly one positive root, and also implies there are at most m-2 negative roots, so all together there are at most m-1 real roots, since zero is not a root).If there is exactly one “missing mass”, i.e., exactly one coefficient is zero (and still ), then again I think we can use Descartes’s rule of signs to argue that, if all roots are real, then the zero coefficient must be (because, like before, there is exactly one positive root, so we have to have m-1 negative roots; but this means that, according to Descartes’s rule of signs criterion for negative roots, we must have a sign change from to the next monomial after it. But, if , we fail to get this sign change; therefore ).

By the way, there do exist polynomials of this type for which all the roots are real. For example, all roots of are real (another example is ). But, this is

nottrue forallsuch polynomials, for example not all roots of are real.To summarize, I think that if all roots of are real (and all are nonnegative, and ) then must also be strictly positive.

Also, if we now restrict the question only to the equations of the form with and (i.e., the way Puzzle 1 was originally formulated by John), then, unless I made some mistake in the comment above, it follows that the only equations of this form that have only real roots must be of the type , or of the type .

But, if we multiply these equations by , we have a lot of cancellations, and we are left with an equation of the form fewnomial = 0, where “fewnomial” is a sparse polynomial, i.e., a polynomial with few nonzero coefficients (in this case, after multiplying by and many cancellations, we are left with at most 4 nonzero coefficients, for any m).

Then, after using Descartes’s rule of signs for the “fewnomial”, (and maybe also after checking a couple cases of degree 3 and 4 by hand) I think we can conclude that the only polynomials of the two types mentioned above for which all the roots are real are , , and .

I try to obtain a demonstration that some polynomial have only complex solution, but I obtain nothing.

I try to evaluate the gradient of the f(N) in the real field in the form:

so that if the infinitesimal displacement is in the complex field, then all the solution are complex.

The two solution for the displacements are:

so that if the discriminant is ever negative, then the solutions are all in the complex plane (only oscillations):

and this seem a dead end.

Forgive my ignorance, most of the math here goes way over my head but i am fairly good with the simpler stuff, not to mention basic chemistry.

I have noticed a couple of points that may open new areas of thought.

Point 1. In the first graph of peptide mass numbers the peaks are approx 14 daltons apart ( which i take means there are a greater occurrence of peptides with the peak mass numbers?) 14 daltons is the equivalent mass of a carbon atom link in a peptide chain as most such single links have 2 hydrogen atoms connected to the carbon: total mass = 12 + 1 + 1 = 14. adding a single such link to a peptide to form a new stable peptide would most likely ocur when one such CH2 connection is added or subtracted.

Point 2 14.28(57) recurring is exactly 100 divided by 7.

The reason i visited the site was to look for information concerning the Fibonacci sequence. Specifically why the number 5 (and 11) is so important to the sequence? In the above post you give a fibonacci formula involving the inverse square root of this number and i believe i have discovered (and not read somewhere yet) a puzzling 5 connection, namely that for all fibonacci numbers above number 6 in the sequence (5) multiplying the number by 11 and adding the number 5 places previously results in the fibonacci number 5 places subsequent to the original! Starting with 5( x 11) + 0 = 55 (place numbers 6, 1 and 11 respectively), then 8 (x 11) + 1 = 89, 13 (x 11) + 1 = 144, ….

love.

It’s interesting to think about what would happen if hydrogen came mostly in the form of deuterium so it had a mass number of 2 instead of 1. List the mass and valency numbers for the elements that make up most proteins

Deuterium: m = 2, v = 1

Carbon: m = 12, v = 4

Nitrogen: m = 14, v = 3

Oxygen: m = 16, v = 2

Sulphur: m = 32, v = 2

In every case it happens that m + 2v = 4 (mod 16)

What can be said about the mass of any molecule that has a tree structure, i.e. no circuits or double bonds? The number of atoms A is then equal to the number of bonds B plus one. A = B + 1

Adding up the congruence over all atoms gives M + 2V = 4A (mod16)

where M is the total mass, V is the total valency, but V = 2B

So M = 4A – 4B = 4 (mod 16)

In other words, if all hydrogen were deuterium then all tree-like molecules made from these elements would have a mass number that is 4 more than a multiple of 16.

In the real world hydrogen is not mostly deuterium and amino acids do have loops and double bonds so the situation is more complex. The difference between masses are reduced to being often approximately multiples of 14.28 instead of always exactly 16.

Wow, that’s an amazing alternate reality there!