Periodic Patterns in Peptide Masses

Gheorghe Craciun is a mathematician at the University of Wisconsin who recently proved the Global Attractor Conjecture, which since 1974 was the most famous conjecture in mathematical chemistry. This week he visited U. C. Riverside and gave a talk on this subject. But he also told me about something else—something quite remarkable.

The mystery

A peptide is basically a small protein: a chain of made of fewer than 50 amino acids. If you plot the number of peptides of different masses found in various organisms, you see peculiar oscillations:

These oscillations have a frequency of about 14 daltons, where a ‘dalton’ is roughly the mass of a hydrogen atom—or more precisely, 1/12 the mass of a carbon atom.

Biologists had noticed these oscillations in databases of peptide masses. But they didn’t understand them.

Can you figure out what causes these oscillations?

It’s a math puzzle, actually.

Next I’ll give you the answer, so stop looking if you want to think about it first.

The solution

Almost all peptides are made of 20 different amino acids, which have different masses, which are almost integers. So, to a reasonably good approximation, the puzzle amounts to this: if you have 20 natural numbers m_1, ... , m_{20}, how many ways can you write any natural number N as a finite ordered sum of these numbers? Call it F(N) and graph it. It oscillates! Why?

(We count ordered sums because the amino acids are stuck together in a linear way to form a protein.)

There’s a well-known way to write down a formula for F(N). It obeys a linear recurrence:

F(N) = F(N - m_1) + \cdots + F(N - m_{20})

and we can solve this using the ansatz

F(N) = x^N

Then the recurrence relation will hold if

x^N = x^{N - m_1} + x^{N - m_2} + \dots + x^{N - m_{20}}

for all N. But this is fairly easy to achieve! If m_{20} is the biggest mass, we just need this polynomial equation to hold:

x^{m_{20}} = x^{m_{20} - m_1} + x^{m_{20} - m_2} + \dots + 1

There will be a bunch of solutions, about m_{20} of them. (If there are repeated roots things get a bit more subtle, but let’s not worry about.) To get the actual formula for F(N) we need to find the right linear combination of functions x^N where x ranges over all the roots. That takes some work. Craciun and his collaborator Shane Hubler did that work.

But we can get a pretty good understanding with a lot less work. In particular, the root x with the largest magnitude will make x^N grow the fastest.

If you haven’t thought about this sort of recurrence relation it’s good to look at the simplest case, where we just have two masses m_1 = 1, m_2 = 2. Then the numbers F(N) are the Fibonacci numbers. I hope you know this: the Nth Fibonacci number is the number of ways to write N as the sum of an ordered list of 1’s and 2’s!

1

1+1,   2

1+1+1,   1+2,   2+1

1+1+1+1,   1+1+2,   1+2+1,   2+1+1,   2+2

If I drew edges between these sums in the right way, forming a ‘family tree’, you’d see the connection to Fibonacci’s original rabbit puzzle.

In this example the recurrence gives the polynomial equation

x^2 = x + 1

and the root with largest magnitude is the golden ratio:

\Phi = 1.6180339...

The other root is

1 - \Phi = -0.6180339...

With a little more work you get an explicit formula for the Fibonacci numbers in terms of the golden ratio:

\displaystyle{ F(N) = \frac{1}{\sqrt{5}} \left( \Phi^{N+1} - (1-\Phi)^{N+1} \right) }

But right now I’m more interested in the qualitative aspects! In this example both roots are real. The example from biology is different.

Puzzle 1. For which lists of natural numbers m_1 < \cdots < m_k are all the roots of

x^{m_k} = x^{m_k - m_1} + x^{m_k - m_2} + \cdots + 1

real?

I don’t know the answer. But apparently this kind of polynomial equation always one root with the largest possible magnitude, which is real and has multiplicity one. I think it turns out that F(N) is asymptotically proportional to x^N where x is this root.

But in the case that’s relevant to biology, there’s also a pair of roots with the second largest magnitude, which are not real: they’re complex conjugates of each other. And these give rise to the oscillations!

For the masses of the 20 amino acids most common in life, the roots look like this:

The aqua root at right has the largest magnitude and gives the dominant contribution to the exponential growth of F(N). The red roots have the second largest magnitude. These give the main oscillations in F(N), which have period 14.28.

For the full story, read this:

• Shane Hubler and Gheorghe Craciun, Periodic patterns in distributions of peptide masses, BioSystems 109 (2012), 179–185.

Most of the pictures here are from this paper.

My main question is this:

Puzzle 2. Suppose we take many lists of natural numbers m_1 < \cdots < m_k and draw all the roots of the equations

x^{m_k} = x^{m_k - m_1} + x^{m_k - m_2} + \cdots + 1

What pattern do we get in the complex plane?

I suspect that this picture is an approximation to the answer you’d get to Puzzle 2:

If you stare carefully at this picture, you’ll see some patterns, and I’m guessing those are hints of something very beautiful.

Earlier on this blog we looked at roots of polynomials whose coefficients are all 1 or -1:

The beauty of roots.

The pattern is very nice, and it repays deep mathematical study. Here it is, drawn by Sam Derbyshire:


But now we’re looking at polynomials where the leading coefficient is 1 and all the rest are -1 or 0. How does that change things? A lot, it seems!

By the way, the 20 amino acids we commonly see in biology have masses ranging between 57 and 186. It’s not really true that all their masses are different. Here are their masses:

57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186

I pretended that none of the masses m_i are equal in Puzzle 2, and I left out the fact that only about 1/9th of the coefficients of our polynomial are nonzero. This may affect the picture you get!

101 Responses to Periodic Patterns in Peptide Masses

  1. Greg Egan says:

    There’s a typo in the formula for F(N) for the Fibonacci numbers; apart from the lower-case n on the RHS, the second root should be 1-\phi.

  2. Greg Egan says:

    I was puzzled as to why there are so many more than 20 roots plotted if there are 20 amino acids, so I looked at the paper. It talks about adding 41 more masses because of post-translational changes to the proteins, but I think there are more than 61 roots in the plot, so there must be something else going on that adds a few more masses.

    • Greg Egan says:

      Oh, sorry, that was a dumb comment! I got confused between the number of terms in the polynomial and the degree.

    • John Baez says:

      Yes, I made the mistake too at one point.

      I hadn’t noticed about the ‘post-translational changes’. I told Craciun about selenocysteine, an unusual amino acid that contains selenium. It’s only used in certain organisms, and it’s coded for by a codon that’s normally one of the stop codons. It would be really cool if the peptides found in one of these other organisms showed oscillations with a notably different periodicity.

      Quoth Wikipedia:

      Although it is found in the three domains of life, it is not universal in all organisms. Unlike other amino acids present in biological proteins, selenocysteine is not coded for directly in the genetic code. Instead, it is encoded in a special way by a UGA codon, which is normally a stop codon. Such a mechanism is called translational recoding and its efficiency depends on the selenoprotein being synthesized and on translation initiation factors. When cells are grown in the absence of selenium, translation of selenoproteins terminates at the UGA codon, resulting in a truncated, nonfunctional enzyme. The UGA codon is made to encode selenocysteine by the presence of a selenocysteine insertion sequence (SECIS) in the mRNA. The SECIS element is defined by characteristic nucleotide sequences and secondary structure base-pairing patterns. In bacteria, the SECIS element is typically located immediately following the UGA codon within the reading frame for the selenoprotein. In Archaea and in eukaryotes, the SECIS element is in the 3′ untranslated region (3′ UTR) of the mRNA, and can direct multiple UGA codons to encode selenocysteine residues.

  3. John Baez says:

    Here’s another cool puzzle.

    Greg Egan worked out the roots of 200 polynomials, where in each case he randomly chose 20 masses between 50 and 200, and threw out any repeated masses. They look like this:

    Compared to the pictures in The Beauty of Roots, this looks pretty chaotic and boring except for the repulsion of the real axis, the strip of positive real roots with large magnitude, and the ‘bands’ or ‘tendrils’ of complex roots with large positive real part and small imaginary part.

    But, these few interesting visible features are connected to the peptide puzzle! The roots with largest magnitude contribute most to the asymptotic growth rate of F(N), and the complex roots with largest magnitude—the tips of the ‘tendrils’—will give the visible oscillations. So, let’s zoom in on those:

    Puzzle 3. Suppose we randomly chose 20 natural numbers

    50 < m_1 < \cdots m_{20} < 200

    and work out the roots of the polynomial

    x^{m_{20}} = x^{m_{20} - m_1} + x^{m_{20} - m_2} + \cdots + 1

    What, on average, is the imaginary of the complex roots with largest real part? And what is its standard deviation? I’m hoping that this quantity, say d, has a small standard deviation and mean \langle d \rangle with

    2\pi/\langle d \rangle \; \approx \; 14.28

    If that’s true, the periodicity of the peptide masses is a robust phenomenon, largely independent of the precise masses of the peptides!

    Puzzle 4. How do things change as we change the numbers 20, 50 and 200 in the above problem?

    • Greg Egan says:

      I think you need to pick the complex root with the largest magnitude, rather than the largest real part; if not, the value is a long way off.

      Using 10,000 random samples, chosen in the same way as I used for the plot (pick 20 numbers between 50 and 200 inclusive, and discard any repetition – so there might be less than 20 numbers after that step), I get:

      \langle d \rangle \; \approx \; 0.630314

      Standard deviation of d \; \approx \; 0.32496

      2\pi/\langle d \rangle \; \approx \; 9.96835

    • Greg Egan says:

      If I change the upper limit on the mass from 200 to 500, while keeping the lower limit at 50 and the number of masses at 20, that doesn’t alter the average of the periods very much. For 1,000 random samples, I get:

      \langle d \rangle \; \approx \; 0.651247

      with a standard deviation of 0.305363, and:

      2\pi/\langle d \rangle \; \approx \; 9.64792

      If I go back to the original mass limits of 50 and 200, but double the number of masses from 20 to 40, then for 1,000 random samples, I get:

      \langle d \rangle \; \approx \; 0.566247

      with a standard deviation of 0.354997, and:

      2\pi/\langle d \rangle \; \approx \; 11.0962

    • John Baez says:

      Thanks for the calculations!

      I think you need to pick the complex root with the largest magnitude, rather than the largest real part; if not, the value is a long way off.

      The problem is, if the real part of a root is small, that root gives oscillations that don’t grow very fast, or even decay. That wouldn’t explain the dramatic growing oscillations that we see. What we see is likely to be dominated by the roots with the largest real parts.

      More precisely, we have

      F(N) = a_1 x_1^N + \cdots a_k x_k^N

      where x_1, \dots, x_k are the roots and a_k are some coefficients. The asymptotic behavior of F(N) as N \to \infty is dominated by the roots with a_k \ne 0 for which \mathrm{Re}(x_k) is the largest. So, roots with a_k = 0 would throw off my simplified claim above.

    • Greg Egan says:

      OK. So maybe I need to calculate something more elaborate.

      Just to be clear, for the specific set of roots given in the diagram from the paper, the two red diamonds that Craciun associates with the period are the non-real roots with the largest magnitude. There are other non-real roots with larger real parts (the two small diamonds with very small phases, around plus or minus 3 degrees, and magnitude around 1.014).

      • John Baez says:

        Hmm. I’m finding this whole phase-magnitude way of plotting a bit confusing. What matters in my mind (and I could be wrong!) are the real and imaginary parts of the roots x_i, since these control the growth rate and frequency of the corresponding functions x_i^N.

        So, I can imagine asking you to draw a picture in terms of real and imaginary parts. But even more useful might be these: a picture of Craciun’s specific roots overlaid on your picture of roots of randomly chosen polynomials. That would let us see how special the biological example is—or not.

      • Greg Egan says:

        Surely x^N = \exp(N \log(x)) has its growth rate controlled by the real part of \log(x), which depends on the magnitude of x, rather than the real part of x itself?

      • John Baez says:

        Now we’re even on dumb comments for the week. Thanks for straightening me out!

    • John Baez says:

      Greg drew a phase/magnitude plot of the roots for 1000 random examples (where in each case he randomly chose 20 masses between 50 and 200, and throw out any repeated masses), together with the roots for the polynomial he got from applying the same construction to the 20 amino acid masses (which include some repeated masses), shown in red:

      Here’s a closeup of the most important part:

      Unfortunately for my dreams, the complex roots that control the dominant oscillations in the peptide mass distribution do not lie on the “first tendril”, so these biological oscillations are not “controlled by math” in a simple way. They seem to lie near the edge of the “fourth tendril”—and the fourth tendril, unlike the first three, is barely visible.

      So, the oscillations we see in nature are slower than they you might get with randomly chosen amino acid masses.

      (There should be some fairly simple explanation of these tendrils, but I don’t see it.)

    • John Baez says:

      I’ve corrected my confusion about real part vs. magnitude and imaginary part vs. phase in the main post and some comments. But here I should really issue a separate, corrected, version of Puzzles 3 and 4. If we’re trying to understand the period of the oscillations in the function

      x^N

      we need to look, not at the imaginary part of the root x, but its phase, or argument. If

      x = |x| e^{i \theta}

      then

      x^N = |x|^n e^{i N \theta}

      so the period of the oscillations in this function is 2\pi \theta. So:

      Puzzle 3′. Suppose we randomly chose 20 natural numbers

      50 < m_1 < \cdots m_{20} < 200

      and work out the roots of the polynomial

      x^{m_{20}} = x^{m_{20} - m_1} + x^{m_{20} - m_2} + \cdots + 1

      What, on average, is the phase of the complex roots with largest magnitude? And what is its standard deviation? I’m hoping that this quantity, say \theta, has a small standard deviation and mean \langle \theta \rangle with

      2\pi/\langle \theta \rangle \; \approx \; 14.28

      If that’s true, the periodicity of the peptide masses is a robust phenomenon, largely independent of the precise masses of the peptides!

      (I’m not at all optimistic that this is true, now that Greg has prepared an image comparing the average situation to the situation that arises with actual peptides. But this at least is the correct way of asking the question!)

  4. Graham Jones says:

    By the way, the 20 amino acids we commonly see in biology have masses ranging between 57 and 186. It’s not really true that all their masses are different. Here are their masses:
    57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186

    When amino acids join to make peptides, a water molecule is lost at each bond, so for peptide made from n amino acids, subtract (n-1)*18.

    • John Baez says:

      Hmm, that’s important! I wonder if Hubler and Craciun discuss that in their paper. It’s a big deal, yet easy to deal with. You can just subtract 18 from the mass of each peptide, then do all the math I described to work out a formula for F(N), then add 18. The last step would not affect the period of F(N), so if you only care about the period, you can skip it.

  5. 14 is the rounded atomic mass of a CH2 unit. You’re overthinking this….

    • John Baez says:

      The paper mentions that. However, they say the fit of 14.28 is noticeably better, and that’s what they get from their more detailed analysis.

      Of course, the two explanations are not necessarily mutually exclusive.

      And of course, amino acids involve a lot more than CH2’s so it’s not obvious that CH2’s explain the truly dramatic oscillations that we see! One would need to give a mathematical argument to prove this.

    • Greg Egan says:

      I think the paper’s comments about the structural subunits of the amino acids explain both the fact that the period is close to but not exactly 14, and also the fact that the period is distinctly non-generic (i.e. quite different from what random masses in the same range produce):

      We should also point out that a period around 14 Da is reasonable since the amino acid residues can all be constructed (with rearrangement) from five groupings of elements, a kind of elemental basis: C, CH2, NH, O, and S, with nominal masses of 12, 14, 15, 16, and 32, respectively. Since S is only in two of the twenty amino acids, it is reasonable to suppose that the other four constituent components would have an effect in a periodic pattern arising from objects made up of this basis.

    • Greg Egan says:

      Hmm, I guess I should add that the mere fact that all the amino acid masses can be expressed as sums of elements of the set {12, 14, 15, 16, 32} is actually completely non-informative, because this is true of every integer from 50 to 200!

      So my previous comment was a bit of a stretch. There must be something else about the composition of amino acids that makes the masses non-generic … and of course there are probably fairly simple reasons to do with valences why you can’t stick any old numbers of these units (C, CH2, NH, O, and S) together, let alone stick them together and get something that counts as an amino acid.

      • John Baez says:

        About this magic number 14: if it’s really important that CH2 has mass 14, then this might show up in lots of amino acids having masses whose differences are multiples of 14. If I were a biochemist I’d just know whether lots of amino acids differ by one or more CH2’s, but since I’m a mathematician I imagine taking the amino acid masses

        57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186

        computing their differences, and seeing if lots differ by multiples of 14. Let’s see:

        71-57 = 14
        113-99 = 14
        115-101 = 14
        129-115 = 14

        That seems like a lot. I’m too lazy to look for differences that equal 28 or other multiples of 14.

        One thing we could do is take 20 numbers between 50 and 100 with a lot that differ by 14, and generate a polynomial from those numbers in the usual way, and look at its roots. Claim: the complex root with largest magnitude has phase \theta such that

        2 \pi / \theta \approx 14

        • Graham Jones says:

          Multiples of 14.28 are:
          57.12 71.40 85.68 99.96 114.24 128.52 142.80 157.08 171.36 185.64
          That’s a good match for most of the amino acid masses, until you subtract 18 from them.

  6. It’s so weird in how subtle ways combinatorics interact with our lives.

  7. Philip Gibbs says:

    The distribution of the masses of the amino acids might not be well modeled by a flat distribution between 50 and 200. It looks more like a normal distribution centred on about 115 with a small standard deviation (about 20?) Would that make a significant difference?

    • John Baez says:

      Good question! This might tend to push the largest-magnitude complex root onto the fourth tendril, which is what we see in nature. But I have no intuition for whether that’s true, because I don’t understand those tendrils.

      • Philip Gibbs says:

        It might be worth solving the case where the amino acid masses are all evenly separated e.g. 48,56,64,…,200. In that case the polynomial takes a simplified form if multiplied by x^8-1 with just four terms, and it is a polynomial in x^8. Perhaps that would give a distribution with clear tendrils.

    • Philip Gibbs says:

      Actually more like mean = 118, stdev = 31, but the mean should not affect the period right?

      • Graham Jones says:

        The mean can affect the strength of the periodicity, From 48,56,64,…,200 you can only make multiples of 8, but from 51,59,67,… 203, you can make much more.

    • John Baez says:

      Let’s look at how having many similarly spaced amino acids affects the mass distribution of peptides. Suppose we have numbers

      a ,  a + b, a + 2b , \dots, a + k b

      How many ways are there to write N as an ordered sum of these?

      I’m too busy to dive in and work that out, so let’s start out with an easy example and hope that helps us with harder ones. Suppose we have numbers

      1, 2, \dots, k

      Let F(N) be the number of ordered sums of these that add up to N.

      Then

      F(N) = F(N-1) + \cdots + F(N-k)

      so if we take as our ansatz

      F(N) = x^N

      we get

      x^N = x^{N-1} + \cdots + x^{N - k}

      or

      x^k = x^{k-1} + \cdots x + 1

      But

      \displaystyle{ x^{k-1} + \cdots + x + 1 = \frac{x^k - 1}{x - 1} }

      so our equation becomes

      \displaystyle{ x^k = \frac{x^k - 1}{x - 1} }

      or

      x^k (x - 1) = x^k - 1

      What are the solutions like, especially those with the largest magnitude?

      Hmm, I’m too busy to even tackle this in detail right now! But I’ll leave this here and maybe return to it later. It could be fun to study problems of this sort.

      (For k = 5, the solution with greatest magnitude seems to be x \approx 1.96. For k = 10, it seems to be x \approx 1.999. So for large k this root seems to be approaching 2. But this isn’t the main thing I’m interested in.)

      • Scott Hotton says:

        Following these calculations where a and b are not 1, I get for the characteristic polynomial

        (x^b)^k (x^b -1 ) - ((x^b)^k - 1)

        The a drops out so unless I am mistaken the loss of water molecules in peptide bond formation does not matter when the masses are equally spaced.

        The polynomial is zero if x^b = 1. In the b=14 case the fourteenth roots of unity are roots of the characteristic polynomial. And one of the fourteenth roots of unity is

        \exp(i 2\pi/14)

        Numerically I find that for k=19 the root of

        x^k (x -1 ) - (x^k - 1)

        with the largest magnitude is nearly 2. The positive fourteenth root of 2 is about 1.05. So

        1.05 \exp(i 2\pi/14)

        approximates a root of

        (x^{14})^{19} (x^{14} -1 ) - ((x^{14})^{19} - 1)

        and it ties for having the maximum magnitude of all the roots.

      • Greg Egan says:

        I think the equation for generic a,b is:

        \displaystyle{ x^{a+(k-1)b} = \frac{x^{k b} - 1}{x^b - 1}}

        or equivalently:

        \displaystyle{ x^{a+(k-1)b} (x^b - 1) - (x^{k b} - 1) = 0}

        Of course if x is a bth root of unity, it will still be a solution regardless of a.

        I think a does affect the other solutions, though.

      • Greg Egan says:

        Now that I check more carefully, it’s not safe to multiply through by x^b - 1 when x is a bth root of unity, and in fact those roots don’t solve the equation.

        But my numerical calculations seem to agree with the idea that the largest-magnitude roots are multiples of the bth roots of unity.

      • Greg Egan says:

        When a is divisible by b (as in the example a=56, b=14 that I described in another comment), then you get b solutions that are some constant real multiple of all the bth roots of unity. But when a is not divisible by b, things aren’t so symmetrical.

        • Scott Hotton says:

          Using a = 56, b=14, k=20 and setting y = x^b I got

          y^{23} (y - 1) - (y^{20}-1) = 0

          It looks like the root with the largest magnitude is

          y \approx 1.37995

          This would make the magnitude of x about 1.0233 which looks close to the magnitude of the red diamonds in Hubler and Craciuns’ figure. Perhaps too close given that the amino acid masses are not equally spaced.

        • Greg Egan says:

          I agree that for a=56, b=14, k=20 there’s a set of 14 roots that all have magnitudes of about 1.0233.

          For the real data, I calculate a magnitude for the largest complex root of 1.02552. I don’t know to what extent that’s just a coincidence, or whether there’s some correlation here that doesn’t depend so much on the details of the mass set as the fact that the periods are very close.

  8. Greg Egan says:

    If you use masses that lie in a perfect arithmetic sequence, I don’t think you get periodic behaviour in F(N).

    The first plot in the image above shows what would be the period from the largest-magnitude complex root for the series of masses with a=1, b=1, but the next plot shows F(N) itself for the series with 5 masses, and like the Fibonacci numbers it doesn’t oscillate.

    The third and fourth plots show what you get with a perfect arithmetic sequence with a=56, b=14, and what happens if you alternately add and subtract 1 from the masses.

    So this seems to suggest that exact spacing gives non-generic (and perhaps always non-oscillatory) results, but adding some variation to an exactly spaced set of masses can give an oscillation that comes close to the original spacing.

    • Greg Egan says:

      For the exact series a=1, b=1, the largest magnitude of any non-real root seems to approach 1 asymptotically from below as you add more masses, while as JB noted the real root seems to approach 2 from below. So for large N, the exponential behaviour will always dominate.

      For the exact series a=56, b=14, it turns out that there are 14 roots with magnitudes exactly equal to the magnitude of the largest real root (which is greater than 1). So that plot where I showed the period associated with the largest complex root seeming to jump back and forth between different values was an artifact of trying to pick just one root, when the “dominant period” in this case is actually multi-valued.

      And in fact these roots all lie equally spaced around a circle in the complex plane. So one of them does correspond to a period of exactly 14, but there are equal-magnitude roots with periods 14/2=7, 14/3, 14/4=7/2, etc.

      I don’t know what actually happens with F(N) in this case, because I haven’t been able to obtain enough values by combinatorial methods to solve for the coefficients of the terms associated with all the roots.

      However, for the case where you perturb the exact series by adding or subtracting 1 from the masses, I can solve for everything, and you do get clear periodic behaviour.

    • John Baez says:

      Thanks! By the way, my original claim that the Fibonacci numbers show no oscillation was based on my silly confusion about x^N versus \exp(N x). If x is negative then the former oscillates even though the latter does not. The former is what matters for our problem. For the Fibonacci numbers the root

      x = 1 - \Phi \approx -0.6180339

      is negative so we do get an oscillatory effect:

      \displaystyle{ F(N) = \frac{1}{\sqrt{5}} \left( \Phi^{N+1} - (1-\Phi)^{N+1} \right) }

      However, the oscillations are exponentially damped since |1 - \Phi| < 1. The oscillations can only be seen if we subtract them off, leaving numbers

      \displaystyle{\frac{1}{\sqrt{5}} \Phi^{N+1} }

      which go like this:

      0.723
      1.171
      1.894
      3.065
      4.960
      8.025

      The Fibonacci numbers are alternately larger and smaller than these, so we can say they oscillate, in a sense!

      However, these oscillations are vastly milder than the growing oscillations we see when there’s a negative or complex x with |x| > 1.

      (I imagine you know all this stuff, but other people may not have thought about it very much, and I need to correct any lingering confusion I sowed.)

  9. Philip Gibbs says:

    Greg Egan’s analysis above may suggest that smooth distributions don’t capture the right effect, but let’s continue with that for now.

    Suppose the amino acid masses m have a smooth distribution \rho(m-\mu) about some mean value \mu that we take to be an integer. Rewrite the equation for the roots as

    x^{\mu} = x^{\mu - m_1} + ... +x^{\mu - m_A}

    where A = 20 is the number of amino acids.

    If x is near one the smooth distribution can be used to write this as an integral.

    x^{\mu} = A \int \rho(m-\mu) x^{\mu-m} dm

    x^{\mu} = A \int \rho(m') x^{-m'} dm'

    Take x = \exp{\Delta} with \Delta small

    \exp{\Delta \mu} = A \int \rho(m') \exp{-\Delta m'} dm'

    So you get an equation to solve involving a Laplace transform.

    • Philip Gibbs says:

      I want to try this for a normal distribution rather than a flat distribution because the result is more tractable and because I think it models the actual numbers better.

      \rho(m) = \frac{1}{\sqrt{2\pi} \sigma} \exp(-\frac{m^2}{2\sigma} )

      \exp(\mu \Delta) = \frac{A}{\sqrt{2\pi} \sigma} \int \exp(-\frac{m^2}{2\sigma} - m \Delta) dm

      The integral can be done giving

      \exp(\mu \Delta) = A \exp(-\frac{\sigma^2 \Delta^2}{2})

      This reduces to a series of quadratic equations to solve for \Delta:

      \frac{\sigma^2 \Delta^2}{2} + \mu \Delta + 2\pi n i - \ln(A) = 0

      for integers n

      In the real case A = 20 , \mu = 118 and \sigma = 31

    • Philip Gibbs says:

      The real roots (n = 0 ) are
      \Delta = -0.26877 or \Delta = 0.023196

      Taking the second dominant root the growth factor is

      x = \exp(\Delta) = 1.0234671

      I think that is quite close.

    • Philip Gibbs says:

      Taking the dominant root for n = 1

      \Delta = 0.029388 - 0.042964 i

      This has a larger real part than the real root and an imaginary part that is far too small. So some random element might be needed here too, and that is too hard to do analytically.

  10. Greg Egan says:

    Suppose you start with 20 masses in an arithmetic series, and then randomly add or subtract 1, or do nothing, to each mass. The periods implied by the largest complex root, for 200 random trials of that process, are shown in the histograms above. In each case, the arithmetic series starts at a=150, and the trials were done with b=14, 22, 34.

    The results for the first two cases are tightly clustered a bit below the original spacing of the masses. But for reasons I don’t understand, things go weird for the third case, where b=34. I’m using very high precision arithmetic (1000 digits), so it’s unlikely that the numerical approximation is misidentifying the correct root, and if there are multiple complex roots that share the same magnitude, I pick the one with the smallest argument, which ought to yield the highest possible period in each case. But for b=34, none of the mass sets yield a period greater than 16.85.

    Of course, these periods are all based on the assumption that these particular roots will have non-zero coefficients in the final formula for F(N). But generically, you’d expect that to be true.

    Anyway, it does seem clear that there are cases where the dominant period in F(N) will be set by the spacing between the masses.

    These examples don’t readily translate to the amino acid masses, though, as the spacings between consecutive masses for them is: 14,16,10,2,2,2,10,0,1,1,13,0,1,2,6,10,9,7,23.

  11. Greg Egan says:

    If you start out with masses in a perfect arithmetic sequence, and if the starting value of the sequence, a, is divisible by the increment, b, then the characteristic polynomial will have b roots that are all of equal magnitude, and are all multiples of bth roots of unity. That’s what I see numerically, and I think Scott Hotton has shown that analytically.

    If you perturb the sequence by adding or subtracting 1 to the masses at random, you break that perfect symmetry. What seems to happen at first is that the positive real root becomes the largest, and the second-largest roots are a conjugate pair with phases of approximately \pm 2\pi / b.

    But as you keep perturbing the masses, the b roots that were originally arranged symmetrically on a perfect circle move around, and if you make enough changes, eventually you can “dethrone” the conjugate pair with the smallest phase, and one of the other pairs of roots becomes larger.

    In the root plots below, the largest root(s) are shown in red, and the second-largest are shown in blue. Eventually, for both b=14 and b=34, the lowest-phase roots get beaten by some other pair of roots, and the phase is abruptly multiplied by some (roughly integer) factor.

    I’m still not entirely sure why the randomly perturbed cases for b=34 that I plotted in the histogram in an earlier comment never had the lowest-phase roots being the largest, but it might be due to the relationship between a and b; you’re only starting from a perfect circle when a is divisible by b, and how much things are distorted by non-divisibility must depend on the particular numbers.

  12. John Baez says:

    On the basis of what Greg has done, I’m willing to guess that if we have a list of integer masses 0 < m_1 \le \cdots \le m_k and a large fraction of them have differences equal to some number \Delta, and we plot a histogram of how many ways each natural number N can be written as an ordered sum of these masses, we’ll usually get oscillations with period close to \Delta. But sometimes funny things happen!

    It might be fun, Greg, to take one of the test cases you examined and actually work out F(N), the number of ways N can be written as an ordered sum of the masses on one of your lists.

    Instead of writing

    F(N) = \sum_i a_i x_i^N

    and trying to figure out the coefficient a_i for each root x_i of the relevant polynomial, a less tiring approach might be to actually generate all the ordered sums up to some number N_{\textrm{max}} and then count how many of these sums give N for each N \le N_{\mathrm{max}}. (This could be tiring for your computer, but I hope not for you.)

    I’m wondering, for example, if the second-largest roots always manifest themselves as a noticeable oscillatory pattern in F(N). Conceivably they might not, if the coefficient a_i of x_i^N for those roots x_i was very small. Unless that coefficient were zero, as N \to \infty those terms a_i x_i^N would eventually beat the terms coming from other complex roots… but by then it might be completely overshadowed by the term coming from the real root.

    A case where another pair of complex roots dethrones the lowest-phase pair would be fun to see. I don’t get what’s really happening there.

  13. John Baez says:

    By coincidence yesterday I ran into 3 undergrad math majors at UCR who did a project on numerical semigroups. A numerical semigroup is just a set of natural numbers that’s closed under addition and has finite complement. (If it doesn’t have finite complement, every number in it must be a multiple of some fixed natural number k > 1, so we can divide everything in it by k and simplify things.)

    This is relevant because if we take any set of natural numbers

    m_1, \dots, m_k

    and keep adding them to our hearts content, the numbers we get form a numerical semigroup, at least if all the m_i aren’t divisible by some number k > 1.

    The Frobenius number of a numerical semigroup is the largest integer that’s not in it.

    A numerical semigroup is called irreducible if it’s not the intersection of two numerical semigroups that properly contain it.

    We can ask: how many numerical semigroups are there with Frobenius number n? And this turns out to grow in an interesting way, which has bumps with period 6:

    (Click to enlarge.) This picture is from the students’ paper, where they prove the lower and upper bound shown:

    • Clarisse Bonnand, Reid Booth, Carina Kaainoa, and Ethan Rooke, Bounds on the number of irreducible semigroups of fixed Frobenius number.

    Needless to say, this caught my eye both because the peptide problem is related and because it shows periodic bumps!

  14. Greg Egan says:

    Here’s a simple example of period-doubling that I was able to solve completely:

    • Greg Egan says:

      To add a bit more detail about what happens at the level of the roots: the blue curve has a dominant complex root of magnitude 1.10551 and period 5.50192, while the orange curve has a dominant complex root of magnitude 1.08076 and period 2.2945; it also has a root of magnitude 1.05422 and period 5.4668. Both have non-zero coefficients in the solution, and in fact the larger-magnitude root (with the shorter period) also has a larger coefficient. That surprised me, because it seems to take a while to clearly dominate the oscillations, but I guess at low values there are always going to be complicated things going on with all the other roots.

    • John Baez says:

      Wow, great! I was confused about ‘period-doubling’ versus ‘frequency-doubling’. but I think I’ve got it straightened out now. Period-doubling is of course a famous thing in dynamical systems, but frequency-doubling, or in other words period-halving, also shows up, e.g. when you blow hard on a flute and the note jumps up an octave.

      Of course you might argue it’s all a matter of convention, but the question is whether the ‘unexpected’ situation has period twice or half the ‘expected’ period. And here it’s half… which I find quite remarkable from one point of view.

      You take the numbers 6, 11, 16, 21, 26, each of which is 5 more than the last, and you count how many ways you can get N as an ordered sum of these. Unsurprisingly, perhaps, the count oscillates with period 5… well, okay, actually 5.50192.

      Then you change the number 6 to 7, and repeat the count, and now the count oscillates with period 5/2! Well, okay, actually 2.2945. But the period has halved, roughly.

      If you’d asked me before you started doing all these calculations, I might have guessed that the period could double, because it might take more time for something to come into alignment. (Not a very clear reason.)

      But of course once you showed that the complex root with the smallest phase could be ‘dethroned’ by one with roughly twice that phase, it became clear that this funny phenomenon should actually cause ‘frequency-doubling’, hence ‘period-halving’.

      But I had trouble believing that until I saw an example! I still have no intuition for why the counts start wiggling twice as fast when you change the 6 to a 7.

      Anyway, it’s great.

      I also find it cute that while the orange curve goes negative, it has the good sense not to do so at integer values.

      And I also find it interesting how the “period 5… well, okay, actually 5.50192” mimics the story in the biological case, where an easy argument suggests that the count should oscillate with period 14, but in fact the period is 14.28.

    • Greg Egan says:

      I knew as soon as I’d posted that I should have said “period-halving” rather than “period-doubling”, but by then it was too late.

      • John Baez says:

        It’s okay, I’d been pondering period-doubling versus period-halving in this problem so you gave me an excuse to vent!

        I’m still trying to figure out how to pose the question “what’s really going on in this period-halving effect?” in a useful way. I think it’s a wonderful and curious thing.

      • John Baez says:

        Here’s one idea. One can imagine wondering not just how many ways there are to write N as an ordered sum of numbers m_1, \dots, m_k, but also more about how we can write it as a sum. Could it be that then the period halves, we get pulses of two different kinds, in which some average characteristic of the ordered sums flips and flops?

        For example, does the average fraction of m_is appearing in the sum stay roughly constant, or does it vary significantly with n? This is just the simplest characteristic I can think of.

        (Even if it stays constant, what’s the constant? If I take a number N and write it as an ordered sum of numbers m_1, \dots, m_k in all possible ways, do each of these numbers tend on average to show up in equal proportion, or do smaller ones show up more? If we were talking about unordered sums, I should be able to answer this using standard techniques of statistical mechanics, sort of like how we figure out the energy distribution of atoms in a gas.)

        Guesses aside, ultimately this question is about how the expression

        F(N) = \sum_j a_j x_j^N

        ‘knows’ how many ways we can write N as an ordered sum of numbers m_1, \dots, m_k, or in other words, how the roots x_j and coefficients a_j encode what is going on. Somehow they encode everything about what’s going on. But the meaning of the individual oscillating functions x_j^N is obscure, since each one is affected by all the numbers m_i. It’s a bit like a Fourier transform or holograph.

        More concretely, suppose we’re trying to write N as an ordered sum of the numbers 7, 11, 16, 21, 26. Then we can write N+5 as an ordered sum in a bunch of ways. Many of these ways come from taking an 11 in a sum that gives N and replacing it with a 16, or taking a 16 and replacing it with a 21, or taking a 21 and replacing it with a 26. So if N tended to have a lot of 11’s in it, maybe N + 5 will tend to have a lot of 16’s in it, and so on.

        I guess it’s again wise to start with the very simple cases, where the m_i are in an arithmetic progression.

  15. Graham Jones says:

    I had a look at Hubler and Craciun’s paper, and at a few other things. I learned:

    The masses 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186 are for residues, ie, without a H2O. So the 18s have already been subtracted.

    The first graph in the blog post is not “the number of peptides of different masses found in various organisms”. Instead, they are the result of splitting up (digesting) a single protein (Lys-C or Trypsin) into peptides. Both Lys-C and Trypsin are proteases, i.e., proteins which digest proteins. They split at certain residues, which means that each peptide they make will begin (or end) with one of a few residues.

    I think (from the shape of the graph) that these semi-digested proteins contain few very short peptides, and this is important for periodicity. If short ones (say <6 residues) are very rare, peptides with masses in the range 300-500 will be dominated by the residues with small mass like 57, 71, 87, 97, 99, 101, 103.

    I am dubious about F(N) being used as a model. There is no process that I know of in the cell, or in the in vitro digestion, that cares about mass. On the other hand, I reckon that if you get a lot of Trypsin and wait for it to self-destruct, there will be a particular distribution of peptide lengths. And so I think a better approach (if anyone cares about the biochemistry) would look at “what masses can you make with r residues?” for various r. Convolutions, Fourier transforms, etc.

    • John Baez says:

      Thanks!

      The masses 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186 are for residues, i.e, without a H2O. So the 18s have already been subtracted.

      Yes, Craciun mentioned this to me in email, saying:

      This is exactly what we do (and is what everybody does — it’s the standard approach in mass spectrometry).

      The amino acid masses after you subtract 18 from their “molecular mass” are called “(mono-isotopic) residue masses”. For example, when you look at the mass of the smallest amino acid, Glycine, in various databases you see 57Da (which is its residue mass), and not 75Da (which is its molecular mass).

      For example, at http://www.matrixscience.com/help/aa_help.html they list the masses of amino acid residues, i.e., without the H at the N-terminus and OH at the C-terminus, since these are not part of the “building block” structure of the amino acid.

      When we add together such amino acid masses we obtain a total residue mass for each peptide, but, as you pointed out, if you want you can just add 18 to it to obtain the molecular mass of that peptide (like they do here http://www.peptidesynthetics.co.uk/tools/ ).

      I was waiting for him to post a comment on the blog.

      About the biology:

      I am dubious about F(N) being used as a model. There is no process that I know of in the cell, or in the in vitro digestion, that cares about mass. On the other hand, I reckon that if you get a lot of Trypsin and wait for it to self-destruct, there will be a particular distribution of peptide lengths. And so I think a better approach (if anyone cares about the biochemistry) would look at “what masses can you make with r residues?” for various r. Convolutions, Fourier transforms, etc.

      Interesting! But I don’t quite see what mathematical model you might be proposing here.

      While the biology is exciting, unfortunately the math problems raised by F(N) are so fundamental and so gripping that I’m perfectly happy thinking about those, for a while at least. Of course there’s nothing mathematically profound about how many ways you can write N as an ordered sum of the numbers 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, and 186. But the generalized question, where we consider arbitrary collections of numbers, feels important, and fun. So it’s probably been studied already. So I should poke into the literature.

      By the way, these problems are also basically problems of convolutions and Fourier transforms, even though we’re not using those words a lot. F(N) is obtained by iterated convolution—that’s what the recurrence relation says—and that lets express it as a linear combination of exponentials, which is a kind of Fourier transform adapted to exponentially growing functions.

      • Graham Jones says:

        Interesting! But I don’t quite see what mathematical model you might be proposing here.

        If there is a distribution p(r) on the lengths of the peptides, and w is the distribution of residue masses, and w^{*r} is the r-fold convolution of w, then I mean to replace F with \sum_r p(r) w^{*r}.

        • John Baez says:

          Okay, now I get your idea. Just to compare using the same notation, our F is given by

          \displaystyle{ F(n) = \sum_{r = 1}^\infty w^{* r}(n) }

          We didn’t write it that way, but that’s what it is. So, you’re suggesting modifying it by weighting the sum by a probability distribution p(r):

          \displaystyle{ F_{\textrm{Jones}}(n) = \sum_{r = 1}^\infty p(r) w^{* r}(n) }

          As long as p(r) is constant for r \le R, F_{\textrm{Jones}}(n) will be proportional to F(n) for n \le 57R, since 57 is the residue mass of the lightest amino acid.

          Thus, I don’t feel very bad about using our original model—at least until someone proposes a specific choice of p.

          Also, unless p oscillates, I have trouble seeing how it could create dramatic oscillations in F_{\textrm{Jones}}(n) if there weren’t already such oscillations in the original F(n). Since the original puzzle was to explain the observed oscillations in empirical data, I think it’s good that we can see them, and understand their appearance, in the original F(n).

          In short, to the extent to which your model is the ‘real thing’, I think the original simplified model is a useful first step toward the real thing.

        • Graham Jones says:

          I think your F is F(n) = \sum_{r=0}^\infty (20^r) w^{*r} (n), assuming we are using 20 masses.

          Perhaps I wasn’t clear. I meant by “w is the distribution of residue masses” to imply that \sum_{r=0}^\infty w(n) = 1, and similarly for p. F_{Jones} is supposed to be a probability distribution.

        • John Baez says:

          Okay, in my comment I used w to mean the function on the integers given by the sum of Kronecker deltas supported at the residue masses of the amino acids. So, it was 20 times your w.

  16. arch1 says:

    I’m just skimming so this might have been answered in earlier comments: Is it clear yet what underlies the apparent structure in John’s initial phase/magnitude scatterplot?

    I think I see in that plot some right-facing circular or elliptical arcs, and maybe some left-facing parabolic or hyperbolic ones.

    Of course one possible answer is ‘the viewer’s brain’ but I guess there are statistical tests that could help w/ that.

    • Philip Gibbs says:

      I think it is interesting to see what happens when all the masses are divided by 14.28

      57/14.28 = 3.99159
      71/14.28 = 4.97198
      87/14.28 = 6.09243
      97/14.28 = 6.79271
      99/14.28 = 6.93277
      101/14.28 = 7.07282
      103/14.28 = 7.21288
      113/14.28 = 7.91316
      114/14.28 = 7.98319
      115/14.28 = 8.05322
      128/14.28 = 8.96358
      129/14.28 = 9.03361
      131/14.28 = 9.17366
      137/14,28 = 9.59383
      147/14.28 = 10.29411
      156/14.28 = 10.92436
      163/14.28 = 11.41456
      186/14.28 = 13.02531

      What you notice is that most of them are close to integers with a few exceptions at the heavier end. In particular the lightest two are very close to integers. This is what is responsible for the oscillations. You can see from the recurrence relation why that would be the case. Greg Egan’s experiments with equally spaced masses also shows it. In a growing sequence the lightest masses control the dominant terms in the recurrence relation.

      Is there a biochemical reason why the masses should be close to multiples of 14.28? I can see that CH_2 and N both have masses of 14 and may dominate the chemistry, O is 16 while P and S are 31 and 32 which are a little over twice the period, pushing the average a little higher. An extra H on the end of a chain has a similar effect.

      I wish these oscillations would turn out to have some significance for life. That would be very profound. However as someone said, masses don’t have a big effect on chemistry.

    • John Baez says:

      Do you mean this picture by Hubler and Craciun?

      I don’t think we know explanations of the patterns you’re alluding to, or even whether those patterns are ‘real things’ (i.e., have interesting explanations).

      • arch1 says:

        Yes, that is the diagram I asked about. I agree that the explanation may well be “it’s all in your head”. Also that (as Philip observes) they are less impressive without the reflection symmetry.

      • Philip Gibbs says:

        I think there is still some possibility that the arcs are a real effect. If the masses were all exact multiples of some integer then we would get some straight vertical lines. So if they are mostly close to multiples of some number there is a possibility that the lines could be distorted into arcs. I just can’t see that idea quite matching up with what we see. Perhaps it is more complicated.

      • Philip Gibbs says:

        To give a little better idea of what I mean, here is an analysis of how you could get some arcs under the right conditions

        Suppose we have a case where all the masses are near multiples of some number P I.e. m_i = n_i P + r_i where the multipliers n_i are positive integers and the remainders r_i are small, e.g. between -0.5 and 0.5. P should also be quite large, e.g. greater than 2\pi , but it does not have to be an integer.

        Suppose x_0 a root of the polynomial equation 1 = \sum_i x^{-m_i} (not necessarily a real root.) Are there other routes near x_0 e^{\frac{2\pi k i}{P}} for integers k = ... -1, 0, 1, ... ? Try x_1 = x_0 e^{\frac{2\pi k i}{P} + \delta_1 k + \delta_2 k^2 + ...} for a small complex correction \delta_1, \delta_2, ... Substituting into the polynomial gives

        \sum_i x_0^{-m_i}e^{-\frac{2\pi k r_i i}{P} - \delta_1 k m_i - \delta_2 k^2 m_i - ...} = 1

        Some multiples of 2\pi i came out of the exponent making it small, so the exponential can now be expanded as a power series. Coefficients of powers k can be matched up to find a solution. The first power gives

        \sum_i x_0^{-m_i} (\frac{2\pi r_i i}{P} + \delta_1 m_i) = 0

        \delta_1 = -\frac{2\pi i}{P} \frac{\sum_i x_0^{-m_i} r_i}{\sum_i x_0^{-m_i} m_i}

        I wont write the rest out but you can expand the exponential to second order and set the coefficient of k^2 to zero to get an expression for \delta_2 and so on. Provided P is big and r_i are small this series will converge rapidly, so the result describes a series of roots following a smooth curve in the complex plane. I don’t know if this fully accounts for what is being seen. It might be useful to setup a cleaner case where the remainders r_i are made smaller to see how it looks.

      • Philip Gibbs says:

        What does the plot look like if the masses are adjusted to

        57, 71, 86, 100, 100, 100, 100, 114,114, 114, 114, 128, 128, 128, 128, 143, 143, 157, 157, 186 ?

        These numbers are nearer multiples of 14.28. Does this make the arcs more visible?

        • arch1 says:

          I hope Greg (whose ‘red dot’ code could answer your Q without modifiation) sees your post. I am starting to see linear features as well and want to put this to rest before I start seeing horsies and doggies.

        • John Baez says:

          Philip wrote:

          What does the plot look like if the masses are adjusted to 57, 71, 86, 100, 100, 100, 100, 114, 114, 114, 114, 128, 128, 128, 128, 143, 143, 157, 157, 186 ?

          That sounds like an interesting question! I’d gotten busy with some funeral arrangements in the last week, but I haven’t lost interest in these puzzles. I hope someone with better computational skills than I can draw this plot. Maybe I’ll try to drum up interest on G+.

        • Greg Egan says:

          I’m working on an interactive page that will let people plot both F(N) and the roots for any set of integers they like, but it’ll be a couple of days before it’s up.

        • John Baez says:

          Wow! I should have guessed you were up to something, not slacking off like me. That sounds fun.

    • Philip Gibbs says:

      I see those arcs too. They are very convincing, but if I cover up the bottom half I am not sure they are so convincing. Is it the reflection symmetry that creates the illusion?

    • Philip Gibbs says:

      To continue with my other point which I somehow got mixed up in this thread, the chemistry seems to be based on chains of CH_2 (mass 14) NH (mass 15) CO (mass 28) CNH_3 (mass 29) S (mass 32) N_2C_3H_2 (mass 66) benzene (mass 76). What oscillations do you get if you use these 7 masses as your starting point?

  17. Gheorghe Craciun says:

    In general, for the sake of some possible applications, I always thought it may be more interesting if the the period of the oscillations does somehow depend on the peptide masses — i.e., if the period captures some property of the peptide masses, which may allow us to say something about the probability that the masses present in some sample set are (or are not) compatible with a given set of “building block” masses.

  18. Gheorghe Craciun says:

    In case anybody may be interested, in a different paper Shane Hubler and I have also looked at some general mathematical properties of this type of recurrence relations: “Mass distributions of linear chain polymers”, Journal of Mathematical Chemistry, 2012 http://www.math.wisc.edu/~craciun/PAPERS_NEW/Hubler_Craciun_JMathChem_2012.pdf

  19. Gheorghe Craciun says:

    Shane Hubler and I have pointed out in our paper that 70% of the amino acid masses are within 1.4Da of an integer multiple of 14.28 — see Table 2 and the second paragraph in section 5 “Why14.28Da?” in the paper at

    http://www.math.wisc.edu/~craciun/PAPERS_NEW/Hubler_Craciun_BioSystems_FINAL_2012.pdf .

    I think it would be interesting to check numerically the following: if we make many random selections of 20 integers (all between 50 and 200) such that most of them are very close to an integer multiple of 14.28, is it true that for a large percentage of those choices the oscillation period is close to 14.28? (For an idea of how large the “noise” could be, you can have a look at the leftmost column labelled “Remainder” in Table 2 in our paper.)

    • Philip Gibbs says:

      Yes, this is the correct explanation. Furthermore it is the low mass components that most control the oscillation and more of these are within your limits for the remainder. The contribution from each mass in the recurrence relation is suppressed according to the growth factor. E.g. with a growth factor of 1.02 per dalton the mass at 57 makes a larger contribution than the mass at 103 by a factor of 1.02^{103-57} = 2.5 The fact that the two lowest mass components are particularly close to multiples of 14.28 is most significant.

      I don’t know much about biochemistry but from the Wikipedia page on amino acids it looks like the lightest components in the chains that they are formed from are CH_2 (mass 14) NH (mass 15) CO (mass 28) CNH_3 (mass 29) S (mass 32) . More complex components such as the benzene ring are much heavier so their influence is suppressed by the growth factor. Being close to twice the period (i.e. 28.56) is just as good as being close to the period itself. The frequency with which these occur as components in the amino acid chains will also be relevant.

      This explanation is a little handwaving but it might be possible to do a more detailed analysis of how this works out.

    • Philip Gibbs says:

      Here is a more detailed analysis.

      The overall growth factor x_0 is the largest real number root of

      \displaystyle{ x^{-m_{20}} + \cdots + x^{m_1} = 1 }

      This can be computed as

      x_0 = 1.0282107

      Next try to find a second complex root assuming that the masses are all close to integer multiples of some fixed size i.e. define

      \displaystyle{ T_i = \frac{m_i}{n_i}}

      for integers n_i such that the values T_i are all close to 14. Look for a second root of the form x_1 = r e^{\frac{2\pi i}{P}} where P will be the period of the oscillation and r is real.

      So solve

      \displaystyle{ \sum_i r^{-m_i} e^{-\frac{2\pi i m_i}{P}} = 1 }

      Subtract a multiple of 2\pi i in the exponents to make them small and simplify.

      \displaystyle{ \sum_i r^{-m_i} e^{-2\pi n_i i (\frac{T_i}{P} - 1) } = 1 }

      Since the exponents are now small use the first order approximation to the exponential

      \displaystyle{ \sum_i r^{-m_i} (1 - 2\pi n_i i (\frac{T_i}{P} - 1) = 1 }

      The real part of the equation gives r = x_0 (first order approximation) and the imaginary part gives

      \displaystyle{ \sum_i x_0^{-m_i} (n_i T_i) = P \sum_i x_0^{-m_i} n_i }

      In other words the period P is just the weighted mean of the values T_i with weights x_0^{-m_i} n_i

      The approximation for the growth rate of the oscillations r could be improved by using a second order approximation to the exponential to show that it grows just slightly slower than the growth factor x_0 by an amount determined by the size of the remainders.

      • Greg Egan says:

        Nice analysis! It gets the period fairly close for 6,11,16,21,26; it gives 5.73981 vs 5.50192.

        For 7, 11, 16, 21, 26 it’s not as good; it gives 6.16646 vs the (lowest-phase) period of 5.4668.

        But for 57, 71, 87, 97, 99, 101, 103, 113, 113, 114, 115, 128, 128, 129, 131, 137, 147, 156, 163, 186, it’s extremely close: 14.2533 vs 14.2767.

      • Philip Gibbs says:

        I calculated this weighted sum on a spreadsheet and got 14.2775

      • Philip Gibbs says:

        What values for n_i did you use Greg? Mine were
        4 5 6 7 7 7 7 8 8 8 8 9 9 9 10 10 11 11 13
        For a few it was not obvious. I could have chosen a different number and the result would be different

        • Greg Egan says:

          I used the same numbers except for 12 rather than 11 for the second-last (mass 163). I rounded the masses divided by 14 to get the n_i; dividing by 14.28 instead would change that 12 to the 11 you used and leave everything else the same.

          I guess there are all sorts of refinements possible in making these choices, but that’s not really the point — we can compute the roots exactly if we really want to know them, exactly. But your analysis gives a nice intuitive explanation of why the period is roughly what it is in relation to the T_i.

      • Philip Gibbs says:

        You can double all the values of n_i giving a new root at half the period, but the growth rate will go down. For the case 7, 11, 16, 21, 26 you could then change n_1 from 2 to 3 to give a better match.

  20. Graham Jones says:

    I made a plot of the roots of many random polynomials with coefficients 1, 0 or 1, 0 or 1,…., 0 or 1, -1. In the complex plane they look quite like the Littlewood ones, but a C not an O. Here’s a link to an image in Mod/Arg space, and the R code. (Goodness knows what WordPress will make of this.)

    plot(NULL, NULL, xlim=c(.5,2), ylim=c(-3.4,3.4), xlab="Mod", ylab="Arg")
    for(i in 1:100000) {
        n <- sample(1:40, 1, prob=1.2^(1:40))
        rs <- polyroot(c(1, sample(0:1, n, TRUE), -1))
        points(Mod(rs), Arg(rs), pch='.', col="#00000010")
    }
    
    • John Baez says:

      Cool stuff, Graham! But I’m confused by this:

      I made a plot of the roots of many random polynomials with coefficients 1, 0 or 1, 0 or 1,…., 0 or 1, -1.

      Maybe there’s a typo there, but even turning on my automatic typo-correction facilities I can’t tell what class of polynomials you’re plotting roots of.

      • Graham Jones says:

        No typo that I can see, but I suppose it is a bit terse and cryptic – and influenced by my R code. I mean polynomials of the type described in Puzzle 2. I rewrote them like \sum_{i=0}^{n} c_i x^i, where c_0=1, c_n=-1 and the one between are either 0 or 1.

        First n is sampled randomly between 3 and 42 (with a bias towards big n), then the coefficients filled in at random.

      • John Baez says:

        I see! You meant

        coefficients 1, (0 or 1), (0 or 1),…., (0 or 1), -1.

        but for some reason I read it as

        coefficients (1, 0) or (1, 0) or (1,…., 0) or (1, -1).

        which makes no sense at all.

  21. Philip Gibbs says:

    Nobody said anything about puzzle 1 yet. Here is one approach that might work.

    If a polynomial has all real roots then all its derivatives have all real roots too. Differentiate the polynomial until you are left with a low degree polynomial, e.g. a cubic. Its coefficients are non-negative. In which cases can this have all real roots?

    • Graham Jones says:

      I have a sketchy proof that the polynomials in the puzzle only have all roots real when m=1 or 2.

      Put h(x) = \sum_{i=0}^{m-1} c_i x^i, where each c_i is 0 or 1, and c_0=1. Put g(x) = x^m.
      The polynomial is then f(x) = g(x) – h(x). I get bounds on h(x) via various geometric sums.

      First, f has exactly one positive root in [1,2]. h(x) is bigger than g(x) for 0<=x=2. So f has at least one root in [1,2], and no positive ones outside. If y is the smallest root in [1,2], show all derivatives of f at y are positive. So there aren’t any more.

      Next look at negative roots. f(0)=-1, and the quickest possibility for f(x) to get up to 0 as -x gets bigger is for c_i=1 when i is odd, otherwise 0. It follows that all negative roots lie in [-2,-.618]. Suppose f has m real roots. Then it has m-1 roots in this range, so f’ has at least m-2 roots in this range, f” has at least m-3, and so on, until you get to the (m-2)’th derivative, which has one root in the range. But this is a quadratic, and both its roots lie too close to 0 for m >= 4. m=3 can be checked manually.

      • Graham Jones says:

        Let’s try the third paragraph again.

        First, f has exactly one positive root in [1,2]. h(x) is bigger than g(x) for 0 \leq x=1, and h(x) is smaller than g(x) for x \geq 2. So f has at least one root in [1,2], and no positive ones outside. If y is the smallest root in [1,2], show all derivatives of f at y are positive. So there aren’t any more.

        • Gheorghe Craciun says:

          If we do not assume that masses are distinct, i.e., if we look at polynomial equations of the form x^m - \sum_{i=0}^{m-1} c_i x^i = 0 where each c_i is nonnegative (instead of being restricted to 0 or 1), then there are many possible choices for the numbers c_i for which all the roots are real. For example, if we consider the polynomial with m-1 roots being exactly -1, and the mth root being a positive number M, then if M is large enough we obtain a (monic) polynomial of the type considered above.
          (This is easy to check: we look at the product (x-M) * \prod_{i=0}^{m-1}(x+1) and note that this is product between (x-M) and a polynomial with positive coefficients, so it will have the desired coefficient signs for large enough M.)

        • Gheorghe Craciun says:

          More about John’s “Puzzle 1”:

          On the other hand, note that the polynomials we obtain this way (i.e., by using the product (x-M) * \prod_{i=0}^{m-1}(x+1) for large M) correspond to a case where all positive integers less than M are masses of some “amino acids”, which is a very restrictive case.

          More generally, consider the equation x^m - \sum_{i=0}^{m-1} c_i x^i = 0 with nonnegative coefficients c_i (and with c_0>0). If we assume that we have two or more “missing masses”, i.e., two or more coefficients c_i are zero (but still c_0>0), then I think we can use Descartes’s rule of signs https://en.wikipedia.org/wiki/Descartes%27_rule_of_signs to argue that not all roots can be real (because in this case Descartes’s rule of signs implies that there is exactly one positive root, and also implies there are at most m-2 negative roots, so all together there are at most m-1 real roots, since zero is not a root).

          If there is exactly one “missing mass”, i.e., exactly one coefficient c_i is zero (and still c_0>0), then again I think we can use Descartes’s rule of signs to argue that, if all roots are real, then the zero coefficient must be c_{m-1} (because, like before, there is exactly one positive root, so we have to have m-1 negative roots; but this means that, according to Descartes’s rule of signs criterion for negative roots, we must have a sign change from (-x)^m to the next monomial after it. But, if c_{m-1} > 0, we fail to get this sign change; therefore c_{m-1} = 0).

          By the way, there do exist polynomials of this type for which all the roots are real. For example, all roots of x^4 - 4x^2 - 4x -1 are real (another example is x^2 - 1). But, this is not true for all such polynomials, for example not all roots of x^4 - x^2 - x -1 are real.

          To summarize, I think that if all roots of x^m - \sum_{i=0}^{m-1} c_i x^i = 0 are real (and all c_i are nonnegative, and c_0>0) then c_1, c_2, ..., c_{m-2} must also be strictly positive.

        • Gheorghe Craciun says:

          Also, if we now restrict the question only to the equations of the form x^m - \sum_{i=0}^{m-1} c_i x^i = 0 with c_i \in \{0,1\} and c_0 = 1 (i.e., the way Puzzle 1 was originally formulated by John), then, unless I made some mistake in the comment above, it follows that the only equations of this form that have only real roots must be of the type x^m - \sum_{i=0}^{m-1} x^i = 0, or of the type x^m - \sum_{i=0}^{m-2} x^i = 0.
          But, if we multiply these equations by (x-1), we have a lot of cancellations, and we are left with an equation of the form fewnomial = 0, where “fewnomial” is a sparse polynomial, i.e., a polynomial with few nonzero coefficients (in this case, after multiplying by (x-1) and many cancellations, we are left with at most 4 nonzero coefficients, for any m).
          Then, after using Descartes’s rule of signs for the “fewnomial”, (and maybe also after checking a couple cases of degree 3 and 4 by hand) I think we can conclude that the only polynomials of the two types mentioned above for which all the roots are real are x-1, x^2-x-1, and x^2-1.

  22. domenico says:

    I try to obtain a demonstration that some polynomial have only complex solution, but I obtain nothing.
    I try to evaluate the gradient of the f(N) in the real field in the form:
    f(x) = 1-\sum_k x^{-m_k}
    f(x_1)=f(x_0+\epsilon) = f(x_0)+\epsilon f(x_0)+\frac{\epsilon^2}{2}f(x_0)+\cdots \simeq 0
    so that if the infinitesimal displacement is in the complex field, then all the solution are complex.
    The two solution for the displacements are:
    \epsilon=\frac{-\partial_x f_0 \pm \sqrt{(\partial_x f_0)^2-2 f_0 \partial^2_x f_0} }{\partial^2_x f_0}
    so that if the discriminant is ever negative, then the solutions are all in the complex plane (only oscillations):
    \Delta = (\partial_x f_0)^2-2 f_0 \partial^2_x f_0 = -\sum_{ks} m_k m_s [1+ 2 m_s] x^{-m_k-1} x^{-m_s-1}
    and this seem a dead end.

  23. lwbut says:

    Forgive my ignorance, most of the math here goes way over my head but i am fairly good with the simpler stuff, not to mention basic chemistry.

    I have noticed a couple of points that may open new areas of thought.
    Point 1. In the first graph of peptide mass numbers the peaks are approx 14 daltons apart ( which i take means there are a greater occurrence of peptides with the peak mass numbers?) 14 daltons is the equivalent mass of a carbon atom link in a peptide chain as most such single links have 2 hydrogen atoms connected to the carbon: total mass = 12 + 1 + 1 = 14. adding a single such link to a peptide to form a new stable peptide would most likely ocur when one such CH2 connection is added or subtracted.

    Point 2 14.28(57) recurring is exactly 100 divided by 7.

    The reason i visited the site was to look for information concerning the Fibonacci sequence. Specifically why the number 5 (and 11) is so important to the sequence? In the above post you give a fibonacci formula involving the inverse square root of this number and i believe i have discovered (and not read somewhere yet) a puzzling 5 connection, namely that for all fibonacci numbers above number 6 in the sequence (5) multiplying the number by 11 and adding the number 5 places previously results in the fibonacci number 5 places subsequent to the original! Starting with 5( x 11) + 0 = 55 (place numbers 6, 1 and 11 respectively), then 8 (x 11) + 1 = 89, 13 (x 11) + 1 = 144, ….

    love.

  24. Philip Gibbs says:

    It’s interesting to think about what would happen if hydrogen came mostly in the form of deuterium so it had a mass number of 2 instead of 1. List the mass and valency numbers for the elements that make up most proteins

    Deuterium: m = 2, v = 1
    Carbon: m = 12, v = 4
    Nitrogen: m = 14, v = 3
    Oxygen: m = 16, v = 2
    Sulphur: m = 32, v = 2

    In every case it happens that m + 2v = 4 (mod 16)
    What can be said about the mass of any molecule that has a tree structure, i.e. no circuits or double bonds? The number of atoms A is then equal to the number of bonds B plus one. A = B + 1

    Adding up the congruence over all atoms gives M + 2V = 4A (mod16)
    where M is the total mass, V is the total valency, but V = 2B
    So M = 4A – 4B = 4 (mod 16)

    In other words, if all hydrogen were deuterium then all tree-like molecules made from these elements would have a mass number that is 4 more than a multiple of 16.

    In the real world hydrogen is not mostly deuterium and amino acids do have loops and double bonds so the situation is more complex. The difference between masses are reduced to being often approximately multiples of 14.28 instead of always exactly 16.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s