Coalescence

From Genetics Wiki
Revision as of 19:12, 19 September 2018 by Floyd (talk | contribs) (The coalescence of more than two lineages)

Jump to: navigation, search

The coalescence of two lineages

Two lineages have a probability of coalescing (picking the same gene copy in the previous generation) of 1/(2N) because there are 2N total copies (in a diploid) to choose from.

The rate per generation is 1/(2N) so the average number of generations until this occurs is 2N generations.

On average two lineages are expected to coalesce to a common ancestor 2N generations in the past.

The coalescence of more than two lineages

There are three possible ways for three lineages to coalesce: A with B, B with C, and A with C. So, the rate of coalescence of three lineages is three times faster on average, 3/2N. The time until the first coalescence of three lineages is expected to be 2N/3 followed by another 2N generations for the coalescence of the remaining two lineages.

There are six ways for four lineages to coalesce. A with B, B with C, C with D, A with D, A with C, and B with D. So the rate of coalescence is 6/2N. We expect the time of coalescence of four lineages to be [math]\frac{2N}{6} + \frac{2N}{3} + \frac{2N}{1}[/math] generations.

For each lineage we add in our sample the number of possible ways for the first coalescence to appear goes up dramatically. It follows the triangular number series, the number of pairwise comparisons, which is [math]{n\choose2}[/math] or [math]\frac{n(n-1)}{2}[/math]. This corresponds to 1, 3, 6, 10, 15, 21, ...

Why is it [math]{n\choose2} = \frac{n(n-1)}{2}[/math]? This is a binomial coefficient question; out of n how many opportunities are there to choose two (to coalesce).


[math]{n\choose2} = \frac{n!}{2!(n-2)!} = \frac{n \times (n-1) \times (n-2) \times (n-3) \times (n-4) \times \cdots}{2\times1 \times (n-2) \times (n-3) \times (n-4) \times \cdots} = \frac{n(n-1)}{2}[/math]

The coalescence of an infinite number of lineages

Of course there are never an infinite number of lineages that coalesce; species are finite in number. Still it is useful to understand what the upper limit in coalescence time is that is approached with very large samples or in an entire population. Keep in mind that this is still only an expectation and there is a large variance assoaciated with these expectations.

To solve the limit we have to find the sum of an infinite series that is made up of the pattern of the sum of coalescence times as the number of sampled lineages increases.

As more lineages are added each step, with i lineages in the current step, the rate of coalescence increases by the Triangular Numbers (i(i-1)/2; these are 1, 3, 6, 10, 15, 21, ...) scaled by 2N generations: [math]\frac{\frac{i(i-1)}{2}}{2N}[/math]. The time that is added in the sum of times is the inverse of the rate or [math]\frac{2N}{\frac{i(i-1)}{2}}[/math].

Sum of the infinite series

[math]\sum_{i=2}^\infty\frac{2N}{\frac{i(i-1)}{2}}=\sum_{i=2}^\infty\frac{4N}{i(i-1)}=4N\sum_{i=2}^\infty\frac{1}{i(i-1)}[/math]

Note shifting the index starting point down by one, i=1 instead of i=2 in the sum in the next line. This makes the calculation more convenient.

[math]4N\sum_{i=2}^\infty\frac{1}{i(i-1)}=4N\sum_{i=1}^\infty\frac{1}{i(i+1)}=4N\sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1}[/math]


Why is

[math]\frac{1}{i(i+1)}=\frac{1}{i}-\frac{1}{i+1}[/math]?

First in reverse: Multiply both sides by one to equalize the denominators and combine.

[math]\frac{1}{i}-\frac{1}{i+1}=\frac{i+1}{i+1}\frac{1}{i}-\frac{i}{i}\frac{1}{i+1}=\frac{i+1-i}{i(i+1)}=\frac{1}{i(i+1)}[/math]

Then forward: Add zero (i-i) to the numerator, split the fraction into two parts, then simplify.

[math]\frac{1}{i(i+1)}=\frac{i+1-i}{i(i+1)}=\frac{i+1}{i(i+1)}-\frac{i}{i(i+1)}=\frac{1}{i}-\frac{1}{i+1}[/math]

Plug in the first few numbers of the sum to see the pattern.

[math]\sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1} = \frac{1}{1} - \frac{1}{2} + \frac{1}{2} - \frac{1}{3} + \frac{1}{3} - \frac{1}{4} + \frac{1}{4} - \frac{1}{5} + \cdots[/math]

After the first one the pairs of fractions cancel out: +1/2 -1/2, +1/3, -1/3, +1/4, -1/4, ... this pattern continues to infinity with smaller and smaller fractions deviating away and back to one. So,

[math]\sum_{i=2}^\infty\frac{1}{i(i-1)} = \sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1} = 1[/math]

[math]4N\sum_{i=2}^\infty\frac{1}{i(i-1)} = 4N[/math]

Summary

So on average in general we expect the lineages within a panmictic species to all coalesce with the last coalescence event 4N generations in the past. From above we can also see that the expected time to coalescence of two lineages is 2N generations. This predicts that all of the coalescence events will occur in the most recent 4N - 2N = 2N generations and then the system will exist as two lineages until the last coalescence event 2N + 2N = 4N generations in the past.

Beyond a sample size of 10 or so (20 gene copies in a diploid) we quickly arrive at diminishing returns in capturing old lineages. Most of the contribution of additional samples is in adding small branches to the tips of the tree. It is very likely to have captures the oldest "2N" lineages within a small sample size.