Coalescence
Contents
The coalescence of two lineages
Two lineages have a probability of coalescing (picking the same gene copy in the previous generation) of 1/(2N) because there are 2N total copies (in a diploid) to choose from.
The rate per generation is 1/(2N) so the average number of generations until this occurs is 2N generations.
On average two lineages are expected to coalesce to a common ancestor 2N generations in the past.
The coalescence of more than two lineages
There are three possible ways for three lineages to coalesce: A with B, B with C, and A with C. So, the rate of coalescence of three lineages is three times faster on average, 3/2N. The time until the first coalescence of three lineages is expected to be 2N/3 followed by another 2N generations for the coalescence of the remaining two lineages.
There are six ways for four lineages to coalesce. A with B, B with C, C with D, A with D, A with C, and B with D. So the rate of coalescence is 6/2N. We expect the time of coalescence of four lineages to be [math]\frac{2N}{6} + \frac{2N}{3} + \frac{2N}{1}[/math] generations.
For each lineage we add in our sample the number of possible ways for the first coalescence to appear goes up dramatically. It follows the triangular number series, the number of pairwise comparisons, which is [math]{n\choose2}[/math] or [math]\frac{n(n-1)}{2}[/math]. This corresponds to 1, 3, 6, 10, 15, 21, ...
Why is it [math]{n\choose2} = \frac{n(n-1)}{2}[/math]? This is a binomial coefficient question; out of n how many opportunities are there to choose two (to coalesce).
[math]{n\choose2} = \frac{n!}{2!(n-2)!} = \frac{n \times (n-1) \times (n-2) \times (n-3) \times (n-4) \times \cdots}{2\times1 \times (n-2) \times (n-3) \times (n-4) \times \cdots} = \frac{n(n-1)}{2}[/math]
The coalescence of an infinite number of lineages
Of course there are never an infinite number of lineages that coalesce; species are finite in number. Still it is useful to understand what the upper limit in coalescence time is that is approached with very large samples or in an entire population. Keep in mind that this is still only an expectation and there is a large variance assoaciated with these expectations.
To solve the limit we have to find the sum of an infinite series that is made up of the pattern of the sum of coalescence times as the number of sampled lineages increases.
As more lineages are added each step, with i lineages in the current step, the rate of coalescence increases by the Triangular Numbers (i(i-1)/2; these are 1, 3, 6, 10, 15, 21, ...) scaled by 2N generations: [math]\frac{\frac{i(i-1)}{2}}{2N}[/math]. The time that is added in the sum of times is the inverse of the rate or [math]\frac{2N}{\frac{i(i-1)}{2}}[/math].
Sum of the infinite series
[math]\sum_{i=2}^\infty\frac{2N}{\frac{i(i-1)}{2}}=\sum_{i=2}^\infty\frac{4N}{i(i-1)}=4N\sum_{i=2}^\infty\frac{1}{i(i-1)}[/math]
Note shifting the index starting point down by one, i=1 instead of i=2 in the sum in the next line. This makes the calculation more convenient.
[math]4N\sum_{i=2}^\infty\frac{1}{i(i-1)}=4N\sum_{i=1}^\infty\frac{1}{i(i+1)}=4N\sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1}[/math]
Why is
[math]\frac{1}{i(i+1)}=\frac{1}{i}-\frac{1}{i+1}[/math]?
First in reverse: Multiply both sides by one to equalize the denominators and combine.
[math]\frac{1}{i}-\frac{1}{i+1}=\frac{i+1}{i+1}\frac{1}{i}-\frac{i}{i}\frac{1}{i+1}=\frac{i+1-i}{i(i+1)}=\frac{1}{i(i+1)}[/math]
Then forward: Add zero (i-i) to the numerator, split the fraction into two parts, then simplify.
[math]\frac{1}{i(i+1)}=\frac{i+1-i}{i(i+1)}=\frac{i+1}{i(i+1)}-\frac{i}{i(i+1)}=\frac{1}{i}-\frac{1}{i+1}[/math]
Plug in the first few numbers of the sum to see the pattern.
[math]\sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1} = \frac{1}{1} - \frac{1}{2} + \frac{1}{2} - \frac{1}{3} + \frac{1}{3} - \frac{1}{4} + \frac{1}{4} - \frac{1}{5} + \cdots[/math]
After the first one the pairs of fractions cancel out: +1/2 -1/2, +1/3, -1/3, +1/4, -1/4, ... this pattern continues to infinity with smaller and smaller fractions deviating away and back to one. So,
[math]\sum_{i=2}^\infty\frac{1}{i(i-1)} = \sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1} = 1[/math]
[math]4N\sum_{i=2}^\infty\frac{1}{i(i-1)} = 4N[/math]
Summary
So on average in general we expect the lineages within a panmictic species to all coalesce with the last coalescence event 4N generations in the past. From above we can also see that the expected time to coalescence of two lineages is 2N generations. This predicts that all of the coalescence events will occur in the most recent 4N - 2N = 2N generations and then the system will exist as two lineages until the last coalescence event 2N + 2N = 4N generations in the past.
Beyond a sample size of 10 or so (20 gene copies in a diploid) we quickly arrive at diminishing returns in capturing old lineages. Most of the contribution of additional samples is in adding small branches to the tips of the tree. It is very likely to have captures the oldest "2N" lineages within a small sample size.
Probability of Oldest Lineage Sampling
The oldest part of a coalescent tree is the point where the last two lineages coalesce. What is the chance of containing this in a sample of finite size? The sum of coalescent events gives a mistaken perception that each new lineage is adding time to the tips of the tree (this is what happens mathematically with calculation of the expectation, but is not the best way to visualize what is going on). The time until coalescence for two lineages is expected, on average, to be 2N generations; however, it could easily be shorter or longer than this. The expectation is half of the expected total time of coalescence of all lineages, 4N. Can this be interpreted as a half chance of not containing the oldest lineage in a comparison of two sequences?
Say we have three lineages that contain the oldest point in the total coalescent. Lineage A and B coalesce with each other before they coalesce with C at the oldest point. If we randomly selected two of these lineages we have a 2/3 chance of including the oldest point in the genealogy (A and C or B and C versus A and B).
Exploring this logic let's write the time until coalescence of a finite sample as a fraction of the theoretical limit with infinite sampling.
[math]\frac{\sum\limits_{i=2}^n\frac{2N}{\frac{i(i-1)}{2}}}{4N} = \frac{4N\sum\limits_{i=2}^n\frac{1}{i(i-1)}}{4N}[/math]
Elaborations
- This assumes population sizes are constant. Population size can change over time and this will affect the times of coalescence (by having fewer or more potential ancestors to choose among).
- Migration among discrete populations.
- Isolation by distance.