Difference between revisions of "Coalescence"
(→Summary Table) |
|||
(35 intermediate revisions by the same user not shown) | |||
Line 62: | Line 62: | ||
Beyond a sample size of 10 or so (20 gene copies in a diploid) we quickly arrive at diminishing returns in capturing old lineages. Most of the contribution of additional samples is in adding small branches to the tips of the tree. It is very likely to have captures the oldest "2''N''" lineages within a small sample size. | Beyond a sample size of 10 or so (20 gene copies in a diploid) we quickly arrive at diminishing returns in capturing old lineages. Most of the contribution of additional samples is in adding small branches to the tips of the tree. It is very likely to have captures the oldest "2''N''" lineages within a small sample size. | ||
+ | |||
+ | =Finite Sampling Compared to Infinite Sampling= | ||
+ | The oldest part of a coalescent tree is the point where the last two lineages coalesce. What is the chance of containing this in a sample of finite size? The sum of coalescent events gives a mistaken perception that each new lineage is adding time to the tips of the tree (this is what happens mathematically with calculation of the expectation, but is not the best way to visualize what is going on). The time until coalescence for two lineages is expected, on average, to be 2''N'' generations; however, it could easily be shorter or longer than this. | ||
+ | |||
+ | ==Fraction of the Total Age== | ||
+ | Exploring this logic let's write the time until coalescence of a finite sample as a fraction of the theoretical limit with infinite sampling. | ||
+ | |||
+ | <math>\frac{\sum\limits_{i=2}^n\frac{2N}{\frac{i(i-1)}{2}}}{4N} = \frac{4N\sum\limits_{i=2}^n\frac{1}{i(i-1)}}{4N} = \sum\limits_{i=2}^n\frac{1}{i(i-1)}</math> | ||
+ | |||
+ | For 2 to 6 lineages this gives times of | ||
+ | *2: 1/2 = 1/2 | ||
+ | *3: 1/2 + 1/6 = 2/3 | ||
+ | *4: 1/2 + 1/6 + 1/12 = 3/4 | ||
+ | *5: 1/2 + 1/6 + 1/12 + 1/20 = 4/5 | ||
+ | *6: 1/2 + 1/6 + 1/12 + 1/20 + 1/30 = 5/6 | ||
+ | |||
+ | The pattern becomes clear. For a sample of n lineages the expected fraction out of the total with infinite sampling is, | ||
+ | |||
+ | <math>\frac{n-1}{n}\mbox{.}</math> | ||
+ | |||
+ | So, a sample of 10 lineages, from 5 diploid individuals, is expected to cover on average 90% of the total depth of the entire coalescent tree. | ||
+ | |||
+ | This is the expected fraction of time out of the total tree but the probability of including the oldest lineage is slightly different. | ||
+ | |||
+ | ==Probability of Containing the Oldest Lineage== | ||
+ | The last coalescence event divides the daughter lineages into two groups. With an infinite number of lineages these can be divided into uniform proportions anywhere from zero to one. Call the proportion on one side ''κ'' and the other side 1-''κ''. In order to not be connected through the oldest lineage---the last coalescence of two lineages---all of ''n'' sampled lineages need to come from one side or the other. So, the chance that they are connected through the oldest part of the tree is 1-''κ''<sup>''n''</sup>-(1-''κ'')<sup>''n''</sup>. We integrate this over ''κ'' to find the total probability for all partitions. | ||
+ | |||
+ | <math>\int\left(1-\kappa^n-\left(1-\kappa\right)^n\right)\mbox{d}κ = \frac{(1-\kappa)^n+n\kappa +\kappa -\kappa^{n+1}-\kappa(1-\kappa)^n}{n+1} + C</math> | ||
+ | |||
+ | Substitute in one for ''κ''. | ||
+ | |||
+ | <math>\int_0^1\left(1-\kappa^n-\left(1-\kappa\right)^n\right)\mbox{d}κ = \frac{n}{n+1} + C</math> | ||
+ | |||
+ | The question is now what is the integration constant ''C''. | ||
+ | |||
+ | A few examples can be solved numerically to find the pattern. For example, ''n'' = 2: | ||
+ | |||
+ | <math>\int 1- \kappa^2 - (1-\kappa)^2 =\int 1- \kappa^2 - 1 + 2\kappa - \kappa^2 = \int 2\kappa - 2\kappa^2 = \kappa^2 -(2/3)\kappa^3</math>. | ||
+ | |||
+ | Substitute ''κ'' = 1. | ||
+ | |||
+ | <math>\kappa^2 -(2/3)\kappa^3 = 1-2/3 = 1/3</math>. | ||
+ | |||
+ | A few more examples show that for ''n'' = 3 the probability is 2/4; ''n'' = 4 gives 3/5, and ''n'' = 5 gives 4/6, ''etc''. | ||
+ | |||
+ | Using ''n'' = 2 set the integral equal to the numeric solution to solve for ''C''. | ||
+ | |||
+ | <math>\frac{2}{2+1} + C = 1/3</math> | ||
+ | |||
+ | <math>C = \frac{1}{3} - \frac{2}{3} = -\frac{1}{3}</math> | ||
+ | |||
+ | A few more examples show that ''C'' is -1/(''n''+1). So the solution for the probability that a sample of ''n'' gene copies captures the oldest point in their coalescent (ignoring recombination in an ideal random mating population of constant size) is | ||
+ | |||
+ | <math>\frac{n}{n+1}-\frac{1}{n+1} = \frac{n-1}{n+1}</math> | ||
+ | |||
+ | ==Summary Table== | ||
+ | {| class="wikitable" | ||
+ | |+ Finite vs. infinite sampling relative results. | ||
+ | |- | ||
+ | !Sample || Age || Probability | ||
+ | |- | ||
+ | !1 | ||
+ | | <math>0</math> || <math>0</math> | ||
+ | |- | ||
+ | !2 | ||
+ | | <math>1/2 = 0.5</math> || <math>1/3 = 0.\overline{3}</math> | ||
+ | |- | ||
+ | !3 | ||
+ | | <math>2/3=0.\overline{6}</math> || <math>1/2=0.5</math> | ||
+ | |- | ||
+ | !4 | ||
+ | | <math>3/4=0.75</math> || <math>3/5=0.6</math> | ||
+ | |- | ||
+ | !5 | ||
+ | | <math>4/5=0.8</math> || <math>2/3=0.\overline{6}</math> | ||
+ | |- | ||
+ | !6 | ||
+ | | <math>5/6=0.8\overline{3}</math> || <math>5/7=0.\overline{714285}</math> | ||
+ | |- | ||
+ | !7 | ||
+ | | <math>6/7=0.\overline{857142}</math> || <math>3/4=0.75</math> | ||
+ | |- | ||
+ | !8 | ||
+ | | 7/8 || 7/9 | ||
+ | |- | ||
+ | !9 | ||
+ | | 8/9 || 4/5 | ||
+ | |- | ||
+ | !10 | ||
+ | | 9/10 || 9/11 | ||
+ | |} | ||
+ | |||
+ | Under an ideal population model small sample sizes can capture a lot of the major coalescent lineages in the history of a species. | ||
=Elaborations= | =Elaborations= |
Latest revision as of 03:07, 26 October 2018
Contents
The coalescence of two lineages
Two lineages have a probability of coalescing (picking the same gene copy in the previous generation) of 1/(2N) because there are 2N total copies (in a diploid) to choose from.
The rate per generation is 1/(2N) so the average number of generations until this occurs is 2N generations.
On average two lineages are expected to coalesce to a common ancestor 2N generations in the past.
The coalescence of more than two lineages
There are three possible ways for three lineages to coalesce: A with B, B with C, and A with C. So, the rate of coalescence of three lineages is three times faster on average, 3/2N. The time until the first coalescence of three lineages is expected to be 2N/3 followed by another 2N generations for the coalescence of the remaining two lineages.
There are six ways for four lineages to coalesce. A with B, B with C, C with D, A with D, A with C, and B with D. So the rate of coalescence is 6/2N. We expect the time of coalescence of four lineages to be [math]\frac{2N}{6} + \frac{2N}{3} + \frac{2N}{1}[/math] generations.
For each lineage we add in our sample the number of possible ways for the first coalescence to appear goes up dramatically. It follows the triangular number series, the number of pairwise comparisons, which is [math]{n\choose2}[/math] or [math]\frac{n(n-1)}{2}[/math]. This corresponds to 1, 3, 6, 10, 15, 21, ...
Why is it [math]{n\choose2} = \frac{n(n-1)}{2}[/math]? This is a binomial coefficient question; out of n how many opportunities are there to choose two (to coalesce).
[math]{n\choose2} = \frac{n!}{2!(n-2)!} = \frac{n \times (n-1) \times (n-2) \times (n-3) \times (n-4) \times \cdots}{2\times1 \times (n-2) \times (n-3) \times (n-4) \times \cdots} = \frac{n(n-1)}{2}[/math]
The coalescence of an infinite number of lineages
Of course there are never an infinite number of lineages that coalesce; species are finite in number. Still it is useful to understand what the upper limit in coalescence time is that is approached with very large samples or in an entire population. Keep in mind that this is still only an expectation and there is a large variance assoaciated with these expectations.
To solve the limit we have to find the sum of an infinite series that is made up of the pattern of the sum of coalescence times as the number of sampled lineages increases.
As more lineages are added each step, with i lineages in the current step, the rate of coalescence increases by the Triangular Numbers (i(i-1)/2; these are 1, 3, 6, 10, 15, 21, ...) scaled by 2N generations: [math]\frac{\frac{i(i-1)}{2}}{2N}[/math]. The time that is added in the sum of times is the inverse of the rate or [math]\frac{2N}{\frac{i(i-1)}{2}}[/math].
Sum of the infinite series
[math]\sum_{i=2}^\infty\frac{2N}{\frac{i(i-1)}{2}}=\sum_{i=2}^\infty\frac{4N}{i(i-1)}=4N\sum_{i=2}^\infty\frac{1}{i(i-1)}[/math]
Note shifting the index starting point down by one, i=1 instead of i=2 in the sum in the next line. This makes the calculation more convenient.
[math]4N\sum_{i=2}^\infty\frac{1}{i(i-1)}=4N\sum_{i=1}^\infty\frac{1}{i(i+1)}=4N\sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1}[/math]
Why is
[math]\frac{1}{i(i+1)}=\frac{1}{i}-\frac{1}{i+1}[/math]?
First in reverse: Multiply both sides by one to equalize the denominators and combine.
[math]\frac{1}{i}-\frac{1}{i+1}=\frac{i+1}{i+1}\frac{1}{i}-\frac{i}{i}\frac{1}{i+1}=\frac{i+1-i}{i(i+1)}=\frac{1}{i(i+1)}[/math]
Then forward: Add zero (i-i) to the numerator, split the fraction into two parts, then simplify.
[math]\frac{1}{i(i+1)}=\frac{i+1-i}{i(i+1)}=\frac{i+1}{i(i+1)}-\frac{i}{i(i+1)}=\frac{1}{i}-\frac{1}{i+1}[/math]
Plug in the first few numbers of the sum to see the pattern.
[math]\sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1} = \frac{1}{1} - \frac{1}{2} + \frac{1}{2} - \frac{1}{3} + \frac{1}{3} - \frac{1}{4} + \frac{1}{4} - \frac{1}{5} + \cdots[/math]
After the first one the pairs of fractions cancel out: +1/2 -1/2, +1/3, -1/3, +1/4, -1/4, ... this pattern continues to infinity with smaller and smaller fractions deviating away and back to one. So,
[math]\sum_{i=2}^\infty\frac{1}{i(i-1)} = \sum_{i=1}^\infty\frac{1}{i}-\frac{1}{i+1} = 1[/math]
[math]4N\sum_{i=2}^\infty\frac{1}{i(i-1)} = 4N[/math]
Summary
So on average in general we expect the lineages within a panmictic species to all coalesce with the last coalescence event 4N generations in the past. From above we can also see that the expected time to coalescence of two lineages is 2N generations. This predicts that all of the coalescence events will occur in the most recent 4N - 2N = 2N generations and then the system will exist as two lineages until the last coalescence event 2N + 2N = 4N generations in the past.
Beyond a sample size of 10 or so (20 gene copies in a diploid) we quickly arrive at diminishing returns in capturing old lineages. Most of the contribution of additional samples is in adding small branches to the tips of the tree. It is very likely to have captures the oldest "2N" lineages within a small sample size.
Finite Sampling Compared to Infinite Sampling
The oldest part of a coalescent tree is the point where the last two lineages coalesce. What is the chance of containing this in a sample of finite size? The sum of coalescent events gives a mistaken perception that each new lineage is adding time to the tips of the tree (this is what happens mathematically with calculation of the expectation, but is not the best way to visualize what is going on). The time until coalescence for two lineages is expected, on average, to be 2N generations; however, it could easily be shorter or longer than this.
Fraction of the Total Age
Exploring this logic let's write the time until coalescence of a finite sample as a fraction of the theoretical limit with infinite sampling.
[math]\frac{\sum\limits_{i=2}^n\frac{2N}{\frac{i(i-1)}{2}}}{4N} = \frac{4N\sum\limits_{i=2}^n\frac{1}{i(i-1)}}{4N} = \sum\limits_{i=2}^n\frac{1}{i(i-1)}[/math]
For 2 to 6 lineages this gives times of
- 2: 1/2 = 1/2
- 3: 1/2 + 1/6 = 2/3
- 4: 1/2 + 1/6 + 1/12 = 3/4
- 5: 1/2 + 1/6 + 1/12 + 1/20 = 4/5
- 6: 1/2 + 1/6 + 1/12 + 1/20 + 1/30 = 5/6
The pattern becomes clear. For a sample of n lineages the expected fraction out of the total with infinite sampling is,
[math]\frac{n-1}{n}\mbox{.}[/math]
So, a sample of 10 lineages, from 5 diploid individuals, is expected to cover on average 90% of the total depth of the entire coalescent tree.
This is the expected fraction of time out of the total tree but the probability of including the oldest lineage is slightly different.
Probability of Containing the Oldest Lineage
The last coalescence event divides the daughter lineages into two groups. With an infinite number of lineages these can be divided into uniform proportions anywhere from zero to one. Call the proportion on one side κ and the other side 1-κ. In order to not be connected through the oldest lineage---the last coalescence of two lineages---all of n sampled lineages need to come from one side or the other. So, the chance that they are connected through the oldest part of the tree is 1-κn-(1-κ)n. We integrate this over κ to find the total probability for all partitions.
[math]\int\left(1-\kappa^n-\left(1-\kappa\right)^n\right)\mbox{d}κ = \frac{(1-\kappa)^n+n\kappa +\kappa -\kappa^{n+1}-\kappa(1-\kappa)^n}{n+1} + C[/math]
Substitute in one for κ.
[math]\int_0^1\left(1-\kappa^n-\left(1-\kappa\right)^n\right)\mbox{d}κ = \frac{n}{n+1} + C[/math]
The question is now what is the integration constant C.
A few examples can be solved numerically to find the pattern. For example, n = 2:
[math]\int 1- \kappa^2 - (1-\kappa)^2 =\int 1- \kappa^2 - 1 + 2\kappa - \kappa^2 = \int 2\kappa - 2\kappa^2 = \kappa^2 -(2/3)\kappa^3[/math].
Substitute κ = 1.
[math]\kappa^2 -(2/3)\kappa^3 = 1-2/3 = 1/3[/math].
A few more examples show that for n = 3 the probability is 2/4; n = 4 gives 3/5, and n = 5 gives 4/6, etc.
Using n = 2 set the integral equal to the numeric solution to solve for C.
[math]\frac{2}{2+1} + C = 1/3[/math]
[math]C = \frac{1}{3} - \frac{2}{3} = -\frac{1}{3}[/math]
A few more examples show that C is -1/(n+1). So the solution for the probability that a sample of n gene copies captures the oldest point in their coalescent (ignoring recombination in an ideal random mating population of constant size) is
[math]\frac{n}{n+1}-\frac{1}{n+1} = \frac{n-1}{n+1}[/math]
Summary Table
Sample | Age | Probability |
---|---|---|
1 | [math]0[/math] | [math]0[/math] |
2 | [math]1/2 = 0.5[/math] | [math]1/3 = 0.\overline{3}[/math] |
3 | [math]2/3=0.\overline{6}[/math] | [math]1/2=0.5[/math] |
4 | [math]3/4=0.75[/math] | [math]3/5=0.6[/math] |
5 | [math]4/5=0.8[/math] | [math]2/3=0.\overline{6}[/math] |
6 | [math]5/6=0.8\overline{3}[/math] | [math]5/7=0.\overline{714285}[/math] |
7 | [math]6/7=0.\overline{857142}[/math] | [math]3/4=0.75[/math] |
8 | 7/8 | 7/9 |
9 | 8/9 | 4/5 |
10 | 9/10 | 9/11 |
Under an ideal population model small sample sizes can capture a lot of the major coalescent lineages in the history of a species.
Elaborations
- This assumes population sizes are constant. Population size can change over time and this will affect the times of coalescence (by having fewer or more potential ancestors to choose among).
- Migration among discrete populations.
- Isolation by distance.