Probability of fixation

From Genetics Wiki
Jump to: navigation, search

This was derived in Kimura 1962.

[math]u(p)=\frac{1-e^{-4N_esp}}{1-e^{-4N_es}}[/math]

If we are considering the initial frequency of a single new mutation in the population p=1/(2Ne),

[math]u(p)_1=\frac{1-e^{-4N_es\frac{1}{2N_e}}}{1-e^{-4N_es}}=\frac{1-e^{-2s}}{1-e^{-4N_es}}[/math].

And if 4Nes is large

[math]u(p)_2\approx\frac{1-e^{-2s}}{1}=1-e^{-2s}[/math].

[math]e^{2s}\approx 1+2s[/math]

[math]u(p)_2 \approx 1-e^{-2s} \approx 1-1+2s = 2s[/math].

This agrees with the results of Fisher 1930 (pp. 215--218) and Wright 1931 (pp. 129--133).

It may be surprising at first the the probability of fixation of a new allele that confers a fitness advantage is only approximately 2s. So if it gives a 3% fitness advantage the probability of fixation is only about 6%. In other words there is a 94% chance the new adaptive allele will be lost due to genetic drift. This implies that adaptive evolution of a species is very inefficient and that adaptive alleles have to occur repeatedly by mutation, to be lost by drift, before they eventually fix.

Why is this process so inefficient? When an allele is rare, such as a single copy as a new mutation, the forces of drift are typically much larger than the forces of selection. As an example work out the probability of sampling zero copies of an allele starting at a count of one from one generation to the next with a Poisson distribution and a mean of λ = 1.

[math]P(k)=\frac{\lambda^k e^{-\lambda}}{k!}[/math]

There is a 0.3679 probability of loss in the next generation. The probability in the next generation of one copy is also 0.3679, two copies is 0.1839, three copies 0.0613, four copies 0.0153, five copies 0.00307, etc. With a greater than a third chance of loss in one generation due to neutral drift this can more than outweigh the small advantage (say on the order of single percents) added by selection.

Even if the allele survives into the next generation it is most likely at a count of one and still has a greater than 1/3 loss in the following generation. If it manages to make it to two copies there is a 13.5% chance of loss in the next generation, a 27.1% chance of reducing back down to a count of one again, etc. Multiplying all of the paths from one in the first generation to zero in the third generation up to a count of five in the second, 1->0->0, 1->1->0, 1->2->0, 1->3->0, 1->4->0, 1->5->0, using a Poisson distribution gives, 0.3679, 0.1353, 0.02489, 0.003053, 0.0002807, 0.00002066 respectively for a total probability of 0.5315. So there is a greater than 1/3 loss in the second generation and a greater than 1/2 loss in the third generation.

Calculating probabilities beyond this point should be done with a binomial transition matrix to properly account for all of the possible paths (See notes below).

I set programmed this (using a transition matrix) and with an N = 20, s = -0.01, and g = 150 only 0.00254 of the trajectories are still segregating, 0.982 have been lost and 0.0154 have fixed at 100% frequency (compare this to 0.0165 using the standard equation above). Notice that the allele has a fitness cost; despite this it fixed in the population approximately 1.5% of the time with drift overpowering selection. If N and g are increased 10X to N = 200 and g = 1500 the proportion fixed drops to 6.55x10-6 (compare this to 6.78x10-6 using the standard equation above). This underscores the interaction of population size and the relative efficiency of selection. Small populations can "accidentally" fix deleterious alleles by drift and the average fitness can decay over time.

Notes

Working from the binomial

The deterministic change in frequency due to selection acting directly on individual alleles is

[math]p_{t+1} = \frac{w p}{w p + (1-p)}[/math]

where the fitness of the allele at frequency p is w and the fitness of the alternative allele is one, see Selection.

The probability of transitioning from the current number of "A" alleles to all possible numbers of "A" alleles in the next generation (0--2N) due to random sampling (genetic drift) is given by the Binomial Distribution.

[math]P\left(k_{t+1}\right)={2N \choose k} p^k \left(1-p\right)^\left(2N-k\right)[/math]

Here p is the current allele frequency and k is the number of alleles of the specific type in the next generation.

We can combine selection and drift by substituting pt+1 for p in the binomial and w = 1 + s.

[math]P\left(k_{t+1}\right)={2N \choose k} \left(\frac{(1+s) p}{(1+s) p + \left(1 - p\right)}\right)^k \left(1-\frac{(1+s) p}{(1+s) p + \left(1 - p\right)}\right)^\left(2N-k\right)[/math]

Use a transition matrix based on this formula and iterate a starting allele frequency a large number of generations to compare the outcome to the predictions from the equation for the fixation probability... (to be continued)

Have an initial vector, S, representing the starting state of one copy of an allele in the population, multiply it by a drift transition matrix, D, raised to a large number of generations, g.

[math]\mathbb{S} \mathbb{D}_{i \to j}^g = \begin{bmatrix}0, & 1, & 0, & 0, & \cdots & 0\end{bmatrix} \mathbb{D}_{i \to j}^g[/math]

D contains the binomial probabilities of transitioning from i to j number of alleles each generation.

Here is the code in R. (It needs to be cleaned up.)

# R code to generate Wright-Fisher probability plots of genetic drift
N=20 # diploid population size

n=2*N # number of gene copies

a=2*N + 1 # number of states from loss to fixation

g=150 # generations

s=-0.01 # selection

# create an a-by-a matrix from an empty list
# have to tell it to treat as numbers with 'as.numeric()'
# or R will think they are characters and we can't
# do math operations on the cell entries # can this just be done directly with numeric() instead of list()?
trans<-matrix(as.numeric(list()),nrow=a,ncol=a)

# create an a-by-g matrix for recording results
drifter<-matrix(as.numeric(list()),nrow=g,ncol=a)

# calculate binomial probability of each cell
for(j in 1:a){
  for(i in 1:a){
    p=(i-1)/n
    prob=(p*(1+s))/(p*(1+s) + (1 - p))
    if(p>1){p=1}
    if(p<0){p=0}
    trans[i,j]=dbinom(j-1, n, prob)
  }
}

# look at result
# print(trans)

# vector of initial population states
#vec=as.numeric(list(0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)) # this needs to change if you increase pop size
vec=numeric(a) # iniitalizes with all zeros
#vec[10]=vec[11]=vec[12]=1/3 # set p = 1/2N
vec[2]=1 # set p = 1/2N

for(i in 1:g){ # loop over generations
  drifter[i,]<-vec # use this if you want to plot the initial state
  
  # apply current state to the transitions
  nextgen<-vec*trans
  
  # print(nextgen)
  
  # need sum of columns for new state vector
  vec<-colSums(nextgen)
  
  #drifter[i,]<-vec # use this if you do not want to plot the initial state
}

#print(drifter)

if(0){ # turn off 3D plot
  # make 3d plot with persp()
  # http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/persp.html
  xposition = c(1:g) # labels for x axis tick marks
  yposition = c(1:a) # labels for y axis tick marks
  persp(xposition,
      yposition,
      drifter, # matrix of z values
      theta = 50, phi = 10, # adjust to rotate the view
      xlab = "generations", ylab = "counts", zlab = "probability", # axis labels
      shade=0.3, # adjust shading of surface
      ticktype="detailed", # show axis tick values
      col='orange',
      ltheta = 90, lphi = 45 # adjust the 'light' position
  )
}

print(paste("loss:", drifter[g,1]))
print(paste("segregating:", 1-drifter[g,1]-drifter[g,a]))
print(paste("fixed:", drifter[g,a]))
proportions=c(drifter[g,1], 1-drifter[g,1]-drifter[g,a], drifter[g,a])
if(1){
  barplot(proportions, names.arg=c("Lost", "Segregating", "Fixed"))
}

Kimura's derivation

This is derived from

[math]u(p) = \frac{\int_0^p G(x)\, \mbox{d} x}{\int_0^1 G(x)\, \mbox{d} x}[/math],

equation 3 of Kimura 1962.

[math]u(p,t)[/math] is the probability of fixation of an allele at frequency p within t generations.

The change in allele frequency ([math]\delta p[/math]) over short periods of time ([math]\delta t[/math]) is

[math]u(p, t+\delta t) = \int f(p, p+\delta p; \delta t) u(p+ \delta p, t) \, \mbox{d} (\delta p)[/math],

integrating over all values of changes in allele frequency ([math]\delta p[/math]).

A mean and variance of the change in allele frequency (p) per generation are defined as

[math]M_{\delta p}=\lim_{\delta t \to 0} \frac{1}{\delta t} \int (\delta p) f(p, p+\delta p; \delta t) \, \mbox{d} (\delta p)[/math]

[math]V_{\delta p}=\lim_{\delta t \to 0} \frac{1}{\delta t} \int (\delta p)^2 f(p, p+\delta p; \delta t) \, \mbox{d} (\delta p)[/math]

The probability of fixation given sufficient time for fixation to occur is

[math]u(p)=\lim_{t \to \infty} u(p,t)[/math]

[math]G(x) = e^{-\int \frac{2M_{\delta x}}{V_{\delta x}} \, \mbox{d} x}[/math]

(to be continued ... I need to work through this and my calculus is rusty.)

A different approach

[math]u(p)=\frac{1-e^{-4N_esp}}{1-e^{-4N_es}}[/math]

The numerator is the probability of not zero (loss) in a Poisson distribution with a mean of 4Nesp. 2Np is the number of copies of the allele in the population. This is multiplied by 2s. [math]2Np \times 2s = 4Nsp[/math]

Why is s multiplied by two here?

The denominator is the same thing but with p=1 (the largest value possible). This rescales the numerator to be a fraction out of one(?).

I suspect there is a more intuitive approach to understanding this by exploring this line of reasoning but I am not quite seeing it yet.

[math]1-e^{-4N_esp}\approx 1- 1 - 4Nsp = -4Nsp[/math] ?

[math]u(p)=\frac{1-e^{-4N_esp}}{1-e^{-4N_es}}\approx\frac{-4Nsp}{-4Ns}=p[/math] ?

[math]1-e^{-4N_esp}\approx 1- (1 - 2s)^{2Np}[/math] ?

There is more to go through in Fisher 1930 (pp. 215--218) and Wright 1931 (pp. 129--133). I need to start there and work forward.