Genetic Drift

From Genetics Wiki
Jump to: navigation, search

Alleles are sampled from a population to make up the next generation. Anytime a random sample is taken it is likely to deviate slightly from the original population. Genetic drift is evolutionary sampling error. These deviations are not biased in any single direction towards an increase or decrease in frequency of a particular allele. However, these small deviations will accumulate over several generations and ultimately alleles will be lost until only a single allele remains in the population (assuming it is not reintroduced by mutation or from another population). In the simulation example below alleles can increase or decrease randomly until arriving at a frequency of zero or one but the average of all of the allele frequencies stays near the starting value. DriftN100.png

The equations and results given on this page are based on the assumptions of selective neutrality (there is no relative fitness advantage or disadvantage between the alleles), each allele in the following generation has an equal chance of being copied from any allele in the previous generation (which implies each individual allele has an equal chance of reproducing a copy of itself and the number of offspring are Poisson distributed), and a Wright-Fisher population model where generations do not overlap and each generation is of a constant size. Each of these assumptions can be altered for alternative models.

Summary Properties

The change in allele frequency due to genetic drift is without direction---it is just as likely to increase as to decrease---and is memory-less. Each generation is independent; the direction of frequency change in previous generations has no effect on the change in future generations.

The rate of change in allele frequency due to genetic drift is inversely proportional to the population size.

Complete loss (a frequency of zero) or fixation (a frequency of one) is an absorbing boundary. Once the allele frequency reaches this state it cannot return (due to genetic drift).

Ultimately an allele will reach loss or fixation within a population (if genetic drift is the only force acting).

On average half of the genetic variation present in a population will be removed by genetic drift in 1.39N generations, where N is the size of a diploid population.

The probability of ultimate fixation of an allele is its current allele frequency. The probability of loss is one minus the allele's current frequency. (All allele copies in the current generation have an equal chance (1/2N) of fixing, so the probability of a particular allele fixing is the number of copies it has out of the total.)

Most genetic variation, affected solely by drift, is expected to be "young" in an evolutionary sense, less than 2.77N generations, where N is the size of a diploid population.

If an allele is still segregating (polymorphic), after 2N generations its frequency is essentially independent of its starting frequency 2N generations ago; it can be at any value with almost an equal probability.

Exact Binomial Probabilities

Genetic drift in a population can be represented as a transition matrix based on binomial probabilities. This equation gives the probability of going from i copies of an allele to j copies of an allele in a single generation.

[math]T_{i \to j} = {2N \choose j} \left( \frac{i}{2N} \right)^j \left( 1 - \frac{i}{2N} \right)^{2N-j}[/math]

For a tiny population of 2N = 4 this corresponds to:

[math]T_{i \to j} = \left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ 0.316 & 0.422 & 0.211 & 0.047 & 0.004 \\ 0.062 & 0.25 & 0.375 & 0.25 & 0.062 \\ 0.004 & 0.047 & 0.211 & 0.422 & 0.316 \\ 0 & 0 & 0 & 0 & 1 \end{array} \right) [/math]

So, the probability of going from one copy of an allele to three copies of the allele in a single generation is 4.7% while the probability of going from one copy to zero copies is 31.6%.

This can be iterated over several generations from a starting state to see how the probabilities of allele frequencies change. In the example below a starting frequency of 1/3 for 9, 10, and 11 allele copies in a population of 2N=16 is tracked over 15 generations.

Drifttransition.png

Obviously, for large populations over many generations these calculations can get tedious.

Variance

The variance in the change of allele frequency, p, between generations due to genetic drift is

[math]\sigma^2_p = \frac{p(1-p)}{2N}[/math]

This is derived from a binomial variance ([math]\sigma^2 = n p(1-p)[/math]) where the number of samples drawn from the population is [math]n = 2N[/math]. Since we are talking about an allele frequency between zero and one we rescale this variance by [math](2N)^2[/math] (the standard deviation, σ, is on a scale from zero to 2N) to put it on an interval from zero to one.

[math]\sigma^2_p = \frac{2N p(1-p)}{(2N)^2}[/math]

[math]\sigma^2_p = \frac{2N p(1-p)}{4 N^2}[/math]

[math]\sigma^2_p = \frac{p(1-p)}{2 N}[/math]

It is clear that genetic drift is strongest when the allele is at intermediate frequencies, p(1-p), and when the population size, N, is small.

The variance in allele frequencies between populations can be used as a way to define Effective Population Size, [math]N_e[/math].

Average Loss of Heterozygosity

If a small population becomes isolated so that there is no introduction of new alleles (for example in a captive breeding program or as a small number of founders on a new island) and genetic drift is a predominant force, the average loss of heterozygosity in the genome over multiple generations can be estimated.

[math]H_t = H_0 (1-\frac{1}{2N})^t[/math]

[math]H_t[/math] is the remaining heterozygosity at generation [math]t[/math].

[math]H_0[/math] is the initial heterozygosity at generation zero.

[math]t[/math] is the time in generations.

[math]N[/math] is the (diploid) population size.

The derivation of this stems from the probability of two allele copies being identical by descent over a period of time (and thus losing any potential starting difference between them). There are 2N total allele copies in the population. The probability of two alleles arising from the exact same copy in the previous generation is [math]1/(2N)[/math] (this is the rate of loss of variation per generation---the fraction of existing heterozygosity, [math]H_0[/math], is exposed to this rate of loss). The probability that this did not occur is [math]1 - 1/(2N)[/math] per generation (this fraction of heterozygosity is preserved each generation). There is an independent chance of this each of t generations so these probabilities are multiplied: [math](1 - 1/2N)^t[/math]. (If two randomly selected allele copies arose from the same copy in the previous generation then they cannot be heterozygous; thus, this is a way to keep track of the change in average heterozygosity in the population.) Wright formulated this as the rate of increase of identity-by-descent in terms of a "fixation" index F (variation is lost when a single allele reaches fxation, 100%, in the population). The increase in homozygosity (fixation) is one minus the decrease in heterozygosity. [math]F_t = 1-H_t[/math]. So, the amount of fixation over time t is [math]F_t = 1 - (1-\frac{1}{2N})^t[/math].

This can be simplified by using an exponential function approximation in continuous time.

[math](1-\frac{1}{2N})^t \approx e^{- t/ 2N}[/math]

[math]H_t \approx H_0 e^{- t/ 2N}[/math]

The time until only half of the original genetic variation remains can be directly estimated as a rule-of-thumb.

[math]\frac{1}{2} = \frac{H_t}{H_0} \approx e^{- t/ 2N}[/math]

[math]\log{\frac{1}{2}} \approx \log{e^{- t/ 2N}}[/math]

[math]-0.6931 \approx - t/ 2N[/math]

[math]1.386 N \approx t [/math]

So, on average half of the genetic variation will be lost due to drift over a period of 1.386 N generations, where N is the diploid population size.

Driftheterozygosity.png

In the simulation example above (N = 100) heterozygosity declined on average (thick black line) from an initial 40% to approximately 20% after 139 generations.

As an example, the 'alalā (Hawaiian crow, Corvus hawaiiensis) became extinct in the wild in 2002 and the species only exists as about 100 individuals maintained in captivity. This predicts that it will take 139 generations for the genetic variation (heterozygosity) of the species to decline by half (if the number of captive individuals remains constant each generation).

In another example, the resident human population of Pitcairn Island is about 50 individuals. If they became completely isolated from the rest of the world and maintained the same population size the majority of their current genetic variation is predicted to still be present in the population 2,000 years from now (assuming a generation time of 30 years).

Of course these examples assume an ideal random-mating population of constant size where each individual has an equal chance of reproducing. There are many ways this scenario can be altered to increase or decrease the rate of loss of genetic variation.

Diffusion Approximation

Genetic drift can also be modeled by a diffusion approximation (Kimura 1955[1]), which has allowed many mathematical solutions to be derived.

The simplest model of the distribution of allele frequency probabilities, x, at time t in a population of constant size is given by:

[math]\phi(x,t)=\sum_{i=1}^{\infty} p (1-p) i (i+1) (2 i +1) \,F\!(1-i, i+2, 2, p) \,F\!(1-i, i+2, 2, x) \, e^{-i(i+1)t / 4N} [/math]

In the equation above F is the ordinary Gaussian hypergeometric function [math]{}_2F_1[/math]

This was used to generate the following frequency distributions (image from Kimura 1955[2]).

Kimura1955.png

Using the diffusion approximation Kimura and Ohta (1969[3]) determined the average time until fixation of an allele (given that it will fix) which is

[math]t = - \frac{1}{p} 4N (1-p) \log_e (1-p)[/math]

The average time until loss of an allele is

[math]t = - \frac{1}{1-p} 4N p \log_e (p)[/math]

The average time until either loss or fixation (i.e., the "waiting time" of a polymorphism) of an allele is

[math]t = - 4N \left( p \log_e (p) + (1-p) \log_e (1-p) \right)[/math]

This is maximized at p=1/2 or 2.77N generations.

These functions are plotted below.

Waitingtime.png

Role in Evolution

There is debate regarding how important genetic drift is in natural populations[4] versus other forces such as selection that in some ways can mimic genetic drift and it may be difficult to distinguish between the two.

Simulations and Programming Code

Forward Simulation

A forward simulation that keeps track of all allele copies in a population each generation is easy to understand but can run very slowly for large populations. The following R code was used to make this image. DriftN100.png

# a simple forward simulation of genetic drift in R

# uncomment to automatically clear last plot if any
# dev.off()

# start the timer
start <- proc.time()

# population size in number of organisms
N=100
# allele frequency
initp=0.5
# generations
gen=100
# number of replicant populations to plot
reps=30

# make an array of alleles to represent the population
pop<-list()
# population size in number of alleles
popsize=2*N
# make a list to record frequencies
freq<-list()

# a list to keep track of the average frequency
avefreq<-list()
for(g in 1:gen){
  avefreq[g]<-as.numeric(0.0)
}

# list of random plotting colors
cl <- sample(rainbow(1000))

# loop through replicate lines to plot
for(r in 1:reps){
  
  # reset frequency
  p=initp

  # assign first generation of alleles
  for(i in 1:popsize){
    if(i<=p*popsize){
      pop[i]<-1
    }else{
      pop[i]<-0
   }
  }

  # loop through generations
  for(g in 1:gen){
  
    # calculate allele frequency
    sum=0
     for(i in 1:popsize){
       if(pop[i]==1){sum=sum+1}
    }
     p=sum/popsize
     print(paste(g,p))
    freq[g]<-p # record frequency
    
    # not sure why so many steps are needed to get
    # the average to work
    temp1=p/reps
    temp2=as.numeric(avefreq[g])
    temp3=temp1+temp2
    avefreq[g]=temp3
  
   # make a copy of the array
   pop.copy<-pop
  
   # randomly select values to replace
   for(i in 1:popsize){
     j=sample(1:popsize, 1)
     pop[i]=pop.copy[j]
    }
  
  }
  
  x<-c(1:gen)
  par(new=TRUE)
  plot(x,freq,type='l',ylim=c(0,1),col = cl[r],xlab="generations",ylab="allele frequency")
  
}

# add the average line
par(new=TRUE)
plot(x,avefreq,type='l',ylim=c(0,1),col = "black",lwd=2,xlab="generations",ylab="allele frequency")

# show the run time
total.time<- proc.time() - start
print(total.time)

R code for plotting the change in heterozygosity over time due to genetic drift with a forward simulation.

Driftheterozygosity.png

# a simple forward simulation of genetic drift in R
# plots heterozygosity

# uncomment to automatically clear last plot if any
# dev.off()

# start the timer
start <- proc.time()

# population size in number of organisms
N=100
# allele frequency
initp=0.276
# generations
gen=139
# number of replicant populations to plot
reps=100

# make an array of alleles to represent the population
pop<-list()
# population size in number of alleles
popsize=2*N
# make a list to record frequencies
freq<-list()

# a list to keep track of the average frequency
avefreq<-list()
for(g in 1:gen){
  avefreq[g]<-as.numeric(0.0)
}

# list of random plotting colors
cl <- sample(rainbow(1000))

# loop through replicate lines to plot
for(r in 1:reps){
  
  # reset frequency
  p=initp

  # assign first generation of alleles
  for(i in 1:popsize){
    if(i<=p*popsize){
      pop[i]<-1
    }else{
      pop[i]<-0
   }
  }

  # loop through generations
  for(g in 1:gen){
  
    # calculate allele frequency
    sum=0
     for(i in 1:popsize){
       if(pop[i]==1){sum=sum+1}
    }
     p=sum/popsize
     print(paste(g,p))
    freq[g]<-2*p*(1-p) # record frequency (actually heterozygosity)
    
    # not sure why so many steps are needed to get
    # the average to work
    temp1=2*p*(1-p)/reps
    temp2=as.numeric(avefreq[g])
    temp3=temp1+temp2
    avefreq[g]=temp3
  
   # make a copy of the array
   pop.copy<-pop
  
   # randomly select values to replace
   for(i in 1:popsize){
     j=sample(1:popsize, 1)
     pop[i]=pop.copy[j]
    }
  
  }
  
  x<-c(1:gen)
  par(new=TRUE)
  plot(x,freq,type='l',ylim=c(0,0.5),col = cl[r],xlab="generations",ylab="heterozygosity")
  
}

# add the average line
par(new=TRUE)
plot(x,avefreq,type='l',ylim=c(0,0.5),col = "black",lwd=2,xlab="generations",ylab="heterozygosity")

# show the run time
total.time<- proc.time() - start
print(total.time)

Pseudosampling Method

Kimura (1980)[5] tested a method that is much more computationally efficient than forward simulation. You don't have to keep track of the entire population; in this approach the allele frequency p is adjusted randomly according to the variance expected under genetic drift given the allele frequency and population size. This was improved on by Kimura & Takahata (1983)[6] to keep track of small allele frequencies near the edges of 0 and 2N where the original method can lead to deviations.

The method is based on matching the variance of a uniformly distributed random variable ([math]\sigma^2=\frac{(b-a)^2}{12}=\frac{1}{3}[/math] where a and b are the values of the edges of the uniform distribution, here 1 and -1) to the variance expected under genetic drift [math]\sigma^2=\frac{p (1-p)}{2 N}[/math]. A "standard deviation" draw from this distribution is made each generation to adjust the allele frequency.

[math]p_g = p_{g-1} + U_{(-1,1)} \sqrt{\frac{3 p (1-p)}{2 N}}[/math], where [math]U_{(-1,1)}[/math] is the uniformly distributed random variable.

This approximation can arrive at a point just outside of the range of possible allele frequencies. The only remaining issue is to take care of the edges where the number of allele copies are near zero or 2N. In the simplest implementation set p equal to zero if p becomes negative and set p equal to one if p becomes greater than one. A better approach is to use Poisson distributed allele counts near the edges of the distribution. If 2Np is less than three then [math]p_g = \frac{P_{(2Np_{g-1})}}{2N}[/math] where [math]P_{(2Np_{g-1})}[/math] is a Poisson distributed random variable with mean [math]2Np_{g-1}[/math]. A similar adjustment is made at the upper end of the distribution where [math]2 N - 2 N (1-p) \lt 3[/math].

Binomial Probabilities

The R code below was used to generate the curve of allele frequency probabilities using a binomial transition matrix. It generated the following image: Drifttransition.png

# R code to generate Wright-Fisher probability plots of genetic drift
N=8 # diploid population size

n=2*N # number of gene copies

a=2*N + 1 # number of states from loss to fixation

g=15 # generations

# create an a-by-a matrix from an empty list
# have to tell it to treat as numbers with 'as.numeric()'
# or R will think they are characters and we can't
# do math operations on the cell entries
trans<-matrix(as.numeric(list()),nrow=a,ncol=a)

# create an a-by-g matrix for recording results
drifter<-matrix(as.numeric(list()),nrow=g,ncol=a)

# calculate binomial probability of each cell
for(j in 1:a){
  for(i in 1:a){
    trans[i,j]=dbinom(j-1, n, (i-1)/n)
  }
}

# look at result
# print(trans)

# vector of initial population states
vec=as.numeric(list(0,0,0,0,0,0,0,0,0,1/3,1/3,1/3,0,0,0,0,0)) # this needs to change if you increase pop size

for(i in 1:g){ # loop over generations
  
  drifter[i,]<-vec # use this if you want to plot the initial state
  
  # apply current state to the transitions
  nextgen<-vec*trans
  
  # print(nextgen)
  
  # need sum of columns for new state vector
  vec<-colSums(nextgen)
  
  #drifter[i,]<-vec # use this if you do not want to plot the initial state
  
}

print(drifter)

# make 3d plot with persp()
# http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/persp.html
xposition = c(1:g) # labels for x axis tick marks
yposition = c(1:a) # labels for y axis tick marks
persp(xposition,
      yposition,
      drifter, # matrix of z values
      theta = 50, phi = 10, # adjust to rotate the view
      xlab = "generations", ylab = "counts", zlab = "probability", # axis labels
      shade=0.3, # adjust shading of surface
      ticktype="detailed", # show axis tick values
      col='orange',
      ltheta = 90, lphi = 45 # adjust the 'light' position
)

Diffusion Approximation

Here is Mathematica code for the diffusion approximation.

Diffusion25.png

t = 10;
n = 1000;
p = 0.5;
m = 100;
(*the frequency distribution (probability) of the polymorphic \
fraction*)
poly =
Plot[Sum[p*(1 - p)*i*(i + 1)*(2*i + 1)*
Hypergeometric2F1[1 - i, i + 2, 2, x]*
Hypergeometric2F1[1 - i, i + 2, 2, p]*E^(-t*i*(i + 1)/(4*n)), {i,
1, m}], {x, 0, 1}, PlotRange -> {-1, 4}, Filling -> Axis,
PlotStyle -> Blue,
AxesLabel -> {"allele frequency", "probability density"}]

References and Footnotes

  1. Kimura, M. (1955). Solution of a process of random genetic drift with a continuous model. Proceedings of the National Academy of Sciences of the United States of America, 41(3), 144-150. http://www.pnas.org/content/41/3/144.full.pdf
  2. Kimura, M. (1955). Solution of a process of random genetic drift with a continuous model. Proceedings of the National Academy of Sciences of the United States of America, 41(3), 144-150. http://www.pnas.org/content/41/3/144.full.pdf
  3. Kimura, M., & Ohta, T. (1969). The average number of generations until fixation of a mutant gene in a finite population. Genetics, 61(3), 763. http://genetics.org/content/genetics/61/3/763.full.pdf
  4. Masel, J. (2011). Genetic drift. Current Biology, 21(20), R837-R838. http://www.sciencedirect.com/science/article/pii/S0960982211008827
  5. Kimura, M. (1980). Average time until fixation of a mutant allele in a finite population under continued mutation pressure: studies by analytical, numerical, and pseudo-sampling methods. Proceedings of the National Academy of Sciences, 77(1), 522-526. http://www.pnas.org/content/77/1/522.full.pdf
  6. Kimura, M., & Takahata, N. (1983). Selective constraint in protein polymorphism: study of the effectively neutral mutation model by using an improved pseudosampling method. Proceedings of the National Academy of Sciences, 80(4), 1048-1052. http://www.pnas.org/content/80/4/1048.full.pdf