Monthly Archives: January 2017

F_{ST} and the multinomial

The binomial distribution that many of us are more familiar with is a special case of the multinomial distribution, which can have more than two outcomes to a "test". F_{ST} is a common summary of how similar populations are on a genetic level and it is based on the concept of "missing heterozygosity". I was thinking of a way to illustrate the multinomial with the probabilities of genotypes under Hardy-Weinberg assumptions, an allele frequency, p, and the degree of missing heterozygosity, F.

Here is a small dataset to work with, counts of three genotypes of a SNP with two allelic states in a sample of 30 individuals.

  • AA  4
  • AT  10
  • TT  16

We can go back to the binomial for a moment and estimate the allele frequency. If we ignore genotype (the association between alleles), there are 18 A's and 42 T's.

  • A  18
  • T  42

So 30% of the alleles are A alleles. With the binomial, the probability of the data x given an allele frequency p is

P(x|p) = \frac{(a+t)!}{a! t!} p^a (1-p)^t

where a is the number of A alleles and t is the number of T alleles.

Plot this in R to visualize the probability of the data given the range of possible allele frequencies.

a=18
t=42
n=a+t
p <- seq(0,1,length=1000)
probx <- dbinom(a,n,p)
plot(p, probx, type='l',xlab="allele frequency", ylab="probability",
     main="Binomial Probability")

We can find the maximum likelihood point by finding the partial derivative (slope) with respect to p, setting it equal to zero, and solving for p. First of all the binomial coefficient in the front is a constant so we can remove that for simplicity.

P(x|p) \propto p^a (1-p)^t

Let's define n as the sum of a and t.

\frac{\partial \left( p^a (1-p)^{n-a} \right)}{\partial p}=a(1-p)^{n-a}p^{a-1}-n-a(1-p)^{n-a-1}p^a=p^{a-1}(1-p)^{n-a-1}(a-np)

p^{a-1}(1-p)^{n-a-1}(a-np)=0

p=\frac{a}{n}=\frac{18}{60}=0.3

This makes intuitive sense. The best guess for the allele frequency is the number of times it is observed out of the total.

We can also integrate to get the area under the curve. The indefinite integral is

\int p^a (1 - p)^{n - a} dp =\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}

\,_2F_1 is the ordinary hypergeometric function. First this has to be evaluated at p=1 to put it on a scale where the area under the curve sums to one. For a=18 and n=60 this comes out to c=1.7722\times10^{-17}.

Evaluating

\frac{1}{c}\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}|_{p=0.3}=0.4703

at the maximum likelihood point, p=0.3, gives a cumulative probability of 0.4703, which is not far from 0.5, the median of the distribution. You can evaluate it online with this link to Wolfram Alpha:

https://www.wolframalpha.com/input/?i=(0.3%5E19*hypergeometric2f1(19,-42,20,0.3))%2F(19*1.7722*10%5E(-17))

Evaluating

\frac{1}{c}\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}|_{p=0.5}=0.9991

at p=0.5 for example gives a cumulative probability of 0.9991; so, more than 99% of the curve is below this point and we can say that p is significantly less than 0.5 given the data.

Okay, that was a warm up; now let's do this with the genotypes before adding in F.

Going back to the original genotype dataset of 30 individuals.

  • AA  4
  • AT  10
  • TT  16

The multinomial probability is

P(x|p) = \frac{(q_{aa}+q_{at}+q_{tt})!}{q_{aa}! q_{at}! q_{tt}!} \left[p^2\right]^{q_{aa}} \left[2 p (1-p)\right]^{q_{at}} \left[(1-p)^2\right]^{q_{tt}}

where q_{xx} are the counts of each genotype in the sample and p is still the frequency of allele "A."

We can plot this along with the binomial curve for comparison with the following Mathematica code.

aa = 4;
at = 10;
tt = 16;
a = 2*aa + at;
t = 2*tt + at;
Plot[{Multinomial[aa, at, tt] (p^2)^aa (2 p (1 - p))^at ((1 - p)^2)^
    tt, Binomial[(a + t), a] p^a (1 - p)^t}, {p, 0, 1} , 
 PlotRange -> Full, PlotLegends -> {"Multinomial", "Binomial"}, 
 AxesLabel -> {"p", "P(x)"}]

The curves are similar, they peak in the same place, but overall the multinomial curve has a lower probability than the binomial. This is because there is more information in the multinomial. Each set of the data for the binomial (allele counts) can correspond to a range of ways to divide the data into genotype counts. We can always reduce genotypes to number of alleles but we cannot do the reverse. So, a particular genotype outcome is less likely (a smaller part of all possible outcomes) than the corresponding allele count outcome.

Okay, now it gets a little more fun. We have a probability over a dimension of allele frequency of p from zero to one. Let's add another dimension of F but first let's think about what F means. F is the proportion of expected (according to Hardy-Weinberg) heterozygosity that is missing. The heterozygous genotypes don't just disappear. The alleles get reorganized into homozygous genotypes. The frequency of each allele in the heterozygotes is 1/2; so, each homozygous genotype is increased by an equal proportion, which is half of the heterozygotes that are missing. Thus, the frequency of each genotype expected given p and F is

f(AA) = p^2 + F 2 p (1-p)/2 = p^2 + F p (1-p)
f(AT) = 2 p (1-p) - F 2 p (1-p) = 2 p (1-p) (1-F)
f(TT) = (1-p)^2 + F 2 p (1-p)/2 = (1-p)^2 + F p (1-p)

This can be incorporated into the individual genotype probability components of the multinomial.

P(x|p,F) = \frac{(q_{aa}+q_{at}+q_{tt})!}{q_{aa}! q_{at}! q_{tt}!} \left[p^2+F p (1-p)\right]^{q_{aa}} \left[2 p (1-p) (1-F)\right]^{q_{at}} \left[(1-p)^2+ F p (1-p)\right]^{q_{tt}}

We can simply the frequency of each genotype terms by symbolizing then with corresponding f_{XX} terms.

P(x|p,F) = \frac{(q_{aa}+q_{at}+q_{tt})!}{q_{aa}! q_{at}! q_{tt}!} f_{AA}^{q_{aa}} f_{AT}^{q_{at}} f_{TT}^{q_{tt}}

This is what the likelihood surface looks like over the range of p from zero to one and F from zero to one.

Here is the mathematica code used to make the plot.

a = 4;
h = 10;
t = 16;
Plot3D[Multinomial[a, h, t] (p^2 + f p (1 - p))^
   a ((1 - f) 2 p (1 - p))^h ((1 - p)^2 + f p (1 - p))^t, {p, 0, 
  1}, {f, 0, 1}, PlotRange -> Full, AxesLabel -> {"p", "F", ""}, 
 PlotPoints -> 100]

The curve at F=0 corresponds to the multinomial curve above (in the plot comparing it to the binomial). We can also see that allowing positive values for F can result in higher likelihoods of the data. What do we actually expect F to be? The number of heterozygotes are 10 out of a sample of 30 genotypes, so the fraction of heterozygous genotypes is 0.333. The allele frequency is 0.3, so we expect 2 p (1-p) = 2 \times 0.3 \times 0.7 = 0.42 of the sample to be heterozygous. There is a 0.42-0.333 = 0.087 proportion of heterozygotes that are missing. This is 0.087 / 0.42 = 0.207 of the heterozygosity expected. So our point estimate of F is 0.207, or about 21% of the heterozygosity is missing. This appears to correspond to the top of the likelihood peak. Finally, notice that the peak is narrower in the p dimension than in the F dimension. There is more information in the dataset about the allele frequency than there is about the genotype frequencies. There is a larger sample, 60, of alleles than there are of genotypes, 30. So, the estimate of p is more refined than the estimate of F.

The range of zero to one makes sense for p. An allele frequency cannot be less than zero or greater than 1. Also, F cannot be more than one (all of the expected heterozygosity is missing), but it can be less than zero (negative F means there is an excess of heterozygosity). It looks like a large volume of the likely values of p and F given the data correspond to a negative F. So, F is not significantly greater than zero (Hardy-Weinberg expectations) for this dataset. Let's extend the plot to encompass a range of negative values for F.

...and something weird happened. There is a second likelihood peak with different values of p and F that actually has an even higher likelihood at its peak, but the parameter values are unreasonable (given what we can see in the dataset about the allele frequency and level of heterozygosity). The minimum possible F is negative one. If the entire sample were made up of heterozygotes then the allele frequency is one half, the proportion of heterozygotes expected would also be one half. This leads to a negative one half divided by one half or negative one. Extending the plot out to the full range looks even more pathological (I limited the scale and truncated the peak because it was so large).

Long story short it turns out that this is an artifact... If you closely inspect what is going on by spot checking a few points in the second peak you find that the expected number of the rarer homozygote is negative, which is impossible, and in the likelihood calculation this is raised to the fourth power, which makes it positive, etc. This is a good example of how unanticipated, and initially cryptic, errors can creep into a computational script. This is fixed by setting up an if statement to only evaluate the likelihood if the expected number of homozygotes are positive; otherwise the likelihood is zero, which is true. Here is the updated script and a plot.

a = 4;
h = 10;
t = 16;
Plot3D[Multinomial[a, h, t] If[
   p^2 + f p (1 - p) > 
    0, (p^2 + f p (1 - p))^a ((1 - f) 2 p (1 - p))^
     h ((1 - p)^2 + f p (1 - p))^t, 0], {p, 0, 1}, {f, -1, 1}, 
 PlotRange -> Full, AxesLabel -> {"p", "F", ""}, PlotPoints -> 100]

Because F and p appear uncorrelated we can feel good about taking a slice of the likelihood surface through the maximum-likelihood value of p as a reasonable approximation of the (contour) likelihood surface of F in order to evaluate the range of possible parameter values.

This is not yet F_{ST}. We have been evaluating the missing heterozygosity of a single population sample. Heterozygosity can be reduced within a population by inbreeding effects where individuals do not mate randomly but tend to mate more often with closer relatives (selfing in plants is an extreme example). What was calculated above is F_{IS}.

Let's say we had another sample from a different population. So now we have our original population one

  • AA 4
  • AT 10
  • TT 16

and population two.

  • AA 10
  • AT 14
  • TT 6

Plotting the likelihood surfaces together.

a1 = 4;
h1 = 10;
t1 = 16;
a2 = 10;
h2 = 14;
t2 = 6;
Plot3D[{If[p^2 + f p (1 - p) > 0 && (1 - p)^2 + f p (1 - p) > 0, 
   Multinomial[a1, h1, t1] (p^2 + f p (1 - p))^
     a1 ((1 - f) 2 p (1 - p))^h1 ((1 - p)^2 + f p (1 - p))^t1, 0], 
  If[p^2 + f p (1 - p) > 0 && (1 - p)^2 + f p (1 - p) > 0, 
   Multinomial[a2, h2, t2] (p^2 + f p (1 - p))^
     a2 ((1 - f) 2 p (1 - p))^h2 ((1 - p)^2 + f p (1 - p))^t2, 0], 
  0.0005}, {p, 0, 1}, {f, -1, 1}, PlotRange -> Full, 
 AxesLabel -> {"p", "F", ""}, PlotPoints -> 100]

The second sample has a higher allele frequency and a value of F_{IS} closer to zero (however, the two ranges of likely F values overlap and might not be considered significantly different). The second allele frequency is p_2 = (10+14/2)/30 = 0.567. With the two allele frequencies we can now calculate a point estimate of F_{ST} between the population samples. The expected heterozygosity in sample 2 is H_{S,2} = 2 \times 0.567 \times (1-0.567) = 0.491. We already showed above that p_1 = 0.3 and H_{S,1} = 0.42. The average allele frequency over both populations is \bar{p} = 0.434 and the average subpopulation expected heterozygosity is \bar{H}_S = 0.456. In a combined total sample, if there were population structure, the expected heterozygosity is H_T = 2 \bar{p} (1-\bar{p}) = 0.491. Therefore,

F_{ST} = \frac{H_T - \bar{H_S}}{H_T} = \frac{0.491-0.456}{0.491} = 0.0713

or about 7% of the expected heterozygosity is missing because of differences in allele frequencies between the populations. Why does this work? Any time there is a difference in allele frequency the average subpopulation expected heterozygosity will be lower than the combined expected heterozygosity. Why? Because the curve of heterozygosity over allele frequency space is a concave function; the average of two points will always be lower than the value of the function evaluated at that point.

F_{ST} is the gap between H_T and H_S, which is 7% of H_T.

Before bringing this back together I need to introduce one more F-statistic. This one is F_{IT}. It is the total missing heterozygosity due to both population structure and inbreeding effects. If we combine the two samples we get

  • AA 14
  • AT 24
  • TT 22

The allele frequency p_T = 0.433 (the same as the average between the two samples) and the fraction of heterozygotes in the combined sample is 24/60= 40%. We expect a heterozygosity of  2 p_T (1-p_T) = 0.491. So, F_{IT} = (0.491-0.4)/0.491 = 0.185.

F_{ST} between populations is related to F_{IT} and F_{IS}. If we know the latter two we can estimate the former.

F_{ST} = \frac{F_{IS}-F_{IT}}{F_{IS}-1}

Using an F_{IT}=0.185 and let's say an F_{IS}=0.125 we get

F_{ST} = \frac{0.125-0.185}{0.125-1} = \frac{0.06}{0.875} = 0.0686.

We can estimate F_{IT} and F_{IS} using the multinomial, which means we can also estimate F_{ST}. This is starting to get more complicated with lots of parameters to estimate from the data, two allele frequencies and two F statistics (if we have a single F_{IS} over both populations, which seems biologically reasonable) or four parameters in total. A good way to explore higher dimensional parameter space is to use a Markov Chain in a Monte Carlo fashion (MCMC) with Metropolis-Hastings algorithm. I save the details of this method for a later post. For now let's write a "hill climbing" Monte Carlo script (Monte Carlo refers to guessing parameter values and hill climbing means that it moves to values that produce higher probabilities of the data). Here it is in R.

# data
pop1=c(4,10,16)
pop2=c(10,14,6)
popt = pop1 + pop2

# intialize variables
probbest = 0
p1best=0.5
p2best=0.5
fisbest=0
fitbest=0

# loop through some steps
for (i in 1:100000) {

  tune = 1/i # narrows in as the run goes
  
  # pick new parameter values and check to make 
  # sure they are not out of bounds
  p1 = p1best + runif(1, min = -tune, max = tune)
  if (p1<0) {p1=0} else if (p1>1) {p1=1}
  p2 = p2best + runif(1, min = -tune, max = tune)
  if (p2<0) {p2=0} else if (p2>1) {p2=1}
  fis = fisbest + runif(1, min = -tune, max = tune)
  if (fis< -1) {fis= -1} else if (fis>1) {fis=1}
  fit = fitbest + runif(1, min = -tune, max = tune)
  if (fit< -1) {fit= -1} else if (fit>1) {fit=1}
  pt = (p1+p2)/2

  # genotype probabilities
  gp1=c(p1^2+fis*p1*(1-p1),2*p1*(1-p1)*(1-fis),(1-p1)^2+fis*p1*(1-p1))
  gp2=c(p2^2+fis*p2*(1-p2),2*p2*(1-p2)*(1-fis),(1-p2)^2+fis*p2*(1-p2))
  gpt=c(pt^2+fit*pt*(1-pt),2*pt*(1-pt)*(1-fit),(1-pt)^2+fit*pt*(1-pt))

  # sample probabilities
  if(p1^2+fis*p1*(1-p1)>0 & (1-p1)^2+fis*p1*(1-p1)>0) {
    prob1<-dmultinom(pop1, , gp1)
  } else {
    prob1 = 0
  }
  if(p2^2+fis*p2*(1-p2)>0 & (1-p2)^2+fis*p2*(1-p2)>0) {
    prob2<-dmultinom(pop2, , gp2)
  } else {
    prob2 = 0
  }
  if(pt^2+fit*pt*(1-pt)>0 & (1-pt)^2+fit*pt*(1-pt)>0) {
    probt<-dmultinom(popt, , gpt)
  } else {
    probt = 0
  }
  # combined probabilities
  prod = prob1 * prob2 * probt

  # keep the best ones
  if(prod>probbest){
    p1best=p1
    p2best=p2
    fisbest=fis
    fitbest=fit
    probbest=prod
  } else {
    p1 = p1best
    p2=p2best
    fis = fisbest
    fit = fitbest
  }
  pt = (p1+p2)/2

  # calculate fst
  fst = (fis-fit)/(fis-1)

  # report results
  print(c(p1,p2,fis,fit,fst))
}

Running this for 100,000 steps I got estimates of p_1 = 0.299, p_2 = 0.567, F_{IS} = 0.125, F_{IT} = 0.185, and F_{ST} = 0.0687; not bad.

Okay, that's going to have to be it for this post. This could be extended to cases where the sample sizes are unequal, multiple loci (F is shared across loci but p is not), more than two alleles, more than two populations and hierarchical population groupings, and an MCMC script to estimate the probability distribution of F_{ST}, but I'll have to save that for later.