The binomial distribution that many of us are more familiar with is a special case of the multinomial distribution, which can have more than two outcomes to a "test". is a common summary of how similar populations are on a genetic level and it is based on the concept of "missing heterozygosity". I was thinking of a way to illustrate the multinomial with the probabilities of genotypes under Hardy-Weinberg assumptions, an allele frequency, , and the degree of missing heterozygosity, .
Here is a small dataset to work with, counts of three genotypes of a SNP with two allelic states in a sample of 30 individuals.
- AA 4
- AT 10
- TT 16
We can go back to the binomial for a moment and estimate the allele frequency. If we ignore genotype (the association between alleles), there are 18 A's and 42 T's.
- A 18
- T 42
So 30% of the alleles are A alleles. With the binomial, the probability of the data given an allele frequency is
where is the number of A alleles and is the number of T alleles.
Plot this in R to visualize the probability of the data given the range of possible allele frequencies.
plot(p, probx, type='l',xlab="allele frequency", ylab="probability",
We can find the maximum likelihood point by finding the partial derivative (slope) with respect to , setting it equal to zero, and solving for . First of all the binomial coefficient in the front is a constant so we can remove that for simplicity.
Let's define as the sum of and .
This makes intuitive sense. The best guess for the allele frequency is the number of times it is observed out of the total.
We can also integrate to get the area under the curve. The indefinite integral is
is the ordinary hypergeometric function. First this has to be evaluated at to put it on a scale where the area under the curve sums to one. For and this comes out to .
at the maximum likelihood point, , gives a cumulative probability of 0.4703, which is not far from 0.5, the median of the distribution. You can evaluate it online with this link to Wolfram Alpha:
at for example gives a cumulative probability of 0.9991; so, more than 99% of the curve is below this point and we can say that is significantly less than given the data.
Okay, that was a warm up; now let's do this with the genotypes before adding in .
(more soon. I haven't updated this blog in a while and wanted to get the ball rolling but I keep getting pulled away to do other things.)