# $F_{ST}$ and the multinomial

The binomial distribution that many of us are more familiar with is a special case of the multinomial distribution, which can have more than two outcomes to a "test". $F_{ST}$ is a common summary of how similar populations are on a genetic level and it is based on the concept of "missing heterozygosity". I was thinking of a way to illustrate the multinomial with the probabilities of genotypes under Hardy-Weinberg assumptions, an allele frequency, $p$, and the degree of missing heterozygosity, $F$.

Here is a small dataset to work with, counts of three genotypes of a SNP with two allelic states in a sample of 30 individuals.

• AA  4
• AT  10
• TT  16

We can go back to the binomial for a moment and estimate the allele frequency. If we ignore genotype (the association between alleles), there are 18 A's and 42 T's.

• A  18
• T  42

So 30% of the alleles are A alleles. With the binomial, the probability of the data $x$ given an allele frequency $p$ is

$P(x|p) = \frac{(a+t)!}{a! t!} p^a (1-p)^t$

where $a$ is the number of A alleles and $t$ is the number of T alleles.

Plot this in R to visualize the probability of the data given the range of possible allele frequencies.

 a=18 t=42 n=a+t

p probx

plot(p, probx, type='l',xlab="allele frequency", ylab="probability",
main="Binomial Probability")

We can find the maximum likelihood point by finding the partial derivative (slope) with respect to $p$, setting it equal to zero, and solving for $p$. First of all the binomial coefficient in the front is a constant so we can remove that for simplicity.

$P(x|p) \propto p^a (1-p)^t$

Let's define $n$ as the sum of $a$ and $t$.

$\frac{\partial \left( p^a (1-p)^{n-a} \right)}{\partial p}=a(1-p)^{n-a}p^{a-1}-n-a(1-p)^{n-a-1}p^a=p^{a-1}(1-p)^{n-a-1}(a-np)$

$p^{a-1}(1-p)^{n-a-1}(a-np)=0$

$p=\frac{a}{n}=\frac{18}{60}=0.3$

This makes intuitive sense. The best guess for the allele frequency is the number of times it is observed out of the total.

We can also integrate to get the area under the curve. The indefinite integral is

$\int p^a (1 - p)^{n - a} dp =\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}$

$\,_2F_1$ is the ordinary hypergeometric function. First this has to be evaluated at $p=1$ to put it on a scale where the area under the curve sums to one. For $a=18$ and $n=60$ this comes out to $c=1.7722\times10^{-17}$.

Evaluating

$\frac{1}{c}\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}|_{p=0.3}=0.4703$

at the maximum likelihood point, $p=0.3$, gives a cumulative probability of 0.4703, which is not far from 0.5, the median of the distribution. You can evaluate it online with this link to Wolfram Alpha:

https://www.wolframalpha.com/input/?i=(0.3%5E19*hypergeometric2f1(19,-42,20,0.3))%2F(19*1.7722*10%5E(-17))

Evaluating

$\frac{1}{c}\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}|_{p=0.5}=0.9991$

at $p=0.5$ for example gives a cumulative probability of 0.9991; so, more than 99% of the curve is below this point and we can say that $p$ is significantly less than $0.5$ given the data.

Okay, that was a warm up; now let's do this with the genotypes before adding in $F$.

(more soon. I haven't updated this blog in a while and wanted to get the ball rolling but I keep getting pulled away to do other things.)

# Mosquito Busters!

This weekend I teamed up with Matt Medeiros to collect mosquitoes. We were looking for adult Culex and Aedes. Culex larvae are easy to find in large numbers but the adults are difficult to locate.  We did find a lot of Aedes albopictus and brought them back to the lab for dissection and DNA isolation.

Matt had a backpack vacuum to catch the adults. In the image above we are looking for mosquitoes resting on the walls in a tunnel system under Honolulu (the image is fuzzy because of the low light levels and being hand held). It is impossible to resist Ghostbusters coming to mind.

In other, not unrelated, news we seem to have been too heavy handed with the tetracycline to clear Wolbachia from the mosquitoes. Most of the mosquitoes in the tetracycline cage died (presumably from the toxic effects of the antibiotic) and we did not recover any eggs.  We are trying again with a shorter term dose.

# Wolbachia PCR and Sequencing

We need a source of Wolbachia for injection into the mosquitoes. Using primers from Simōes et al. (2011) Áki and Maria were able to get these PCR products.

The gel is shown sideways with shorter fragments to the left. Our white[1] stock of Drosophila melanogaster is positive for Wolbachia infection. The Oregon-R stock appears to be uninfected. The Culex mosquitoes collected locally here on Oʻahu also appear to be infected and give a double band. We submitted them for sequencing and the Culex sequence came back very messy---consistent with possible super infection of multiple strains. However, the Drosophila w[1] sequence was clear enough to get a basepair sequence.

Here is the full sequence we recovered.

>Dmelwol_w[1]
TAATACGGAGAGRGCTAGCGTTATTCGGAATTATTGGGCGTAAAGGGCGCGTAGGCGRATTAGTAAGTTAAAAGTGAAATCCCAAGGCTCAACCTTGGAATTGCTTTTAAAACTGCTAATCTAGAGATTGAAAGAGGATAGAGGAATTCCTAGTGTAGAGGTGAAATTCGTAAATATTAGGAGGAACACCAGTGGCGAAGGCGTCTATCTGGTTCAAATCTGACGCTGAGGCGCGAAGSKTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGAATGTTAAATATGGGAAGTTTTACTTTCTGTATTACAGCTAACGCGTTAAACATTCCGCCTGGGGACTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACSCGAAAAACCTTACCACTCCTTGACATGGAAATTATACCTATTCGAAGGGATAGGGTCGGTTCGTCCGGGTTTCACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATCCTTAGTTACCATCAGTTAATGCTGGGGACTTTAAGGAAACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGATGTCAAGTCATCATGGCTATGAGTGCTACACACGTGCTACAATGGTGGCTACAATGGGCTGCAAAGTCGCGAGGCTAAGCTAATCCCTTAAAGGCCATCTCAGTCCGGATTGTACTCTGCAACTCGAGTGCATGAGTTGATCGCTAGTAATCGTGGATCAGCACGCCACGGTGATAGCTTTCTCGGGTCTG

This appears to be a Group A Wolbachia and is consistent with wMel.

Simōes, P. M., Mialdea, G., Reiss, D., Sagot, M. F., & Charlat, S. (2011). Wolbachia detection: an assessment of standard PCR protocols. Molecular Ecology Resources, 11(3), 567–572.

# SEM sponge imaging

Michael Wallstrom and Áki Láruson made some scanning electron microscope images of a new species of marine sponge we are working on (it is associated with invasive algal mats here in Hawai'i). I can't resist sharing a few of them here but I am saving the best for the publication we are working on.

Above is the surface of the sponge. If you look closely you can see the tiny ostia pores in the surface.

A closeup on the ostia, one is in cross section to the interior of the sponge.

Above, you can see two types of cells inside the sponge. The choanocytes use flagella (the threads) to move water through the sponge and filter food particles out of the seawater; amoebocytes crawl around and transport nutrients to other cells (among other functions).

In the above image, at the highest magnification for these images, you can see bacteria that are living in the sponge. The spiral objects are spirochaetes; some of these cause diseases in humans like Lyme disease, syphilis, relapsing fever, and leptospirosis.

# Marine virus isolation

Another undergrad, Stacy Paulino, is doing a project developing methods to discover phages that infect a marine bacteria that in turn cause disease in coral---as a kind of phage therapy for coral diseases. She recently obtained some nice plaques (clear areas where the virus has killed the bacteria growing on the plates).

An undergrad, Angelina Holcomb, is working on a telomere-aging genetics project in the lab. While setting up crosses she noticed an odd eye phenotype. The two flies at the top of the image have smaller eyes and disrupted head development, along with some other more subtle phenotypes, compared to their sibling controls at the bottom of the image.

# EM Algorithm Haikus

In a graduate "Ecology & Evolution" class I am co-teaching this semester I, on a lark, asked the students to write a Haiku on the Expectation-Maximization (EM) algorithm.

I introduced it in class to talk about finding maximum-likelihood answers to some complex problems using a simple approach. Earlier I gave them some homework problems to find answers to by writing EM algorithms in R. We had a midterm exam coming up and there is not enough time in-class for the students to program a new algorithm in R for an exam question. Rather, I asked them to write a Haiku about it to see what they would write and if this gave me any insight into their understanding of the method. I like the results so much I am posting them here.

Data and a guess
Update guess with each cycle
Find the true value.

then plug it in again and again
and again and again, yay!

Just a few short steps
Almost silly how simple

Find Expectation:
Calculate the Maximum.
Re-evaluate.

Looping, elegant
Throw your guess into the loop
And see what comes out

We must use a guess
As the change becomes smaller
We approach the truth

# Associate editor for Journal of Heredity

I've been asked to serve as an Associate Editor on the editorial board for the Journal of Heredity (established in 1903 with a 2015 impact factor of 2.075), to help handle manuscript submissions and reviews. I looked today and saw that I was just added to the front matter (it is a long list so I magnified it).