Author Archives: Floyd A. Reed

CRISPR/Cas9 Commentary in Genetics

Late last year I was asked to write a commentary for an article that was coming out in Genetics. The article is Unckless et al. (2017) “Evolution of resistance against CRISPR/Cas9 gene drive.” The authors addressed the issue of the high mutation rates of CRISPR/Cas9 gene drive and the potential to generate alleles in the population that are resistant to the drive effects (which I have mentioned briefly before on this blog). Long story short, the evolution of resistance is essentially inevitable.The article inspired the cover image for that month (click on the image for a link and information about the artist Kent Smith).

This could be viewed as a positive thing, depending on the application, and there are already strategies being proposed to address the mutation issue. I tried to put this into perspective and suggest some alternatives in my article: Reed (2017) "CRISPR/Cas9 Gene Drive: Growing Pains for a New Technology."

DLNR and USFWS funding!

The State of Hawaiʻi, Department of Land and Natural Resources (DLNR) and the U.S. Fish and Wildlife Service are supporting our labs' (Reed at UH Mānoa and Sutton at UH Hilo) Wolbachia project to generate incompatible mosquito mating types. This started a little while back but I didn't want to preempt the press release that DLNR was planning.

It is a huge help to get this support. We currently have a colony of Culex mosquitoes that are cleared of Wolbachia with tetracycline antibiotic. We are planning to begin microinjection of wMel Wolbachia from Drosophila fruit flies very soon. Spring break is coming up and it might be very good timing for getting a lot of work done in the lab.

Here are some links related to the press release:

F_{ST} and the multinomial

The binomial distribution that many of us are more familiar with is a special case of the multinomial distribution, which can have more than two outcomes to a "test". F_{ST} is a common summary of how similar populations are on a genetic level and it is based on the concept of "missing heterozygosity". I was thinking of a way to illustrate the multinomial with the probabilities of genotypes under Hardy-Weinberg assumptions, an allele frequency, p, and the degree of missing heterozygosity, F.

Here is a small dataset to work with, counts of three genotypes of a SNP with two allelic states in a sample of 30 individuals.

  • AA  4
  • AT  10
  • TT  16

We can go back to the binomial for a moment and estimate the allele frequency. If we ignore genotype (the association between alleles), there are 18 A's and 42 T's.

  • A  18
  • T  42

So 30% of the alleles are A alleles. With the binomial, the probability of the data x given an allele frequency p is

P(x|p) = \frac{(a+t)!}{a! t!} p^a (1-p)^t

where a is the number of A alleles and t is the number of T alleles.

Plot this in R to visualize the probability of the data given the range of possible allele frequencies.

p <- seq(0,1,length=1000)
probx <- dbinom(a,n,p)
plot(p, probx, type='l',xlab="allele frequency", ylab="probability",
     main="Binomial Probability")

We can find the maximum likelihood point by finding the partial derivative (slope) with respect to p, setting it equal to zero, and solving for p. First of all the binomial coefficient in the front is a constant so we can remove that for simplicity.

P(x|p) \propto p^a (1-p)^t

Let's define n as the sum of a and t.

\frac{\partial \left( p^a (1-p)^{n-a} \right)}{\partial p}=a(1-p)^{n-a}p^{a-1}-n-a(1-p)^{n-a-1}p^a=p^{a-1}(1-p)^{n-a-1}(a-np)



This makes intuitive sense. The best guess for the allele frequency is the number of times it is observed out of the total.

We can also integrate to get the area under the curve. The indefinite integral is

\int p^a (1 - p)^{n - a} dp =\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}

\,_2F_1 is the ordinary hypergeometric function. First this has to be evaluated at p=1 to put it on a scale where the area under the curve sums to one. For a=18 and n=60 this comes out to c=1.7722\times10^{-17}.


\frac{1}{c}\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}|_{p=0.3}=0.4703

at the maximum likelihood point, p=0.3, gives a cumulative probability of 0.4703, which is not far from 0.5, the median of the distribution. You can evaluate it online with this link to Wolfram Alpha:*hypergeometric2f1(19,-42,20,0.3))%2F(19*1.7722*10%5E(-17))


\frac{1}{c}\frac{ p^{a+1} \,_2F_1(a+1, a-n, 2+a, p)}{a+1}|_{p=0.5}=0.9991

at p=0.5 for example gives a cumulative probability of 0.9991; so, more than 99% of the curve is below this point and we can say that p is significantly less than 0.5 given the data.

Okay, that was a warm up; now let's do this with the genotypes before adding in F.

Going back to the original genotype dataset of 30 individuals.

  • AA  4
  • AT  10
  • TT  16

The multinomial probability is

P(x|p) = \frac{(q_{aa}+q_{at}+q_{tt})!}{q_{aa}! q_{at}! q_{tt}!} \left[p^2\right]^{q_{aa}} \left[2 p (1-p)\right]^{q_{at}} \left[(1-p)^2\right]^{q_{tt}}

where q_{xx} are the counts of each genotype in the sample and p is still the frequency of allele "A."

We can plot this along with the binomial curve for comparison with the following Mathematica code.

aa = 4;
at = 10;
tt = 16;
a = 2*aa + at;
t = 2*tt + at;
Plot[{Multinomial[aa, at, tt] (p^2)^aa (2 p (1 - p))^at ((1 - p)^2)^
    tt, Binomial[(a + t), a] p^a (1 - p)^t}, {p, 0, 1} , 
 PlotRange -> Full, PlotLegends -> {"Multinomial", "Binomial"}, 
 AxesLabel -> {"p", "P(x)"}]

The curves are similar, they peak in the same place, but overall the multinomial curve has a lower probability than the binomial. This is because there is more information in the multinomial. Each set of the data for the binomial (allele counts) can correspond to a range of ways to divide the data into genotype counts. We can always reduce genotypes to number of alleles but we cannot do the reverse. So, a particular genotype outcome is less likely (a smaller part of all possible outcomes) than the corresponding allele count outcome.

Okay, now it gets a little more fun. We have a probability over a dimension of allele frequency of p from zero to one. Let's add another dimension of F but first let's think about what F means. F is the proportion of expected (according to Hardy-Weinberg) heterozygosity that is missing. The heterozygous genotypes don't just disappear. The alleles get reorganized into homozygous genotypes. The frequency of each allele in the heterozygotes is 1/2; so, each homozygous genotype is increased by an equal proportion, which is half of the heterozygotes that are missing. Thus, the frequency of each genotype expected given p and F is

f(AA) = p^2 + F 2 p (1-p)/2 = p^2 + F p (1-p)
f(AT) = 2 p (1-p) - F 2 p (1-p) = 2 p (1-p) (1-F)
f(TT) = (1-p)^2 + F 2 p (1-p)/2 = (1-p)^2 + F p (1-p)

This can be incorporated into the individual genotype probability components of the multinomial.

P(x|p,F) = \frac{(q_{aa}+q_{at}+q_{tt})!}{q_{aa}! q_{at}! q_{tt}!} \left[p^2+F p (1-p)\right]^{q_{aa}} \left[2 p (1-p) (1-F)\right]^{q_{at}} \left[(1-p)^2+ F p (1-p)\right]^{q_{tt}}

We can simply the frequency of each genotype terms by symbolizing then with corresponding f_{XX} terms.

P(x|p,F) = \frac{(q_{aa}+q_{at}+q_{tt})!}{q_{aa}! q_{at}! q_{tt}!} f_{AA}^{q_{aa}} f_{AT}^{q_{at}} f_{TT}^{q_{tt}}

This is what the likelihood surface looks like over the range of p from zero to one and F from zero to one.

Here is the mathematica code used to make the plot.

a = 4;
h = 10;
t = 16;
Plot3D[Multinomial[a, h, t] (p^2 + f p (1 - p))^
   a ((1 - f) 2 p (1 - p))^h ((1 - p)^2 + f p (1 - p))^t, {p, 0, 
  1}, {f, 0, 1}, PlotRange -> Full, AxesLabel -> {"p", "F", ""}, 
 PlotPoints -> 100]

The curve at F=0 corresponds to the multinomial curve above (in the plot comparing it to the binomial). We can also see that allowing positive values for F can result in higher likelihoods of the data. What do we actually expect F to be? The number of heterozygotes are 10 out of a sample of 30 genotypes, so the fraction of heterozygous genotypes is 0.333. The allele frequency is 0.3, so we expect 2 p (1-p) = 2 \times 0.3 \times 0.7 = 0.42 of the sample to be heterozygous. There is a 0.42-0.333 = 0.087 proportion of heterozygotes that are missing. This is 0.087 / 0.42 = 0.207 of the heterozygosity expected. So our point estimate of F is 0.207, or about 21% of the heterozygosity is missing. This appears to correspond to the top of the likelihood peak. Finally, notice that the peak is narrower in the p dimension than in the F dimension. There is more information in the dataset about the allele frequency than there is about the genotype frequencies. There is a larger sample, 60, of alleles than there are of genotypes, 30. So, the estimate of p is more refined than the estimate of F.

The range of zero to one makes sense for p. An allele frequency cannot be less than zero or greater than 1. Also, F cannot be more than one (all of the expected heterozygosity is missing), but it can be less than zero (negative F means there is an excess of heterozygosity). It looks like a large volume of the likely values of p and F given the data correspond to a negative F. So, F is not significantly greater than zero (Hardy-Weinberg expectations) for this dataset. Let's extend the plot to encompass a range of negative values for F.

...and something weird happened. There is a second likelihood peak with different values of p and F that actually has an even higher likelihood at its peak, but the parameter values are unreasonable (given what we can see in the dataset about the allele frequency and level of heterozygosity). The minimum possible F is negative one. If the entire sample were made up of heterozygotes then the allele frequency is one half, the proportion of heterozygotes expected would also be one half. This leads to a negative one half divided by one half or negative one. Extending the plot out to the full range looks even more pathological (I limited the scale and truncated the peak because it was so large).

Long story short it turns out that this is an artifact... If you closely inspect what is going on by spot checking a few points in the second peak you find that the expected number of the rarer homozygote is negative, which is impossible, and in the likelihood calculation this is raised to the fourth power, which makes it positive, etc. This is a good example of how unanticipated, and initially cryptic, errors can creep into a computational script. This is fixed by setting up an if statement to only evaluate the likelihood if the expected number of homozygotes are positive; otherwise the likelihood is zero, which is true. Here is the updated script and a plot.

a = 4;
h = 10;
t = 16;
Plot3D[Multinomial[a, h, t] If[
   p^2 + f p (1 - p) > 
    0, (p^2 + f p (1 - p))^a ((1 - f) 2 p (1 - p))^
     h ((1 - p)^2 + f p (1 - p))^t, 0], {p, 0, 1}, {f, -1, 1}, 
 PlotRange -> Full, AxesLabel -> {"p", "F", ""}, PlotPoints -> 100]

Because F and p appear uncorrelated we can feel good about taking a slice of the likelihood surface through the maximum-likelihood value of p as a reasonable approximation of the (contour) likelihood surface of F in order to evaluate the range of possible parameter values.

This is not yet F_{ST}. We have been evaluating the missing heterozygosity of a single population sample. Heterozygosity can be reduced within a population by inbreeding effects where individuals do not mate randomly but tend to mate more often with closer relatives (selfing in plants is an extreme example). What was calculated above is F_{IS}.

Let's say we had another sample from a different population. So now we have our original population one

  • AA 4
  • AT 10
  • TT 16

and population two.

  • AA 10
  • AT 14
  • TT 6

Plotting the likelihood surfaces together.

a1 = 4;
h1 = 10;
t1 = 16;
a2 = 10;
h2 = 14;
t2 = 6;
Plot3D[{If[p^2 + f p (1 - p) > 0 && (1 - p)^2 + f p (1 - p) > 0, 
   Multinomial[a1, h1, t1] (p^2 + f p (1 - p))^
     a1 ((1 - f) 2 p (1 - p))^h1 ((1 - p)^2 + f p (1 - p))^t1, 0], 
  If[p^2 + f p (1 - p) > 0 && (1 - p)^2 + f p (1 - p) > 0, 
   Multinomial[a2, h2, t2] (p^2 + f p (1 - p))^
     a2 ((1 - f) 2 p (1 - p))^h2 ((1 - p)^2 + f p (1 - p))^t2, 0], 
  0.0005}, {p, 0, 1}, {f, -1, 1}, PlotRange -> Full, 
 AxesLabel -> {"p", "F", ""}, PlotPoints -> 100]

The second sample has a higher allele frequency and a value of F_{IS} closer to zero (however, the two ranges of likely F values overlap and might not be considered significantly different). The second allele frequency is p_2 = (10+14/2)/30 = 0.567. With the two allele frequencies we can now calculate a point estimate of F_{ST} between the population samples. The expected heterozygosity in sample 2 is H_{S,2} = 2 \times 0.567 \times (1-0.567) = 0.491. We already showed above that p_1 = 0.3 and H_{S,1} = 0.42. The average allele frequency over both populations is \bar{p} = 0.434 and the average subpopulation expected heterozygosity is \bar{H}_S = 0.456. In a combined total sample, if there were population structure, the expected heterozygosity is H_T = 2 \bar{p} (1-\bar{p}) = 0.491. Therefore,

F_{ST} = \frac{H_T - \bar{H_S}}{H_T} = \frac{0.491-0.456}{0.491} = 0.0713

or about 7% of the expected heterozygosity is missing because of differences in allele frequencies between the populations. Why does this work? Any time there is a difference in allele frequency the average subpopulation expected heterozygosity will be lower than the combined expected heterozygosity. Why? Because the curve of heterozygosity over allele frequency space is a concave function; the average of two points will always be lower than the value of the function evaluated at that point.

F_{ST} is the gap between H_T and H_S, which is 7% of H_T.

Before bringing this back together I need to introduce one more F-statistic. This one is F_{IT}. It is the total missing heterozygosity due to both population structure and inbreeding effects. If we combine the two samples we get

  • AA 14
  • AT 24
  • TT 22

The allele frequency p_T = 0.433 (the same as the average between the two samples) and the fraction of heterozygotes in the combined sample is 24/60= 40%. We expect a heterozygosity of  2 p_T (1-p_T) = 0.491. So, F_{IT} = (0.491-0.4)/0.491 = 0.185.

F_{ST} between populations is related to F_{IT} and F_{IS}. If we know the latter two we can estimate the former.

F_{ST} = \frac{F_{IS}-F_{IT}}{F_{IS}-1}

Using an F_{IT}=0.185 and let's say an F_{IS}=0.125 we get

F_{ST} = \frac{0.125-0.185}{0.125-1} = \frac{0.06}{0.875} = 0.0686.

We can estimate F_{IT} and F_{IS} using the multinomial, which means we can also estimate F_{ST}. This is starting to get more complicated with lots of parameters to estimate from the data, two allele frequencies and two F statistics (if we have a single F_{IS} over both populations, which seems biologically reasonable) or four parameters in total. A good way to explore higher dimensional parameter space is to use a Markov Chain in a Monte Carlo fashion (MCMC) with Metropolis-Hastings algorithm. I save the details of this method for a later post. For now let's write a "hill climbing" Monte Carlo script (Monte Carlo refers to guessing parameter values and hill climbing means that it moves to values that produce higher probabilities of the data). Here it is in R.

# data
popt = pop1 + pop2

# intialize variables
probbest = 0

# loop through some steps
for (i in 1:100000) {

  tune = 1/i # narrows in as the run goes
  # pick new parameter values and check to make 
  # sure they are not out of bounds
  p1 = p1best + runif(1, min = -tune, max = tune)
  if (p1<0) {p1=0} else if (p1>1) {p1=1}
  p2 = p2best + runif(1, min = -tune, max = tune)
  if (p2<0) {p2=0} else if (p2>1) {p2=1}
  fis = fisbest + runif(1, min = -tune, max = tune)
  if (fis< -1) {fis= -1} else if (fis>1) {fis=1}
  fit = fitbest + runif(1, min = -tune, max = tune)
  if (fit< -1) {fit= -1} else if (fit>1) {fit=1}
  pt = (p1+p2)/2

  # genotype probabilities

  # sample probabilities
  if(p1^2+fis*p1*(1-p1)>0 & (1-p1)^2+fis*p1*(1-p1)>0) {
    prob1<-dmultinom(pop1, , gp1)
  } else {
    prob1 = 0
  if(p2^2+fis*p2*(1-p2)>0 & (1-p2)^2+fis*p2*(1-p2)>0) {
    prob2<-dmultinom(pop2, , gp2)
  } else {
    prob2 = 0
  if(pt^2+fit*pt*(1-pt)>0 & (1-pt)^2+fit*pt*(1-pt)>0) {
    probt<-dmultinom(popt, , gpt)
  } else {
    probt = 0
  # combined probabilities
  prod = prob1 * prob2 * probt

  # keep the best ones
  } else {
    p1 = p1best
    fis = fisbest
    fit = fitbest
  pt = (p1+p2)/2

  # calculate fst
  fst = (fis-fit)/(fis-1)

  # report results

Running this for 100,000 steps I got estimates of p_1 = 0.299, p_2 = 0.567, F_{IS} = 0.125, F_{IT} = 0.185, and F_{ST} = 0.0687; not bad.

Okay, that's going to have to be it for this post. This could be extended to cases where the sample sizes are unequal, multiple loci (F is shared across loci but p is not), more than two alleles, more than two populations and hierarchical population groupings, and an MCMC script to estimate the probability distribution of F_{ST}, but I'll have to save that for later.

Mosquito Busters!

This weekend I teamed up with Matt Medeiros to collect mosquitoes. We were looking for adult Culex and Aedes. Culex larvae are easy to find in large numbers but the adults are difficult to locate.  We did find a lot of Aedes albopictus and brought them back to the lab for dissection and DNA isolation.


Matt had a backpack vacuum to catch the adults. In the image above we are looking for mosquitoes resting on the walls in a tunnel system under Honolulu (the image is fuzzy because of the low light levels and being hand held). It is impossible to resist Ghostbusters coming to mind.

gbposterIn other, not unrelated, news we seem to have been too heavy handed with the tetracycline to clear Wolbachia from the mosquitoes. Most of the mosquitoes in the tetracycline cage died (presumably from the toxic effects of the antibiotic) and we did not recover any eggs.  We are trying again with a shorter term dose.

Wolbachia PCR and Sequencing

We need a source of Wolbachia for injection into the mosquitoes. Using primers from Simōes et al. (2011) Áki and Maria were able to get these PCR products.


The gel is shown sideways with shorter fragments to the left. Our white[1] stock of Drosophila melanogaster is positive for Wolbachia infection. The Oregon-R stock appears to be uninfected. The Culex mosquitoes collected locally here on Oʻahu also appear to be infected and give a double band. We submitted them for sequencing and the Culex sequence came back very messy---consistent with possible super infection of multiple strains. However, the Drosophila w[1] sequence was clear enough to get a basepair sequence.


Here is the full sequence we recovered.


This appears to be a Group A Wolbachia and is consistent with wMel.

Simōes, P. M., Mialdea, G., Reiss, D., Sagot, M. F., & Charlat, S. (2011). Wolbachia detection: an assessment of standard PCR protocols. Molecular Ecology Resources, 11(3), 567–572.

SEM sponge imaging

Michael Wallstrom and Áki Láruson made some scanning electron microscope images of a new species of marine sponge we are working on (it is associated with invasive algal mats here in Hawai'i). I can't resist sharing a few of them here but I am saving the best for the publication we are working on.


Above is the surface of the sponge. If you look closely you can see the tiny ostia pores in the surface.


A closeup on the ostia, one is in cross section to the interior of the sponge.


Above, you can see two types of cells inside the sponge. The choanocytes use flagella (the threads) to move water through the sponge and filter food particles out of the seawater; amoebocytes crawl around and transport nutrients to other cells (among other functions).


In the above image, at the highest magnification for these images, you can see bacteria that are living in the sponge. The spiral objects are spirochaetes; some of these cause diseases in humans like Lyme disease, syphilis, relapsing fever, and leptospirosis.

New fly head phenotype


An undergrad, Angelina Holcomb, is working on a telomere-aging genetics project in the lab. While setting up crosses she noticed an odd eye phenotype. The two flies at the top of the image have smaller eyes and disrupted head development, along with some other more subtle phenotypes, compared to their sibling controls at the bottom of the image.

EM Algorithm Haikus

In a graduate "Ecology & Evolution" class I am co-teaching this semester I, on a lark, asked the students to write a Haiku on the Expectation-Maximization (EM) algorithm.

I introduced it in class to talk about finding maximum-likelihood answers to some complex problems using a simple approach. Earlier I gave them some homework problems to find answers to by writing EM algorithms in R. We had a midterm exam coming up and there is not enough time in-class for the students to program a new algorithm in R for an exam question. Rather, I asked them to write a Haiku about it to see what they would write and if this gave me any insight into their understanding of the method. I like the results so much I am posting them here.

Data and a guess
Update guess with each cycle
Find the true value.

Hey, start with a guess
then plug it in again and again
and again and again, yay!

Just a few short steps
Almost silly how simple
We find the answer

Find Expectation:
Calculate the Maximum.

Looping, elegant
Throw your guess into the loop
And see what comes out

We must use a guess
As the change becomes smaller
We approach the truth