Jukes-Cantor Probabilities

In an earlier post I went through a derivation of the Jukes Cantor model (based on a Poisson distribution of mutation events) and got the following type of relationship.

d = \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right),

where d is the fraction of divergent sites between two DNA sequences and t is the time the sequences diverged from a common ancestor in units of g \mu where g is the number of generations and \mu is the per generation per nucleotide mutation rate.

This can be rearranged to

t = -\frac{3}{8} \log_e \left( 1 - \frac{4}{3} d \right)

to convert an observed difference between two sequences into a corrected estimate of the divergence between two species.

However, this is only a point estimate and cannot handle cases where the fraction of sites that differ happen to be above 75% (this is possible by chance but becomes a log of a negative number and undefined).

We can rewrite this in a different way, as a binomial probability, and evaluate the probability of a data set over a range of divergence and place intervals around our point estimate.

As above, the probability that two nucleotides are different is

\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right).

Assuming each nucleotide is independent in a comparison of two sequences we can multiply by the number of sites that are different, D.

P(D|t) = \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D

On the other hand two nucleotides can be in the same state if no mutations occurred, which is a probability of e^{- \frac{8}{3}t} (this comes from the Poisson distribution the model is based on, look back at the earlier post if this isn't clear) or mutations could have occurred along the lineages but they happened to end up as the same nucleotide (an "invisible" mutation). This second outcome has a probability of

\frac{1}{4}\left(1 - e^{- \frac{8}{3}t}\right).

Combined the probability that we see S sites that are the same in two sequences descended from a common ancestor is

P(S|t) = \left[e^{- \frac{8}{3}t}+\frac{1}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S.

Since there are only two possibilities, the sites are the same or different, and the probability of all possible outcomes must be 100% we can rewrite this in a slightly simpler way.

P(S|t) = \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S.

Then multiply everything together to get the probability of the data including sites that differ and sites that are the same.

P(data|D,S) \propto \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S

There is also a binomial coefficient that should be multiplied to get the exact probability (because there are different ways to get the same data: the first and second site can be different, or the first and third, or the second and fifth, etc.). However, for a given data set this is a constant and we're going to drop it for the moment.

Say we align two sequences of ten base pairs from the same gene in two different species.

grasshopper: AGCTACAACT
cricket:     AGATACGACT

The sequence differs at two sites. We could rewrite the comparison as a 1 if the base is the same and a 0 if they are different.

DNA: 1101110111

Making a plot of the two parts of the equation we can see that the probability that two sites are the same goes down as the time of divergence increases and the probability that they are different increases.

Importantly, there is a middle ground that can best accommodate both signals from the data, the differences and the similarities, near but lower than the crossover point. Combining these together we get the curve of probability of all the data with a peak just above 0.1.

This can be integrated to find upper and/or lower confidence intervals (which is messy and uses the hypergeometric function); however, it is clear from looking at the curve that a wide range of values are hard to rule out with a data set this small.

If the data set were larger we can narrow down the confidence interval.

So, how does this compare to the point estimate we get from the traditional Jukes-Cantor distance? A difference of 0.2 (two sites out of 10) can be plugged in for d,

t = -\frac{3}{8} \log_e \left( 1 - \frac{4}{3} d \right),

and gives t \approx 0.1163.

When we take the derivative of

P(data|D,S) \propto \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S

set it equal to zero and solve for t we get

t = \frac{3}{8} \log_e\left(- \frac{3(D+S)}{D-3S} \right) .*

Plugging in D=2 and S=8 gives t \approx 0.1163, exactly the same as with the other method. The original Jukes-Cantor method is a maximum likelihood point estimate (the parameter value of the model that maximizes the likelihood of the data). You can also see it is equivalent in other ways. The D-3S in the denominator makes the log evaluate a negative number if the fraction of sites that vary are greater than 3/4 of the total, which is undefined. (Thinking about this intuitively it is most consistent with positive infinity; however, I suspect the curve gets infinitely flat and almost any large number is essentially just as likely.) However, we can still work with the probabilities which allows us to rule out small values of t as in the example below where the sites that differ are 80% of the total.


* This is only true for real solutions. The more general solution is

t = \frac{3}{8} \left(\log_e\left(- \frac{3(D+S)}{D-3S} \right) + 2 i \pi n \right),

where n is an element of the integers. This is fun because it combines a base e log with i and \pi, but this isn't really relevant to the discussion (time of divergence doesn't rotate into a second dimension).

Toxorhynchites

We found a total of five Toxorhynchites larvae in a pool containing Culex larvae near Manoa stream (Jan 18, 2018, 21.299882, -157.813210). I haven't seen this genus before; they are possibly T. splendens. These are unusual mosquitoes in that they do not blood feed. The larvae eat aquatic invertebrates including other mosquito larvae and several species were introduced into Hawaiʻi as an attempt at biocontrol of mosquitoes. They are also very large mosquitoes and the genus includes the largest mosquito species.  Thanks to Matt Medeiros and Jessica Mavica for helping me ID them.

Transcriptional effects of a positive feedback circuit in Drosophila melanogaster

A new paper is out at BMC Genomics; it was published a few days before the new year.

Transcriptional effects of a positive feedback circuit in Drosophila melanogaster

This is work that was started many years ago. The tTAV gene expression system was developed into a genetic pest management tool by various groups (e.g., Horn C, Wimmer EA. A transgene-based, embryo-specific lethality system for insect pest management. Nat Biotechnol. 2002;21:64–70; Alphey, Luke. "Expression system for insect pest control." U.S. Patent 9,121,036, issued September 1, 2015). A positive feedback form of the tTAV system was being marketed by Oxitec as "RIDL."  In this system a transgene promotes its own expression by binding to DNA. If left unchecked this causes a runaway feedback loop and is toxic to the organism. However, the protein produced also binds to tetracycline and if tetracycline is added to the insect's diet then the lethal effect is masked (tetracycline bound protein does not bind to DNA and drive gene expression).

In a nutshell, in applications large amounts of insects can be reared with tetracycline. Then released into the wild to mate with wild individuals. The toxic effect is dominant and kills off the offspring of the wild individuals that inherit it (who presumably do not have tetracycline in their diet). If enough individuals can be released relative to the wild population this can cause a reduction in wild population numbers over the following generations. This is similar to classical SIT (Sterile Insect Technique) for population suppression.

We were curious about testing this system in various ways. In the first exploration we added tetracycline to food for fruit flies to see if there was an effect. Tetracycline can disrupt normal mitochondrial function and kill naturally occurring bacteria including endosymbionts like Wolbachia. This was the basis of a student's research project: Müller, H., Krefft, M., Reeves, G., & Reed, F. A. 2010. Trans-generational influence of tetracycline on Drosophila melanogaster (Bachelor Thesis, Fachhochschule Bingen). He found a reduction in egg laying and a delay in development in response to tetracycline. Interestingly there was a transgenerational effect in both the paternal and maternal sides (maternal could be due to mitochondrial or Wolbachia effects, but these do not explain the paternal effect).

Next we engineered the tTAV system in Drosophila melanogaster and performed an EMS (ethyl methanesulfonate) mutation screen to see if any mutations could result in tolerance of the feedback loop and survival off of dietary tetracycline. This was the basis of a grant application: Deutshe Forschungsgemeinschaft (DFG , German National Research Foundation). Die Entstehung von Resistenzen gegen genetisch induzierte Sterilität bei Insekten. (The evolution of resistance to genetically induced sterility in insects.) 2010 RE-3062/2-1. Long story short we generated ~400,000 embryos from flies exposed to EMS and recovered five individuals that had the tTAV insert but could survive without tetracycline.  All five were male. We tried mating them to females to recover offspring---and all five appeared to be sterile which essentially ended that experiment.

In this paper we wanted to determine the genome-wide transcriptional response to the runaway tTAV effect positive feedback loop.

The tTAV insert works as expected in D. melanogaster. The image above shows the curve of lethality as the concentration of two antibiotics, doxycycline and tertracycline, declines. At the upper end of the curve approximately half of the offspring of a heterozygous parent carry the genetic modification, as expected. As the dietary concentration diminishes the curve approaches zero---none of the offspring that carry the tTAV insert survive.

The cause of this lethality is unknown. One can imagine that a positive feedback loop of gene expression and protein production may over-use transcriptional and translational resources and/or overload protein degradation pathways, or disrupt off target gene expression. These hypotheses are not mutually exclusive. Looking at the shift in expression of other genes in the genome in response to tTAV runaway expression could be informative and give us clues as to what process results in lethality.

While we did see thousands of genes across the Drosophila genome that significantly changed expression, there was no clear pattern that emerged. The tTAV system was inserted at different points in the genome and these different lines gave different groups of genes with altered expression. A relative handful of 31 genes with differences in expression were shared across all lines. However, 27 of these 31 genes shifted in expression in different directions, some with increased expression and some with decreased expression across the lines.

This left just four genes with consistent changes in expression when the tTAV system was activated: crok (CG17218), Cyp6a17 (CG10241), olf186-F (CG11430), and Pex23 (CG32226). There is no clear relationship between these genes nor any obvious hypothesis that presents itself in terms of understanding a response to shifts in tTAV activity.

We backup up and looked at gene ontology enrichment. Were there categories of genes involved in certain processes or pathways that tended to be represented in the shift of gene expressions associated with tTAV---again no clear pattern presented itself.

Were there regional changes in expression, perhaps related to different insert sites or nuclear architecture interactions among chromosomes? No, we looked at groups of genes in sliding windows along the genome and there was no clear pattern of clustering by region.

Keep in mind however, there were widespread and significant changes in gene expression. It was not that there was the lack of an effect; just that there is not a clear pattern to the effect.

Gene expression levels between pairwise comparisons of the different insert lines and the control all showed statistically significant correlations. However, the direction of the correlation was not consistent. (This is shown in the figure above.)

I was quite surprised by this result---that the insertion site of a highly expressed transgene has such a profoundly idiosyncratic effect on genome-wide transcription patterns. This was a bit frustrating to be honest but it does suggest some cautions in these kinds of studies and next steps to explore. First of all a a caution, if we had only studied the transcriptional response to a single insert we would have found a significant response, with many genes affected, and may have been misled by the resulting pattern (if for example a number of these genes were in a certain pathway), overgeneralizing the particular signal seen. In a next step, it would be interesting to see if this widespread idiosyncratic effect extends to other genetic modifications or is specific to the tTAV system. For example, what is the genome-wide translational response, if any, to actin5c driven expression of EGFP? Importantly, does any effect seen vary widely by insertion site?

Gene Drive Questions

I received an inquiry about gene drive technology by email and thought that it might be useful to post part of it along with my reply here.

Here is the question:

"Since a few months, gene drive is a very prominent topic. But what I am missing a recent review about the state of the art in building gene drive constructs and testing them. I would like to ask you if you could provide some advice about literature, which would cover an overview about this theme.  At the level where I try to follow the discussions, I have the feeling that things are randomly mixed (like e.g. gene drive = synthetic biology). It would be really great to have an overview about what is possible by now, what has been done, which systems work and which do not work. And perhaps where is gene drive applied."

And my response:

Here are some general reviews:

 
This is rapidly developing.  A lot of the recent focus is on CRISPR-Cas9 based gene drive and there is some confusion in the media that this is the only form of gene drive.  I've tried to categorize the different forms in Appendix A of this article:

https://arxiv.org/abs/1706.01710

One major problem with CRISPR-Cas9 based gene drive is the rapid accumulation of mutation and naturally occurring resistance in the wild. See:

http://www.genetics.org/content/205/3/1037.full
http://www.genetics.org/content/205/2/827
http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006796
https://www.biorxiv.org/content/biorxiv/early/2017/06/14/150276.full.pdf

Work is also progressing on underdominance-type approaches that may get around some of the problems with the CRISPR system.

The Genetic Engineering and Society Center at NCSU has a lot of resources and knowledgeable people in this area.

https://research.ncsu.edu/ges/

Here is a series of recent publications that were recently posted---I haven't had a chance to go through all of them yet.

https://research.ncsu.edu/ges/publications/faculty-publications/?table_filter=JRISpecialIssue

There are also some relevant publications that will come out over the next few months. I can update you as these are published and become publicly available.

To my knowledge there is no direct application as of this moment, but this is likely only a matter of time.

The US military is also very interested in gene drive issues. DARPA is funding some of the research on gene drive and IARPA is working on ways to detect if genetic modifications have occurred.

https://www.darpa.mil/program/safe-genes
https://www.iarpa.gov/index.php/working-with-iarpa/requests-for-information/detection-of-genome-editing

Let me know if I can help with additional information or if you want more detail about something.

Virus imaging with negative staining

First of all there has been a long lag in posting any updates here about the lab.  There are many things going on that are a higher priority at the moment; however, I also want to try to keep a certain tempo of posts and updates going here if possible.  So, I set aside a few minutes today to make a post.

A couple of undergraduates in the lab have an interest in viruses, specifically bacteriophages. They have worked out methods to isolate and propagate the phages for various experiments. However, we wanted to get a look at the actual viral particle instead of inferring it indirectly from plaques on a plate of bacteria. I talked to Tina Carvalho at our Biological Electron Microscope Facility and she was enthusiastic about imaging the viruses.

Negative staining is a relatively quick and easy method to get viral mugshots. The virus particles are put onto a thin membrane held in a metal grid and electrons are beamed through both the virus and the grid to get an image. The problem is that this doesn't really work that well.  The virus is very transparent to electrons and they will pass right through almost like it wasn't there, i.e., the image contrast defining the virus will be low and look faint.

A stronger image can be made by adding a thin fluid of uranium salt (Uranyl acetate (UO2(CH3COO)2·2H2O)). Uranium is very dense and good at blocking electrons. If all goes well (and Tina pointed out that there can be quite a bit of luck involved with the color of you socks that day possibly playing a role) the fluid will dry a bit and form droplets around the virus particles by adhesion. The electrons pass through the virus, which is making a path of sorts, through the uranium drop, so the outside shell of the viral surface can be better resolved. Here is a figure to try to illustrate.

First we looked at a Vibriophage that Stacy Paulino isolated from seawater from Maunalua Bay. This phage infects a coral pathogen, Vibrio coralliilyticus, and Stacy imaged it below.

The next image is an unknown coliphage (infects Escherichia coli).

Notice the scale in the image indicating 20 nanometers; that is the size of a ribosome. Double stranded DNA is about 2 nm wide of 1/10 of the scale bar. The graininess of the tube is from individual tail tube proteins, single polypeptide molecules, which are typically assembled from ~250 individual peptides.

Maya Shaulsky (in a collaboration with Bob Thomson) is working with T7 (below) and made these coliphage images. Originally we thought we were working with T7 coliphage but we kept getting frustrating results that didn't make sense. When we actually saw the virus (the image above) it confirmed that we were not working with T7 so we ordered actual T7 from a different stock center.

And here is an E. coli bacterium with T7 particles replicating inside the cell!

If you are working with something small I encourage you not to be intimidated with the idea of electron microscopy imaging. It is really something to actually see what you are working with. The protocol to prepare the specimens is not that long or difficult (some people may disagree with this, perhaps we have been lucky) and it is not really that expensive to pay for the training and machine time. It is worth trying at least once.

An alternative GAL4-based genetic sterile insect technique?

Here is a thought that came to me on a walk. Perhaps this could also be applied to mosquitoes to suppress wild populations.

GAL4, UAS, and GAL80 are genetic tools from yeast that are commonly used in Drosophila.

GAL4 is a protein that drives expression of genes with UAS (Upstream Activation Sequence) as a gene expression enhancer. GAL80 inhibits GAL4.

So, place the gene encoding the GAL4 protein under UAS expression. This could create a runaway positive feedback loop (GAL4 drives its own expression of more GAL4) and result in lethality.  However, set this up in Drosophila that also express GAL80 to inhibit the GAL4 from binding to UAS.

If GAL80 were expressed from a Y-chromosome only male offspring would survive to mate with the remaining females. Female-specific lethality is a powerful way to suppress a population.

This could be set up in a stable lab Drosophila stock using compound X chromosomes for X^X Y females and regular X Y males. All the Drosophila stocks, sequences, and insertion sites exist to quickly try this out.

http://flystocks.bio.indiana.edu/Browse/aberration/compound_x_overview.htm

http://flystocks.bio.indiana.edu/Browse/chr_y/special_y.php

https://www.addgene.org/vector-database/5556/

https://www.thebestgene.com/

Sea urchin mitochondrial genome evolution

A lab publication is out (link)!

Áki J. Láruson generated the mitochondrial genome of Tripneustes gratilla and compared it to all other existing sea urchin genomes. He found that the Toxopneustidae are a sister family to the Strongylocentrotidae splitting from each other around a time of major climate change (link) and that patterns of DNA substitutions are generally consistent with neutrality for the sea urchin mitochodria. Stay tuned for more Tripneustes gratilla genomics publications!

NIH capping fund distributions to address problems in science in the US

In an article recently posted at NIH a plan to limit individual grant support has been announced.

Implementing Limits on Grant Support to Strengthen the Biomedical Research Workforce

NIH is facing the fact that (1) there is not enough grant funding to go around, (2) the funding is being concentrated into fewer labs (3) at fewer institutions, and as a consequence (4) science is loosing young researchers because the field appears to have no future for people starting out. Without a next generation, science in the US is approach a crash in its workforce. In response NIH plans to spread out research funding among more researchers by placing a cap on individual funds received.

"And we will monitor, on a trans-agency basis, investigators’ Grant Support Index, with the idea that over time and in close consultation with the extramural research community, we will phase in a resetting of expectation for total support provided to any one investigator. We plan to implement a Grant Support Index cap of 21 points, essentially the equivalent of 3 single-PI R01 grants. Over the next few weeks to months, we will meet with NIH Advisory Councils and other stakeholder groups to explore how best to phase in and implement this cap – so that formal assessment of grant support can be used to best inform, on a trans-NIH basis, our funding decisions." link

Why stop at three? I would be overjoyed with just one or two R01 grants to allow me to do my job.

What is science?

We can read the steps of the scientific method and know about conventions such as the peer review process but at its core what is science? I have been thinking about this question more and more recently. In trying to pin it down the definition seems to diffuse into a overly abstract statement that resembles something I dislike as a shortcut to thinking and refer to as "bumper sticker philosophy." It seems odd to phrase it this way but science is, in a way, about what works and what doesn't work, and understanding the reasons why what works and what doesn't work in a very rigorous way. Science, ideally, is a careful balance of keeping an open mind, embracing curiosity, taking details seriously, and reasoned agreement and disagreement with others. Then there is the real world human aspect of how science is or is not practiced. Statements such as "believing" in evolution or "believing" in climate change do not sit well in a scientific perspective. In current American English this implies faith in something... However, pointing this out can be misinterpreted and hijacked by people with non-scientific motivations.

At the risk of sounding like a kōan, I thought two famous historical case examples, one from physics and one from biology, that have interested me for different reasons, might provide more of an illustration of what science is then trying to nail down a precise verbal/text definition. However, this comes at the risk of people not being patient enough to read them. Perhaps therein lies part of the modern problem. If it doesn't fit into a snappy short phrase it never reaches most people. I've linked PDFs of the two examples below.