# GWAS of blue mussel stress response

A new collaborative publication is out in print: http://dx.doi.org/10.1111/jeb.13224

## Abstract

A key component to understanding the evolutionary response to a changing climate is linking underlying genetic variation to phenotypic variation in stress response. Here, we use a genome-wide association approach (GWAS) to understand the genetic architecture of calcification rates under simulated climate stress. We take advantage of the genomic gradient across the blue mussel hybrid zone (Mytilus edulis and Mytilus trossulus) in the Gulf of Maine (GOM) to link genetic variation with variance in calcification rates in response to simulated climate change. Falling calcium carbonate saturation states are predicted to negatively impact many marine organisms that build calcium carbonate shells – like blue mussels. We sampled wild mussels and measured net calcification phenotypes after exposing mussels to a ‘climate change’ common garden, where we raised temperature by 3°C, decreased pH by 0.2 units and limited food supply by filtering out planktonic particles >5 μm, compared to ambient GOM conditions in the summer. This climate change exposure greatly increased phenotypic variation in net calcification rates compared to ambient conditions. We then used regression models to link the phenotypic variation with over 170 000 single nucleotide polymorphism loci (SNPs) generated by genotype by sequencing to identify genomic locations associated with calcification phenotype, and estimate heritability and architecture of the trait. We identified at least one of potentially 2–10 genomic regions responsible for 30% of the phenotypic variation in calcification rates that are potential targets of natural selection by climate change. Our simulations suggest a power of 13.7% with our study's average effective sample size of 118 individuals and rare alleles, but a power of >90% when effective sample size is 900.

My primary role was in running simulations and the power analysis of the genotype-phenotype associations. A few SNPs were found that were in significant association with calcification rates under stressful conditions. However, the power to detect these loci was low, less than 15%, which on the surface suggests that there are several more loci in the genome that also have an effect.

The discovered SNPs were also at very low frequency, which also suggests a slight ascertainment bias associated with their discovery (they happened to be included in the sample by chance compared to possible similar undiscovered rare SNPs that were not sampled). So, a way to correct for ascertainment bias was explored and incorporated into the power analysis.

Suppose the number of rare alleles included in a sample has a Poisson distribution with a mean of 1.13.

There is a significant part of the distribution at a sample of zero (in blue). We can only estimate allele frequency from the sampled alleles (in red). This shifts the weight of the distribution upward. In this example the sampled mean is 1.67 an increase of 47%.

We can work backwards to the underlying Poission mean from the observed counts by keeping track of the difference between the full Poisson distribution and the part above zero.

The distribution mean, $\lambda_a$, with ascertainment bias is related to the underlying mean, $\lambda$, by

$\lambda_a = \frac{\sum^\infty_{i=1}i\frac{\lambda^ie^{-\lambda}}{i!}}{\sum^\infty_{j=1}\frac{\lambda^je^{-\lambda}}{j!}}$

This is equal to

$\lambda_a =\lambda\left(1+\frac{1}{e^\lambda-1}\right)$

This is correct but it is in the wrong direction. We need to solve for $\lambda$. To do this we have to use the Lambert-W function, $W\left(x\right)$.

$\lambda = \lambda_a + W\left(-e^{-\lambda_a}\lambda_a\right)$

Honestly, this is not that useful because W cannot be expressed in terms of elementary functions but you can plug in values from a truncated Poisson distribution, numerically evaluate $\lambda$, and see that it works.

After correcting for ascertainment bias power can be evaluated with repeated simulations of multinomial draws for a given sample size and effect size.

In a 2x2 table of case controls and two different genotypes

 genotype AA aa case a b control c d

The effect size can be quantified by

$\phi = \frac{a d - b c}{\sqrt{(a+c)(b+d)(a+b)(c+d)}}$

Evenly dividing the cases and controls, setting the corrected frequency out of 100 of 1.13, and an effect size of 10% gives the following cell probabilities.

 genotype AA aa total case 0.4996 0.000365 0.5 control 0.4891 0.0109 0.5 total 0.9887 0.0113 1

These are used as multinomial probabilities and tested with a chi-square with a 1% cutoff in the following R script.

# sample size
n=600
# replicates
reps=100

sam <- numeric(length=100)
pow <- numeric(length=100)
for(j in 1:100){
n=j*100
sam[j]=n
q=power(n)
pow[j]=q
print(c(n,q))
}

power<-function(samplesize){
count=0
for(i in 1:1000){
# generate multinomial random draw
draw<-rmultinom(1, size = samplesize, prob = c(0.499635,0.489065,0.000365,0.010935))
# save in 2X2 matrix form
twobytwo<-matrix(draw, nrow=2, ncol=2)
# run a chi square test
result<-chisq.test(twobytwo)
pval=result$p.value # extract p-value and record if below threshold if(is.na(pval)){pval=1} if(pval<0.01){count=count+1} } return(count/1000) } plot(sam, pow,log="x",type = "l", xlab="sample size", ylab="power")  This gives the following plot. This shows a dramatic increase in power from a sample size of 600 to 1,200. Therefore, you would want to use an actual sample size of at least 1,000 under this scenario to have a reasonable chance of detecting a genotype-phenotype association of this size and frequency if it exists. If your sample size is below 400 (in this scenario) then you are just wasting time and resources. The values and results above are just illustrative. In the blue mussel work the effect size was larger and tested in a different way. However, we found that we were at the lower edge of power in this kind of study (red circle over the solid blue line in the plot below). We had a power of only 14% and found three sites in significant association, which implies there are more to be found. The other colored lines show increasing significant cutoff stringency when correcting for multiple testing and the dashed lines are with the effect size halved and dotted line with the effect size quartered. If the study was done with a sample size of 1,000 all similar loci should be discovered and perhaps some of weaker effect. There are a couple of other points to bring up related to this work. Briefly, 1. This was done in a zone of hybridization between two mussel species which increases the power to detect associations because of longer range linkage disequilibrium (and this comes along with some confounding issues). 2. Stressing the mussels exaggerates the phenotypic range, which, if there is underlying genetic variance influencing the phenotype, can uncover otherwise cryptic genetic influences on the trait being studied. In some ways this is similar to the often misunderstood phenomenon of genetic assimilation where altering the environment to exaggerate a phenotype can allow for more efficient selection to act upon otherwise subtle or completely hidden genetic influences upon a phenotype. 3. The variants associated with increased calcification under stressful conditions are at very low frequencies. If climate change results in strong selection for these variants it could still take centuries for them to increase to sufficient frequency to rescue mussel populations. # Evolving a definition of life This is derived from a class in am involved in this semester with Justin Walguarnery, ZOOL 490 "Origin and Future of Life." We started off by going over some attempts to define life. I wondered if a "living document" could participate in defining life. The original text is below. Can a living thing help define life? February 14, 2018 Abstract What is life? The field of biology does not have a generally accepted definition of life. This document proposes a working definition of life and applies it to several examples and counter examples. Finally, a new experiment is proposed to attempt to harness a living system to help humans define life. Introduction The line separating living versus non-living is hard to define despite many attempts (e.g., Schrödinger 1944; Korzeniewski 2001; Ruiz-Mirazo et al. 2004; Macklem & Seely 2010). Definitions do not exist on their own; they are a tool used by humans. Definitions help shape human thought, and useful definitions help us to think about the essential relationships between things. Good definitions promote additional insights that may not have been apparent before. These definitions do not need to be mutually exclusive and should be used according to their utility towards a thought or idea. Defining a whale as a fish highlights its shape (fins, tail) and environment (marine, primarily underwater). A whale as a mammal highlights its evolutionary relationship with other placental mammals and traits such as air breathing, warm blooded, live birth, etc. However, we can transcend to another level and call a whale (and all mammals) a fish to illustrate their evolutionary relationship and realize that we ourselves are actually highly modified fish (Shubin 2008). Viewing a bird as a dinosaur (Chiappe 2009), or a tree as a carbon crystal which has grown out of an atmospheric solution of carbon dioxide, leads to some poetic, surprising, and sometimes insightful perspectives that suggest alternative lines of thought to explore. Proposed definition: 1) Living things are capable of using resources from their environment to produce multiple copies of themselves with a level of complexity that allows for infinite possible heritable alterations. 2) These heritable alterations can result in unexpected emergent properties and can affect relative reproductive success among their copies. 3) The plans or instructions used to make a new organism are not separate from, but are linked to and share, the reproductive fate of the new organism. Examples of life Cells—cells are considered the basic unit of life on Earth and there is no argument among biologists that cells are examples of living things. Cells capture energy and materials from their environment and reproduce. Biology is full of the presumably ultimately infinite complexity of species that cells can give rise to. The problem with cells are that they have become equated with life in some definitions—if it is not a cell it is not alive—and this has likely profoundly biased our concept of what life is and is not. The definition used here does not make a distinction between autotrophs, cells that can produce all they need from non-living material, and heterotrophs which depend on other cells for energy and materials. This lack of a distinction is very important both in this definition of life and for many of the examples to follow. Other cells are a part of many cells’ environments, which they make use of in order to reproduce. The definition of life used here recursively includes organisms that depend on other organisms. Viruses—viruses also use their environment, namely certain types of cells depending on the virus, to make copies of themselves. Simply because a virus depends on another cell for reproduction does not mean that it is not living. The same could also be said for humans. We, along with the vast majority of life on Earth, ultimately depend on other cells to survive—and we certainly consider ourselves to be living. If humans are living, viruses are living. Viruses are also capable of extremely rapid evolution into a diversity of forms that affect their reproduction in an ever-changing immune response from cells (e.g., Gong et al. 2013). Computational life—humans, which depend on other cells to survive, have built computers, which currently depend on humans to operate. These computers can run simulations in which organisms compete with each other within a simulated environment to reproduce. If these simulations are sufficiently complex, so that there are an infinite number of possible “mutations” that affect survival and reproduction, with emergent unexpected properties, then these organisms are living according to the definition here (e.g., Yaeger 1994). It is not relevant that they exist within electronic states of a computer’s memory or that the course of evolution can be exactly recreated by starting from a saved earlier state. Self replicating machines—humans use tools, which are simple machines, to build other machines. The last century has seen an increasing level of automation within factories, some of which build the very machines used in the robotic automation of factories. It would take a considerable amount of planning and design but it is entirely conceivable that a complex set of interacting machines could be built to collect their own energy in order to locate, mine, and refine raw materials. They could use these materials to produce components from a set of instructions, for building additional machines to collect energy, to locate more raw materials, and to repair and maintain existing tools and parts. It would be very complex, but in principle this could be accomplished. The idea has been proposed to build self-replicating factories on the moon, without humans present, in order to generate materials for space exploration (Chirikjian et al. 2002). If heritable “mutations” in the instructions used to make parts were possible, and it is hard to imagine how this could be completely prevented, then this system could be capable of limitless evolution and optimization as multiple factories began competing for resources. It is also easy to imagine specializations as some factories evolve to “steal” parts from other factories, etc. This type of system would be alive in every sense. The fact that it was initially designed by humans is not relevant to the definition used here. In fact, the initial factories would be autotrophs and even more alive than humans according to definitions of life that object to a reliance upon other cells. Similar to self-replicating moon factories, self-replicating spacecraft have been proposed (Tipler 1981). The machines would build copies of themselves from resources, such as comets, asteroids, and solar power, in space. This would also be technically challenging but there is no theoretical reason why this could not ultimately be accomplished. By constantly renewing and multiplying themselves, a greater region of space can be explored. All that is necessary to become living is for heritable changes to the designs, that affects the function of the probe, to be inherited with the machine so that the evolutionary trajectory is open ended and unpredictable—and incidentally potentially dangerous to us humans. Social insects—insect colonies with a sterile “caste” of individuals are another challenging example. These workers are produced by a queen and only queens and drones can reproduce. The queen is dependent on the workers for long-term survival and reproduction. We humans, ironically, are comfortable with the idea of a multi-cellular organism, but have difficulty embracing the idea of a multi-bodied organism. (Bacterial scientists have a similar problem in thinking of the collection of all of the cells in our body as a single living unit⸮) Worker ants or bees are just extensions of the soma, the cells of an organism that are not in the germ-line which are passed on to the next generation. The entire colony is the unit of life. Extending this idea another step; some species of ants have domesticated, in every sense of the word, other species such as fungi and are now mutually reliant. Fungal farming is coupled with specific species and is copied from colony to new colony (e.g., Mueller et al. 1998). Here perhaps the unit of life is even larger and multi-species? Transposable elements—these are a class of DNA sequences that reproduce themselves within a host’s genome. Much like a virus, they cannot function outside of a cell, yet they use their environment to reproduce and they evolve in potentially complex ways (e.g., Feschotte et al. 2003; Chuong et al. 2017). Counter examples of life Fire—in a sense fires can reproduce and multiply. However, there is no heritable functional complexity that can optimize reproduction. Therefore fire is not a living system. Crystals—crystals grow and, if broken and distributed, perhaps by mechanical action such as waves, they can seed new crystal growth. However, like fire there is not an open-ended heritable complexity. Prions—prions are proteins folded into a three dimensional configuration that induce similar proteins to also fold into the new configuration. In mammals prions are most often associated with disease; however, in fungi prions are used as a type of molecular memory to record past states and environmental conditions (Shorter & Lindquist 2005). Prions use their environment to make copies of themselves, but they do not pass on malleable heritable information that is capable of complex forms. Robots—currently no robotic systems are alive. They are capable of complex behaviors and interactions. Robots can also be used to build other robots. However, if the plans used to build robots are static (e.g., the same physical copy is used), and not also copied along with the daughter robots and capable of change (and linked to the survival of the resulting copy), there is not a heritable system capable of unpredictable change, which is a requirement for this definition of life. Evolved hardware—hardware exists that has been designed by an evolutionary process (e.g., Thompson 1996; Lohn 2005). How they work can be very hard to understand (by humans) and they can often outperform human-made designs. However, there is again not an open-ended heritable complexity passing from one design to the next. RepRap—the RepRap project is an effort to make 3D printers out of parts largely made by 3D printers (Jones et al. 2011). This is a fun project and the printers are evolving to a certain extent, based on human-designed changes. However, the plans used to make the printers are not copied along with the printers, with mutation and alterations that are hertiable. The fate of the plans is not linked to the fate of the product. Thus, there is again a lack of heritable complexity that can lead to unpredictable unbounded evolution. Ultimately this could evolve into a living system but it is not there yet according to the definition used here. Artificial Intelligence—the concept of highly developed AI is often confused with what might qualify as a living system. AI is or will soon be capable of intelligent communication, self awareness, and detailed knowledge about the world (Ferruchi et al. 2013; MacDonald 2015; Warwick & Shah 2016). However, what is missing is evolution and selection with heritable components. Often speculation about AI is confounded with assumptions of the need for self-preservation and even eventual hostility towards humans (e.g., Holley 2015); why? If AI was not shaped by evolution and competition for resources then there is no reason to suspect that it would have a motive of self-preservation. (How people might use AI is however another matter; Helbing 2017.) This is an anthropomorphic projection of human behaviors onto a nonliving system. If AI were designed to make copies of itself with heritable changes then it could become a living system, but perhaps we had better not do that. Discussion Often there is a confounding of complexity, intelligence, and replication with life. These are not the same. Life can be fairly simple (e.g., viruses and transposable elements, although one may hesitate to call these simple) if an appropriately complex environment, an operating system in the literal sense, is available. A key concept of this definition of life is being comfortable with various levels of dependence on other living things. There are layers of living things that depend upon each other, enabling additional forms of life. Just as the operating system of a computer allows computer viruses to flourish. Living things are both able to replicate and have a sufficient degree of heritable complexity so that their evolutionary trajectory is unpredictable. Living objects are capable of evolving into an infinite number of states. Sufficient complexity is meant in the sense of Turing complete systems that are ultimately capable of simulating any computation (in an informational computation perspective, Turing 1937, sans the infinite memory requirement). Unpredictable is meant in the sense of chaotic systems, where approximate current states (with any degree of measurement uncertainty) cannot predict future states (cf., Werndl 2009). There is also a confounding of the question of the origin of cellular life on Earth, with a focus on nucleotides and early metabolic processes, with definitions of life. Here it is argued that the definition of life and the question of the origin of a specific form of life are distinctly different things. Life does not exist on a scale from more living to less alive. This is a tempting elaboration to deal with viruses, heterotrophs, and social insects in alternative definitions of life. However, there is a discrete phase shift between systems that do not contain all the aspects of living systems (lacking in reproduction, heritable evolvability, complexity, etc.) and living systems that can use their environment to ultimately evolve endless possible forms. Some things might exist very close to this boundary—the edge between living and non-living. However, a key component is that the plans used to make multiple copies are themselves also copied and can change, and that their fate is linked to the fate of the copy. Life is also consistent with design and extinction. Life under this definition is not limited to natural (non-human made) occurrences. Also, living things can ultimately go extinct—in fact the majority of cellular based species on Earth have gone extinct; yet, were they were no less living. Many definitions of life discriminate between existence in physical reality versus … what‽ Can something in this universe ultimately exist outside of physical laws? Life, such as computer simulations, is not necessarily embedded within objective physical reality. But does this objective reality actually exist? Beginning with realizations that the Earth is not flat and the sun does not orbit the earth, even though it casually appears to us that this is so, we have uncovered more and more about the natural world that is not obvious to casual Earth-bound human observation, such as microorganisms, molecules, atoms, elementary particles, and the theoretical idea of quarks. Observations and ideas like particle entanglement in quantum mechanics, time-space trade-offs in relativity, energy-matter equivalence, challenge what we think of as objective reality and hint at deeper unfamiliar layers that underlie the world we see around us. Is it impossible to suppose that there are more fundamental physical worlds from which we would be viewed as, at best, a superficial simulation? Some attempts at defining life utilize a thermodynamic perspective (e.g., Schrödinger 1944; Schneider & Kay 1994). However, nothing is free from physical laws, including thermodynamics, so we can entertain the idea that it is ultimately not useful to define life in thermodynamic terms to attempt to separate it from non-living systems. It is also a mistake to act as a reductionist of biology to the chemical and physical worlds. What is the more interesting perspective is a focus on the emergent properties of living systems in their own right, which is not simply deducible from physical and chemical laws (Dobzhansky 1964). One final note, some of us humans are generally very focused on individualism, which again is ironic because we are multi-cellular and highly dependent upon each other and other species. This might profoundly bias our thinking about life and what constitutes living systems. We tend to think of physically discrete organisms as the unit of life (see the example of social insects above). However, we should work to try to relax that assumption. We should try to keep an open mind in order to recognize life around us and in the universe. This might take form in units that are physically distinct yet cooperate to reproduce and evolve (cf., Vaidya et al. 2012). The experiment There is another example of life that is perhaps unusual and surprising; yet, it fits the definition used here. Chain letters have existed for centuries, most recently in electronic form. They use their environment (humans) to make copies of themselves, i.e., reproduce. Usually this is in the form of exploiting human magical thinking in terms of good and/or bad luck related to copying and distribution of the text. This has been recognized and described by some in evolutionary and biological terms (Goodenough & Dawkins 1994; Bennett et al. 2003; VanArsdale 2016). It is also impossible to ignore similarities between chain letters and some religious texts and to speculate on the role of chain-letter-type dynamics both in text and in oral tradition in the evolution of religions (e.g., pp. 208, 221-222, 230-231, Budge 1904; pp. 96, 275, Goddard 1938; p. 172, Mizuno 1982; as pointed out by VanArsdale 2016; and possibly other examples, Psalms 96:3; Mark 16:15; Sahih al-Bukhari 6:61:510). Occasionally differences appear in chain letters that are copied to their offspring and influence their ability to reproduce (some of these differences are erroneous mistakes and some are purposefully made by humans). There is no obvious limit to the ways they can psychologically manipulate humans in order to reproduce (see the cookie recipe for a luck-free revenge-motivated example, Mikkelson 2016). This is not unlike some parasites that modify the behavior of their host organism to promote their own reproduction (Poulin 2010). Can we utilize chain-letter-type dynamics for science? You can help make this document an example of life. Edit this manuscript with some changes and send it to some of your colleagues. This should be done without copyright by releasing it into the public domain (in current legal terms). Thereby, this document can use its environment (humans) to make copies of itself and reproduce. The most successful versions will be more likely to reproduce. What will be the result? Perhaps a refined definition of life can evolve from the evolution of this document (definitions are for human use and this document needs to utilize humans to reproduce), or it might likely take a trajectory towards attention seeking text unrelated to defining life. At the very least it is unpredictable and could be an interesting experiment in challenging ideas of what a living thing is and how we might recognize life, if it works at all. Please send a copy of the original version of the document you received and the modified copy you sent out to this address, life.defining.life@gmail.com , so that the evolution of this document can be monitored over time and the results made public. References Bennett, C. H., Li, M., & Ma, B. (2003). Chain letters & evolutionary histories. Scientific American, 288(6), 76-81. Budge, E. A. W. (1904). The Gods of the Egyptians. Dover (1969), Vol. I & II. Chiappe, Luis M. (2009). Downsized Dinosaurs: The Evolutionary Transition to Modern Birds. Evolution: Education and Outreach. 2(2): 248–256. Chirikjian, G. S., Zhou, Y., & Suthakorn, J. (2002). Self-replicating robots for lunar development. IEEE/ASME Transactions on Mechatronics, 7(4), 462-472. Chuong, E. B., Elde, N. C., & Feschotte, C. (2017). Regulatory activities of transposable elements: from conflicts to benefits. Nature Reviews Genetics, 18(2), 71. Dobzhansky, T. (1964). Biology, Molecular and Organismic. American Zoologist 4, 443-452. Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., & Mueller, E. T. (2013). Watson: beyond Jeopardy!. Artificial Intelligence, 199, 93-105. Feschotte, C., Swamy, L., & Wessler, S. R. (2003). Genome-wide analysis of mariner-like transposable elements in rice reveals complex relationships with stowaway miniature inverted repeat transposable elements (MITEs). Genetics, 163(2), 747-758. Goddard, D. (1938). A Buddhist Bible. Boston: Beacon Press Gong, L. I., Suchard, M. A., & Bloom, J. D. (2013). Stability-mediated epistasis constrains the evolution of an influenza protein. Elife, 2:e00631 Goodenough, O. R., & Dawkins, R. (1994). The 'St Jude' mind virus. Nature, 371(6492), 23. Helbing, D., Frey, B. S., Gigerenzer, G., Hafen, E., Hagner, M., Hofstetter, Y., ... & Zwitter, A. (2017). Will democracy survive big data and artificial intelligence. Scientific American, 25. Holley, P. (2015). Bill Gates on dangers of artificial intelligence:‘I don’t understand why some people are not concerned’. Washington Post, 29. https://www.washingtonpost.com/news/the-switch/wp/2015/01/28/bill-gates-on-dangers-of-artificial-intelligence-dont-understand-why-some-people-are-not-concerned/?utm_term=.7bff3b352aba Jones, R., Haufe, P., Sells, E., Iravani, P., Olliver, V., Palmer, C., & Bowyer, A. (2011). RepRap–the replicating rapid prototyper. Robotica, 29(1), 177-191. Korzeniewski, B. (2001). Cybernetic formulation of the definition of life. Journal of Theoretical Biology, 209(3), 275-286. Lohn, J. D., Hornby, G. S., & Linden, D. S. (2005). An evolved antenna for deployment on NASA’s Space Technology 5 mission. In Genetic Programming Theory and Practice II. pp. 301-315. Springer US. MacDonald, F. (2015). A robot has just passed a classic self-awareness test for the first time. Science Alert, 17. https://www.sciencealert.com/a-robot-has-just-passed-a-classic-self-awareness-test-for-the-first-time Macklem, P. T., & Seely, A. (2010). Towards a definition of life. Perspectives in Biology and Medicine, 53(3), 330-340. Mikkelson, B. (2016). Neiman Marcus$250 Cookie Recipe. https://www.snopes.com/business/consumer/cookie.asp
Mizuno, K. (1982). Buddhist Sutras: Origin, Development, Transmission. Tokyo: Kosei Publishing Co.
Mueller, U. G., Rehner, S. A., & Schultz, T. R. (1998). The evolution of agriculture in ants. Science, 281(5385), 2034-2038.
Poulin, R. (2010). Parasite manipulation of host behavior: an update and frequently asked questions. In Advances in the Study of Behavior (Vol. 41, pp. 151-186). Academic Press.
Ruiz-Mirazo, K., Peretó, J., & Moreno, A. (2004). A universal definition of life: autonomy and open-ended evolution. Origins of Life and Evolution of the Biosphere, 34(3), 323-346.
Schneider, E. D., & Kay, J. J. (1994). Life as a manifestation of the second law of thermodynamics. Mathematical and Computer Modeling, 19(6-8), 25-48.
Schrödinger, E. (1944). What Is Life? Cambridge University Press, Cambridge.
Shorter, J., & Lindquist, S. (2005). Prions as adaptive conduits of memory and inheritance. Nature Reviews Genetics, 6(6), 435.
Shubin, N. (2008). Your inner fish: a journey into the 3.5-billion-year history of the human body. Vintage.
Thompson, A. (1996, July). Silicon evolution. In Proceedings of the 1st Annual Conference on Genetic Programming (pp. 444-452). MIT press.
Tipler, F.J., 1981. Extraterrestrial Intelligent Beings do not Exist. Quarterly J. of the Royal Astronomical Society 21, 267-281.
Turing, A. M. (1937). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society. 2. 42: 230–65.
Vaidya, N., Manapat, M. L., Chen, I. A., Xulvi-Brunet, R., Hayden, E. J., & Lehman, N. (2012). Spontaneous network formation among cooperative RNA replicators. Nature, 491(7422), 72.
VanArsdale, D. W. (2016) Chain Letter Evolution. http://www.silcom.com/~barnowl/chain-letter/evolution.html
Warwick, K., & Shah, H. (2016). Can machines think? A report on Turing test experiments at the Royal Society. Journal of Experimental & Theoretical Artificial Intelligence, 28(6), 989-1007.
Werndl, Charlotte (2009). What are the New Implications of Chaos for Unpredictability? The British Journal for the Philosophy of Science. 60 (1): 195–220.
Yaeger, L. (1994). Computational genetics, physiology, metabolism, neural systems, learning, vision, and behavior or Poly World: Life in a new context. In Santa Fe Institute Studies in the Sciences of Complexity-Proceedings Vol. 17, pp. 263-263. Addison-Wesley Publishing Co.

# Jukes-Cantor Probabilities

In an earlier post I went through a derivation of the Jukes Cantor model (based on a Poisson distribution of mutation events) and got the following type of relationship.

$d = \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)$,

where $d$ is the fraction of divergent sites between two DNA sequences and $t$ is the time the sequences diverged from a common ancestor in units of $g \mu$ where $g$ is the number of generations and $\mu$ is the per generation per nucleotide mutation rate.

This can be rearranged to

$t = -\frac{3}{8} \log_e \left( 1 - \frac{4}{3} d \right)$

to convert an observed difference between two sequences into a corrected estimate of the divergence between two species.

However, this is only a point estimate and cannot handle cases where the fraction of sites that differ happen to be above 75% (this is possible by chance but becomes a log of a negative number and undefined).

We can rewrite this in a different way, as a binomial probability, and evaluate the probability of a data set over a range of divergence and place intervals around our point estimate.

As above, the probability that two nucleotides are different is

$\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)$.

Assuming each nucleotide is independent in a comparison of two sequences we can multiply by the number of sites that are different, $D$.

$P(D|t) = \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D$

On the other hand two nucleotides can be in the same state if no mutations occurred, which is a probability of $e^{- \frac{8}{3}t}$ (this comes from the Poisson distribution the model is based on, look back at the earlier post if this isn't clear) or mutations could have occurred along the lineages but they happened to end up as the same nucleotide (an "invisible" mutation). This second outcome has a probability of

$\frac{1}{4}\left(1 - e^{- \frac{8}{3}t}\right)$.

Combined the probability that we see $S$ sites that are the same in two sequences descended from a common ancestor is

$P(S|t) = \left[e^{- \frac{8}{3}t}+\frac{1}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S$.

Since there are only two possibilities, the sites are the same or different, and the probability of all possible outcomes must be 100% we can rewrite this in a slightly simpler way.

$P(S|t) = \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S$.

Then multiply everything together to get the probability of the data including sites that differ and sites that are the same.

$P(data|D,S) \propto \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S$

There is also a binomial coefficient that should be multiplied to get the exact probability (because there are different ways to get the same data: the first and second site can be different, or the first and third, or the second and fifth, etc.). However, for a given data set this is a constant and we're going to drop it for the moment.

Say we align two sequences of ten base pairs from the same gene in two different species.

grasshopper: AGCTACAACT
cricket:     AGATACGACT


The sequence differs at two sites. We could rewrite the comparison as a 1 if the base is the same and a 0 if they are different.

DNA: 1101110111

Making a plot of the two parts of the equation we can see that the probability that two sites are the same goes down as the time of divergence increases and the probability that they are different increases.

Importantly, there is a middle ground that can best accommodate both signals from the data, the differences and the similarities, near but lower than the crossover point. Combining these together we get the curve of probability of all the data with a peak just above 0.1.

This can be integrated to find upper and/or lower confidence intervals (which is messy and uses the hypergeometric function); however, it is clear from looking at the curve that a wide range of values are hard to rule out with a data set this small.

If the data set were larger we can narrow down the confidence interval.

So, how does this compare to the point estimate we get from the traditional Jukes-Cantor distance? A difference of 0.2 (two sites out of 10) can be plugged in for $d$,

$t = -\frac{3}{8} \log_e \left( 1 - \frac{4}{3} d \right)$,

and gives $t \approx 0.1163$.

When we take the derivative of

$P(data|D,S) \propto \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S$

set it equal to zero and solve for $t$ we get

$t = \frac{3}{8} \log_e\left(- \frac{3(D+S)}{D-3S} \right)$.*

Plugging in $D=2$ and $S=8$ gives $t \approx 0.1163$, exactly the same as with the other method. The original Jukes-Cantor method is a maximum likelihood point estimate (the parameter value of the model that maximizes the likelihood of the data). You can also see it is equivalent in other ways. The $D-3S$ in the denominator makes the log evaluate a negative number if the fraction of sites that vary are greater than 3/4 of the total, which is undefined. (Thinking about this intuitively it is most consistent with positive infinity; however, I suspect the curve gets infinitely flat and almost any large number is essentially just as likely.) However, we can still work with the probabilities which allows us to rule out small values of $t$ as in the example below where the sites that differ are 80% of the total.

* This is only true for real solutions. The more general solution is

$t = \frac{3}{8} \left(\log_e\left(- \frac{3(D+S)}{D-3S} \right) + 2 i \pi n \right)$,

where $n$ is an element of the integers. This is fun because it combines a base $e$ log with $i$ and $\pi$, but this isn't really relevant to the discussion (time of divergence doesn't rotate into a second dimension).

# Toxorhynchites

We found a total of five Toxorhynchites larvae in a pool containing Culex larvae near Manoa stream (Jan 18, 2018, 21.299882, -157.813210). I haven't seen this genus before; they are possibly T. splendens. These are unusual mosquitoes in that they do not blood feed. The larvae eat aquatic invertebrates including other mosquito larvae and several species were introduced into Hawaiʻi as an attempt at biocontrol of mosquitoes. They are also very large mosquitoes and the genus includes the largest mosquito species.  Thanks to Matt Medeiros and Jessica Mavica for helping me ID them.

# Jessica Mavica, Pirbright Institute

Jessica Mavica visited our lab last week and gave a presentation in our department seminar series. She works at the Pirbright Institute (https://www.pirbright.ac.uk/) and took some Culex mosquito larvae back to the institute for research projects there.

# Transcriptional effects of a positive feedback circuit in Drosophila melanogaster

A new paper is out at BMC Genomics; it was published a few days before the new year.

Transcriptional effects of a positive feedback circuit in Drosophila melanogaster

This is work that was started many years ago. The tTAV gene expression system was developed into a genetic pest management tool by various groups (e.g., Horn C, Wimmer EA. A transgene-based, embryo-specific lethality system for insect pest management. Nat Biotechnol. 2002;21:64–70; Alphey, Luke. "Expression system for insect pest control." U.S. Patent 9,121,036, issued September 1, 2015). A positive feedback form of the tTAV system was being marketed by Oxitec as "RIDL."  In this system a transgene promotes its own expression by binding to DNA. If left unchecked this causes a runaway feedback loop and is toxic to the organism. However, the protein produced also binds to tetracycline and if tetracycline is added to the insect's diet then the lethal effect is masked (tetracycline bound protein does not bind to DNA and drive gene expression).

In a nutshell, in applications large amounts of insects can be reared with tetracycline. Then released into the wild to mate with wild individuals. The toxic effect is dominant and kills off the offspring of the wild individuals that inherit it (who presumably do not have tetracycline in their diet). If enough individuals can be released relative to the wild population this can cause a reduction in wild population numbers over the following generations. This is similar to classical SIT (Sterile Insect Technique) for population suppression.

We were curious about testing this system in various ways. In the first exploration we added tetracycline to food for fruit flies to see if there was an effect. Tetracycline can disrupt normal mitochondrial function and kill naturally occurring bacteria including endosymbionts like Wolbachia. This was the basis of a student's research project: Müller, H., Krefft, M., Reeves, G., & Reed, F. A. 2010. Trans-generational influence of tetracycline on Drosophila melanogaster (Bachelor Thesis, Fachhochschule Bingen). He found a reduction in egg laying and a delay in development in response to tetracycline. Interestingly there was a transgenerational effect in both the paternal and maternal sides (maternal could be due to mitochondrial or Wolbachia effects, but these do not explain the paternal effect).

Next we engineered the tTAV system in Drosophila melanogaster and performed an EMS (ethyl methanesulfonate) mutation screen to see if any mutations could result in tolerance of the feedback loop and survival off of dietary tetracycline. This was the basis of a grant application: Deutshe Forschungsgemeinschaft (DFG , German National Research Foundation). Die Entstehung von Resistenzen gegen genetisch induzierte Sterilität bei Insekten. (The evolution of resistance to genetically induced sterility in insects.) 2010 RE-3062/2-1. Long story short we generated ~400,000 embryos from flies exposed to EMS and recovered five individuals that had the tTAV insert but could survive without tetracycline.  All five were male. We tried mating them to females to recover offspring---and all five appeared to be sterile which essentially ended that experiment.

In this paper we wanted to determine the genome-wide transcriptional response to the runaway tTAV effect positive feedback loop.

The tTAV insert works as expected in D. melanogaster. The image above shows the curve of lethality as the concentration of two antibiotics, doxycycline and tertracycline, declines. At the upper end of the curve approximately half of the offspring of a heterozygous parent carry the genetic modification, as expected. As the dietary concentration diminishes the curve approaches zero---none of the offspring that carry the tTAV insert survive.

The cause of this lethality is unknown. One can imagine that a positive feedback loop of gene expression and protein production may over-use transcriptional and translational resources and/or overload protein degradation pathways, or disrupt off target gene expression. These hypotheses are not mutually exclusive. Looking at the shift in expression of other genes in the genome in response to tTAV runaway expression could be informative and give us clues as to what process results in lethality.

While we did see thousands of genes across the Drosophila genome that significantly changed expression, there was no clear pattern that emerged. The tTAV system was inserted at different points in the genome and these different lines gave different groups of genes with altered expression. A relative handful of 31 genes with differences in expression were shared across all lines. However, 27 of these 31 genes shifted in expression in different directions, some with increased expression and some with decreased expression across the lines.

This left just four genes with consistent changes in expression when the tTAV system was activated: crok (CG17218), Cyp6a17 (CG10241), olf186-F (CG11430), and Pex23 (CG32226). There is no clear relationship between these genes nor any obvious hypothesis that presents itself in terms of understanding a response to shifts in tTAV activity.

We backup up and looked at gene ontology enrichment. Were there categories of genes involved in certain processes or pathways that tended to be represented in the shift of gene expressions associated with tTAV---again no clear pattern presented itself.

Were there regional changes in expression, perhaps related to different insert sites or nuclear architecture interactions among chromosomes? No, we looked at groups of genes in sliding windows along the genome and there was no clear pattern of clustering by region.

Keep in mind however, there were widespread and significant changes in gene expression. It was not that there was the lack of an effect; just that there is not a clear pattern to the effect.

Gene expression levels between pairwise comparisons of the different insert lines and the control all showed statistically significant correlations. However, the direction of the correlation was not consistent. (This is shown in the figure above.)

I was quite surprised by this result---that the insertion site of a highly expressed transgene has such a profoundly idiosyncratic effect on genome-wide transcription patterns. This was a bit frustrating to be honest but it does suggest some cautions in these kinds of studies and next steps to explore. First of all a a caution, if we had only studied the transcriptional response to a single insert we would have found a significant response, with many genes affected, and may have been misled by the resulting pattern (if for example a number of these genes were in a certain pathway), overgeneralizing the particular signal seen. In a next step, it would be interesting to see if this widespread idiosyncratic effect extends to other genetic modifications or is specific to the tTAV system. For example, what is the genome-wide translational response, if any, to actin5c driven expression of EGFP? Importantly, does any effect seen vary widely by insertion site?

# Gene Drive Questions

I received an inquiry about gene drive technology by email and thought that it might be useful to post part of it along with my reply here.

Here is the question:

"Since a few months, gene drive is a very prominent topic. But what I am missing a recent review about the state of the art in building gene drive constructs and testing them. I would like to ask you if you could provide some advice about literature, which would cover an overview about this theme.  At the level where I try to follow the discussions, I have the feeling that things are randomly mixed (like e.g. gene drive = synthetic biology). It would be really great to have an overview about what is possible by now, what has been done, which systems work and which do not work. And perhaps where is gene drive applied."

And my response:

Here are some general reviews:

This is rapidly developing.  A lot of the recent focus is on CRISPR-Cas9 based gene drive and there is some confusion in the media that this is the only form of gene drive.  I've tried to categorize the different forms in Appendix A of this article:

https://arxiv.org/abs/1706.01710

One major problem with CRISPR-Cas9 based gene drive is the rapid accumulation of mutation and naturally occurring resistance in the wild. See:

Work is also progressing on underdominance-type approaches that may get around some of the problems with the CRISPR system.

The Genetic Engineering and Society Center at NCSU has a lot of resources and knowledgeable people in this area.

https://research.ncsu.edu/ges/

Here is a series of recent publications that were recently posted---I haven't had a chance to go through all of them yet.

https://research.ncsu.edu/ges/publications/faculty-publications/?table_filter=JRISpecialIssue

There are also some relevant publications that will come out over the next few months. I can update you as these are published and become publicly available.

To my knowledge there is no direct application as of this moment, but this is likely only a matter of time.

The US military is also very interested in gene drive issues. DARPA is funding some of the research on gene drive and IARPA is working on ways to detect if genetic modifications have occurred.

Let me know if I can help with additional information or if you want more detail about something.

# Virus imaging with negative staining

First of all there has been a long lag in posting any updates here about the lab.  There are many things going on that are a higher priority at the moment; however, I also want to try to keep a certain tempo of posts and updates going here if possible.  So, I set aside a few minutes today to make a post.

A couple of undergraduates in the lab have an interest in viruses, specifically bacteriophages. They have worked out methods to isolate and propagate the phages for various experiments. However, we wanted to get a look at the actual viral particle instead of inferring it indirectly from plaques on a plate of bacteria. I talked to Tina Carvalho at our Biological Electron Microscope Facility and she was enthusiastic about imaging the viruses.

Negative staining is a relatively quick and easy method to get viral mugshots. The virus particles are put onto a thin membrane held in a metal grid and electrons are beamed through both the virus and the grid to get an image. The problem is that this doesn't really work that well.  The virus is very transparent to electrons and they will pass right through almost like it wasn't there, i.e., the image contrast defining the virus will be low and look faint.

A stronger image can be made by adding a thin fluid of uranium salt (Uranyl acetate (UO2(CH3COO)2·2H2O)). Uranium is very dense and good at blocking electrons. If all goes well (and Tina pointed out that there can be quite a bit of luck involved with the color of you socks that day possibly playing a role) the fluid will dry a bit and form droplets around the virus particles by adhesion. The electrons pass through the virus, which is making a path of sorts, through the uranium drop, so the outside shell of the viral surface can be better resolved. Here is a figure to try to illustrate.

First we looked at a Vibriophage that Stacy Paulino isolated from seawater from Maunalua Bay. This phage infects a coral pathogen, Vibrio coralliilyticus, and Stacy imaged it below.

The next image is an unknown coliphage (infects Escherichia coli).

Notice the scale in the image indicating 20 nanometers; that is the size of a ribosome. Double stranded DNA is about 2 nm wide of 1/10 of the scale bar. The graininess of the tube is from individual tail tube proteins, single polypeptide molecules, which are typically assembled from ~250 individual peptides.

Maya Shaulsky (in a collaboration with Bob Thomson) is working with T7 (below) and made these coliphage images. Originally we thought we were working with T7 coliphage but we kept getting frustrating results that didn't make sense. When we actually saw the virus (the image above) it confirmed that we were not working with T7 so we ordered actual T7 from a different stock center.

And here is an E. coli bacterium with T7 particles replicating inside the cell!

If you are working with something small I encourage you not to be intimidated with the idea of electron microscopy imaging. It is really something to actually see what you are working with. The protocol to prepare the specimens is not that long or difficult (some people may disagree with this, perhaps we have been lucky) and it is not really that expensive to pay for the training and machine time. It is worth trying at least once.

# An alternative GAL4-based genetic sterile insect technique?

Here is a thought that came to me on a walk. Perhaps this could also be applied to mosquitoes to suppress wild populations.

GAL4, UAS, and GAL80 are genetic tools from yeast that are commonly used in Drosophila.

GAL4 is a protein that drives expression of genes with UAS (Upstream Activation Sequence) as a gene expression enhancer. GAL80 inhibits GAL4.

So, place the gene encoding the GAL4 protein under UAS expression. This could create a runaway positive feedback loop (GAL4 drives its own expression of more GAL4) and result in lethality.  However, set this up in Drosophila that also express GAL80 to inhibit the GAL4 from binding to UAS.

If GAL80 were expressed from a Y-chromosome only male offspring would survive to mate with the remaining females. Female-specific lethality is a powerful way to suppress a population.

This could be set up in a stable lab Drosophila stock using compound X chromosomes for X^X Y females and regular X Y males. All the Drosophila stocks, sequences, and insertion sites exist to quickly try this out.

http://flystocks.bio.indiana.edu/Browse/aberration/compound_x_overview.htm

http://flystocks.bio.indiana.edu/Browse/chr_y/special_y.php