Summer Update

It has been a busy summer! We are using the break in classes to catch up on everything else.

We just had a manuscript accepted for publication! More about that later.

I am leaving in a few hours for the AGA Symposium on the Big Island.

And, I just got back from an HHMI SEA-PHAGES "Phage Discovery Workshop" 11B.

The Phage workshop was extremely well organized and intensely packed with information. In one week we went from soil samples we collected to EM images of our isolated phage, with tricks like spot test, titer calculations, DNA extraction, and restriction digests along the way. Plus a lot about basic phage biology and teaching methods.

There are two things I've been thinking about since. 1) Temperate phages can "hide out" in the host bacteria's genome replicating along with the bacetria and only becoming active and lysing when the bacteria are stressed. One way to cause this stress is with UV light. Could activities like plowing, turning over soil, in agriculture release more nutrients for plants by exposing soil bacteria to UV from the sun, inducing them to lyse, and release nutrients back into the soil? 2) RNA genome viruses might be largely overlooked. The methods we went over would only recover DNA genome viruses. There might be ways to screen for RNA genome viruses such as extracting nucleotide polymers from filtered environmental samples, treating it with DNase (to destroy the DNA), inactivating the DNase, then treating with reverse transcriptase to convert the RNA into DNA, and using this to transform bacterial cells with heat shock or electroporation in order to get the DNA into the cell to assemble the RNA virus, then plating these out on top agar.


New discovery-based research curriculum for biological sciences launching at UH Mānoa

I (Floyd Reed) along with Megan Porter, Becky Chong, and Stuart Donachie are involved with an HHMI sponsored SEA-PHAGES program of undergraduate virus discovery and characterization in the classroom. This is something I have been interested in for several years now and it is exciting that we are finally able to bring it to the biology program here at UH.

(wall)Strom Wars

Michael Wallstrom gave the final EECB presentation of the semester on May 4th. He talked about some work he is planning to do to study near shore species diversity. Since it was on May the Fourth he did the presentation with a Star Wars theme complete with the initial sloped text animation moving off into a starfield (I wish I had a picture of that). In the photo he is on a slide where he talks about the economics of the Coastal Empire—the rebel base is here in Hawaiʻi.

Update: He gave his talk again and I snapped a shot of the introductory text animation.

Áki's defense

The big news in the lab recently is that Áki Láruson defended his Ph.D. dissertation "Genomics of the globally distributed echinoid genus Tripneustes".

From left to right in the lab photo taken right after his defense is me (Floyd Reed), Michael Wallstrom, Áki Láruson, and Maria Costantini. People outside of Hawaiʻi might be surprised by all the things around Áki's neck. Here it is customary to give a Lei to someone after a big event and if several people give you Leis they can pile up as in this example. Note that at least two of the Lei's have a sea urchin theme.

Áki's graduation ceremony is on May 12th. He plans to stay in Hawaiʻi over the summer finishing up analysis of exon capture data. He then has an NSF funded postdoc at Northeastern's Marine Science Center in Boston starting this fall.


GWAS of blue mussel stress response

A new collaborative publication is out in print:


A key component to understanding the evolutionary response to a changing climate is linking underlying genetic variation to phenotypic variation in stress response. Here, we use a genome-wide association approach (GWAS) to understand the genetic architecture of calcification rates under simulated climate stress. We take advantage of the genomic gradient across the blue mussel hybrid zone (Mytilus edulis and Mytilus trossulus) in the Gulf of Maine (GOM) to link genetic variation with variance in calcification rates in response to simulated climate change. Falling calcium carbonate saturation states are predicted to negatively impact many marine organisms that build calcium carbonate shells – like blue mussels. We sampled wild mussels and measured net calcification phenotypes after exposing mussels to a ‘climate change’ common garden, where we raised temperature by 3°C, decreased pH by 0.2 units and limited food supply by filtering out planktonic particles >5 μm, compared to ambient GOM conditions in the summer. This climate change exposure greatly increased phenotypic variation in net calcification rates compared to ambient conditions. We then used regression models to link the phenotypic variation with over 170 000 single nucleotide polymorphism loci (SNPs) generated by genotype by sequencing to identify genomic locations associated with calcification phenotype, and estimate heritability and architecture of the trait. We identified at least one of potentially 2–10 genomic regions responsible for 30% of the phenotypic variation in calcification rates that are potential targets of natural selection by climate change. Our simulations suggest a power of 13.7% with our study's average effective sample size of 118 individuals and rare alleles, but a power of >90% when effective sample size is 900.

My primary role was in running simulations and the power analysis of the genotype-phenotype associations. A few SNPs were found that were in significant association with calcification rates under stressful conditions. However, the power to detect these loci was low, less than 15%, which on the surface suggests that there are several more loci in the genome that also have an effect.

The discovered SNPs were also at very low frequency, which also suggests a slight ascertainment bias associated with their discovery (they happened to be included in the sample by chance compared to possible similar undiscovered rare SNPs that were not sampled). So, a way to correct for ascertainment bias was explored and incorporated into the power analysis.

Suppose the number of rare alleles included in a sample has a Poisson distribution with a mean of 1.13.

There is a significant part of the distribution at a sample of zero (in blue). We can only estimate allele frequency from the sampled alleles (in red). This shifts the weight of the distribution upward. In this example the sampled mean is 1.67 an increase of 47%.

We can work backwards to the underlying Poission mean from the observed counts by keeping track of the difference between the full Poisson distribution and the part above zero.

The distribution mean, \lambda_a, with ascertainment bias is related to the underlying mean, \lambda, by

\lambda_a = \frac{\sum^\infty_{i=1}i\frac{\lambda^ie^{-\lambda}}{i!}}{\sum^\infty_{j=1}\frac{\lambda^je^{-\lambda}}{j!}}

This is equal to

\lambda_a =\lambda\left(1+\frac{1}{e^\lambda-1}\right)

This is correct but it is in the wrong direction. We need to solve for \lambda. To do this we have to use the Lambert-W function, W\left(x\right).

\lambda = \lambda_a + W\left(-e^{-\lambda_a}\lambda_a\right)

Honestly, this is not that useful because W cannot be expressed in terms of elementary functions but you can plug in values from a truncated Poisson distribution, numerically evaluate \lambda, and see that it works.

After correcting for ascertainment bias power can be evaluated with repeated simulations of multinomial draws for a given sample size and effect size.

In a 2x2 table of case controls and two different genotypes

genotype AA aa
case a b
control c d

The effect size can be quantified by

\phi = \frac{a d - b c}{\sqrt{(a+c)(b+d)(a+b)(c+d)}}

Evenly dividing the cases and controls, setting the corrected frequency out of 100 of 1.13, and an effect size of 10% gives the following cell probabilities.

genotype AA aa total
case 0.4996 0.000365 0.5
control 0.4891 0.0109 0.5
total 0.9887 0.0113 1

These are used as multinomial probabilities and tested with a chi-square with a 1% cutoff in the following R script.

# sample size
# replicates

sam <- numeric(length=100)
pow <- numeric(length=100) 
for(j in 1:100){

   for(i in 1:1000){
      # generate multinomial random draw
      draw<-rmultinom(1, size = samplesize, prob = c(0.499635,0.489065,0.000365,0.010935))
      # save in 2X2 matrix form
      twobytwo<-matrix(draw, nrow=2, ncol=2)
      # run a chi square test
      # extract p-value and record if below threshold

plot(sam, pow,log="x",type = "l", xlab="sample size", ylab="power")

This gives the following plot.

This shows a dramatic increase in power from a sample size of 600 to 1,200. Therefore, you would want to use an actual sample size of at least 1,000 under this scenario to have a reasonable chance of detecting a genotype-phenotype association of this size and frequency if it exists. If your sample size is below 400 (in this scenario) then you are just wasting time and resources.

The values and results above are just illustrative. In the blue mussel work the effect size was larger and tested in a different way. However, we found that we were at the lower edge of power in this kind of study (red circle over the solid blue line in the plot below).

We had a power of only 14% and found three sites in significant association, which implies there are more to be found. The other colored lines show increasing significant cutoff stringency when correcting for multiple testing and the dashed lines are with the effect size halved and dotted line with the effect size quartered. If the study was done with a sample size of 1,000 all similar loci should be discovered and perhaps some of weaker effect.

There are a couple of other points to bring up related to this work. Briefly,

  1. This was done in a zone of hybridization between two mussel species which increases the power to detect associations because of longer range linkage disequilibrium (and this comes along with some confounding issues).
  2. Stressing the mussels exaggerates the phenotypic range, which, if there is underlying genetic variance influencing the phenotype, can uncover otherwise cryptic genetic influences on the trait being studied. In some ways this is similar to the often misunderstood phenomenon of genetic assimilation where altering the environment to exaggerate a phenotype can allow for more efficient selection to act upon otherwise subtle or completely hidden genetic influences upon a phenotype.
  3. The variants associated with increased calcification under stressful conditions are at very low frequencies. If climate change results in strong selection for these variants it could still take centuries for them to increase to sufficient frequency to rescue mussel populations.

Evolving a definition of life

This is derived from a class in am involved in this semester with Justin Walguarnery, ZOOL 490 "Origin and Future of Life." We started off by going over some attempts to define life. I wondered if a "living document" could participate in defining life. The original text is below.

Can a living thing help define life?

February 14, 2018

What is life? The field of biology does not have a generally accepted definition of life. This document proposes a working definition of life and applies it to several examples and counter examples. Finally, a new experiment is proposed to attempt to harness a living system to help humans define life.

The line separating living versus non-living is hard to define despite many attempts (e.g., Schrödinger 1944; Korzeniewski 2001; Ruiz-Mirazo et al. 2004; Macklem & Seely 2010). Definitions do not exist on their own; they are a tool used by humans. Definitions help shape human thought, and useful definitions help us to think about the essential relationships between things. Good definitions promote additional insights that may not have been apparent before. These definitions do not need to be mutually exclusive and should be used according to their utility towards a thought or idea. Defining a whale as a fish highlights its shape (fins, tail) and environment (marine, primarily underwater). A whale as a mammal highlights its evolutionary relationship with other placental mammals and traits such as air breathing, warm blooded, live birth, etc. However, we can transcend to another level and call a whale (and all mammals) a fish to illustrate their evolutionary relationship and realize that we ourselves are actually highly modified fish (Shubin 2008). Viewing a bird as a dinosaur (Chiappe 2009), or a tree as a carbon crystal which has grown out of an atmospheric solution of carbon dioxide, leads to some poetic, surprising, and sometimes insightful perspectives that suggest alternative lines of thought to explore.

Proposed definition:

1) Living things are capable of using resources from their environment to produce multiple copies of themselves with a level of complexity that allows for infinite possible heritable alterations.

2) These heritable alterations can result in unexpected emergent properties and can affect relative reproductive success among their copies.

3) The plans or instructions used to make a new organism are not separate from, but are linked to and share, the reproductive fate of the new organism.

Examples of life
Cells—cells are considered the basic unit of life on Earth and there is no argument among biologists that cells are examples of living things. Cells capture energy and materials from their environment and reproduce. Biology is full of the presumably ultimately infinite complexity of species that cells can give rise to. The problem with cells are that they have become equated with life in some definitions—if it is not a cell it is not alive—and this has likely profoundly biased our concept of what life is and is not. The definition used here does not make a distinction between autotrophs, cells that can produce all they need from non-living material, and heterotrophs which depend on other cells for energy and materials. This lack of a distinction is very important both in this definition of life and for many of the examples to follow. Other cells are a part of many cells’ environments, which they make use of in order to reproduce. The definition of life used here recursively includes organisms that depend on other organisms.
Viruses—viruses also use their environment, namely certain types of cells depending on the virus, to make copies of themselves. Simply because a virus depends on another cell for reproduction does not mean that it is not living. The same could also be said for humans. We, along with the vast majority of life on Earth, ultimately depend on other cells to survive—and we certainly consider ourselves to be living. If humans are living, viruses are living. Viruses are also capable of extremely rapid evolution into a diversity of forms that affect their reproduction in an ever-changing immune response from cells (e.g., Gong et al. 2013).
Computational life—humans, which depend on other cells to survive, have built computers, which currently depend on humans to operate. These computers can run simulations in which organisms compete with each other within a simulated environment to reproduce. If these simulations are sufficiently complex, so that there are an infinite number of possible “mutations” that affect survival and reproduction, with emergent unexpected properties, then these organisms are living according to the definition here (e.g., Yaeger 1994). It is not relevant that they exist within electronic states of a computer’s memory or that the course of evolution can be exactly recreated by starting from a saved earlier state.
Self replicating machines—humans use tools, which are simple machines, to build other machines. The last century has seen an increasing level of automation within factories, some of which build the very machines used in the robotic automation of factories. It would take a considerable amount of planning and design but it is entirely conceivable that a complex set of interacting machines could be built to collect their own energy in order to locate, mine, and refine raw materials. They could use these materials to produce components from a set of instructions, for building additional machines to collect energy, to locate more raw materials, and to repair and maintain existing tools and parts. It would be very complex, but in principle this could be accomplished. The idea has been proposed to build self-replicating factories on the moon, without humans present, in order to generate materials for space exploration (Chirikjian et al. 2002). If heritable “mutations” in the instructions used to make parts were possible, and it is hard to imagine how this could be completely prevented, then this system could be capable of limitless evolution and optimization as multiple factories began competing for resources. It is also easy to imagine specializations as some factories evolve to “steal” parts from other factories, etc. This type of system would be alive in every sense. The fact that it was initially designed by humans is not relevant to the definition used here. In fact, the initial factories would be autotrophs and even more alive than humans according to definitions of life that object to a reliance upon other cells. Similar to self-replicating moon factories, self-replicating spacecraft have been proposed (Tipler 1981). The machines would build copies of themselves from resources, such as comets, asteroids, and solar power, in space. This would also be technically challenging but there is no theoretical reason why this could not ultimately be accomplished. By constantly renewing and multiplying themselves, a greater region of space can be explored. All that is necessary to become living is for heritable changes to the designs, that affects the function of the probe, to be inherited with the machine so that the evolutionary trajectory is open ended and unpredictable—and incidentally potentially dangerous to us humans.
Social insects—insect colonies with a sterile “caste” of individuals are another challenging example. These workers are produced by a queen and only queens and drones can reproduce. The queen is dependent on the workers for long-term survival and reproduction. We humans, ironically, are comfortable with the idea of a multi-cellular organism, but have difficulty embracing the idea of a multi-bodied organism. (Bacterial scientists have a similar problem in thinking of the collection of all of the cells in our body as a single living unit⸮) Worker ants or bees are just extensions of the soma, the cells of an organism that are not in the germ-line which are passed on to the next generation. The entire colony is the unit of life. Extending this idea another step; some species of ants have domesticated, in every sense of the word, other species such as fungi and are now mutually reliant. Fungal farming is coupled with specific species and is copied from colony to new colony (e.g., Mueller et al. 1998). Here perhaps the unit of life is even larger and multi-species?
Transposable elements—these are a class of DNA sequences that reproduce themselves within a host’s genome. Much like a virus, they cannot function outside of a cell, yet they use their environment to reproduce and they evolve in potentially complex ways (e.g., Feschotte et al. 2003; Chuong et al. 2017).

Counter examples of life
Fire—in a sense fires can reproduce and multiply. However, there is no heritable functional complexity that can optimize reproduction. Therefore fire is not a living system.
Crystals—crystals grow and, if broken and distributed, perhaps by mechanical action such as waves, they can seed new crystal growth. However, like fire there is not an open-ended heritable complexity.
Prions—prions are proteins folded into a three dimensional configuration that induce similar proteins to also fold into the new configuration. In mammals prions are most often associated with disease; however, in fungi prions are used as a type of molecular memory to record past states and environmental conditions (Shorter & Lindquist 2005). Prions use their environment to make copies of themselves, but they do not pass on malleable heritable information that is capable of complex forms.
Robots—currently no robotic systems are alive. They are capable of complex behaviors and interactions. Robots can also be used to build other robots. However, if the plans used to build robots are static (e.g., the same physical copy is used), and not also copied along with the daughter robots and capable of change (and linked to the survival of the resulting copy), there is not a heritable system capable of unpredictable change, which is a requirement for this definition of life.
Evolved hardware—hardware exists that has been designed by an evolutionary process (e.g., Thompson 1996; Lohn 2005). How they work can be very hard to understand (by humans) and they can often outperform human-made designs. However, there is again not an open-ended heritable complexity passing from one design to the next.
RepRap—the RepRap project is an effort to make 3D printers out of parts largely made by 3D printers (Jones et al. 2011). This is a fun project and the printers are evolving to a certain extent, based on human-designed changes. However, the plans used to make the printers are not copied along with the printers, with mutation and alterations that are hertiable. The fate of the plans is not linked to the fate of the product. Thus, there is again a lack of heritable complexity that can lead to unpredictable unbounded evolution. Ultimately this could evolve into a living system but it is not there yet according to the definition used here.
Artificial Intelligence—the concept of highly developed AI is often confused with what might qualify as a living system. AI is or will soon be capable of intelligent communication, self awareness, and detailed knowledge about the world (Ferruchi et al. 2013; MacDonald 2015; Warwick & Shah 2016). However, what is missing is evolution and selection with heritable components. Often speculation about AI is confounded with assumptions of the need for self-preservation and even eventual hostility towards humans (e.g., Holley 2015); why? If AI was not shaped by evolution and competition for resources then there is no reason to suspect that it would have a motive of self-preservation. (How people might use AI is however another matter; Helbing 2017.) This is an anthropomorphic projection of human behaviors onto a nonliving system. If AI were designed to make copies of itself with heritable changes then it could become a living system, but perhaps we had better not do that.

Often there is a confounding of complexity, intelligence, and replication with life. These are not the same. Life can be fairly simple (e.g., viruses and transposable elements, although one may hesitate to call these simple) if an appropriately complex environment, an operating system in the literal sense, is available. A key concept of this definition of life is being comfortable with various levels of dependence on other living things. There are layers of living things that depend upon each other, enabling additional forms of life. Just as the operating system of a computer allows computer viruses to flourish.
Living things are both able to replicate and have a sufficient degree of heritable complexity so that their evolutionary trajectory is unpredictable. Living objects are capable of evolving into an infinite number of states. Sufficient complexity is meant in the sense of Turing complete systems that are ultimately capable of simulating any computation (in an informational computation perspective, Turing 1937, sans the infinite memory requirement). Unpredictable is meant in the sense of chaotic systems, where approximate current states (with any degree of measurement uncertainty) cannot predict future states (cf., Werndl 2009). There is also a confounding of the question of the origin of cellular life on Earth, with a focus on nucleotides and early metabolic processes, with definitions of life. Here it is argued that the definition of life and the question of the origin of a specific form of life are distinctly different things.
Life does not exist on a scale from more living to less alive. This is a tempting elaboration to deal with viruses, heterotrophs, and social insects in alternative definitions of life. However, there is a discrete phase shift between systems that do not contain all the aspects of living systems (lacking in reproduction, heritable evolvability, complexity, etc.) and living systems that can use their environment to ultimately evolve endless possible forms. Some things might exist very close to this boundary—the edge between living and non-living. However, a key component is that the plans used to make multiple copies are themselves also copied and can change, and that their fate is linked to the fate of the copy.
Life is also consistent with design and extinction. Life under this definition is not limited to natural (non-human made) occurrences. Also, living things can ultimately go extinct—in fact the majority of cellular based species on Earth have gone extinct; yet, were they were no less living.
Many definitions of life discriminate between existence in physical reality versus … what‽ Can something in this universe ultimately exist outside of physical laws? Life, such as computer simulations, is not necessarily embedded within objective physical reality. But does this objective reality actually exist? Beginning with realizations that the Earth is not flat and the sun does not orbit the earth, even though it casually appears to us that this is so, we have uncovered more and more about the natural world that is not obvious to casual Earth-bound human observation, such as microorganisms, molecules, atoms, elementary particles, and the theoretical idea of quarks. Observations and ideas like particle entanglement in quantum mechanics, time-space trade-offs in relativity, energy-matter equivalence, challenge what we think of as objective reality and hint at deeper unfamiliar layers that underlie the world we see around us. Is it impossible to suppose that there are more fundamental physical worlds from which we would be viewed as, at best, a superficial simulation?
Some attempts at defining life utilize a thermodynamic perspective (e.g., Schrödinger 1944; Schneider & Kay 1994). However, nothing is free from physical laws, including thermodynamics, so we can entertain the idea that it is ultimately not useful to define life in thermodynamic terms to attempt to separate it from non-living systems. It is also a mistake to act as a reductionist of biology to the chemical and physical worlds. What is the more interesting perspective is a focus on the emergent properties of living systems in their own right, which is not simply deducible from physical and chemical laws (Dobzhansky 1964).
One final note, some of us humans are generally very focused on individualism, which again is ironic because we are multi-cellular and highly dependent upon each other and other species. This might profoundly bias our thinking about life and what constitutes living systems. We tend to think of physically discrete organisms as the unit of life (see the example of social insects above). However, we should work to try to relax that assumption. We should try to keep an open mind in order to recognize life around us and in the universe. This might take form in units that are physically distinct yet cooperate to reproduce and evolve (cf., Vaidya et al. 2012).

The experiment
There is another example of life that is perhaps unusual and surprising; yet, it fits the definition used here. Chain letters have existed for centuries, most recently in electronic form. They use their environment (humans) to make copies of themselves, i.e., reproduce. Usually this is in the form of exploiting human magical thinking in terms of good and/or bad luck related to copying and distribution of the text. This has been recognized and described by some in evolutionary and biological terms (Goodenough & Dawkins 1994; Bennett et al. 2003; VanArsdale 2016). It is also impossible to ignore similarities between chain letters and some religious texts and to speculate on the role of chain-letter-type dynamics both in text and in oral tradition in the evolution of religions (e.g., pp. 208, 221-222, 230-231, Budge 1904; pp. 96, 275, Goddard 1938; p. 172, Mizuno 1982; as pointed out by VanArsdale 2016; and possibly other examples, Psalms 96:3; Mark 16:15; Sahih al-Bukhari 6:61:510). Occasionally differences appear in chain letters that are copied to their offspring and influence their ability to reproduce (some of these differences are erroneous mistakes and some are purposefully made by humans). There is no obvious limit to the ways they can psychologically manipulate humans in order to reproduce (see the cookie recipe for a luck-free revenge-motivated example, Mikkelson 2016). This is not unlike some parasites that modify the behavior of their host organism to promote their own reproduction (Poulin 2010).
Can we utilize chain-letter-type dynamics for science? You can help make this document an example of life. Edit this manuscript with some changes and send it to some of your colleagues. This should be done without copyright by releasing it into the public domain (in current legal terms). Thereby, this document can use its environment (humans) to make copies of itself and reproduce. The most successful versions will be more likely to reproduce. What will be the result? Perhaps a refined definition of life can evolve from the evolution of this document (definitions are for human use and this document needs to utilize humans to reproduce), or it might likely take a trajectory towards attention seeking text unrelated to defining life. At the very least it is unpredictable and could be an interesting experiment in challenging ideas of what a living thing is and how we might recognize life, if it works at all. Please send a copy of the original version of the document you received and the modified copy you sent out to this address, , so that the evolution of this document can be monitored over time and the results made public.

Bennett, C. H., Li, M., & Ma, B. (2003). Chain letters & evolutionary histories. Scientific American, 288(6), 76-81.
Budge, E. A. W. (1904). The Gods of the Egyptians. Dover (1969), Vol. I & II.
Chiappe, Luis M. (2009). Downsized Dinosaurs: The Evolutionary Transition to Modern Birds. Evolution: Education and Outreach. 2(2): 248–256.
Chirikjian, G. S., Zhou, Y., & Suthakorn, J. (2002). Self-replicating robots for lunar development. IEEE/ASME Transactions on Mechatronics, 7(4), 462-472.
Chuong, E. B., Elde, N. C., & Feschotte, C. (2017). Regulatory activities of transposable elements: from conflicts to benefits. Nature Reviews Genetics, 18(2), 71.
Dobzhansky, T. (1964). Biology, Molecular and Organismic. American Zoologist 4, 443-452.
Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., & Mueller, E. T. (2013). Watson: beyond Jeopardy!. Artificial Intelligence, 199, 93-105.
Feschotte, C., Swamy, L., & Wessler, S. R. (2003). Genome-wide analysis of mariner-like transposable elements in rice reveals complex relationships with stowaway miniature inverted repeat transposable elements (MITEs). Genetics, 163(2), 747-758.
Goddard, D. (1938). A Buddhist Bible. Boston: Beacon Press
Gong, L. I., Suchard, M. A., & Bloom, J. D. (2013). Stability-mediated epistasis constrains the evolution of an influenza protein. Elife, 2:e00631
Goodenough, O. R., & Dawkins, R. (1994). The 'St Jude' mind virus. Nature, 371(6492), 23.
Helbing, D., Frey, B. S., Gigerenzer, G., Hafen, E., Hagner, M., Hofstetter, Y., ... & Zwitter, A. (2017). Will democracy survive big data and artificial intelligence. Scientific American, 25.
Holley, P. (2015). Bill Gates on dangers of artificial intelligence:‘I don’t understand why some people are not concerned’. Washington Post, 29.
Jones, R., Haufe, P., Sells, E., Iravani, P., Olliver, V., Palmer, C., & Bowyer, A. (2011). RepRap–the replicating rapid prototyper. Robotica, 29(1), 177-191.
Korzeniewski, B. (2001). Cybernetic formulation of the definition of life. Journal of Theoretical Biology, 209(3), 275-286.
Lohn, J. D., Hornby, G. S., & Linden, D. S. (2005). An evolved antenna for deployment on NASA’s Space Technology 5 mission. In Genetic Programming Theory and Practice II. pp. 301-315. Springer US.
MacDonald, F. (2015). A robot has just passed a classic self-awareness test for the first time. Science Alert, 17.
Macklem, P. T., & Seely, A. (2010). Towards a definition of life. Perspectives in Biology and Medicine, 53(3), 330-340.
Mikkelson, B. (2016). Neiman Marcus $250 Cookie Recipe.
Mizuno, K. (1982). Buddhist Sutras: Origin, Development, Transmission. Tokyo: Kosei Publishing Co.
Mueller, U. G., Rehner, S. A., & Schultz, T. R. (1998). The evolution of agriculture in ants. Science, 281(5385), 2034-2038.
Poulin, R. (2010). Parasite manipulation of host behavior: an update and frequently asked questions. In Advances in the Study of Behavior (Vol. 41, pp. 151-186). Academic Press.
Ruiz-Mirazo, K., Peretó, J., & Moreno, A. (2004). A universal definition of life: autonomy and open-ended evolution. Origins of Life and Evolution of the Biosphere, 34(3), 323-346.
Schneider, E. D., & Kay, J. J. (1994). Life as a manifestation of the second law of thermodynamics. Mathematical and Computer Modeling, 19(6-8), 25-48.
Schrödinger, E. (1944). What Is Life? Cambridge University Press, Cambridge.
Shorter, J., & Lindquist, S. (2005). Prions as adaptive conduits of memory and inheritance. Nature Reviews Genetics, 6(6), 435.
Shubin, N. (2008). Your inner fish: a journey into the 3.5-billion-year history of the human body. Vintage.
Thompson, A. (1996, July). Silicon evolution. In Proceedings of the 1st Annual Conference on Genetic Programming (pp. 444-452). MIT press.
Tipler, F.J., 1981. Extraterrestrial Intelligent Beings do not Exist. Quarterly J. of the Royal Astronomical Society 21, 267-281.
Turing, A. M. (1937). On Computable Numbers, with an Application to the Entscheidungsproblem. Proceedings of the London Mathematical Society. 2. 42: 230–65.
Vaidya, N., Manapat, M. L., Chen, I. A., Xulvi-Brunet, R., Hayden, E. J., & Lehman, N. (2012). Spontaneous network formation among cooperative RNA replicators. Nature, 491(7422), 72.
VanArsdale, D. W. (2016) Chain Letter Evolution.
Warwick, K., & Shah, H. (2016). Can machines think? A report on Turing test experiments at the Royal Society. Journal of Experimental & Theoretical Artificial Intelligence, 28(6), 989-1007.
Werndl, Charlotte (2009). What are the New Implications of Chaos for Unpredictability? The British Journal for the Philosophy of Science. 60 (1): 195–220.
Yaeger, L. (1994). Computational genetics, physiology, metabolism, neural systems, learning, vision, and behavior or Poly World: Life in a new context. In Santa Fe Institute Studies in the Sciences of Complexity-Proceedings Vol. 17, pp. 263-263. Addison-Wesley Publishing Co.

Jukes-Cantor Probabilities

In an earlier post I went through a derivation of the Jukes Cantor model (based on a Poisson distribution of mutation events) and got the following type of relationship.

d = \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right),

where d is the fraction of divergent sites between two DNA sequences and t is the time the sequences diverged from a common ancestor in units of g \mu where g is the number of generations and \mu is the per generation per nucleotide mutation rate.

This can be rearranged to

t = -\frac{3}{8} \log_e \left( 1 - \frac{4}{3} d \right)

to convert an observed difference between two sequences into a corrected estimate of the divergence between two species.

However, this is only a point estimate and cannot handle cases where the fraction of sites that differ happen to be above 75% (this is possible by chance but becomes a log of a negative number and undefined).

We can rewrite this in a different way, as a binomial probability, and evaluate the probability of a data set over a range of divergence and place intervals around our point estimate.

As above, the probability that two nucleotides are different is

\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right).

Assuming each nucleotide is independent in a comparison of two sequences we can multiply by the number of sites that are different, D.

P(D|t) = \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D

On the other hand two nucleotides can be in the same state if no mutations occurred, which is a probability of e^{- \frac{8}{3}t} (this comes from the Poisson distribution the model is based on, look back at the earlier post if this isn't clear) or mutations could have occurred along the lineages but they happened to end up as the same nucleotide (an "invisible" mutation). This second outcome has a probability of

\frac{1}{4}\left(1 - e^{- \frac{8}{3}t}\right).

Combined the probability that we see S sites that are the same in two sequences descended from a common ancestor is

P(S|t) = \left[e^{- \frac{8}{3}t}+\frac{1}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S.

Since there are only two possibilities, the sites are the same or different, and the probability of all possible outcomes must be 100% we can rewrite this in a slightly simpler way.

P(S|t) = \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S.

Then multiply everything together to get the probability of the data including sites that differ and sites that are the same.

P(data|D,S) \propto \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S

There is also a binomial coefficient that should be multiplied to get the exact probability (because there are different ways to get the same data: the first and second site can be different, or the first and third, or the second and fifth, etc.). However, for a given data set this is a constant and we're going to drop it for the moment.

Say we align two sequences of ten base pairs from the same gene in two different species.

grasshopper: AGCTACAACT
cricket:     AGATACGACT

The sequence differs at two sites. We could rewrite the comparison as a 1 if the base is the same and a 0 if they are different.

DNA: 1101110111

Making a plot of the two parts of the equation we can see that the probability that two sites are the same goes down as the time of divergence increases and the probability that they are different increases.

Importantly, there is a middle ground that can best accommodate both signals from the data, the differences and the similarities, near but lower than the crossover point. Combining these together we get the curve of probability of all the data with a peak just above 0.1.

This can be integrated to find upper and/or lower confidence intervals (which is messy and uses the hypergeometric function); however, it is clear from looking at the curve that a wide range of values are hard to rule out with a data set this small.

If the data set were larger we can narrow down the confidence interval.

So, how does this compare to the point estimate we get from the traditional Jukes-Cantor distance? A difference of 0.2 (two sites out of 10) can be plugged in for d,

t = -\frac{3}{8} \log_e \left( 1 - \frac{4}{3} d \right),

and gives t \approx 0.1163.

When we take the derivative of

P(data|D,S) \propto \left[\frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^D \left[1 - \frac{3}{4}\left(1 - e^{- \frac{8}{3}t}\right)\right]^S

set it equal to zero and solve for t we get

t = \frac{3}{8} \log_e\left(- \frac{3(D+S)}{D-3S} \right) .*

Plugging in D=2 and S=8 gives t \approx 0.1163, exactly the same as with the other method. The original Jukes-Cantor method is a maximum likelihood point estimate (the parameter value of the model that maximizes the likelihood of the data). You can also see it is equivalent in other ways. The D-3S in the denominator makes the log evaluate a negative number if the fraction of sites that vary are greater than 3/4 of the total, which is undefined. (Thinking about this intuitively it is most consistent with positive infinity; however, I suspect the curve gets infinitely flat and almost any large number is essentially just as likely.) However, we can still work with the probabilities which allows us to rule out small values of t as in the example below where the sites that differ are 80% of the total.

* This is only true for real solutions. The more general solution is

t = \frac{3}{8} \left(\log_e\left(- \frac{3(D+S)}{D-3S} \right) + 2 i \pi n \right),

where n is an element of the integers. This is fun because it combines a base e log with i and \pi, but this isn't really relevant to the discussion (time of divergence doesn't rotate into a second dimension).


We found a total of five Toxorhynchites larvae in a pool containing Culex larvae near Manoa stream (Jan 18, 2018, 21.299882, -157.813210). I haven't seen this genus before; they are possibly T. splendens. These are unusual mosquitoes in that they do not blood feed. The larvae eat aquatic invertebrates including other mosquito larvae and several species were introduced into Hawaiʻi as an attempt at biocontrol of mosquitoes. They are also very large mosquitoes and the genus includes the largest mosquito species.  Thanks to Matt Medeiros and Jessica Mavica for helping me ID them.