Monthly Archives: April 2013

pVIB transformation

The latest plasmid I tried out for bacterial transformation works well and has a cool end result.  The plasmid, pVIB, was constructed with genes from a marine bacteria Aliivibrio fischeri by Engebrecht et al. 1983. This bacteria is bioluminescent and lives in symbiosis with some fish and squid species allowing them to glow in the dark.  The enzyme that produces the light is called a luciferase.  The plasmid also contains a gene for ampicillin resistance to allow transformed E. coli cells to be selected.  So I did a heat shock transformation with pVIB and competent cells, like I blogged about earlier (here and here).


In the image above the upper right and lower left diagonal plates are E. coli spread on luria broth (LB) plates without any selective agent.  This is both with (lower left) and without (upper right) a plasmid added to the mix.  It might be hard to see but it is just bacteria growth all over, coating the surface.  This is referred to as a "lawn" of bacteria.  In the upper left an antibiotic (AMP, ampicillin) has been added to the media and there is no growth.  The cells have been completely killed.  In the lower left the plasmid (pVIB, containing a gene for AMP resistance) has been added and only cells that have taken up the plasmid can grow on the AMP plates.

This plasmid contains the genes to produce bacterial luciferase (a dimer (two proteins bind together) with α-luciferase and β-luciferase subunits.  The structure was determined by Fisher et al. 1996 (link to structure at protein data bank (PDB) and the PDB molecule of the month).  Actually there is a operon (expressed together) of five genes including the two luciferase units.  The operon is symbolized by luxCDABE, which gives the order of the genes.  luxA and luxB are the alpha and beta subunits.  The other genes convert long chain fatty acids into aldehydes with ATP.

From the PDB site you can visualize and rotate the 3D structure of the enzyme.


The image above shows the alpha subunit on the left (which contains the active catalytic site) and the beta subunit on the right.  These two proteins bind together and create the enzymes quaternary structure.  Below is the same view but the components are color coded by secondary structure (alpha helices in red and beta sheets in yellow).


You can see that there is a similarity between the alpha and beta units.  It is hypothesized that the beta unit originated as a duplication of the alpha unit and it enhances thermal stability of the enzyme (a fusion protein of the alpha and beta units is sensitive to high temperatures, Escher et al. 1989).

Also, here is a webpage (and here) that contains a great deal of information about bacterial luciferase (there are lots of types of luciferse, like the one in the firefly, but they have different structures, etc.).


In this image above the active site that catalyzes the reaction is indicated in yellow, within the alpha-subunit (blue).

The enzyme takes luciferins, in this case a reduced flavin mononucleotide, and a long chain aldehyde, and oxidizes them with molecular oxygen to produce flavin mononucleotide, water and a fatty acid.  There is an excess of energy in the reaction that is released as light with a peak wavelength at around 490 nm (nano-meters) which is a blue-green color.


The additional genes in the operon  convert the fatty acid back into an adlehyde.  luxC is a reductase, luxD a synthetase, and luxE a transferase; and these assemble together into a fatty acid reductase enzyme complex.

Below is the larger reaction showing the fatty acid being converted back into the aldehyde for the luciferase reaction (and the flavin mononucleotide being reduced to recycle back in).


Below are a series of pictures showing the transformed cell clones in progressively lower light levels.  They look like normal E. coli cells in bright light.





The last picture is an exposure of several seconds.  The glow is visible by eye, but you need to be in a dark room and let your eyes adjust for a minute.

The glow doesn't last forever.  It was strong 24-48 hours after the transformation.  Producing the luciferins requires a lot of the cells energy in the form of ATP for the reductase and the effect is temperature sensitive.  I put the plates at 4°C for 24 hours to preserve the bacteria then took them out and the glow was gone.  However, I warmed them up to 30°C for an hour and it was back almost as strong as before.  I stored them at 4°C again for over a week, then warmed them up again and the glow was completely gone.

NSF Funding Rates

I found a nice blog post (link) discussing the falling rates of successful funding from the National Science Foundation (specifically DEB).  I've copied a graph from the post below.


The number of proposals is growing each year, which is good.  However, the funding is not growing.  So the success rate is dropping and is now in the single digit (currently ~7% level) and falling.  People with new positions, like myself, are getting hit especially hard.  I have submitted an NSF preproposal each year and have not been successful yet in getting funding from them.  (I have however obtained funding from the Hawai'i Community Foundation, which I am very grateful for.)  Because getting grants (and publishing) early in your career is necessary for getting tenure this is pushing many new faculty to explore alternative ways to get funding, as I did with the HCF.  Follow the link at the beginning to see the original article.

Not increasing research funding is not investing in the future of our country.  To put this in perspective, the NSF budget is $7 billion (2012).  The graph below is the US budget (2012):

NSF falls under the discretionary category.  Here is a breakdown below (2013):

As you can see, the National Science Foundation is not even on the radar and falls below the lowest listing of $9 billion for the EPA.

When people see "NSF" what do we want to come to mind?  "National Science Foundation" or "Non-Sufficient Funds"

Reversible Mutations

In the previous post about mutation predictions I only considered mutations in one direction (for example, from functional to non-functional gene sequences (while non-functional to non-functional mutations were ignored because the outcome is the same).)  However, some mutations can be thought of as reversible.  We could think of a nucleotide position mutating from an "A" to any other base-pair state ("C," "G," or "T") and then back again to an "A."  Or, we could think about changes in a codon that alternate between two different amino acids.  For example "CAT" and "CAC" both code for histidine while "CAA" and "CAG" both code for glutamine in the corresponding polypeptide (protein).  The only change is at the third position so a "T" or "C" is one amino acid and an "A" or "G" is the other.  As the sequence mutates and evolves between these four bases at this position we could think about this as reversible between two amino acid states.


Another type of mutation is even closer to the sense of reversible.  Transposable elements are small stretches of DNA that can insert into a gene sequence, sometimes inactivating the gene, and then later excise out of the sequence, which might restore the original function.  In fact, transposable elements (or "TE's") are quite common in the genome of many organisms.  In the image below is an example from corn (in which TE's were first discovered).  Starting with an individual that has a TE inserted into a gene that produces the purple pigment (so no pigment is produced and the kernels are white) the TE can excise in certain cells (giving purple spots) and may even excise in the germ-line cells so that the next generation has completely restored pigment production.


We can write down the expected frequency in the next generation with reversible mutations as a recursion.  To be an allele at frequency in generation you are either already a type allele in the previous generation and did not mutate away with a mutation rate of (red) or you were the alternative allele at a frequency of and mutated at a rate of (blue).  (Incidentally, the "or" in the sentence implies that we add these two outcomes together rather than multiply because they are mutually exclusive; the allele either mutated or it did not; while the two "and"s in the sentence imply multiplying, the mutation rate and the allele frequency are independent events that have to co-occur to have the effect we are focusing on.)


We could also write this from the alternative alleles point of view .


Either way works the same in the end, but for the rest of this we are using the version.

At equilibrium so setting the allele frequencies between generations equal to each other gives:


Subtract from both sides to get the terms together.

Multiply everything out.

cancel out on the right



subtract from both sides

multiply both sides by negative one

solve for

This gives us the equilibrium allele frequency as determined by the forward and backward mutation rates.  Equilibrium values are often designated with a "hat" symbol like this:

From looking at this equation you can see the the equilibrium frequency of an allele is given by the mutation rate to the allele state as a fraction out of the total of the mutation rates, .  So for example, if is one half of the total rate of mutations (in other words if ) then . This seems to make intuitive sense, if different alleles are mutating into each other at the same rate, then over enough time the population will be made up of a 50/50 ratio of the alleles.

Going back to the original equation and multiplying everything out to rearrange the right hand side:

is this helpful?

substituting back in gives:

which is

which can be written in a summation series as

or (pulling out of the sum)

I'm not sure this is very helpful.  The sum on the right is hard to work with.  It would probably be easier to plot the change over time by simply iterating the original recursion equation.  However, one interesting thing about the equation above is the term goes to zero as becomes large because a large number of numbers less than one are being multiplied together.  This makes sense because as becomes large the equilibrium is approached and the initial condition matters less and less.  In fact, using this logic, and letting go to infinity, , we could write down:

dividing by gives:

which makes sense in the sense that the equilibrium allele frequency is only a function of the mutation rates (again the starting point should disappear because after an infinite number of steps toward a single equilibrium the result will be the same no matter where you started).

Is it known in general that this infinite series reduction pattern is true?  If we set then



Then setting gives:

This is a classic result of an infinite sum of a geometric series.  (It is called geometric because raising to the is like adding dimensions in geometry.  is a point; is a line of length ; is a square with sides of length ; is a cube with edges of length ; etc.)

Also, looking at again.  If we have a total mutation rate then the first part of the equation, , is basically the same as the simpler model of irreversible mutation, .  So we can interpret the first part of this equation, , as the fraction of alleles that have not mutated yet.  Of course this will disappear as the equilibrium is approached.

Backing up to

and looking at the part on the right.  Now that we realize this is a geometric series we might be able to reduce the finite series.

A finite geometric series can be reduced by


As above, substituting and realizing


Again, as goes to infinity goes to zero and becomes .

So, to be able to directly calculate the allele frequency at any point along the way as reversible mutations drive the system from a starting point toward equilibrium the equation is:


which can be simplified to:


This can be divided into an equilibrium component and a component quantifying the deviation from equilibrium due to the starting point and the time since starting :


Here is an example plot showing the predicted decline in allele frequency from 100% (blue) and the rise from 0% (red) compared to the equilibrium (yellow) where one mutation rate is a fifth of the other: and .  (Generations is on the x-axis and allele frequency is on the y-axis.)


So, by 50,000 generations, which again is not that long compared to geologic timescales, the allele frequency is predicted to essentially arrive at an equilibrium.  In this case we expect it to be

Exactly how long does it take to converge?  The allele frequency trajectories asymptotically approach equilibrium so they will never be exactly equal.  We could however ask when the difference is below a certain threshold.  If we start one and the other this is the largest difference, , we can begin with.  The difference between the two trajectories is (with zero and 1 substituted for in red):

We can immediately get rid of the difference in the equilibrium components , which is zero.  Also multiplying by one and zero simplifies the equation a little more.  Long story short, a lot cancels out and we end up with something familiar:


which makes a lot of sense.  We realized above that this was the fraction that had not yet mutated, which is the reason the curve deviates from equilibrium (i.e. starting points are different).

Solving for to get the number of generations gives us:


Plugging in the mutation rates from the example above we find that the difference in highest and lowest trajectories drops to less than 1% after 38,375 generations.

The probability that a transposable element inserts into a specific gene sequence is very small.  Once it is inserted (and sticking with simplistic assumptions here) the probability it will excise is likely much higher than the insertion in the first place.  Say we had data from huge fields of purple kernel color corn that had been maintained for many generations.  In, let's say, 3% of these we find a mutation due to a transposable element that prevents the pigment from being produced.  Assuming the population is at or very near to equilibrium what can we say about the insertion versus excision relative mutation rates?

can be rearranged to


Setting gives

So in this example, the excision rate is approximately 32 times higher than the insertion rate.  (If you define as the excision rate and as the insertion rate.  We can alternatively set and switch the symbols for the insertion and excision rates but the result is the same.)

Backing up to a more basic level, why is there an equilibrium at all?  We might also intuitively think that if one mutation rate is higher than the other that we would eventually end up with all of one type (allele) in the population.  (Actually this can happen when we start talking about drift in finite population but we are still ignoring that at the moment and pretending populations are so large that they are essentially infinite and even tiny fractions will be present.)  The trick to thinking about this is that as one allele becomes rarer it is a smaller target for mutations in the population.  As it becomes more common and in more copies there are more opportunities for mutations to occur to change the allele into a different form.  This frequency effect "buffers" the alleles towards intermediate frequencies, so an allele is never quite lost or fixed and we end up with an equilibrium.

Department Hike

Last Saturday our department organized a hike up Kuliouou Ridge Trail (link and link).  It was aimed to bring faculty, staff, grad students and undergrads together in a social activity.  Some family members also showed up.  It is a very scenic hike but also very steep toward the end.  (There were also a couple scary drop offs right next to the trail.)  At the top we could look over to Kailua on the other side of the ridge.  Afterward we had lunch under tents at Kuliouou Beach Park.

Tester Memorial Symposium

I am still working out what to use this blog for.  In the most important sense I want to provide a way for the general public to see some of what goes on in the daily work of scientists and university professors.  Also, as a way for people to see a little of what I am up to.  This may also function as a log of activities; however, I do not intend for it to be a full and complete record in any way.  Rather this is an informal and possibly eclectic compilation of topics.

This week our department is having the 38th Annual Albert L. Tester Memorial Symposium in the Keoni Auditorium of the East-West Center, and I volunteered to chair the first session yesterday morning.  The symposium is a series of presentations of student research from any department.

First however, Dr. Tim Tricas gave a short introduction of who Albert L. Tester was, what topics he worked on (Fisheries Biologist) and some memories people shared with him of people who knew Dr. Tester.

Then Jaclyn Mueller from Oceanography gave a talk about "Efficient extraction of nucleic acids from microbial plankton (viruses, bacteria, and protists) collected on aluminum oxide filters."  I talked with her briefly before the session and she is working on marine RNA viruses.  The concentration of viruses in ocean water is amazing.

This was followed by Carolyn Parcheta from Geology and Geophysics on "Volcanic fissure conduits: the first quantification of shallow subsurface geometry."  She made maps on the centimeter scale of fissures to better understand outgassing dynamics.  One detail from this talk was the use of lydar to 3D map deep into volcanic vents beyond what is visible on the surface.  I wonder if rovers could be used to crawl deeper into the fissures to map them some more?

Then Tyler Hee Wai from Mechanical Engineering talked on "Investigating Dusk and Dawn Shifts in Snapping Shrimp Sounds."  There is an applied angle to this work.  The baseline of snapping shrimp sounds can be used to remotely detect boat motors.  It is completely passive and undetectable so activities like fishing in marine refuges could be detected.

Finally, the final talk of the first morning session was by Chelsea Marvos from Nursing on "Emotional Intelligence and Clinical Performance/ Retention of Nursing Students."  I also talked to her before the session; nursing programs have a problem with retention and she is investigating how predictive the performance on an emotional intelligence test is for student retention rates.  Afterward there were questions from the audience about what the test is like.  She said parts of it can be strange like asking to interpret the feeling of a picture of a pile of rocks, etc.  However, during her talk she said emotional intelligence is something that can be trained and brought up the possible value of incorporating this into nursing training programs.

Eight-Species Plant DNA Alignment

I've added a few more species to our plant chloroplast rbcL DNA sequence collection.


The first two new ones, above and below, I bought at the local store: sweet corn (Zea mays) and papaya (Carica papaya).


There is also a type of flower blooming all around campus this time of year (March-April).  They range from white to dark purple.  I am guessing that they might be a good example of incomplete dominance (in genetics classes we often use white/pink/red snapdragons as an example, but these may be a nice local example that students actually see outside of class).  I looked them up and they are Chinese Violets (Asystasia gangetica).


Above, purple on the right, to light purple in the middle, to white on the left.

The plant below (it looks like orange threads) is very interesting.  It is called western field dodder (Cuscuta campestris) and is a parasite that attaches to and takes nutrients from other plants.  Most plants use their chloroplasts to produce energy, but the dodder has come up with an alternative strategy.  So, what does its chloroplast sequence look like?


Some places online say that the dodder does not have chloroplasts or produce chlorophyll.  However, this is not true.  Funk et al. (2007) sequenced the chloroplast genome of two dodder species (C. reflexa and C. gronovii) and found that they had reduced and rearranged genomes.  McNeal et al. (2007) also found some gene loss and rearrangements, and an increase in the nucleotide substitution rate, in two other dodders (C. exaltata and C. Obtusiflora).  Furthermore, Berg et al. (2004) found that C. gronovii and C. subinclusa have lost the plastid RNA polymerase (necessary for gene expression by generating a messenger RNA, i.e. transcription) and that rbcL has had to evolve to be transcribed by RNA polymerase from the nuclear genome.

Below is the DNA sequence alignment from all eight species (including the four from the earlier post).  There are nine entries because I included two Chinese violets, a white form and a purple form, and unsurprisingly they have identical sequences.


I changed the settings so that any DNA position that is variable is highlighted and positions that are the same in all species are represented by a ".", except in the consensus sequence given along the top.  Overall, you may notice a 2-1-2-1-2-1 pattern in the spacing of variable sites.  Below is a close up of a stretch where this is particularly strong.


In this part of the rbcL gene every three nucleotides codes for a particular amino acid.  Below each DNA sequence I have listed the corresponding amino acid code.  Across all eight species the amino acid sequence is identical (in this region; there are some variations in some other parts of the gene) corresponding to "TSIVGNVFGFKALRALRLEDLRIP."  In other words, none of the DNA changes affect the sequence of the protein enzyme (enzymes catalyze biochemical reactions (catalyze = cause the reaction to occur)) that is produced by the gene.  The sets of three nucleotides that code for an amino acid are known as codons.  If you look at a codon table that translates between nucleotides and amino acids you can see that any change to the second position causes a change in the corresponding amino acid.  However, often changes in the third position have no effect on the amino acid (i.e. they are "silent").  For example, GCA, GCC, GCG, and GCT all code for alanine (A).  Occasionally changes to the first position do not change the amino acid; so both TTA and CTA code for leucine (L).  This explains the spacing of the DNA sequence changes between these species.  Mutations are expected to be happening across all of the sites.  However, ones that change the protein sequence may affect the function of the enzyme and are removed by selection.  So, over time, the only changes that are accumulating between species are ones that are not affected by selection (i.e. are selectively "neutral").

Below is a phylogenetic tree that can be obtained from these sequences to represent their inferred evolutionary history (rooted by the fern sequence as an outgroup).


Corn and bamboo are both types of grasses so it makes sense that they group together.  The dodder has a long branch, representing many DNA changes, which fits with what we expect given the extensive changes and higher rates of substitutions in the dodder chloroplast genome.  The two Chinese violets cluster together identically, which makes sense because they are the same species.  There could be a small amount of genetic variation within the species between the two samples but it is also not surprising that there is not in this particular sequence. The remaining branches are the lehua, hibiscus and papaya which are all considered Rosids within the Eudicots (true dicots) while the dodder and Chinese violet are considered Asterids within the Eudicots.

So how much confidence do we have in the branching pattern in this tree?  One way to address this is by "bootstrapping."  This is a process where a large number of fake samples are generated by randomly sampling from the original dataset (randomly picking nucleotide sites) and the percent of time a particular branch of the tree is found (contains the same set of descentants) is used as a measure of confidence in that part of the tree.  However, if many different branching orders are found, and different groups of descendants are contained within the branch, from different sets of the data then we have lower confidence in that particular feature.  (See Felsenstein (1985) and Efron et al. (1996) for more information.)  In the tree below I have applied bootstrapping and only shown the branches that are supported in a majority of the replicates.  If there is no majority (e.g. if three different branching orders are found at equal 1/3 frequency) that part of the tree is collapsed together.


Several parts of the tree have high support from the data and are recovered >95% of the time (in fact the grass grouping was found 100% of the time).  But the papaya hibiscus grouping had low support (64.4%) and branching order at the base of the eudicots (asterids and rosids) have little support in this dataset and so at the moment should be taken with a grain of salt (i.e. without confidence).

More Transformations, pBLU and pGreen


Above are three plates of media with E. coli bacteria growing on them at 37 C overnight.  This is a different strain of E. coli called DH5alpha (I used MM294 earlier) but neither one has antibiotic resistance.  In the left plate, without antibiotic, I "streaked" out cells by running a sterile wire loop back and fourth through the cells repeatedly, making a dilution until I could pick a single clone colony of rapidly growing cells.  I then transformed the cells using heat shock.  The plate on the right is expressing GFP (as part of a fusion protein with lacZ) that gives them a green color that I mentioned in an earlier post.  The plate in the middle shows colonies that are dark blue, they are expressing a fully functional lacZ gene and have X-gal added to their media.  Chemically X-gal "looks" like a disaccharide (a "double" sugar molecule made of two simple sugars).  lacZ normally cuts disaccharides to make monosaccharides for the cell to use, however when X-gal is cut a molecule containing bromine is formed that spontaneously forms a new molecule with itself and gives the cells the dark blue color.


In the close up image above you can see smaller "satellite" colonies around the GFP transformed cells.  In these plates ampicillin is used to select for only the cells that have taken up the plasmid, which contains an amp-resistance gene.  However, the enzyme produced destroys ampicillin in the media around the cells, which allows cells that were not transformed to start growing.  You have to be careful not to pick the nearby untransformed colonies when attempting to clone a DNA sequence in a plasmid.

I tried looking at the GFP expressing cells under UV light in a dark room and I could not see them glowing.  So I put the plates in a transilluminator we use to take pictures of gels.  It uses UV light and the camera can be set for a long exposure.

IMG_0024 (2)

Above is the image from the UV transilluminator.  You can see bright spots where the colonies are growing, but is this GFP?

I put some other plates in for comparison.

IMG_0023 (2)

On the bottom is the GFP expressing plate, on the upper right are the lacZ colonies, and one the upper left are untransformed colonies I streaked out.  The lacZ expressing cells are darker, but this could be because of the blue dye from Xgal.  The regular cells seem bright, perhaps from auto-fluorescence, which does not make the GFP cells very convincing in terms of fluorescence.

I am going to try some more variations.  Part of the reason I am doing this is to get a bacterial transfromation/cloning system up and running in the lab, another part is to find a nice system, or set of systems, for teaching a genetics lab in the fall.

Fruit Fly Images

A camera that fits into the eyepiece of our microscope arrived this afternoon (MiniVID USB by LW Scientific) and I couldn't wait to try it out.  Here are the first pictures I captured with it.  These are "raw" without any balance adjustments to color, brightness, etc.


In the image above six fruit flies are knocked out with carbon dioxide.  We move them around with paint brushes.  I've arranged three females along the top and three males below.  The females tend to be slightly larger in body size and the males have darkly pigmented ends of their abdomens.  The relative size and pigmentation can vary between strains however so the best way to tell the difference is at the end point of the abdomen, simply put, males are rough and bumpy and females have a sharp point.


And here (above) is a close up of a female.  You can see the bristles on the body and wing veins.  Right behind the base of the wing is something like a "ball on a stick" called the haltere.  I've circled it in the copy below.


The halteres vibrate in a plane when the fly is flying and act like a type of gyroscope to maintain orientation.  (Also, incidentally, this effect is not unrelated to Foucault's pendulum which remains swinging in a plane as the earth rotates beneath it.)


Above is a female that is starting to wake up and has stood up on her legs.  The big difference about this fly from the ones above is that she is a mutant and has white eyes (they look kind of reddish/yellow in these images on some displays but in real life they are indeed white).  Below is a male, also with white eyes.


You can also just see the "sex combs" on the front legs.  The are a dark patch on the front of the leg midway up.  Only males have these but they can be hard to see when sorting large numbers of flies.  I've outlined them in the picture below.


The white eye color is due to a mutation at a gene called white (genes are named after mutant phenotypes) on the fruit fly's X-chromosome.  The symbol for white is w and this particular mutant is the first allele (mutant variation) found at white by Thomas Hunt Morgan and is written as w1.  These flies have been maintained by various labs over time and are direct descendants of the white mutants discovered by Morgan over a century ago, which he used to establish that genes were located on chromosomes by X-linked inheritance.

This also brings up another point.  The names of genes seems seem easy and obvious at first but at some point along the way you realize it is counter intuitive.  In classical genetics a gene was discovered when a mutation occurred.  More often than not mutations inactivate a genes normal function to some degree.  So genes are named for the opposite of what they normally do.  When the white gene is functioning the flies have red eyes; when it is inactivated they have white eyes.  Using the car analogy again, if we named parts of cars in the same way the brake pedal would be called something like stopless and the gas pedal would be unmoved.  So in a normally functioning car you would activate unmoved to go faster and stopless to slow down, which seems intuitively backwards based on the names.

Mutation Predictions

This semester I have been working through some basic population genetics background to prepare for a class I plan to teach next spring.  One place to start is on the predicted effects of mutations.  The simplest model to begin with is one of irreversible, one-way, mutations.  Imagine a functional gene sequence and that mutations can occur to disrupt the gene function, effectively turning it "off."  I like to mention working on cars for analogy.  If you made small random changes to car parts, the most likely outcome, if anything happens at all, is that you break the function of the part and render it useless (rather than gaining a new and different function or, even-rarer, improve its function).

So say the mutation rate is really high, like 10% per generation, and you start off with 100% functional gene copies.  Then after one generation only 90% of the copies are functional because of the 10% that mutated (1 - 0.1 = 0.9).  In the next generation 90% of the 90%, or 81%, are still functional; in the third generation 90% of the 90% of the 90%, or , remain functional.  (There are also mutations in the already mutated alleles but these do not change the phenotype (it is still an inactive gene function) so these are ignored and lumped together; only the mutations in the remaining unmutated alleles are kept track of.)  So it is easy to see that after generations with a mutation rate of the fraction of unmutated alleles is .  Our example gives the following graph over the first 20 generations:


This curve follows a (discrete) geometric distribution and over long periods of time can be closely approximated by a (continuous) exponential distribution.   This is an example of exponential decay like the classic curve of radioactive decay and the idea of radioisotope half-lives.

The same type of curve and equation applies even if we do not start off at a 100% frequency of one allele.  Say the functional allele is at 50% frequency, , then one generation later 10% of the 50% mutate leaving 90% of the 50% unmutated, .  Generally, if the starting frequency at time zero is then the frequency after generations, , can be calculated as .

Another way to look at this is (the frequency at time g is equal to the frequency in the generation before, g-1, multiplied by the fraction that did not mutate, . However, and , etc.  Substituting in the reverse order, and , etc.  Quickly we see that from a beginning point the equation becomes because we are multiplying g-times.

One reader did not like the previous paragraph and found it hard to understand; I'll try to present it again here in just equation form.

, if is defined as the number of generations after exists.

Actual mutation rates vary widely over several orders of magnitude but in general are much lower than the example of 10% per generation I used above.  Often, the mutation rate affecting the function of a gene is on the order of to per generation.  The mutation rate at a single nucleotide site in a DNA sequence is on the order of .  It can be tricky to measure mutation rates directly.  Often mutations are recessive or are not completely visible as a phenotype.  However, in humans there have been several studies focusing on achondroplasia (a form of dwarfism) to measure mutation rates.  The mutant allele is dominant, so a single copy results in the dwarf phenotype.  The phenotype is fully penetrant (if you have the allele you have the phenotype, in contrast many human traits are incompletely penetrant).  Finally, the phenotype is unambiguous.  These factors make measuring the rate of appearance of achondroplasic individuals from birth records ideal for directly measuring mutation rates.  Results from different studies vary but rates on the order of 1 in 25,000 or have been found.

Obviously, selection and genetic drift are important factors affecting allele frequencies in real populations.  Mutations in the FGFR3 gene that result in achondroplasia are removed from a population by selection and never attain high frequencies.  However, for the moment I am keeping things simple by only looking at the predicted effects of mutation rates.  Imagine a species that moves into a cave system and then the population is cut off from the outside world (so called troglofauna, or cave animals).  If genes are no longer needed in the cave environment, like ones involved in eye development or pigmentation patterns, how long would we expect functional alleles (different forms of a gene) to remain in the population?  In other words, mutant non-functional alleles are no longer removed by selection.

Using a little algebra

can be rearranged to

by dividing by .  Then take the of both sides

and divide again to solve for the number of generations


Using from above and setting (so that the frequency at time is the starting frequency at time zero, we get a half life of the functional allele of 17,328 generations.

Setting we find that after 115,127 generations 99% of the population's alleles have mutated to the non-functional form.  The curve for the first 120,000 generations looks like this:


showing the initial steep drop and then leveling off of the change in frequency due to mutations.

If we consider a generation to be about a year long for many species than after about 120,000 years (which is not really that long) any genes that do not have functions that are selected for are expected to be inactivated and functionally lost from the genome.  This can easily explain the pigment-less, eyeless cave fish found in the southeastern US.

Turning this around, if a gene is found to be functional and preserved in the genome and at a high frequency (>90%) in the population, it follows that it is being maintained by selection in the recent past.  Humans along with many other primates (and independently guinea pigs) have lost the ability to synthesize vitamin C because of mutations in the GULO gene (the remnants of which still remain on our 8th chromosome). This suggests our distant ancestor had plenty of vitamin C in their diet for thousands of generations and that mutations in GULO were not removed by selection.

This also suggests a way to measure mutation rates by changes in allele frequency in the absence of selection, if we know when the selection pressure was removed.  Imagine bacteria that carry a plasmid with resistance to two different types of antibiotics.  Initially they are kept on media containing both antibiotics, but then are transferred to a plate containing only one (to maintain the plasmid).  We know that bacteria, under optimal conditions, can divide every 20 minutes or so.  We could periodically take out a sample and assay what proportion still maintain resistance to the missing antibiotic.  The equation above can be rearranged again so that the fraction resistant, and the time on the new media, can be plugged in to estimate the mutation rate.

However, this does ignore any possible fitness cost to the bacteria to maintain antibiotic resistance, and selection could also drive the inactivating mutations to high frequency in the population (this could also be occurring in cave species; an energetic cost to producing pigments, or the presence of eyes providing a source of infections, could result in selection inactivating the genes even faster than predicted by mutation).

In fact, growing bacteria continuously in a chemostat and periodically checking for mutations in resistance to infection by bacteriophages (viruses that infect bacteria) is a classical method to assay mutation rates.  Different compounds can be added to the chemostat to test if they raise or lower mutation rates.  The assumptions used in measuring these mutations rates are essentially the same as presented here in this post, with one exception.  When the mutant frequency is very low the curve is nearly linear.  So

is almost equal to

when is near 100%.  This is because almost all of the alleles are unmutated, so essentially any potential mutations that can occur do occur on unmutated copies.  In terms of the mutant frequency , if the mutation rate is 1% and at first no mutants are present, in the first generation the mutant fraction is .  In the second generation , which is almost 0.02.  In the third generation , which is almost 0.03, etc.  The low frequency approximation can be rewritten as (the mutant fraction each time step adds the same quantity to the initial number of mutants).  This is a linear equation of the form , where b is the y-intercept and m is the slope of the line.  So in this case the slope of the increase in mutant frequency is equal to the mutation rate.  If the slope increases when chemicals are added to the broth the bacteria are growing in then they are potential mutagens.

At any rate, the phenotypes of cave species and the predicted rapid inactivation of gene function by mutation in the absence of selection maintaining the function is a nice, simple, easy to understand example of inferred evolution.  The mutations that can occur in bacteria are also a nice observable example of evolution in action.

OK, that's enough for now.  In future posts I will discuss some more complicated mutation models.

A Four-Species Plant DNA Alignment

There are several interesting plants growing around the building where I work.  One of these is the native ʻŌhiʻa lehua (Metrosideros polymorpha).


There are plants with red flowers (with darker leaves and stems) and ones with yellow flowers (with lighter leaves and stems).



One of the students here told me that her grandmother said if you pick the flowers from a lehua it will rain.

I took a small leaf sample for extracting DNA and some samples from three other plants for comparison.  I planned to amplify and sequence a small segment from the chloroplast genome.  Like the mitochondria, the chloroplast has a small genome that is a loop of DNA.  It is much larger than the mitochondria but much smaller than the nuclear genome (which contain the linear chromosomes we are used to thinking of) of plants and animals.  Of course the chloroplast is well known as the site of photosynthesis in the plant but it also carries out other functions.  Here is a representation of the chloroplast genome from the cotton plant and then zooming in on a section of the rbcL gene (in the lower right) for sequencing.



The purple "BLAST Hit" is the section sequenced.  If you look back a few posts at the mitochondrial genome you can see that there are many more genes and the chloroplast genome is quite a bit larger.

A close comparison that happened to be around was a Chinese Hibiscus (Hibiscus rosa-sinensis).


The lehua and the hibiscus are both dicot plants.  Moving further out is the golden bamboo (Bambusa vulgaris) which is a monocot but still an angiosperm (flowering plant).


For an even more distant relative of lehua I took a sample from the wart ferns (Microsorum scolopendria) growing around their base.


Like many plants M. scolopendria ferns were introduced to Hawai'i and they have now esabilshed themselves in the wild here.  They are called "wart" ferns because of the spore clusters on the leaves.

So here is what the alignment of the chloroplast DNA sequences from these four species of plants looks like.


Most DNA positions that vary between the samples are highlighted (but if you look closely not all are; I used geneious software to make this plot; this might be a bug or, more likely, I have a highlighting setting wrong.).  One thing that is quickly apparent is the large number of unique nucleotides in the wart fern sequence.  This can be explained because the fern is so divergent from the other species.  You can also see that when a DNA difference is shared between only two species (like position 11 near the beginning), it tends to be shared between the lehua and hibiscus, the two most closely related species among these four.  Overall the unique sites and shared sites give this type of "tree" representation.  (Shared derived (derived means different from a common ancestor) states are useful for reconstructing ancestral relationships between divergent species and are called synapomorphies.)


The numbers along each branch indicate the unique DNA basepair differences highlighted for each species (there are actually some more upon closer inspection that were not highlighted, as I mentioned above, but I left it as it is for now).  The "11" in the segment separating the lehua and hibiscus from the other plants represent the 11 differences (synapomorphies) only shared by these two species.

However, there are some sites that do not share this pattern.  For example, site 404 places lehua with bamboo and site 281 places lehua with the fern.  Here is the alignment with the 11 shared sites mentioned above outlined in black, and the sites that are incongruent with the actual relationship of these species indicated in red or green (for the two different incongruent patterns present).


And below the incongruent sites are indicated on the tree with dashed arrows.


Overall the data suggests the grouping that we believed to be true a priori (based on observable physical characteristics of the plants), that the lehua and hibiscus are more closely related.  But one DNA position suggests a closer relationship between the lehua and bamboo and three positions push the lehua closer to the ferns.  This gives some level of support for the following two alternative tree topologies.



However, a far more likely explanation is that these DNA positions experienced more than one mutation event that resulted in parallel mutations to the same state in different lineages (or even  a "back" mutation to an ancestral state after a lineage seperation.)  This sharing of states from different events is termed homoplasy.  Because of the long evolutionary distances between these species it is reasonable to suspect a few sites have had multiple "hits;" also, if you look closely, a few sites like position 20 and 149 have had multiple mutation events to more than two states (but these have not been highlighted by the software I used to make the plot).

If we believe the wart fern is the most divergent, we can use this assumption to "root" the tree by using the fern as an "outgroup."  In this simple representation time moves from left to right; the older part of the tree is to the left and the younger is to the right.  (This plot is just to represent the relative order of events, the branch lengths here are not proportional to the amount of change or inferred time period.)


In the image below a past mutation even from an "A" to a "G" in the bamboo linage (for example position 32 in the alignment) is mapped onto the tree.  This is a single (not shared) derived state (an autapomorphy) so it is not useful for inferring ancestral relationships, but it gets the ball rolling with thinking about mapping mutations onto a tree.


The state of the DNA position is given at the tips of the tree in front of each species name (resulting in a "G" in the bamboo sequence) and at two positions inside the tree where we can infer the sequence of a common ancestor.  Now for the synapomorphy pattern:


Above the history of position 11 in the alignment is indicated.  A C to T mutation occurred before the common ancestor of the hibiscus and lehua but after the species split from the lineage leading to the bamboo (indicated in the black box as in the alignment).  And now I'll plot a homoplasy pattern that gives incongruent results.


Here is an example from position 281 in the alignment.  The lehua and fern share one state and the hibiscus and bamboo share the other.  In this case we suppose that more than one mutation event has occurred in the history of this position, but based on only this data we cannot tell when/where it occurred, what direction it was in (a T to C or a C to T), or what the ancestral states were likely to be.

And here is yet another possibility where a back-mutation occurred along a lineage to restore an ancestral state.


This example helps illustrate that often in biology we deal with some level of uncertainty; but this does not necessarily prevent us from being able to make inferences.  Also, assumptions are important tools to use to work through a problem, so long as we are clear about what those assumptions are.  The four species used in this example cover a long range of evolutionary time.  There is plenty of opportunity for multiple mutations to occur.  When looking at closer related species there is often far less uncertainty (of the kind discussed here, homoplasy, for this type of gene--there are other types of issues with closely related datasets).  This data also provides a "backbone" tree to place other plants onto as I collect more samples.