Category Archives: Uncategorized

The Bernoulli distribution mean and variance

598px-Ståhlberg_flipping_coin2

One of the simplest probability distributions to begin with is the discrete Bernoulli distribution, named after Jacob Bernoulli (1655-1705).  It can be thought of as a coin toss where the coin either comes up heads (with value 1) or tails (with a value of 0).  Other distributions like the binomial can be built up as multiple Bernoulli trials (or the multinomial distribution which is built from a binomial with more than two outcomes).

So, if is the outcome of a coin toss (a Bernoulli trial) and "heads" gives a value of , we can say the probability of is ;

.

All of the outcomes must sum to one (100%) so the probability of tails, , is the probability the outcome is not heads:

.

For a fair coin .  However, this formulation is general so we could have a trick coin with , for example.

So what is the expected value of a Bernoulli trial (a coin toss)?  This is pretty intuitive but it doesn't hurt to write it out in systematic "bookkeeping" fashion.  If we flipped the same coin many times and kept track of the outcomes, fraction of the time .  The rest of the time, , the outcome is .  So amount of time we have a value of and is the amount of the time .  Weighting these outcomes by the frequency we expect to see them gives us an average result of:

The zero cancels out so we are left with:

.

So, if we have a fair coin with we expect and average outcome of value . This seems to make sense.

The average () is another way of saying the expected value of a trial or the "expectation."

.

Now for the variance, or degree of spread around the average.  Variance is often symbolized by a lower case sigma-squared, .  The standard deviation, , is the square root of the variance.

Variance is calculated as the average squared difference of individual outcomes, , from the mean.

We already know from the mean derivation above that .  So we can substitute this in:

We also know that amount of the time , which gives us one side of the two mutually exclusive (add the probabilities) outcomes:

...

Also, amount of the time so the full equation becomes:

Multiplying this out and simplifying gives:

Like heterozygosity in Hardy-Weinberg genotype frequencies, the variance is at its greatest values at intermediate frequencies near and declines to zero at and .

bern-var

If is indeed zero or one than the outcome is always identical to the mean and therefore variance is zero--there is no deviation in individual outcomes from the average.  On the other hand if every outcome will be a value with a distance of 0.5 from the mean (either or with ) and this distance squared is 1/4, .

Accidental site directed mutagenesis in PCR product sequence

The last post about my COI "species barcode" sequence has been bugging me.  I wouldn't really expect to find a unique sequence in a small region by chance in a mitochondrial gene.  Many, many mitochondria in humans have been sequenced and the general levels of genetic variation are well understood.  I checked closer and it appeared to be an amino acid altering mutation to a different class of amino acid--not expected at all.

GtoAtaq_error

It was close to one edge of the sequence so, I looked at the other edge and found some more unique changes.  In the sequence snippet above each A that is underlined by an orange box is a G in standard human COI sequences.  I made an assembly with the human reference sequence and added my primers to the alignment.  Then it became obvious that I had made a rookie mistake.  The "mutations" were located in the primer.  These primers were not designed to only work with humans, but to work across a wide range of animals.  They do not match the human sequence exactly.  As the PCR progressed making more and more copies starting from the annealed primers the primer sequence was incorporated into the total sequence.

coi-primer-mut-1

The image above shows the reference sequence at the bottom, the primer position is in light green and my sequence is above that.  The changes match the primer (5'-GGTCAACAAATCATAAAGATATTGG-3' the compliment of which is 5'-CCAATATCTTTATGATTTGTTGACC-3' in the sequence above).  This is actually a method to engineer specific changes to a DNA sequence, a form of site directed mutagenesis.

Going back to the first side and taking another look:

coi-primer-mut-2

The A is incorporated from the primer (5'-TAAACTTCAGGGTGACCAAAAAATCA-3').  There is also a missing A just inside the sequence (in the 3' direction); taking a closer look at the trace I can agree that an extra A may be at the position (the arrow below points to a "shoulder" on the red A trace that may be the missing nucleotide):

trace-error

There is also a "T" just outside (5' to) the primer sequence.

The Taq polymerase enzyme used in PCR is known to be error prone and sometimes add A's on to the 3' end of a PCR product (in fact this is exploited in one method, TA cloning, to clone PCR products). By convention sequences are written 5' to 3' so a 3' A will appear as a complementary T in this instance.  also, note in the previous figure with the first primer an extra A was present on the 3' end (but this also agrees with the reference sequence).

When I trimmed out the edges and carefully curated the sequence I got the following alignment below.

me_COI-alignment

100% identical to many human sequences on genbank.

So, just for the record, my corrected COI DNA sequence, in FASTA format is below:

>me_COI
TAGGTGTTGGTATAGAATGGGGTCTCCTCCTCCGGCGGGGTCGAAGAAGGTGGTGTTGAGG
TTGCGGTCTGTTAGTAGTATAGTGATGCCAGCAGCTAGGACTGGGAGAGATAGGAGAAGTA
GGACTGCTGTGATTAGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGG
GGGTTTTATATTGATAATTGTTGTGATGAAATTGATGGCCCCTAAGATAGAGGAGACACCT
GCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTGGGAGTAGTTCCCTG
CTAAGGGAGGGTAGACTGTTCAACCTGTTCCTGCTCCGGCCTCCACTATAGCAGATGCGAG
CAGGAGTAGGAGAGAGGGAGGTAAGAGTCAGAAGCTTATGTTGTTTATGCGGGGAAACGCC
ATATCGGGGGCACCGATTATTAGGGGAACTAGTCAGTTGCCAAAGCCTCCGATTATGATGG
GTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGTAGATGTG
GTCGTTACCTAGAAGGTTGCCTGGCTGGCCCAGCTCGGCTCGAATAAGGAGGCTTAGAGCT
GTGCCTAGGACTCCAGCTCATGCGCCGAATAATAGGTATAGTG

Incidentally, this section is now an exact match to the Cambridge reference sequence, NC_012920.

National Academies Summer Institutes - Almost Done

We are near the end of the week long teaching workshop.  Yesterday we all presented brief (in 25 minutes) teaching units that our six individual groups came up with during the week--designed to use higher level cognitive skills and active learning.  I can say that I saw a few good ideas to use.  I am also impressed with how much we all came up with in such a short time.  I was in the "Evolution" group and we came up with a "Tree Thinking" project where a phylogeny is selected that best represents a set of amino acid sequences.  This is done on both an individual level then on a sub-group level with an increase in confidence in their individual result.  Then the results are compared for the entire "class" and it is shown that the class strongly disagrees (which I think is a valuable scientific lesson in itself).  The subgroups then talk to each other and discover that two different sequences are used and generate two different trees.  Then they brainstorm why this might happen and the function of the proteins is revealed--one is alpha hemoglobin involved in oxygen transport in the blood and the other is prestin which affects high frequency sound perception in the inner ear.  The prestin tree groups echolocating mammals together into a clade...  Then the class votes on a final clicker question with two hypotheses to explain the disagreement--in essence they learn an aspect of tree thinking by creating their own tree and realize echolocation evolved twice, as evidenced by the globin sequence, and is responsible for convergent evolution in the prestin amino acid sequence.  We got lost of positive feedback on our unit.

As I said, there are lost of examples and topics that will likely prove useful to me, but one turnoff aspect of this is we are required to come up with a similar workshop-like activity for our home institution after this workshop.  One of the activities this week was to group by institution and come up with a plan for this.  We had to "agree" to this, and to use a teachable unit from the workshop in our classes in the next year, to sign up for the workshop in the first place (some/many of us had to sign up and didn't really have a choice in the matter).  This feels like a proselytizing pyramid scheme; we have to agree to spread an idea before we even know or are informed of what it is really all about.  I understand they want to get the word out to as many educators as possible to magnify the return for their efforts, but as I said this arm twisting is a real turnoff to me.  I am trying to place that to the side and keep an open mind about the usefulness of the methods they are presenting.  However, in the former vein there are a lot of tricks used in the workshop that remind me of tricks used with kids to encourage compliance (in schools or summer camps) like guided choices, revised terminology to re-frame and reenforce topics, buy-in and identification with effort and activities, and group cohesion mediated acceptance of indoctrination.

Two things that I brought up with the instructors: the group learning activities can be a problem in the classroom for people with hearing disabilities.  The multiple group discussions in the same room create a lot of background noise which makes it hard for a certain minority to participate.  I was told that not everyone could be accommodated by the new classroom format.  (The ADA might have something to say about this?  One way around this is to let groups go to different rooms to discuss then regroup together at a later time.)  I also brought up that I and others would like to see more detail of studies that support the effectiveness of the teaching techniques.  After all, we are scientists and we (should) be naturally inquisitive, questioning and require evidence to make decisions.  This is presented as "scientific teaching" and the results from some studies have been presented that support some of their claims, (and I can intuitively sense some advantages to these methods) but many of us still do not feel satisfied that enough well controlled supporting evidence has been presented.  Also, simply showing the number of studies that are supportive versus the number that are not can be misleading, for example there could be a publication bias in "reformative" findings versus "negative" results and a bias in the results "looked for" in the people doing this kind of educational study in the first place.  (There is also the real possibility of bias from the instructor "trying harder" in classes where new formats and activities are being added.)  What I want is a T. H. Morgan that set out to disprove genes were on chromosomes and ended up proving to the world that they were and accepting the results of his experiments despite his initial beliefs.  In theory I could compare my own results from classes on different years, but as I said in the last post I use many of these methods already and do not want to get rid of this to do a classic lecture only format.

Anyway, I do not want to come across as completely negative.  This workshop has been useful and has also been an opportunity to network with biologists here in Hawai'i, Alaska and the West Coast.

On a final note, it has been a bit fun to be a student again in a classroom-like format after being the educator for so many years.  One thing I noticed is that faculty can make horrible students.  At times I had the people next to me keep whispering to me during presentations or copy my answers to questions.  Several people were looking at their laptops, tablets, phones rather than the presentation.  Some people spoke up and interrupted seemingly for no other reason than to get themselves heard, often with disruptive "joke" comments.  The presenters did a good job of herding the cats and deflecting attempts of derailment to unrelated topics.  I also found myself second guessing some of the questions "there must be a twist so the obvious answer is not right" and watching for cues from the presenter like telegraphing the answer with their hand movements.  For example, they would show an empty graph with two axes and ask us to draw what we thought the relationship would look like, but when they pointed to the graph and made a drawing motion while telling us the instructions they unintentionally drew the correct answer in the air.  These unintentional subconscious actions tend to occur when people are under stress and overloaded with distractions--like when giving presentations in front of a group.  This also gives me things to think about in the context of my own presentations.

National Academies Summer Institutes

This week I am attending the National Academies Summer Institute(s) on Undergraduate Education.  It is a series of all day workshops with people attending from colleges and universities from all over the West Coast, Alaska and Hawai'i.  The focus is on improving undergraduate biology education through active learning, assessment, using methods that improve learning, ... and the motivation is both that many people, including future policy makers, community leaders, voters, etc. have a very poor understanding of biology (especially genetics and evolution) and that we are not attracting and retaining enough students into STEM fields like modern biology to supply the future workforce needs of the US, so we need to use every opportunity in undergraduate education as effectively as possible.

In the first day we brought an example of one of our class exams and assessed questions as a group in terms of cognitive level in Bloom's taxonomy.  Most student assessment in most classes are at lower cognitive levels like recalling information (memorization) and explaining concepts (comprehension).  Very little moves to the middle ground of applying knowledge in new ways or inference at higher levels of using your knowledge to synthesize new hypotheses, experimental designs or evaluate/appraise/critique concepts based on new and prior knowledge.  There were also presentations on methods of testing/question writing that are more effective in assessing student knowledge, ways to get feedback during lecture (like iclickers), and backward design of classes from learning goals to assessment to activities, how to get students to monitor their own knowledge level, etc.  Then we met as smaller groups  to start on a group project to design a teachable unit that will be presented to the entire workshop with a smaller activity for the workshop to participate in.

To be honest, I am not sure how much I will really get out of this workshop (much of what we are doing so far seems like, at least on the surface, pointless semantics and I am naturally suspicious of anything that appears to contain a lot of buzzwords even if I am not familiar with them, but I am trying to keep an open mind).  I already use iclickers for class feedback during lecture, some active learning techniques and inverted classroom techniques, aspects of backward design and assess some higher level cognition with my exam questions (where students are expected to synthesize prior knowledge/experience to answer new questions in new situations they have not seen before and/or evaluate competing hypotheses)--not that there isn't room for a lot of improvement.  There is also the practical issue of time and effort that goes into some of these approaches--we have a lot of ground to cover in my undergraduate class and at some points we have to move through topics quickly and frankly do not have time for many detailed activities in lecture.  ...  At the very least however I am already getting some good examples that I can use in my class this fall.

My COI Sequence

I have been trying different methods to isolate DNA from cheek cells.  I tried a "BuccalAmp" kit on a recommendation and it works really well.  It is quick and easy.  I tried the COI barcode PCR on my own DNA sample and compared the results to the earlier DNA extractions and PCR from three species of insects.  First here is part of the aligned traces from forward and reverse sequencing of the amplified product.

me-coi

And here is an alignment with the fruitfly, mantis and mosquito sequences (the basepairs that are different are highlighted).

coi-four-species

You can probably see from the alignment that a lot of the changes are with my sequence and that the insects are closer related to each other.  There are a few in-congruent homoplasies however.  Here is one at position 127 in the alignment.

human-mosquito-homoplasy

Humans and mosquitoes share an "A" at this position while fruit-flies and mantises share a "G," recall that mosquitoes are closer related to fruit-flies than to mantises.  This is just an example of mutations that have occurred more than once at the same position.  Making a "tree" to represent the mutations that are shared and different gives this pattern:

nj-tree-four-species-coi

There is a long distance between me and the insects, as expected.  Additionally, the mosquito (C. quinquefasciatus) groups closer to the fruit-fly (D. melanogaster) than to the mantis (T. sinensis).  The number in the middle is the bootstrap support value for the tree.  In bootstrapping the DNA bases are randomly shuffled and sampled with replacement to make new sequences the same length as the original (and by chance some basepairs are left out while others can be included more than once) the amount of time the grouping of species below that node in the tree is found is indicated.  So, mosquitoes and fruit-flies are grouped together 96.9% of the time with random shuffling of their sequences (rather than one of these being closer to the mantis), which indicates a high amount of support form the data-set that this is the actual evolutionary pattern (with assumptions...).

This mitochondrial sequence is inherited from my mother, and her mother, on back in a maternal linage, changing only occasionally by mutations.  My brother and sister share the same sequence as myself but I will not pass it on to my children, because the mitochondria is not transmitted from the father, only from the mother.  I compared my sequences to others on genbank and found something funny.

my-coi-difference

Near the beginning of the sequence I have an "A" at position 25 (above) that begins a string of six A's.  However none of the other human mitochondrial sequences on genbank share this.  Below is a comparison of 25 of the most similar human mitochondrial sequences from genbank to my sequence at the end, they are all identical with me except at this position (at position 18 in the alignment below):

me-coi-trim-alignment

They all have a G followed by five A's.  Since this seems to be rare, it might be interesting to see if anyone out there has the same sequence.  I can not go back very far, only about four generations, along my maternal lineage.  My matrilineal Great Grandmother's maiden name was Sarah J. Carlisle born in 1873 in North Carolina; her mother was Dicey (which might be short for Leodicia) Owenby (or Owensby), born about 1834 in North Carolina, and her mother might have been Sarah Hunter born around 1815 in South Carolina--just in case anyone that might be related reads this.

Highlights from the Meeting

I had a nice time at the Evolution meeting in Utah--this was my first Evolution meeting.  There are some presentations that I attended that have stuck in my mind and I thought I should mention them here.

I attended a phylogenetics workshop on the first day.  Joe Felsenstine (U. of Washington) gave a talk that segued from a historical overview to some current problems in phylogenetics.  He gives entertaining talks that sometimes border on the cynical but are humorous.  I especially liked his funding sources in the acknowledgements that included the Felsenstine foundation who's slogan is "instead of painting the house."  On the more serious side he brought up the issue of inference of horizontal gene flow versus linage sorting in prokaryotes and archaea.  There are several instances of horizontal gene transfer where DNA sequence has moved from one species to another in bacteria, but how many of these are actually "incongruent" lineage sorting in bacteria with very large ancestral species population sizes?  On the other hand, rates of recombination in bacterial genomes are actually quite low, so if selective sweeps are frequent in bacteria (a later talk by R. Lenski is very relevant to this), perhaps effective populations sizes are indeed small and these are true horizontal transfers?  He also brought up the possibility of looking at QTLs/heritability across multiple species to have power to get at what is under selection in correlated traits.

Brant Faircloth (UCLA) talked about using ultraconserved elements (UCEs) to reconstruct various phylogenies including helping to resolve the placement of turtles among birds and reptiles (which turns out to be an unresolved problem).  There is a temptation to sequence and compare whole genomes of a range of species, especially since sequencing has become so cheap, but in the end a lot of the data is thrown out because it is hard to align, etc.  There are various ways to focus down on smaller parts of the genome for comparison such as genome-wide exon (exome) sequencing from mRNA (exons are typically more conserved across species), DNA enrichment by sequence capture (seqcap) of exome DNA with tethered oligonucleotide probes, and finding variants around restriction endonuclease sites across the genome with RadSeq and RadTags.  Another approach is to focus on regions that are highly conserved across a wide range of species (UCEs), there is a website dedicated to using these (http://ultraconserved.org/); these UCEs seem to contain a lot of binding sites for transcription factors that are deeply shared across metazoans (Ryu et al. 2012).

Later in the same workshop, Stacey Smith (University of Nebraska-Lincoln) gave one of the clearest, well organized, and informative (to the novice to phylogenetic inference) talks I have seen.  (Many presentations I have seen on this topic have been cryptic and hard to follow, which often reflects a lack of detailed understanding by the people presenting.)  She has also made her slides available online (link: http://www.iochroma.info/links) and said that anyone can use her graphics in their own presentations if they want!  She went over parsimony, maximum likelihood, and Bayesian inference for discrete and continuous traits, the inference of relative rates of change and ancestral states and ways to visualize and interpret the results.  Again, she gave an excellent presentation!

In a special address, Richard Lenski (Michigan S.U.) talked about some new results from his long term (25 year) artificial evolution experiment in E. coli.  The experiment is now up to 50,000 generations and is still going.  He has kept samples of the bacteria over the generations and can compare them to each other.  One overall result is that fitness has been increasing over time from new mutations.  Early in the experiment he thought the curve of fitness increase was hyperbolic, which asymptotically approaches a limit, but with more generations he showed that the change in fitness fits a power law curve significantly better, which does not have a limit.  The implication is that fitness will continue increasing forever and will never reach a peak.

The image below compares an example hyperbloic curve, tanh(x), to a power law curve, (x/1.5)^0.7.  The two curves may be similar early on, but diverge by a greater amount as time becomes large.

power-vs-hyperbolic-curve

He also talked about the implications of one of his replicates that evolved citric acid metabolism.  They screened a huge number of cells and did not see this occur again independently.  But in the line where it did occur, if they went back to the generations just before this appeared, it did evolve again repeatedly (if I understood correctly what he was saying).  There was also a great deal of evolutionary fine-tuning of the novel citric acid metabolism after it appeared.  This demonstrated the steps of potentiation (setting the stage for a particular adaptation to be able to occur), actualization (the final mutant step that results in the new function), and refinement (additional changes to improve the system).  They are working on nailing down the precise mutational steps that have happened in this lineage and a gene duplication (in citrate transport) is involved.  He also discussed several lines of reasoning that this population of E. coli can be considered to have evolved into a new bacterial species--one of the defining features of E. coli is that it does not use citric acid as a carbon source, and intermediates have reduced fitness.

In other talks, Mingzi Xu talked about honest signals of fitness in male dragon flies.  The dragon flies she works on use mie scattering of light (which also makes clouds white) from fat deposits on their wings to give them white bands that females are attracted to.  The males don't seem to be able to cheat and produce a large white band if they are smaller and have less resources because the fat residue uses a lot of energy to create.  She is going on to estimate heritability of the trait among offspring, etc.

Carl Bergstrom talked about "Timing of antimicrobial use influences the evolution of antimicrobial resistance during disease epidemics." Missing antibiotic doses extends the time it takes to get rid of infections--perhaps in a very predictable way--and missing doses early in a program is far worse that missing them later in a program in terms of the possibility of antibiotic resistance arising. I think this is because there are many more microbe cells present early in the program that could mutate to resistant varieties.

William Soto talked about "Adaptive Radiation of Vibrio fischeri During the Free-living Phase and Subsequent Consequences for Squid Host Colonization." Vibrio genus bacteria in Hawai'i are something I have become more interested in lately so I attended this to talk to get some more background information. Bioluminescent Vibrio fischeri are both free living in seawater and symbiotic with nocturnal bobtail squid; however, there may be different morphologies that are differentially adaptive in colonizing versus free living stages.

Benjamin Parker talked about immune response in aphids.  They can generate winged and unwinged aphids that are genetically identical but develop differently and test how well they fight off infections.  There is a cost to producing wings and the winged aphids also do not stop infections as well as the unwinged aphids; the immune response genes are not expressed as highly, etc.  There are some parallels with costs of immunity in bumble bees.

Tyler Hether showed that the effects of mutations depend on the ways that genes interact (epistasis).  The pattern depends on the type of regulation, positive or negative.  This distorts the state space that can be traveled by a sequence via mutations so that change in some directions is faster than others as a function of the type of interaction among genes--the gene interaction network has an effect on the direction of evolvability.  In general networks constrain adaptation; the space to transverse by mutations to get between states was larger with epistasis than the null model of no interaction between genes.  This possibly relates to issues of phenotypic robustness despite mutational change in the underlying genes.

Benjamin Liebeskind gave a great presentation titled "Sodium Channels and the Origin(s) of Animal Behavior."  This is a very difficult question to approach: what are the origins of animal behavior.  Animals use sodium channels (similar to calcium channels), which also exist in protists, to create charge potentials and propagate signals along neurons.  By looking at the gene sequences among species he found that these sodium channels predate animals who inherited them from common ancestors with modern protists.  The channels started off as an "EEEE" holoenzyme, which evolved to a heterogeneous "DEEA" then a "DEKA" sodium channel holoenzyme.  Interestingly there has been convergent evolution in cnidarians and bilaterians in sodium channel evolution and possibly origins of complex behavior?

Then I gave a presentation on engineering underdominance to transform wild populations (link to PDF of the slides).  It was much more applied than the other talks but I still got some nice feedback from people that saw my presentation.

M. Slatkin (UCB) talked about inferring the number of genes underlying a complex trait (like stature that is influenced by many genes and the environment) from a very subtle signal in the covariance of the trait among parents and offspring.  It requires huge sample sizes to work, on the order of thousands; however, these kinds of sample sizes exist for many traits.  For example, each year I have my genetics class report height for themselves and their parents, if they can, to illustrate quantitative genetics (regression/heritability); this is a sample of about 200 each year so by the end of this fall I will have a total of about 600 parent offspring trios for a single trait.  Perhaps the sample will be large enough in the coming years to use his approach with the students data to illustrate estimating the number of genes affecting the trait to the class?

Rebecca Chong (CSU) talked about rates of evolution in rearranged mitochondrial genomes.  If I understood correctly, there is an acceleration of substitutions in genomes that have been rearranged, but this can not be explained by changes in mutation rate due to their position (single stranded DNA is more susceptible to mutation and the process of replication of the mitochonrial genome makes some areas single stranded longer than others) nor can it be explained by simple relaxation of selection (based on dN/dS ratios).  So the question is, what is driving the changes in these genomes?

There was also a presentation followed by a question and answer session by program officers from NSF to explain more about applying for funding from NSF and some issues with the new application procedure.  I appreciated them doing this and I learned some things, but honestly I did not get a lot out of it because I already read all the instructions, and follow them, for the application.  Long story short--and they did not say this--we need to do a better job of petitioning congress (mailed letters, not emails, are the best way to do this) to provide funding to NSF.  (I also received some comments back on my last application, but I will write about these in a later post.)

One talk that I missed that I really wanted to see was "The convivial origins of life on the Earth: A cooperative network of RNA replicators" by Niles Lehman (PSU).  I have had a side interest in RNA evolution for a long time, but I couldn't be in two places at once...  There were also a couple presentations about evolving robots that might have been fun to see, but the scheduling was bad for me.

Finally, there was a talk by Joan Strassmann (WU) on the discovery of bacterial farming in social amoebas.  I liked the talk, but toward the end of the meeting things were starting to blur together.  In a similar vein, I saw a presentation about strategies to minimize infections (undershooting versus overshooting a minimal infection threshold) during flu outbreaks--there is an interesting social (public versus private interest) conflict dynamic involved--but I can't remember now who gave the talk, sorry.

There were also poster sessions and one that jumped out at me was a poster by David Gokhman (HUJ) "Reconstructing the DNA Methylation Map of Extinct Human Species."  Simplistically put, DNA methylation can "turn" genes "on" or "off."  These are interesting because this process can explain a lot of things in genetics including trans-generational environmental effects on gene expression.  There are ancient genomes that are being sequenced and Gokhman et al. used a mutation bias between methylated and unmethylated C's in the DNA sequence.  (Methylated C's are deanimated to T's, unmethylated C's are deanimated to U's and lost from the sequence.)  The Denisovan genome (lived until 30,000 years ago in Siberia and likely S. to S.E. Asia) was studied.  Ancient methylated DNA sequences are, according to this mutation bias, expected to have higher C-to-T ratios and this can be compared to known methylation patterns in modern humans--the two are correlated. However they found 41 genes (e.g., EBP, HOXD9, UPF3B, NBEA, MAB21L1, miR-291-2, etc.) that are inferred to be differently methylated between the Denisovan and modern humans--these are genes affecting things like limb development and psychiatric disorders when they are disrupted.

There were also some displays related to teaching resources...(listed briefly here):

There were no presentations Monday afternoon, so I went on an outing to the top of a nearby mountain (11,000 ft elevation) and took this picture of myself at arms length with my phone's camera (yes, that is snow in the background, in June).  In the valley in the distance are the Sandy and West Jordan suburbs of Salt Lake City.

Photo0257

chemfig: Chemistry in WordPress

I just learned about adding chemical graphics in wordpress using the QuickLaTeX plugin and the chemfig package (links: QuickLaTeX, chemfig and chemfig).

Here is my first attempt, starting with something simple: molecular hydrogen.

 \begin{verbatim} \chemfig{H-H} \end{verbatim}

This code gives:
  \chemfig{H-H}

And now for a double bond: molecular oxygen.

 \begin{verbatim} \chemfig{O=O} \end{verbatim}

  \chemfig{O=O}

Putting these together we get water.

 \begin{verbatim} \chemfig{H-O-H} \end{verbatim}

  \chemfig{H-O-H}

However a water molecule actually has an angle of about 104.5 degrees. The angle below is raised 75.5 degrees because it would normally be 180 degrees between the hydrogens, .

 \begin{verbatim} \chemfig{H-O-[:+75.5]H} \end{verbatim}

  \chemfig{H-O-[:+75.5]H}

The oxygen has a higher affinity for electrons (is highly electronegative) and develops a negative charge while the electrons pulled away from the hydrogen give them positive charges from the proton nucleus.

 \begin{verbatim} \chemfig{H^+-O^{-}-[:+75.5]H^+} \end{verbatim}

  \chemfig{H^+-O^{-}-[:+75.5]H^+}

These charges result in hydrogen bonds between water molecules, which allows water to be a solid and a liquid at higher temperatures than is usual for a molecule of its size.  This also is responsible for the expansion of ice compared to the liquid form (the molecules take up more space to arrange themselves with optimal hydrogen bonding).

 \begin{verbatim} \chemfig{H^+-O^{-}-[:+75.5]H^+-[::+0,,,,dash pattern=on 2pt off 2pt]O^{-}(-[:+104.5]H^+)-H^+} \end{verbatim}

  \chemfig{H^+-O^{-}-[:+75.5]H^+-[::+0,,,,dash pattern=on 2pt off 2pt]O^{-}(-[:+104.5]H^+)-H^+}

And here is an actual reaction (the bond angles in the water come from :

 \begin{verbatim} \chemname{\chemfig{O=O}}{Oxygen} \chemsign{+} 2 \chemname{\chemfig{H-H}}{Hydrogen} \chemrel{->} 2 \chemname{\chemfig{H^+-[:-37.75]O^{-}-[:+37.75]H^+}}{Water} \end{verbatim}

  \chemname{\chemfig{O=O}}{Oxygen} \chemsign{+} 2 \chemname{\chemfig{H-H}}{Hydrogen} \chemrel{->} 2 \chemname{\chemfig{H^+-[:-37.75]O^{-}-[:+37.75]H^+}}{Water}

Switching gears with carbon, here is methane.

 \begin{verbatim} \chemfig{C(-[:0]H)(-[:90]H)(-[:180]H)(-[:270]H)} \end{verbatim}

  \chemfig{C(-[:0]H)(-[:90]H)(-[:180]H)(-[:270]H)}

Actually methane is organized into a tetrahedron with approximately 109.5 degree angles between the hydrogen atoms (, , ). Projecting above and below the plane of the image can be indicated with Cram-style bonds.

 \begin{verbatim} \chemfig{C(-[:90]H)(<:[:-9.75]H)(<[:-29.25]H)(-[:199.5]H)} \end{verbatim}

  \chemfig{C(-[:90]H)(<:[:-9.75]H)(<[:-29.25]H)(-[:199.5]H)}

Now for cyclic hydrocarbons! Benzene is:

 \begin{verbatim} \chemfig{C*6((-H)-C(-H)=C(-H)-C(-H)=C(-H)-C(-H)=)} \end{verbatim}

  \chemfig{C*6((-H)-C(-H)=C(-H)-C(-H)=C(-H)-C(-H)=)}

And as is typically done for hydrocarbons, we can leave out the hydrogen atoms and just show the carbon backbone:

 \begin{verbatim} \chemfig{*6(=-=-=-)} \end{verbatim}

  \chemfig{*6(=-=-=-)}

Finally, the delocalized pi-bonds can be illustrated as a ring which better reflects the aromatic p molecular orbitals (these give benzene and similar aromatic hydrocarbons enhanced molecular stability):

 \begin{verbatim} \chemfig{**6(------)} \end{verbatim}

  \chemfig{**6(------)}

Glucose! (the most common form at least) I actually had help on this one; I started to put it together than found it already assembled in Section 11.3.4 P. 37 of the ChemFig Documentation.  I just modified it slightly:

 \begin{verbatim} \setcrambond{2pt}{}{} \chemname{\chemfig{HO-[2,0.5,2]?<[7,0.7](-[2,0.5]OH)-[,,,, line width=4pt](-[6,0.5]OH)>[1,0.7](-[6,0.5]OH)-[3,0.7]O-[4]?(-[2,0.3]-[3,0.5]OH)}}{$\alpha$-D-Glucopyranose} \end{verbatim}

  \setcrambond{2pt}{}{} \chemname{\chemfig{HO-[2,0.5,2]?<[7,0.7](-[2,0.5]OH)-[,,,, line width=4pt](-[6,0.5]OH)>[1,0.7](-[6,0.5]OH)-[3,0.7]O-[4]?(-[2,0.3]-[3,0.5]OH)}}{$\alpha$-D-Glucopyranose}

Finally, let me have a shot (double entendre) at putting together caffeine.  First the hydrocarbon shorthand:

 \begin{verbatim} \chemname{\chemfig{*6(-N(-)-(=O)-N(-)-(=O)-(*5(-N(-)-=N-=)))}}{Caffeine} \end{verbatim}

  \chemname{\chemfig{*6(-N(-)-(=O)-N(-)-(=O)-(*5(-N(-)-=N-=)))}}{Caffeine}

Then the full version with all the atoms labeled:

 \begin{verbatim} \chemname{\chemfig{*6(C-N(-C(<:[:-9.75]H)(<[:-29.25]H)(-[:199.5]H))-C(=O)-N  (-C(<:[:+129.75]H)(<[:100]H)(-[:-40.5]H))-C(=O)-C(*5(-N(-C(<:[:+180]H)  (<[:+170]H)(-[:-307.5]H))-C(-H)=N-C=C)))}}{Caffeine} \end{verbatim}

  \chemname{\chemfig{*6(C-N(-C(<:[:-9.75]H)(<[:-29.25]H)(-[:199.5]H))-C(=O)-N(-C(<:[:+129.75]H)(<[:100]H)(-[:-40.5]H))-C(=O)-C(*5(-N(-C(<:[:+180]H)(<[:+170]H)(-[:-307.5]H))-C(-H)=N-C=C)))}}{Caffeine}

GMO Forensics with PCR

The Polymerase Chain Reaction (PCR) is a very powerful technology invented in 1983 by Kary Mullis.  We use it routinely today in molecular genetics labs and it is easy to take it for granted.  What makes this reaction particularly significant is not only that it uses a enzyme catalyst that remains unchanged while driving the reaction, but also that it results in an exponential amplification of a defined DNA sequence.  Human intuition tends to underestimate the power of exponential amplifications.  Casually imagining that you get paid a penny on the first day of a job and that your pay doubles each day (two pennies on the second day, four pennies on the third day) seems like it would take a long time to make much money--at the end of the week you are only up to $1.28.  However, you would have well over 10 million dollars ( pennies) by the end of the month!  Kary Mullis was only given a $10,000 bonus for inventing PCR by Cetus, the company he worked for; but in 1993 he was finally awarded the nobel prize in chemistry for his invention.

PCR works like this on DNA.  A stretch of DNA is designated with a set of primers (short single stranded DNA segments that can be synthesized).  A DNA polymerase enzyme attaches to one end of the primers and synthesizes a new DNA strand in one direction matching the original template DNA.  There are cycles of heating ("melting" the DNA by breaking hydrogen bonds between the strands), annealing of the primers to single stranded DNA, and extension of the new strands by the polymerase.  Importantly, the newly synthesized DNA strand can be used as a template in the next round, so, in theory, the amount of DNA between the primers is doubled each round.  If you do this for 30, 35, or even 40 cycles, you can amplify enough DNA to work with and sequence from even a single starting molecule.

Here in Hawai'i there is currently a GMO food labeling debate.  Unlike the EU, crops that are genetically modified do not have to be indicated on food labels in the US.  This fits into the larger GMO crop debate and is something that the students in my class are very interested in learning about.  In fact, just after I arrived here in August 2011 there were newspaper headlines about acres of transgenic crops destroyed on the big island and rewards for information (link).  This led me to thinking about a lab activity that would capture their interest--detecting GMO food ingredients by PCR.

Almost all GMO crops use the same strong constitutive promoter (highly expressed and always on because of a high affinity for RNA polymerase) from the cauliflower mosaic virus (CMV35S) to drive expression of the inserted gene and a "nos" (from nopaline synthase) terminator from the Agrobacterium tumefaciens Ti plasmid to stop transcription of the gene on the opposite side.  Primers to detect these parts of the genetic insert are already published so I ordered some to try out.

The image below is an overview of a 9+ kbp (over nine thousand base pairs) genetic modification inserted into transgenic papayas to combat the papaya ringspot virus (the virus devastated the papaya industry here in Hawai'i over the last 50 years).  The type of insert varies for each application, but as you will see later on, there is a reason I picked the GM papaya to illustrate.

papaya-insert-zoomout2

And below I've indicated the positions where three sets of primers are predicted to anneal: one set in the nos terminator and two sets in the CMV promoter indicated by the green triangles.

papaya-insert-zoomout

Below I have zoomed in a bit to see more detail; the sequence is so large it has to be wrapped on the screen so moving off to the right comes back on the next row down on the left.  In light green promoters are indicated, protein coding genes are in yellow and terminator sequences in orange.

papaya-insert

Below is a closer picture of part of the CMV promoter with "forward" and "reverse" primers. This promoter is used to drive expression of a coat protein from the ringspot virus, to prime the papaya's immune response to the virus--this particular genetic modification is often credited with saving the papaya industry here in Hawai'i.  The particular primer pair I tried as of this moment was the nos pair and one pair in the CMV promoter; I have not tried the second CMV pair yet.

CMV-promoter

Now it's time to go out and get some samples to try them out on.  In the US the big three are soybeans, corn and cotton (the image below is from here).

adoption_of_genetically_engineered_crops_in_the_u.s

I couldn't think of a source of cotton with enough DNA to easily test, so I focused on corn, soybeans and--here in Hawai'i--papaya.

At the local grocery store I found "unlabeled" corn from the US:

IMG_0025

"Green Giant Nibblers" also grown in the US:

IMG_0027

And "GloriAnn Sweet Corn" a product of Mexico:

D18

The only soybeans I found were all products of China.  "Shirakiku Edamame":

IMG_0030

"WelPac Shelled Edamame":

IMG_0032

And "Safeway Kitchens Shelled Edamame":

IMG_0034

For Papayas I found "Diamond Head Papaya":

D17

Anonymous "Papaya" from a different store:

IMG_0037

And "Solo" variety papaya from a farmer's market:

IMG_0026

So we have three of each of corn, soybean and papaya samples.  After extracting DNA samples, running the PCR and running a gel to view the results, this is what I found:

IMG_0039

To the left are the CMV primers and to the right the NOS primer pairs.  Three lanes in each showed up positive for amplification of the corresponding DNA.  The positive bands are faint so I indicated them with arrows in the image below.

IMG_0039-arrows

These primers amplify a short DNA sequence which does not stain with much dye to see on the gel.  The CMV promoter primers worked "better" i.e. are easier to see.  If you look closely you can also see a fainter band just below the positive samples that is "primer dimer".  This happens when the primers anneal to each other and amplify an even shorter sequence from just the primers.  Primers are designed in pairs to minimize this effect however.

In the gel below I tried different combinations of the CMV and NOS primers together in the same PCR reaction to try to amplify the larger sequence of DNA between them.  In this case these primers were not designed to work together and the primer dimer is much stronger at the bottom of the lanes.  There are three lanes that gave a large (did not move as far down the gel) band in the upper right (CMVf, NOSr).  The PCR in the second lane in from the top left didn't seem to work at all, this could be due to a PCR inhibitor that was co-extracted with the DNA.

IMG_0040

So what were the three positive samples?  They were the three papaya samples, two from the store and one from the farmer's market.  To nail this down a bit more I sequenced the DNA that was amplified to make sure it is what I suspected.

sequence-matches

Above, the short sequences from the three samples (gray bands in the lower part) are mapped to the papaya insert as a reference sequence.  As expected they map to the CMV promoter.  Below I zoomed in some more on a section so you can see the DNA sequence, it is a match to the promoter sequence.

sequence-matches-2

Above, two matches to the "right" promoter and below one match to the "left" promoter.

sequence-matches-3

Below is the longer sequence I recovered from the CMV-NOS primer pairs.

sequence-matches-long1

Zooming in more below you can see (hopefully) it is a match to the uidA gene in the genetic modification insert.  (The colors in the trace file correspond to the different colored bases in the reference sequence at the top.)

sequence-matches-long2

uidA is a gene from E. coli and produces the enzyme beta-D-Glucuronidase.  When plant tissues are treated with a bromine containing compound (X-Gluc) that is cleaved by beta-D-Glucuronidase and stains the tissues blue.  This provides visual confirmation that the gene in the insert is being expressed (similar to the X-Gal reporter system in E. coli) and is called the GUS reporter system.  The point being here, this is not a gene sequence (with components originating from a virus, CMV, and two different bacteria, NOS and uidA) that would be present in non-genetically modified papaya, but is exactly what we expect in ringspot virus resistant genetically modified papaya.

Let me back up a moment at this point and put a disclaimer in here.  I did this informally in the lab to test for use as a teaching tool in my genetics class in the fall.  There is a chance of a false positive here from DNA contamination between samples.  As I said at the beginning, PCR is so powerful it can amplify a single DNA molecule so this is something to always keep in mind.  There are ways to address this with multiple independent extractions in different labs, amplification from different parts of the insert between different samples, looking for unique genetic differences between the samples (if they exist), etc. but I am not going to go into that detail here.

Also, while we are on a cautionary topic, the CMV promoter is from the cauliflower mosaic virus.  It is possible that cauliflower and related brassica may show up positive even if they are not genetically modified from natural CMV infection.  Agrobacterium infects certain plants naturally and even E. coli can contaminate fruits and vegetables from contaminated water sources (which can be a serious health problem).  But you would not expect to find the CMV - uidA combination above.  That is proof that these papaya (or at least one if we are being hyper-cautious) are transgenic genetically modified.

I was surprised that the "solo" variety from the farmer's market showed up positive for the ringspot virus insert.  This is not supposed to be a genetically modified variety.  However, if you do a search for the terms "papaya gmo seed contaminated" you can find that it is quite common for papaya varieties, solo included, to contain the genetic modification here in Hawai'i, unknown to the growers, sellers and buyers using them.

U.S. Supreme Court unanimous in prohibiting patents on naturally occurring human genes

(link to Reuters article) (link to ACLU article)

A victory in the courts!  Finally the law is catching up with reality on gene patents.

A basic fact about patenting is that naturally occurring substances cannot be patented.  I cannot patent water, gold or air and charge you a royalty if you use them.  There has to be an "invention" involved where you have created something new as a prerequisite for patenting.

At the center of this current debate is Myriad Genetics and their patents of BRCA1 and BRCA2.  Having mutant alleles at BRCA1 or 2 elevates a woman's lifetime risk of breast cancer.  The patent allowed Myriad to legally prevent people from testing if they had these alleles unless they bought a $3,000+ testing kit from Myriad.  These are alleles that are a biological part of the people seeking to be tested.  Myriad does not own these people, how can they have a patent on part of their genome that the people naturally inherited from their parents?

I am fine with patenting genetic constructs that are novel and involve creativity in their design--constructs that do not already exist in nature.  I am also involved in a gene patent application based on our work to engineer underdominance via haploinsufficiency (M 10021/RN Max-Plank-Society for the Advancement of Science).  These kinds of patents can be potentially valuable to the institutes that own them and help drive research into new areas.

However, patents on naturally occurring substances can seriously inhibit medicine and research (see Paradise et al. 2005 and references therin).  Also, not having humanitarian and research exemptions can be a problem.  Imagine that you are developing a new line of research in the lab and then the university grants an exclusive license to a company from a patent based on your research.  Suddenly you are not allowed to continue your own research without permission from the company...  (see Andrews et al. 2006).

Incidentally, I am listed as an "inventor" on a human gene patent application (USPTO App. #20080220429) that stemmed from my postdoc work on lactase persistence in East Africa. I did not initiate patenting of the alleles; I was unaware a patent application had been drafted when the work was published; and transferring ownership of the patent to the University of Maryland was a term of my previous employment.  However, there is often confusion by the public that research scientists are responsible for gene patents rather than the companies and people that employ them.  After it was published I found out about an article in South Africa (Jordan 2009) criticizing me personally for this patent application.  (It also contained a number of other misconceptions, for example, I have never been affiliated with the University of Pennsylvania and I have no record of the author ever attempting to contact me for comment.)

When I first heard about the ACLU's challenge to human gene patents in early 2010 I contacted them and offered to help in any way I could--if my unique position as an "inventor" of gene patents and personal opposition to the granting of gene patents was useful.  They wrote back to me and we had a brief correspondence but in the end they did not take me up on my offer.  However, clearly they did not need any help and in the end were successful with today's court ruling.