Monthly Archives: July 2013

The Bernoulli distribution mean and variance

598px-Ståhlberg_flipping_coin2

One of the simplest probability distributions to begin with is the discrete Bernoulli distribution, named after Jacob Bernoulli (1655-1705).  It can be thought of as a coin toss where the coin either comes up heads (with value 1) or tails (with a value of 0).  Other distributions like the binomial can be built up as multiple Bernoulli trials (or the multinomial distribution which is built from a binomial with more than two outcomes).

So, if x is the outcome of a coin toss (a Bernoulli trial) and "heads" gives a value of 1, we can say the probability of x=1 is p;

Probability(heads) = P(x=1) = p.

All of the outcomes must sum to one (100%) so the probability of tails, x=0, is the probability the outcome is not heads:

Probability(tails) = P(x=0) = \neg P(x = 1) = 1-p.

For a fair coin p=1/2.  However, this formulation is general so we could have a trick coin with P(x=1) = p = 1/10, for example.

So what is the expected value of a Bernoulli trial (a coin toss)?  This is pretty intuitive but it doesn't hurt to write it out in systematic "bookkeeping" fashion.  If we flipped the same coin many times and kept track of the outcomes, p fraction of the time x=1.  The rest of the time, 1-p, the outcome is x=0.  So p amount of time we have a value of 1 and 1-p is the amount of the time x=0.  Weighting these outcomes by the frequency we expect to see them gives us an average result of:

\bar{x} = 1 p + 0 (1-p)

The zero cancels out (1-p) so we are left with:

\bar{x} = 1 p = p.

So, if we have a fair coin with p=1/2 we expect and average outcome of value P(x=1)=1/2. This seems to make sense.

The average (\bar{x}) is another way of saying the expected value of a trial or the "expectation."

\bar{x} = Expectation(x) = E(x) = p.

Now for the variance, or degree of spread around the average.  Variance is often symbolized by a lower case sigma-squared, \sigma^2.  The standard deviation, \sigma, is the square root of the variance.

Variance is calculated as the average squared difference of individual outcomes, x_i, from the mean.

\sigma^2 = (\bar{x} - x_i)^2

We already know from the mean derivation above that \bar{x} = P(x=1) = E(x) = p.  So we can substitute this in:

\sigma^2 = (p - x_i)^2

We also know that {\color{red} p } amount of the time x={\color{blue} 1 }, which gives us one side of the two mutually exclusive (add the probabilities) outcomes:

\sigma^2 = {\color{red} p } (p - {\color{blue} 1 } )^2 + ...

Also, {\color{red}{1-p}} amount of the time x={\color{blue}{0}} so the full equation becomes:

\sigma^2 = p (p - 1)^2 + \color{red}{1-p} (p - \color{blue}{0} )^2

Multiplying this out and simplifying gives:

\sigma^2 = p (p - 1)^2 + (1-p) (p - 0 )^2

\sigma^2 = p (p^2 -2p + 1) + (1-p) p^2

\sigma^2 = p^3 - 2p^2 + p + p^2 - p^3

\sigma^2 = - p^2 + p

\sigma^2 = p - p^2

\sigma^2 = p (1 - p)

Like heterozygosity in Hardy-Weinberg genotype frequencies, the variance is at its greatest values at intermediate frequencies near p=1/2 and declines to zero at p=1 and p=0.

bern-var

If p is indeed zero or one than the outcome is always identical to the mean and therefore variance is zero--there is no deviation in individual outcomes from the average.  On the other hand if p=0.5 every outcome will be a value with a distance of 0.5 from the mean (either x_i=1 or x_i=0 with \bar{x}=1/2) and this distance squared is 1/4, \sigma^2 = 0.25.

Accidental site directed mutagenesis in PCR product sequence

The last post about my COI "species barcode" sequence has been bugging me.  I wouldn't really expect to find a unique sequence in a small region by chance in a mitochondrial gene.  Many, many mitochondria in humans have been sequenced and the general levels of genetic variation are well understood.  I checked closer and it appeared to be an amino acid altering mutation to a different class of amino acid--not expected at all.

GtoAtaq_error

It was close to one edge of the sequence so, I looked at the other edge and found some more unique changes.  In the sequence snippet above each A that is underlined by an orange box is a G in standard human COI sequences.  I made an assembly with the human reference sequence and added my primers to the alignment.  Then it became obvious that I had made a rookie mistake.  The "mutations" were located in the primer.  These primers were not designed to only work with humans, but to work across a wide range of animals.  They do not match the human sequence exactly.  As the PCR progressed making more and more copies starting from the annealed primers the primer sequence was incorporated into the total sequence.

coi-primer-mut-1

The image above shows the reference sequence at the bottom, the primer position is in light green and my sequence is above that.  The changes match the primer (5'-GGTCAACAAATCATAAAGATATTGG-3' the compliment of which is 5'-CCAATATCTTTATGATTTGTTGACC-3' in the sequence above).  This is actually a method to engineer specific changes to a DNA sequence, a form of site directed mutagenesis.

Going back to the first side and taking another look:

coi-primer-mut-2

The A is incorporated from the primer (5'-TAAACTTCAGGGTGACCAAAAAATCA-3').  There is also a missing A just inside the sequence (in the 3' direction); taking a closer look at the trace I can agree that an extra A may be at the position (the arrow below points to a "shoulder" on the red A trace that may be the missing nucleotide):

trace-error

There is also a "T" just outside (5' to) the primer sequence.

The Taq polymerase enzyme used in PCR is known to be error prone and sometimes add A's on to the 3' end of a PCR product (in fact this is exploited in one method, TA cloning, to clone PCR products). By convention sequences are written 5' to 3' so a 3' A will appear as a complementary T in this instance.  also, note in the previous figure with the first primer an extra A was present on the 3' end (but this also agrees with the reference sequence).

When I trimmed out the edges and carefully curated the sequence I got the following alignment below.

me_COI-alignment

100% identical to many human sequences on genbank.

So, just for the record, my corrected COI DNA sequence, in FASTA format is below:

>me_COI
TAGGTGTTGGTATAGAATGGGGTCTCCTCCTCCGGCGGGGTCGAAGAAGGTGGTGTTGAGG
TTGCGGTCTGTTAGTAGTATAGTGATGCCAGCAGCTAGGACTGGGAGAGATAGGAGAAGTA
GGACTGCTGTGATTAGGACGGATCAGACGAAGAGGGGCGTTTGGTATTGGGTTATGGCAGG
GGGTTTTATATTGATAATTGTTGTGATGAAATTGATGGCCCCTAAGATAGAGGAGACACCT
GCTAGGTGTAAGGAGAAGATGGTTAGGTCTACGGAGGCTCCAGGGTGGGAGTAGTTCCCTG
CTAAGGGAGGGTAGACTGTTCAACCTGTTCCTGCTCCGGCCTCCACTATAGCAGATGCGAG
CAGGAGTAGGAGAGAGGGAGGTAAGAGTCAGAAGCTTATGTTGTTTATGCGGGGAAACGCC
ATATCGGGGGCACCGATTATTAGGGGAACTAGTCAGTTGCCAAAGCCTCCGATTATGATGG
GTATTACTATGAAGAAGATTATTACAAATGCATGGGCTGTGACGATAACGTTGTAGATGTG
GTCGTTACCTAGAAGGTTGCCTGGCTGGCCCAGCTCGGCTCGAATAAGGAGGCTTAGAGCT
GTGCCTAGGACTCCAGCTCATGCGCCGAATAATAGGTATAGTG

Incidentally, this section is now an exact match to the Cambridge reference sequence, NC_012920.

National Academies Summer Institutes - Almost Done

We are near the end of the week long teaching workshop.  Yesterday we all presented brief (in 25 minutes) teaching units that our six individual groups came up with during the week--designed to use higher level cognitive skills and active learning.  I can say that I saw a few good ideas to use.  I am also impressed with how much we all came up with in such a short time.  I was in the "Evolution" group and we came up with a "Tree Thinking" project where a phylogeny is selected that best represents a set of amino acid sequences.  This is done on both an individual level then on a sub-group level with an increase in confidence in their individual result.  Then the results are compared for the entire "class" and it is shown that the class strongly disagrees (which I think is a valuable scientific lesson in itself).  The subgroups then talk to each other and discover that two different sequences are used and generate two different trees.  Then they brainstorm why this might happen and the function of the proteins is revealed--one is alpha hemoglobin involved in oxygen transport in the blood and the other is prestin which affects high frequency sound perception in the inner ear.  The prestin tree groups echolocating mammals together into a clade...  Then the class votes on a final clicker question with two hypotheses to explain the disagreement--in essence they learn an aspect of tree thinking by creating their own tree and realize echolocation evolved twice, as evidenced by the globin sequence, and is responsible for convergent evolution in the prestin amino acid sequence.  We got lost of positive feedback on our unit.

As I said, there are lost of examples and topics that will likely prove useful to me, but one turnoff aspect of this is we are required to come up with a similar workshop-like activity for our home institution after this workshop.  One of the activities this week was to group by institution and come up with a plan for this.  We had to "agree" to this, and to use a teachable unit from the workshop in our classes in the next year, to sign up for the workshop in the first place (some/many of us had to sign up and didn't really have a choice in the matter).  This feels like a proselytizing pyramid scheme; we have to agree to spread an idea before we even know or are informed of what it is really all about.  I understand they want to get the word out to as many educators as possible to magnify the return for their efforts, but as I said this arm twisting is a real turnoff to me.  I am trying to place that to the side and keep an open mind about the usefulness of the methods they are presenting.  However, in the former vein there are a lot of tricks used in the workshop that remind me of tricks used with kids to encourage compliance (in schools or summer camps) like guided choices, revised terminology to re-frame and reenforce topics, buy-in and identification with effort and activities, and group cohesion mediated acceptance of indoctrination.

Two things that I brought up with the instructors: the group learning activities can be a problem in the classroom for people with hearing disabilities.  The multiple group discussions in the same room create a lot of background noise which makes it hard for a certain minority to participate.  I was told that not everyone could be accommodated by the new classroom format.  (The ADA might have something to say about this?  One way around this is to let groups go to different rooms to discuss then regroup together at a later time.)  I also brought up that I and others would like to see more detail of studies that support the effectiveness of the teaching techniques.  After all, we are scientists and we (should) be naturally inquisitive, questioning and require evidence to make decisions.  This is presented as "scientific teaching" and the results from some studies have been presented that support some of their claims, (and I can intuitively sense some advantages to these methods) but many of us still do not feel satisfied that enough well controlled supporting evidence has been presented.  Also, simply showing the number of studies that are supportive versus the number that are not can be misleading, for example there could be a publication bias in "reformative" findings versus "negative" results and a bias in the results "looked for" in the people doing this kind of educational study in the first place.  (There is also the real possibility of bias from the instructor "trying harder" in classes where new formats and activities are being added.)  What I want is a T. H. Morgan that set out to disprove genes were on chromosomes and ended up proving to the world that they were and accepting the results of his experiments despite his initial beliefs.  In theory I could compare my own results from classes on different years, but as I said in the last post I use many of these methods already and do not want to get rid of this to do a classic lecture only format.

Anyway, I do not want to come across as completely negative.  This workshop has been useful and has also been an opportunity to network with biologists here in Hawai'i, Alaska and the West Coast.

On a final note, it has been a bit fun to be a student again in a classroom-like format after being the educator for so many years.  One thing I noticed is that faculty can make horrible students.  At times I had the people next to me keep whispering to me during presentations or copy my answers to questions.  Several people were looking at their laptops, tablets, phones rather than the presentation.  Some people spoke up and interrupted seemingly for no other reason than to get themselves heard, often with disruptive "joke" comments.  The presenters did a good job of herding the cats and deflecting attempts of derailment to unrelated topics.  I also found myself second guessing some of the questions "there must be a twist so the obvious answer is not right" and watching for cues from the presenter like telegraphing the answer with their hand movements.  For example, they would show an empty graph with two axes and ask us to draw what we thought the relationship would look like, but when they pointed to the graph and made a drawing motion while telling us the instructions they unintentionally drew the correct answer in the air.  These unintentional subconscious actions tend to occur when people are under stress and overloaded with distractions--like when giving presentations in front of a group.  This also gives me things to think about in the context of my own presentations.

National Academies Summer Institutes

This week I am attending the National Academies Summer Institute(s) on Undergraduate Education.  It is a series of all day workshops with people attending from colleges and universities from all over the West Coast, Alaska and Hawai'i.  The focus is on improving undergraduate biology education through active learning, assessment, using methods that improve learning, ... and the motivation is both that many people, including future policy makers, community leaders, voters, etc. have a very poor understanding of biology (especially genetics and evolution) and that we are not attracting and retaining enough students into STEM fields like modern biology to supply the future workforce needs of the US, so we need to use every opportunity in undergraduate education as effectively as possible.

In the first day we brought an example of one of our class exams and assessed questions as a group in terms of cognitive level in Bloom's taxonomy.  Most student assessment in most classes are at lower cognitive levels like recalling information (memorization) and explaining concepts (comprehension).  Very little moves to the middle ground of applying knowledge in new ways or inference at higher levels of using your knowledge to synthesize new hypotheses, experimental designs or evaluate/appraise/critique concepts based on new and prior knowledge.  There were also presentations on methods of testing/question writing that are more effective in assessing student knowledge, ways to get feedback during lecture (like iclickers), and backward design of classes from learning goals to assessment to activities, how to get students to monitor their own knowledge level, etc.  Then we met as smaller groups  to start on a group project to design a teachable unit that will be presented to the entire workshop with a smaller activity for the workshop to participate in.

To be honest, I am not sure how much I will really get out of this workshop (much of what we are doing so far seems like, at least on the surface, pointless semantics and I am naturally suspicious of anything that appears to contain a lot of buzzwords even if I am not familiar with them, but I am trying to keep an open mind).  I already use iclickers for class feedback during lecture, some active learning techniques and inverted classroom techniques, aspects of backward design and assess some higher level cognition with my exam questions (where students are expected to synthesize prior knowledge/experience to answer new questions in new situations they have not seen before and/or evaluate competing hypotheses)--not that there isn't room for a lot of improvement.  There is also the practical issue of time and effort that goes into some of these approaches--we have a lot of ground to cover in my undergraduate class and at some points we have to move through topics quickly and frankly do not have time for many detailed activities in lecture.  ...  At the very least however I am already getting some good examples that I can use in my class this fall.

My COI Sequence

I have been trying different methods to isolate DNA from cheek cells.  I tried a "BuccalAmp" kit on a recommendation and it works really well.  It is quick and easy.  I tried the COI barcode PCR on my own DNA sample and compared the results to the earlier DNA extractions and PCR from three species of insects.  First here is part of the aligned traces from forward and reverse sequencing of the amplified product.

me-coi

And here is an alignment with the fruitfly, mantis and mosquito sequences (the basepairs that are different are highlighted).

coi-four-species

You can probably see from the alignment that a lot of the changes are with my sequence and that the insects are closer related to each other.  There are a few in-congruent homoplasies however.  Here is one at position 127 in the alignment.

human-mosquito-homoplasy

Humans and mosquitoes share an "A" at this position while fruit-flies and mantises share a "G," recall that mosquitoes are closer related to fruit-flies than to mantises.  This is just an example of mutations that have occurred more than once at the same position.  Making a "tree" to represent the mutations that are shared and different gives this pattern:

nj-tree-four-species-coi

There is a long distance between me and the insects, as expected.  Additionally, the mosquito (C. quinquefasciatus) groups closer to the fruit-fly (D. melanogaster) than to the mantis (T. sinensis).  The number in the middle is the bootstrap support value for the tree.  In bootstrapping the DNA bases are randomly shuffled and sampled with replacement to make new sequences the same length as the original (and by chance some basepairs are left out while others can be included more than once) the amount of time the grouping of species below that node in the tree is found is indicated.  So, mosquitoes and fruit-flies are grouped together 96.9% of the time with random shuffling of their sequences (rather than one of these being closer to the mantis), which indicates a high amount of support form the data-set that this is the actual evolutionary pattern (with assumptions...).

This mitochondrial sequence is inherited from my mother, and her mother, on back in a maternal linage, changing only occasionally by mutations.  My brother and sister share the same sequence as myself but I will not pass it on to my children, because the mitochondria is not transmitted from the father, only from the mother.  I compared my sequences to others on genbank and found something funny.

my-coi-difference

Near the beginning of the sequence I have an "A" at position 25 (above) that begins a string of six A's.  However none of the other human mitochondrial sequences on genbank share this.  Below is a comparison of 25 of the most similar human mitochondrial sequences from genbank to my sequence at the end, they are all identical with me except at this position (at position 18 in the alignment below):

me-coi-trim-alignment

They all have a G followed by five A's.  Since this seems to be rare, it might be interesting to see if anyone out there has the same sequence.  I can not go back very far, only about four generations, along my maternal lineage.  My matrilineal Great Grandmother's maiden name was Sarah J. Carlisle born in 1873 in North Carolina; her mother was Dicey (which might be short for Leodicia) Owenby (or Owensby), born about 1834 in North Carolina, and her mother might have been Sarah Hunter born around 1815 in South Carolina--just in case anyone that might be related reads this.