Biofinysics

Wednesday, May 28, 2014

How does bowtie2 assign MAPQ scores?

Outline:
I. Introduction

II. A little bit to know about bowtie2
A. How bowtie2 scores a mismatch
B. How bowtie2 decides if an alignment is valid
C. How bowtie2 describes its alignments
III. Experiments to look at how Bt2 assigns MAPQ scores
A. Methods
B. Experiments, Results, and Conslusions
IV. An English translation of how bowtie2 assigns MAPQs from the code and experiments
V. Python code for calculating bt2-style MAPQs

Introduction:

Bowtie2 is an ultra-fast program for aligning next generation sequence reads to large genomes written by Ben Langmead and which happens to be my aligner of choice. RTFM here. Download it here or learn how to get it on your Mac OS X with "homebrew" from one of my previous posts, "The Craft of Homebrew-ing for Mac OS".

Recently, I discussed "The slow death of the term 'uniquely mappable'". There I talked about the older definition of "uniquely mappable", which describes a read that maps to only a single site in the genome, and how it has has become a less-than-reliable way to extract only high quality mappings. I also talked about the newer definition of uniquely mappable, which describes any read for which the alignment score of the best alignment is better than that of the 2nd best and all other alignments. Reads that meet the old definition are a subset of the reads in the newer definition since their alignment score is by default better than scores in the empty set of other possible alignments. To disambiguate the term "uniquely mappable read" or "uniread", I refer to any read that meets the newer definition as a "maxiread" because it has a single position with the maximum alignment score. To reiterate in the newer language, the traditionally defined "unireads" make a proper subset of the set of maxireads. Even though the newer definition encompasses more possible high quality mapped reads, it does not by itself eliminate mapped reads when the best or only alignment score is poor (and the 2nd best is really poor). It just takes the best of the totally horrible situation. Thus, the value of the alignment score itself also needs to be taken into consideration, not just whether or not it is the only score or better than the next best one. Mapping quality (MAPQ or MQ) scores are used by aligners such as bowtie2 and bwa to achieve this goal. Briefly, MAPQ is supposed to represent a transformation of p, the probability a read is mapped wrong by:

Read more background about MAPQ scores, unireads, multireads, and maxireads in my previous post, "The slow death of the term 'uniquely mappable' in deep sequencing studies and resisting the "conservative" urge to toss out data". In practice, MAPQ ranges from 0 to 37, 40, or 42, depending on what aligner you are using. My experiences below are referring to bowtie2 MAPQ scores. These MAPQ scores are not calculated with that equation above. In fact, as far as I know, no program does that. Each program has ways to estimate or approximate the MAPQ for a read mapping that it then assigns it. The goal of this post is to talk about how bowtie2 sees mismatches, alignment scores, and MAPQ scores, then to translate into English, through tiny experiments and looking into the code, what each MAPQ score from bowtie2 means. I also provide python code to calculate bowtie2 style MAPQs yourself. The code itself might be better than English for some people (assuming anyone other than spam bots read this).

When I started learning bioinformatics to analyze NGS datasets, I knew I should filter out all reads with a MAPQ lower than X, but it was not clear what X should be. One trusted colleague who works on Drosophila told me to keep reads with MAPQ >= 30. However, this in practice seemed too conservative when working with reads (that can have errors) from a human cancer cell line, which has SNPs, microdeletion, microinsertions, breakpoints, etc - i.e. things that make the alignment score lower even at "correct" site. Searching the forums (e.g. seqanswers and biostars) did not lead to a concrete answer. I read one thread that concluded any read with a MAPQ >= 10 is fine, and another that testified that any MAPQ >= 1 is fine. Both claimed that after 1 or 10, all of the reads are "unique" (usually it is not clear by which definition someone means). Then someone else said a MAPQ of 255 is for reads with a unique mapping (presumably referring to the older definition). That, however, is contradicted by the SAM format specifications, which says that 255 is to be used for mapped reads where the MAPQ is not available. Another thread provided the seeming gem that a MAPQ of 0 is for reads that with 10+ mapping locations, a MAPQ of 1 is for reads with 4-9 locations, a MAPQ of 2 is for reads with 3 locations, a MAPQ of 3 is for reads with 2 locations, and presumably anything with a MAPQ of 4 or better mapped to 1 location with increasingly good alignment scores. If you use the equation above, you will find that this set of MAPQ descriptions essentially came from that equation rather than an actual breakdown of what bowtie2 is doing, for example. Yet other people said that multireads in bowtie2 are only given a 0 or a 1 and uniquely mappable reads had scores of 2 or higher (with no indication on what their definition of "uniquely mappable" was). Contrary to MAPQs of 0 and 1 being reserved for multireads, I had uniquely mapped (single site, old definition) reads with MAPQ scores of 0 to explain. Thus, I set out on a series of tiny experiments and looking for snippets of the code to tell me exactly what bowtie2 does.

First, here is a little bit to know about bowtie2 end-to-end mode:

How bowtie2 scores a mismatch:

- The maximum alignment score possible is 0. From there penalties are subtracted.

- MX is the maximum penatly and MN is minimum penalty. These are tune-able parameters. The defaults are MN=2 and MX=6.

- Recall base call qualities can be between 0-40, lowest to highest, and are referred to by the variable, Q or BQ to distinguish it from MAPQ.

- When a mismatch is encountered, a penalty between MN-MX is subtracted.

penalty = MN + floor( (MX-MN)*(MIN(Q, 40.0)/40.0) )

Lets say: Diff = MX-MN

And note: MN + Diff = MX

Lets say: fraction = Q/40

That is, "fraction" is the base quality, Q, divided by the best possible base quality (40)

And lets ignore "floor".

Then,

penalty = MN + Diff*fraction

--> So if the base quality is perfect (40), then

fraction = 40/40 = 1

and,

penatly = MN + Diff = MX = 6 (default)

--> And the lower quality the base call, the more we believe it is an error so the less of a penalty we give the mismatch. For example if Q = 0, then

fraction = 0/40 = 0

and,

penatly = MN + 0 = MN = 2 (default)

Or if Q = 20, then

fraction = 20/40 = 0.5

and,

penatly = MN + 0.5*Diff = 4 (default)

- In summary, MN is subtracted when the base call quality is 0. This is because when the base call quality is low, we assume the mismatch is more likely due to a base call error than to a real mismatch. MX is subtracted when the base call quality is 40. This is because when the base quality is high, we assume the mismatch is more likely to be real. It could be due to a real SNP though, which gives rise to the unfortunate case where we penalize unjustly when aligning it to the "correct" spot on the reference. All other mismatches over bases with quality Q=1-39 have penalties between MN and MX. If you choose to ignore quality scores, MX is always subtracted. This is because you are telling it to always assume the mismatch is a real mismatch and are not distinguishing between mismatches potentially caused by a sequencing error.

- Other penalties and defaults:

N penalty: default -1 for every position with an N (whether the N is either in the read, ref, or both)

gap open penalty: default -5 for both read and ref

gap extend penalty: default -3 for both read and ref

How bowtie2 decides if an alignment is valid:

- Bowtie2 considers an alignment valid if it is higher than minimum Alignment Score allowed

- Bowtie2 uses a function to determine the minimum alignment score, which is dependent on the length of the read:

score minimum = f(readLength) = b + m * readLength

- Bowtie2 allows you to set the "b" and "m" in the minimum alignment score function with the "--score-min" flag

- The default values are b = -0.6 and m = -0.6:

default score minimum = f(readLength) = -0.6 + -0.6 * readLength

- For a 50 bp single end (SE) read, this translates to a minimum alignment score of -30.6:

default score minimum = f(50) = -0.6 + -0.6 * 50 = -30.6

- To put this in terms of mismatches, this will accept 5 mismatches over high quality (Q=40) bases (though MAPQ goes to 0 after 3), up to 7 mismatches over 'medium' quality (Q=20) base calls (though MAPQ goes to 0 after 5), and up to 15 mismatches over the lowest quality (Q=0) bases (though MAPQ goes to 0 after 10 mismatches). Why and how MAPQ goes to 0 for different numbers of bases given the quality is because it only cares about the alignment score and those are the numbers before the alignment score gets bad enough for MAPQ to go to 0.

How bowtie2 describes its alignments:

Bowtie2 output is in SAM format. For more information on SAM format files, see the SAM format specifications. For now suffice it to say just what I say here.

Fields 1-11 are fixed.

Fields 12+ are not. They may or may not be present.

Field What You Find There

1 readname

2 bitflag -- for SE bt2 mapping, only 0 (pos strand), 4 (unmapped), 16 (neg strand) are relevant

3 chrName/refname

4 1-based start on fwd strand (this is the end if on neg strand)

5 MAPQ

6 CIGAR string rep of alignment (most are 50M which means 50 match+mismatch)

7 chrName/refName of mate if PE; * if SE

8 1 based start of mate; * if SE

9 Inferred fragment length if PE; * if SE

10 the read's sequence

11 the reads base call qualities (Qs)

Optional field formats:

All optional fields follow the TAG:TYPE:VALUE format where:

TAG is a two-character string

TYPEs:

A (printable character)

i (integer)

f (float)

Z (printable string including space)

H (byte array in Hex format)

B (integer or numeric array)

*only types i and Z apply to bt2 optional fields

Optional fields 12+:

AS:i:<N> aln score

XS:i:<N> aln score for next best aln (only reported if >1 aln found)

YS:i:<N> aln score for mate (PE only)

XN:i:<N> number of ambiguous bases in the reference covering the aln (#Ns)

XM:i:<N> number of mismatches in aln

XO:i:<N> number of gap opens

XG:i:<N> number of gap extensions

NM:i:<N> the edit distance (whereas XM more like hamming -- only if no gaps were present)

Note: if XM == NM, then XO and XG == 0

YF:Z:<S> reports why bt2 ignored this read (filtered out) if it did -- see codes in bt2 manual

YT:Z:<S> reports on pair information

all SE reads are given: YT:Z:UU -- meaning they are not part of a pair

MD:Z:<S> A string representation of the mismatched reference bases in the alignment. See SAM specs.

Experiments to look at how Bt2 assigns MAPQ scores:

Methods:

Making reads:

I randomly generated 50 bp sequences in python using this code:

def genSeq(seqLen, numSeq, probs=True):
'''Depends on random module
bases = 'ACGT'
seqs = []
for i in range(numSeq):
seq = ''
for j in range(seqLen):
index = random.randint(0,3)
seq += bases[index]
seqs += [seq]
return seqs

Since I only made a few, reads were manually turned into fasta or fastq format. Note that when mapping with bowtie2, fasta reads are assumed to all be of the highest quality base calls (base quality scores of 40). In fastq files, I either made all bases in a read have base call qualities of 20 or 0.

Example reads:

Fasta Read:

>read1

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

Fastq Read Q20:

@read1_Q20

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

55555555555555555555555555555555555555555555555555

Fastq Read Q0:

@read1_6mm_Q20

TAACCCTAACCCTAACCCTAACGCTAACCGTAACCCAAACCCAAACCCAT

55555555555555555555555555555555555555555555555555

Making chromosomes to map the reads to:

To start, I made eleven 50bp fasta reads named read1 - read11. Each read was made into a "chromosome" by having the read sequence repeat itself to have as many copies as the integer in its name. For example, read1 was kept as is and renamed chr1, read2 was lengthened by repeating it once for a copy number of 2 and renamed chr2, read3 was lengthened by repeating it 2 times and renamed chr3. The randomly generated 50 bp reads throughout the experiments have a range of hamming distances between 27-47 (median = 38). These parameters hold true even when generating thousands of pairwise comparisons this way. Thus, these random reads only ever align to their corresponding chromosome even when mismatches are engineered into them. For example, read1 should align only to chr1 (once), reads2 should align only to chr2 (twice), read3 should align only to chr3 ('thrice'), etc...

Example "chromosomes":

>chr1

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

>chr2

ACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAAACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAA

>chr3

GTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGGGTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGGGTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGG

Indexing the fasta file of chromosomes:

bowtie2-build -f chr.fa chr

Randomly adding mutations to reads (or chromosomes):

To randomly mutate a read I just manually changed a base to another base in the read since these were such tiny experiments. Reads with mismatches were given names that told the read they derived from, how many mismatches, and when the base call quality was important (fastq files) what the Q value was for all bases (As described above every base in the read was assigned the same base call quality score).

Example read names:
read1_1mm
read1_1mm_Q20

Mapping reads:

bowtie2 --very-sensitive -N 1 -x chr -U reads.fa

-f used when fasta provided for reads

-a used when I wanted to count all possible alignments for a read
As the above command illustrates, defaults for the minimum alignment score cutoff and for the penalties for mismatches, gap opens and gap extends always used.
Only end to end alignments were performed in these experiments.

Counting number of times a chromosome was mapped to:

In general the following command was used to find all alignments for each read:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa -a

-f used when fasta provided for reads

Since bowtie2 does not have a SAM flag that reports how many times a read aligned, I used various command line tricks to count how many times each read mapped such as:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa -f -a | samtools view -F 4 -S - | cut -f 1 | uniq -c

Extracting alignment scores for the best (AS:i:<N>) and second best (XS:i:<N>) alignments:

bowtie2 --very-sensitive -N 1 -x chr -U reads.fa | grep -v ^@ | cut -f 1,3,5,12,13

-f used when fasta provided for reads
Note that if there is not a second best alignment, then the above command extracts XN:i:<N>

Experiments, Results, and Conclusions:

For each experiment, note that reads only ever aligned to their respective chromosomes (e.g. read1 to chr1, see methods). Moving forward, keep in mind that for bowtie2, AS is alignment score of the best alignment and XS is the alignment score for the second best alignment. In these experiments, AS always equals XS if an XS exists by design. So these are "true multireads" as compared to "maxireads", reads with a single best alignment score, but other mappings with poorer scores.

Experiment 1 - true multireads with "perfect" base qualities only receive MAPQs of 0 or 1:

I made reads 1-11 and 100 (fasta format) and their corresponding chromosomes chr1-11 and chr 100 with copy numbers of 1-11 and 100 respectively (see methods). I also made 6 more derivative reads with 1-6 mismatches respectively from each read1, read2, read3, and read100. Since all base calls are considered perfect for fasta reads and because of the defaults of bowtie2, the maximum number of mismatches allowed should be 5 to be considered a valid alignment. The read with 6 mismatches should not map. One of the hypotheses that I tested was that bowtie2 MAPQ scores follow the rules in one forum stating 0 means the read mapped 10+ times, 1 means 4-9 times, 2 means 3 times, and 3 means 2 times.

Results:

Read Name NumAligns AS XS MAPQ (for best position)

read1 1 AS:i:0 * 42

read1_1mm 1 AS:i:-6 * 40

read1_2mm 1 AS:i:-12 * 23

read1_3mm 1 AS:i:-18 * 3

read1_4mm 1 AS:i:-24 * 0

read1_5mm 1 AS:i:-30 * 0

read1_6mm 0 * * *

read2 2 AS:i:0 XS:i:0 1

read2_1mm 2 AS:i:-6 XS:i:-6 1

read2_2mm 2 AS:i:-12 XS:i:-12 0

read2_3mm 2 AS:i:-18 XS:i:-18 0

read2_4mm 2 AS:i:-24 XS:i:-24 0

read2_5mm 2 AS:i:-30 XS:i:-30 0

read2_6mm 0 * * *

read3 3 AS:i:0 XS:i:0 1

read3_1mm 3 AS:i:-6 XS:i:-6 1

read3_2mm 3 AS:i:-12 XS:i:-12 0

read3_3mm 3 AS:i:-18 XS:i:-18 0

read3_4mm 3 AS:i:-24 XS:i:-24 0

read3_5mm 3 AS:i:-30 XS:i:-30 0

read3_6mm 0 * * *

read4 4 AS:i:0 XS:i:0 1

read5 5 AS:i:0 XS:i:0 1

read6 6 AS:i:0 XS:i:0 1

read7 7 AS:i:0 XS:i:0 1

read8 8 AS:i:0 XS:i:0 1

read9 9 AS:i:0 XS:i:0 1

read10 10 AS:i:0 XS:i:0 1

read11 11 AS:i:0 XS:i:0 1

read100 100 AS:i:0 XS:i:0 1

read100_1mm 100 AS:i:-6 XS:i:-6 1

read100_2mm 100 AS:i:-12 XS:i:-12 0

read100_3mm 100 AS:i:-18 XS:i:-18 0

read100_4mm 100 AS:i:-24 XS:i:-24 0

read100_5mm 100 AS:i:-30 XS:i:-30 0

read100_6mm 0 * * *

All AS and XS were as expected for 0-5 mismatches: 0, -6, -12, -18, -24, and -30 respectively. The MAPQ for read1 was 42 because there was only a single perfect alignment. The MAPQs for read1 with 1-5 mismatches were 40, 23, 3, 0, and 0 respectively. The MAPQ for all other reads with 0 mismatches was 1 regardless of whether it had 2-100 alignments. The MAPQ for all reads with 1 mismatch was 1 regardless of whether it had 2-100 alignments. The MAPQ for all reads with 2-5 mismatches was 0 regardless of whether it had 2-100 alignments.

Conclusions:

The hypothesis stated above is false. In bowtie2, the best a true multiread (AS=XS) can get is MAPQ=1 regardless of how low or high its multiplicity. This occurs when there are 0 or 1 mismatches over perfect base calls in the read, or when AS=XS goes down to -6. When there are 2-5 mismatches over perfect base calls (or the AS=XS <= -12 ---- i.e. -12 to -30.6), the MAPQ becomes 0.

If someone wanted to exclude "true multireads" from their data set, using MAPQ >= 2 would work. This would also exclude any uniquely mapping reads with >=4 mismatches over high quality bases. In terms of high quality bases and unireads, MAPQ >= 3 allows up to 3 mismatches, MAPQ >= 23 allows up to 2 mismatches, MAPQ >= 40 allows up to 1 mismatch, and MAPQ >= 42 allows 0 mismatches. There will also be other "maxireads" in most or all of these sets.

Experiment 2 - true multireads with "medium" and "poor" base qualities only receive MAPQs of 0 or 1:

For this mini-experiment, I converted the first 6 fasta format reads above into fastq format reads to be able to have control over base call quality values. Each read was given either all Q=20 or Q=0 base qualities. Note in experiment one all were considered Q=40 since the reads were from a fasta file. Here I wanted to see how poorer base qualities affect MAPQs, for example, of multireads. Moreover, I wanted to see if they had any affect on "perfect matches". Note that with these lower quality bases, more mismatches are allowed to accrue.

Q0 entries in blue to make them easier to find.

Read Name NumAligns AS XS MAPQ (for best position)

read1_Q20 1 AS:i:0 * 42
read1_1mm_Q20 1 AS:i:-4 * 42
read1_2mm_Q20 1 AS:i:-8 * 40
read1_3mm_Q20 1 AS:i:-12   * 23
read1_4mm_Q20 1 AS:i:-16   * 8
read1_5mm_Q20 1 AS:i:-20   * 3
read1_6mm_Q20 1 AS:i:-24   * 0
read1_1mm_Q0 1 AS:i:-2 * 42
read1_2mm_Q0 1 AS:i:-4 * 42
read1_3mm_Q0 1 AS:i:-6 * 40
read1_4mm_Q0 1 AS:i:-8 * 40
read1_5mm_Q0 1 AS:i:-12   * 23
read1_6mm_Q0 1 AS:i:-12   * 23
read1_7mm_Q0 1 AS:i:-14   * 23
read1_8mm_Q0 1 AS:i:-16   * 8
read1_9mm_Q0 1 AS:i:-18   * 3
read1_10mm_Q0 1 AS:i:-20   * 3
read1_11mm_Q0 1 AS:i:-22   * 0
read2_Q20 2 AS:i:0 XS:i:0 1
read2_1mm_Q20 2 AS:i:-4 XS:i:-4 1
read2_2mm_Q20 2 AS:i:-8 XS:i:-8 1
read2_3mm_Q20 2 AS:i:-12 XS:i:-12 0
read2_3mm_Q0 2 AS:i:-6 XS:i:-6 1
read2_4mm_Q0 2 AS:i:-8 XS:i:-8 1
read2_5mm_Q0 2 AS:i:-10 XS:i:-10 0
read3_Q20 3 AS:i:0 XS:i:0 1
read3_1mm_Q20 3 AS:i:-4 XS:i:-4 1
read3_2mm_Q20 3 AS:i:-8 XS:i:-8 1
read3_3mm_Q20 3 AS:i:-12 XS:i:-12 0
read4_Q20 4 AS:i:0 XS:i:0 1
read4_1mm_Q20 4 AS:i:-4 XS:i:-4 1
read4_2mm_Q20 4 AS:i:-8 XS:i:-8 1
read4_3mm_Q20 4 AS:i:-12 XS:i:-12 0
read5_Q20 5 AS:i:0 XS:i:0 1
read5_2mm_Q20 5 AS:i:-8 XS:i:-8 1
read5_3mm_Q20 5 AS:i:-12 XS:i:-12 0
read6_Q20 6 AS:i:0 XS:i:0 1
read6_2mm_Q20 6 AS:i:-8 XS:i:-8 1
read6_2mm_Q0 6 AS:i:-4 XS:i:-4 1
read6_4mm_Q0 6 AS:i:-8 XS:i:-8 1
read6_5mm_Q0 6 AS:i:-10 XS:i:-10 0
read6_6mm_Q0 6 AS:i:-12 XS:i:-12 0

When read1 (which aligns to a single position) had either 0 or 1 mismatch over a Q20 base, it received a perfect MAPQ of 42. When it had between 1-2 mismatches over Q0 bases, it also still had a perfect MAPQ. MAPQs remain non-zero for up to 5 Q20 mistmatches, up to 10 Q0 mismatches, and from expt 1, up to 3 mismatches over Q40 bases. What those 3 cutoffs have in common are alignment scores >= -20 (i.e. between 0 to -20) while the one more mismatch in each case that caused a MAPQ of 0 had alignment scores <= -22 (i.e. -22 to -30.6). Thus, the AS cutoff is somewhere between -20 and -22 for having a non-zero MAPQ. All true multireads (AS=XS), no matter what the copy number, still could only have a MAPQ of 0 or 1. The MAPQ was 1 when the AS was >= -8 (i.e. 0 to -8) and was 0 when the AS was <= -10 (i.e. -10 to -30.6). So, a true multiread (AS=XS) gets a MAPQ of 1if its AS is "good" and a 0 if its "bad" and the cutoff between good and bad is somewhere between -8 and -10.

An English translation of how bowtie2 assigns MAPQs from the code and experiments

Here I use the experiments above and the actual code to provide a useful table on what select MAPQ cutoffs mean. The code I looked at was actually from the C code version by Devon Ryan here.

Bowtie2 does not use the number of times a read mapped in the calculation for MAPQ. Instead, it just uses the AS and the XS, which is either >= the minimum score or below it. If AS == XS, the read is considered a true multiread and can only get a score of 0 or 1. If XS is below the minimum score, the read is considered a true uniread under the given scoring scheme. True unireads can get scores of 0, 3, 8, 23, 24, 40 and 42. A score of 0 is reserved for the true unireads with an AS in the bottom 30% of allowable scores. The scores 3, 8, 23, 24, 40, and 42 are unique to true unireads. Therefore, if someone is hell bent on taking only "unireads" with decent alignment scores, the way to do it would be to take only reads with those scores.

awk '$5 == 3 || $5 == 8 || $5 == 23 || $5 == 24 || $5 == 40 || $5 == 42' file.sam

The other possibility is to grep for reads that do not have XS:i::

grep -v XS:i: file.sam

Note that it is important to grep for "XS:i:" and not "XS" which will also take reads with base quality encodings that have "XS". If you use this approach, you will also get the low quality MAPQ=0 true unireads.Though it is possible to extract only "true unireads" that way, you should note that since mismatches and gaps were allowed during these alignments, that this action may be way more conservative than you want to be. It may also be just kind of wrong to do. Read my previous posting about why.

Other maxireads can get scores between 0-39. These are the exact scores that can be assigned: 0-7, 11-12, 14-22, 25-27, 30-39. It would seem that the scores 9, 10, 13, 28, 29, and 41 can not be assigned in bowtie2. Though maybe someone can prove me wrong by looking at their bt2 outputs.

Recall, "true multireads" are those where AS=XS, "true unireads" are reads that have only a single mapping position in the scoring scheme, and "maxireads" include true unireads and reads where AS > XS. The following are MAPQ cutoffs and how they will affect each of these types of reads. Note, however, that other "maxireads" will be in each set. These experiments just do not elucidate which ones. This table puts MAPQs in terms of maximum number of mismatches possible over a certain quality base, but combinations of base qualities are more likely. Also remember gaps can occur as well.

MAPQ >= X #MM Q40 #MM Q20 #MM Q0 Description

0 5 7 15 All mappable reads

1 3 5 10 True multi w/ "good" AS, maxi of MAPQ >= 1

2 3 5 10 No true multi, maxi of MAPQ >= 2

3 3 5 10 No true multi, maxi of MAPQ >= 3

8 2 4 8 No true multi, maxi of MAPQ >= 8

23 2 3 7 No true multi, maxi of MAPQ >= 23

30 1 2 4 No true multi, maxi of MAPQ >= 30

39 1 2 4 No true multi, maxi of MAPQ == 39*

40 1 2 4 No true multi, only true uni-reads

42 0 1 2 Only "perfect" true unireads

The next table puts it in terms of AS, XS, and minimum score. Maxireads refer to any read where AS > XS, but for now not when there is no XS. "classic unireads" will refer to when there is no XS. "Multireads" refer to AS=XS true multireads. I will use the term percentile to mean that percentile of all possible alignment scores. For example, AS in the 80th percentile means the AS is >= 80% of all possible alignment scores given the minimum score allowed -- here the column will just say 80th to mean all reads with AS >= 80% of all possible AS. "dAS" will refer to the absolute value of AS-XS and how this compares to the absolute value of the minimum score. Thus, dAS_80th would mean that the difference between the best alignment score and the 2nd best is equivalent to an AS in the 80th percentile.

##TODO :-P -- read the python code

MAPQ
>= X multi maxi maxi

0 all all all
1 68th 30th dAS<=30% & AS 68th OR dAS in 30th*
2 None   30th dAS<=30% & AS 68th OR dAS in 30th*
3 None   30th dAS<=30% & AS 68th OR dAS in 30th*
4   None
5   None
6   None
7   None
8   None
11   None
12   None
14   None
15   None
16   None
17   None
18   None
21   None
22   None
23   None
24   None
25   None
26   None
27   None
30   None
31   None
32   None
33   None
34   None
35   None
36   None
37   None
38   None
39   None
40   None
42 None None 80th Only high AS classic unireads

*not strictly followed in the case of MAPQ=2, check out the code

Python code for calculating bt2-style MAPQs

Here is my python code translated from Devon Ryan's C code version here. I made some changes to the values (e.g. 0.61 in stead of 0.60) to make the MAPQ calculations 100% consistent with the MAPQ values I obtained in the above experiments with the same AS and XS.

def bt2_mapq_end2end(AS, XS=None, scMin=-30.6):

'''scMin = minScore'''

if XS == None:

XS = scMin-1

if XS > AS:

return None

diff = abs(scMin)

bestOver = AS-scMin

bestdiff = abs(abs(AS)-abs(XS))

if XS < scMin:

if bestOver >= diff*0.8:

return 42

elif bestOver >= diff*0.7:

return 40

elif bestOver >= diff*0.61:

return 24

elif bestOver >= diff*0.5:

return 23

elif bestOver >= diff*0.42:

return 8

elif bestOver >= diff*0.3:

return 3

else:

return 0

else:

if bestdiff >= diff*0.9:

if bestOver == diff:

return 39

else:

return 33

elif bestdiff >= diff*0.8:

if bestOver == diff:

return 38

else:

return 27

elif bestdiff >= diff*0.97:

if bestOver == diff:

return 37

else:

return 26

elif bestdiff >= diff*0.6:

if bestOver == diff:

return 36

else:

return 22

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.84:

return 25

elif bestOver >= diff*0.68:

return 16

elif bestOver >= diff*0.68:

return 5

elif bestdiff >= diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.84:

return 21

elif bestOver >= diff*0.68:

return 14

else:

return 4

elif bestdiff >= diff*0.3:

if bestOver == diff:

return 32

elif bestOver >= diff*0.88:

return 18

elif bestOver >= diff*0.67:

return 15

else:

return 3

elif bestdiff >= diff*0.2:

if bestOver == diff:

return 31

elif bestOver >= diff*0.88:

return 17

elif bestOver >= diff*0.67:

return 11

else:

return 0

elif bestdiff >= diff*0.1:

if bestOver == diff:

return 30

elif bestOver >= diff*0.88:

return 12

elif bestOver >= diff*0.67:

return 7

else:

return 0

elif bestdiff > 0:

if bestOver >= diff*0.67:

return 6

else:

return 2

else:

if bestOver >= diff*0.68:

return 1

else:

return 0

The End.

The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data

by JohnUrbanGenome

Here starts the little series of illustrations showing possible outcomes starting with requiring a perfect match (0 mismatches) and gradually allowing more complex alignments (e.g. allowing mismatches). This is not meant to show things in a proportionally correct manner (i.e. not in proportion to all of your reads). Above, 6 reads are shown that do not map to the human genome for various reasons: from a region not in the current assembly (NIA), has a SNP, has a base call error (BCE) with a low base quality score, or is from a contaminating E. coli source (Con). There are 4 examples of reads that map uniquely: 2 map to the correct positions and 2 map to incorrect positions since they had a SNP and/or BCE. Finally, there is an example of a multiread that maps to >1 position equally well.

Unireads are reads that align to a single, unique position in the genome.
Multireads are reads that align to multiple positions in the genome.

Original Publish Date:

Wednesday, May 28, 2014

The slow death of the term "uniquely mappable"

Most genomics studies using next generation sequencing (e.g. ChIP-seq) use what people consider to be standard quality control steps such as using "uniquely mappable" reads. Even the ENCODE website says one of their quality metrics is using uniquely mappable reads. But what are "uniquely mappable reads"? Why or how are they useful? and do they always ensure high quality data? If you answer those questions too quickly, you will miss a lot of nuance. Anyone new to doing their own NGS bioinformatic analyses who thought through those questions just a moment too long can attest to how much they have haunted them since. SeqAnswers.com, BioStars, and other forums are filled with questions about what defines a uniquely mappable read and how to extract them. However, the answers on different threads are often different. Why?

"Uniquely mappable" is a concept that started when short reads first hit the scene and there was no sophisticated way to deal with them other than to see if a short read matched 1 site exactly (uniquely mappable), if it matched >1 site exactly, or if it did not match any site exactly. However, it was always kept in mind that SNPs and base call errors would require the allowance of some mismatches once the dust settled from all of the initial excitement. Once mismatches were allowed, "uniquely mappable" became a very obscured concept. The number of uniquely mappable reads decreases with an increasing number of mismatches allowed in the alignment. Another way of looking at it: the number of multireads increases with the number of mismatches allowed since it allows formerly "unique" reads to potentially match more sites. So with allowing more mismatches, more formerly uniquely mapped reads are tossed out when keeping only "unireads". Thus, in some ways using only uniquely mapped reads, especially after allowing, for example, 5 mismatches, is absurdly conservative and perhaps even plain wrong due to how much data is thrown out. Moreover, unique reads include increasing amounts of false positives with increasing numbers of mismatches allowed.

Another way to envisage why uniquely mapped reads become multi-reads when mismatches are allowed is to think of a genome composed of all 4^50 50-mers (all 50-mers composed of A,C,G,T), each as its own chromosome so there are no overlapping 50-mers to think about. If you take one 50 bp read (no errors) from this genome and map it back requiring a perfect match, then there is only 1 place it can align -- to itself, where it came from. Once you allow 1 mismatch, it could map to 151 different 50-mers. Want to allow 2 mismatches? Now it can map to over 11 thousand 50-mers. If you allow up to 5 mismatches, it can map to over 534 million positions in that genome! In each of these cases, if you were keeping tabs on the number of mismatches somehow, then despite the fact that it mapped to over 534 million positions, you could pick the highest confidence position - the one with 0 mismatches in this case. OR you could throw it away because it is a "multiread". In real life, a genome is only composed of a tiny fraction of all those 50-mers and inside that tiny fraction each 50-mer is present in different copy numbers, not just 1. This is why even when allowing 0 mismatches, some reads map to multiple sites. In the real life case, when allowing, for example, 5 mismatches, yes there are theoretically >534 million 50-mers it could match, but it is possible none of them occur in the genome since 534 million 50-mers is <<<<1% of all possible 50-mers. However, it is also possible that one or a couple do exist in the genome and in this case, the read, which was unique when requiring a perfect match, is now considered a multi-read and would be tossed out in studies that keep only "unireads" rather than choosing the best position. Thus, there becomes an absurdity to keeping only uniquely mappable reads when more and more mismatches are allowed. Moreover, in real life, this is also why contaminating (false positive) reads begin to align more when more mismatches allowed. Again, if you allow 5 mismatches in the 50 bp read, then the contaminating read can map to >534 million 50-mers, 1 of which may be in the reference genome of interest and in this case, such a uniquely mappable read would be kept. However, if one was more nuanced in his/her judgement and thought the alignment score was too poor, they would toss it despite that it was uniquely mapped.

After allowing 1 MM, one of the unique reads is now able to map to >1 position and so would no longer be considered unique if you extract only the reads that map to one place. The other correctly mapped uniread stays unaffected. The next read, originally a multiread, increases in multiplicity though this new position has a lower alignment score than the first 2 and could be ignored by more sophisticated algorithms. The read with the SNP is now able to align to the correct position (an intended result for allowing mismatches). The read with the base call error that could not map now becomes a multiread - it can map to the correct position and a position that agreed with the BCE but needed 1 mismatch allowance elsewhere in the read for alignment. Note that in bowtie2, the correct alignment would have a higher alignment score since the mismatch occurred over the BCE (which in this example has a low base quality score) whereas the mismatch in the incorrect position occurs over a high quality base call. Thus, it could choose the correct position, but people who take only unireads would toss it. The read with both a BCE and a SNP still only aligns to the incorrect place but uniquely! The next read with a BCE now has become a multiread by now being able to align to the correct position. In bowtie2, the incorrect mapping would have the better alignment score since it did not require a mismatch and matches over low quality base calls are not penalized. Thus, this is an example of a read for which the best alignment is not the correct alignment (unless a different scoring scheme that penalized low quality base calls in general was adopted). The contaminating reads still do not map.

Now 2 mismatches are allowed. The read with both a SNP and a BCE that originally aligned to the incorrect position is now able to align to the correct position. However, the correct position has a lower alignment score since it requires 2 mismatches (and since, at least in bowtie2, low quality base calls are not penalized over matches). The only other new event here is that one of the contaminating reads is now able to align uniquely.

The problem of "uniquely mappable" does not end with mismatch tolerance. Once gaps are also allowed, "uniquely mappable" becomes almost impossible to define. Note that most current aligners allow gaps. Think about it -- it is possible to map reads that were considered "unique" by the perfect match definition in ways such that none end up being "uniquely mappable" when allowing gaps and mismatches (without constraints). What becomes important is the scoring scheme, the alignment score, and cutoffs on alignment scores to be considered a "valid alignment". Then it is important to form some definition of a well-mapped or reliably mapped position for a given read from the alignment score of its best alignment as well as those of its other alignments. This also means that the status of a read being "uniquely mappable" becomes directly dependent on the scoring scheme and cutoffs - which are tune-able parameters, thus making the term even more elastic.

"Uniquely mappable reads" have even less of a practical relevance when end to end alignment is not enforced and, instead, positions of the best local alignments (where the alignment to the reference can start and end anywhere inside the read) are searched for. For many reads, the best local alignment may still be the position that was formerly "uniquely mappable" in an end to end alignment requiring a perfect match. However, there are likely to be many other possible local alignments (think BLAST results). What again becomes most important is the scoring scheme, cutoffs for valid alignments, and sophisticated ways to define "confidently mapped" and "best" positions rather than using the binary criterion of "uniquely mapped" as a proxy. At this point, very few reads will have been uniquely mapped and many that have been uniquely mapped can be false positives. Taking all "uniquely mapped" reads could blindly include those false positives whereas an alignment score that represents mismatches, big gaps, and the use of only small region of the read would represent the lack of confidence in the mapping at which point we can choose to ignore it.

Now gaps are allowed in addition to 2 MM. The multiread increases in multiplicity again, but again this new alignment has an even lower alignment score so could be ignored. The read with a SNP that was unique when allowing 1 or 2 mismatches, now has become a multiread. The 2 incorrect alignments would have lower alignment scores and could be ignored. When keeping only reads that align to a single position in the genome, though, it would be tossed. The next 2 reads, one with a BCE and the next with both a SNP and BCE, increase in multiplicity (again lower scores though which could be ignored). Finally, another contaminating read is now able to uniquely align.

The first change in this picture is that the read from a region not in the assembly is now able to map, but to an incorrect position -- though uniquely! The 3rd read from the left, which was a uniread for all the conditions until now has finally lost its uniread status. It can now align to 2 positions, though the incorrect position would have a lower alignment score and could be ignored. If keeping only reads that aligned to a single position, it would be tossed. Now the remaining contaminating read is also able to align (uniquely) and one of the others has now become a multiread. All in all, this series of illustrations is not proportionally accurate -- it is just meant to show the absurdity of ignoring alignment scores and judging a read that maps uniquely as good and multiply as bad when in many cases an alignment score comparison could tell you where a multiread maps and that a uniread maps horribly.

So why is there lack of agreement about "uniquely mappable" in the forums?

Practically speaking, "uniquely mappable" has actually become an outdated and not-so-useful term of sorts. Technically speaking, the "uniquely mappable" status of a read can change with the algorithm used, with its heuristics for finding alignments in super reasonable periods of time and its scoring schemes and cutoffs. Culturally speaking, "uniquely mappable" has become a term associated with being "rigorous" with your data. The term persists and even as it persists it exists simultaneously in two states: some people latch on to the old definition and take it to mean that the read only has a single site in the genome that it maps to (but this changes with how the read is mapped as discussed above!) while others have adopted a newer definition and take it to mean that the read either maps to a single place OR the alignment score of the best alignment position is better than the score of the next best alignment position (e.g. bowtie2). The latter definition is more sophisticated and takes into consideration the fact that more and more reads will be able to map to more and more places once you start allowing mismatches, gaps, and local alignments, all of which may be necessary because of SNPs, base call errors, microdeletions, microinsertions, reads split across break points … in other words, the many issues that arise when mapping reads (often with errors) from a cancer genome back to the reference. The view is that even if it can map to more than one place, the better the best score is over the 2nd best score, the more likely that position is the correct position to align it to. However, it is not guaranteed that the best alignment score belongs to the correct mapping position, though this might be the case most of the time.

The goal should not be to include only uniquely mappable reads. The goal should be to include any read that is mapped to a position with high confidence. This would include some "multireads" and exclude some "unireads":

Since, on some level, all reads can become multireads (some sooner than others), "uniquely mappable" is not a useful concept anymore. Truth is, it was always just a quick-and-dirty, black-and-white, binary hueristic to classify a mapped read. There always needed to be a more dynamic way to quantify a "well-mapped" read; a spectrum from lowest to highest in belief or confidence that the mapping is correct. Mapping Quality (MAPQ or MQ) scores, which typically range from 0-40 or 42, theoretically approach this ideal. MAPQ is supposed to be -10log10(p) where p is the probability the read is mapped wrong. This way, the smaller the probability it is mapped wrong, the higher the score it gets. Thus, if there are 2 places a read can go and it is assigned to 1 uniformly at random, the a priori probability that it is mapped wrong is 0.5 and the MAPQ = -10log10(0.5) ~= 3. Similarly, a MAPQ of 10, 20, 30 and 40 should mean the probability it mapped wrong is ~0.1, 0.01, 0.001, and 0.0001. I do not think most current aligners implement probabilistic models that first estimate the probability a read is mapped wrong and then convert it to a MAPQ though. For example, Bowtie2 seems to follow a rule based approach to assigning MAPQ values (which I discuss in another post). In Bowtie2, a read that aligns to more than 1 site equally well is never given higher than a MAPQ of 1 (even if it aligns to only 2 sites equally well as discussed above).

Though MAPQ is related to uniquely mappable (mainly the newer definition), it compounds it with other factors. MAPQ is just a transformation of the probability a read is mapped wrong, which should take into consideration, at least implicitly, the following: the alignment score, the number of times the read mapped, and the alignment scores of the other mappings (Bowtie2 looks only at the minimum alignment score allowed and the alignment scores of the best and second best alignments to estimate the MAPQ). Since the estimated probability that a read is mapped wrong depends on whether the read mapped more than once, it is also heavily influenced by the allowance of mismatches and gaps, scoring schemes, and cutoffs. Moreover, different algorithms approximate MAPQ scores in different ways. So the MAPQ of a mapped read from bowtie2 can be different than that of bwa given that everything else (e.g. finding and scoring alignments) is the same (which it isn't). MAPQ may not be completely standardized yet, but with current read mapping it is a more nuanced, more practical, more accurate way to quantify our belief that a read is correctly placed than is classifying it as "uniquely mapped" or not.

The goal should not be to discard your multireads. The goal should be to figure out the most probable site they came from and to map them there.

From here on forward let us refer to "multireads" as only those reads for which their best alignment score is the same as their second best alignment score. Reads for which their best alignment score is better than their second best, formerly multireads, will now be called "maxireads" to represent they were placed at the position of maximum score. This term, "maxiread" also encompasses reads that would previously be called "uniquely mappable" as they also by definition were placed at the position with maximum alignment score. So the two terms moving forward are "maxireads", those with single maximum alignment scores, and "multireads" for which there is no single maximum alignment score (i.e. the maximum is shared by >= 2 positions).

The frontier is really in deciding how to find the best positions for all multi-reads. Multireads are still essentially just filtered out. Thinking about mapping all mappable reads like a puzzle, the problem is, we have been assuming we have 2 options:

1. toss out the puzzle pieces that have shapes that fit in multiple places equally well

2. randomly choose one of the multiple places that a puzzle piece's shape fits equally well

Most of us have largely ignored the 3rd option:

3. use the picture we can already see formed by the puzzle pieces that fit into 1 spot best to inform us where the puzzle pieces with shapes that fit multiple places go

In other words, use the background already put together by the higher confidence maxiread data to form posterior probabilities of where a multi-read goes. Posterior MAPQs could be assigned and the reads could still be filtered out if they did not quite meet someone's MAPQ cutoff. Some posterior MAPQs would still be 0 if the maxi-read background was uninformative (i.e. balanced across all sites). Other posterior MAPQs could become quite high if the high quality maxi-read background was concentrated at just 1 of the optional sites.

For example, imagine a read that maps to 10 sites equally well. A priori, the probability it maps to any of those sites is 10%. Thus, you have a 90% chance of placing it at the wrong site if you assign it uniformly at random. The prior MAPQ would therefore be -10log10(0.9) = ~0.46, which would be rounded to 0. However, if you look at the background distribution of "maxireads" and at one of those 10 sites there are 91 maxi-reads and each of the others has 1 for a total of 100 maxi-reads, the posterior probability might be 91% to the first site and 1% to each of the 9 others. Now, for a read mapped to the first site (91% chance it came from there), the posterior probability that a it is mapped wrong given it is a perfect match is 9%. Let's call it 10% to make the math easier. The posterior MAPQ would now be -10*log10(0.1) = 10 instead of 0. However, it is difficult for me to say how a read randomly assigned to one of the other 9 sites would be scored. Since reads mapped to the other 9 sites were assigned to them under the same posterior probability distribution as the reads mapped to the first site (at a rate of 1% to each site), it seems like they should be scored as equally correct as the read mapped to the first site -- not necessarily -10log10(0.99) = 0.436 ~= 0 -- though that would be one way to score them. Moreover, other factors need to be taken into consideration such as the mappability of each of those sites. Some regions of the genome might be deserts in terms of places maxi-reads could align while others might be chock full of them. There needs to be a way to normalize the number of maxi-reads at a site by the propensity of maxi-reads to occur at that site. If one could do that, then one could intelligently place multi-reads to reconstruct the most probabilistically accurate puzzle picture rather than to leave puzzle pieces out of it or to fit them in uniformly at random.

Before mapping all of the high quality "maxireads" (see text for definition), we have a completely uninformed, uniform distribution on how to assign this multiread.

After mapping all of the high quality "maxireads" (see text for definition), i.e. after looking at the data, we have better informed posterior distribution on how to assign this multiread.

I have found that some groups have began trying to implement ideas like this. This needs to catch on!

- MuMRescueLite

- AREM: Aligning Reads by Expectation Maximization (uses MACS1)

- BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

- this workflow: http://pages.stat.wisc.edu/~keles/Software/multi-reads/

- CSEM: ChIP-Seq multi-read allocation using Expectation-Maximization

Slightly related -- others have come up with ways to study repetitive regions (from where many multireads arise) in general:

- TEA: http://compbio.med.harvard.edu/Tea/

----

Note: of course these "goals" change according to the application. However, my hope is to open a discussion on all the data we toss out and the general assumption that keeping only uniquely mapped reads is "rigorous". It could be called "conservative", but it is not rigorous. Its more closely related intellectual laziness than to intellectual rigor. In another post, I will happily discuss more better ways to deal with "redundant reads" as well, instead of just removing them all or all but 1, which of course sets a very low ceiling on how many reads you can align to the genome in the era of increasing throughput.

Tuesday, February 18, 2014

Why it burns when you P.

by JohnUrbanGenome

The smaller the p-value, the better the result. True or false?

If you answered "True", then that is why it burns when you use and interpret p-values. It hurts us all. Stop it.

The p-value seems to have become the judge, jury, and executioner of evaluating scientific results. As I go further into my PhD, I see just how in love we are with p-values as well as all the signs of an abusive relationship. No matter how many horrors the p-value wreaks upon us, we just know it won't do it again. Plus, no matter how much we misuse and manipulate the P-value, it just won't leave us. It's here to stay and that's love … right?! Love or hate, it is pretty hard to imagine a world without p-values, but such a world did exist: p-values have only been around for less than a century. And before that: nothing was statistically significant! Yet scientists were able to somehow test hypotheses, come to conclusions, and develop theories anyway including Darwin, Mendel, Newton, Einstein…

Google Ngram snapshot Feb 17, 2014.
Side note: It is not strictly true that the world was completely absent of significance before the P value. For example, John Arbuthnot seems to have stumbled upon a significant result in 1710 and the term "statistically significant" seems to have come up around the year 1600 according to google ngram (above). However, "P values", "statistical significance", and "statistically significant results" in their modern incarnations are a product of the 1920s.

So who invented the P-value, what was it originally used for, and how is it used in modern data analysis? Could science and data analysis exist without p-values once again? Should it? Many of you (i.e. my readers, i.e. referer spam bots) might have felt a sense of shock thinking that the almighty p-value ever had an inventor. What mere mortal could have created the mighty metric for "significance" itself? Rather than tell you, I encourage you, Spam Bot, to read, "The Lady Tasting Tea" by David Salsburg. One result of inventing the p-value that is almost certainly statistically significant is the number of scientific papers that have included the p-value since. More interesting is that the p-value has become so omnipresent, so omnipotent, so mythological and magical that few if anyone cites the mere mortal inventor when they use it. If everyone did cite R. A. Fisher, he would quite possibly be the most cited scientist ever by no small margin (okay I told you, Spam Bot ...but I still think you should read that book).

The p-value has been a source of controversy ever since its invention. Unfortunately, the controversy has mostly gone on in the coolest statistics circles of which it is statistically unlikely that you or any of your ancestors are or were in. Good news! Due to increasingly bad science -- that is perhaps a consequence of the rise, abuse, misuse, and generally poor understanding of p-values -- the controversy has reached the foreground.

There are already plenty of people that have provided prose preaching about the promiscuous rise and predicted plummet of the p-value's popularity and that promote posterior probabilities and other alternatives such as reporting effect sizes and confidence intervals. So I do not want to do my own version of that. See the end for recommended books and articles covering these issues. Rather, I want to ask you a few questions (feel free to answer in the comments section). This is not a comprehensive set of questions, but they are questions that I think any scientist/data analyst should have their own thoughts on. In fact, it would be amazing if average Joe Citizen had thoughts on these questions about the almighty p-value and related topics such as the False Discovery Rate, reproducibility, and validity. All of us are blasted in the face with statistically significant results on a daily basis. Popular views on whether eggs and coffee are good or bad for your health change all the time because of statistical significance. So, which is it: good or bad?!?! Can statistical significance ever answer those health questions definitively?

Take a moment to think about each of the following questions. There are some starting points to explore these topics afterward, but mostly I expect that you know how to use Google. For the uninitiated, remember that p-values get smaller with "higher significance" -- hence the tempting, but horribly wrong conclusion of "the smaller the p-value, the better the result":

1. What is a p-value?

2. What does a p-value mean to you?

3. What does p-value mean to your colleagues?

4. Are results better when the p-value is smaller (i.e."more significant")?

5. What factors make a p-value tiny anyway?

6. When you are scanning a paper or report and see "highly significant" (tiny) p-values, how does that affect how you perceive the results?

7. When should you or anyone else report a p-value?

8. Is it always necessary to include a p-value?

9. What is an effect size?

10. Does a tiny p-value guarantee a large effect size?

11. Is it possible to get a highly significant (tiny) p-value with a tiny effect size?

12. How does "n" (number of data points) affect the p-value?

13. What is a null hypothesis? or what is the null distribution?

14. Is there more than one possible null distribution to use for a given test?

15. How do you pick a null distribution?

16. Is there a correct null distribution? or multiple correct null distributions ("nulls")? or mostly incorrect nulls? or do nulls have no inherent correctness at all?

17. What does the null distribution you use say about your data?

18. If your p-value is "highly significant" and lets you conclude that your data almost certainly did not come from your null distribution, a negative assertion, does that give you any positive assertions to make about your data?

19. If your p-value is "not significant" and so you fail to reject the null hypothesis, does that mean the null hypothesis is therefore true?

20. If your p-value is "not significant" and so you fail to reject the null hypothesis, does that imply that your alternative hypothesis is therefore false?

21. If your p-value is "significant", would it come out statistically significant again if you repeated the experiment? What is the probability it would be significant again? Is that knowable?

22. Is it possible to get a highly significant (tiny) p-value when your data have indeed come from the null distribution?

23. When you do 100 tests with a cut-off for significant p-values at 0.05, if all the tests are in fact from the null distribution, how many do you expect to be counted as significant anyway? -- i.e. how many false positive tests will there be out of the 100 when the significance cutoff is 0.05? If you do 2.5 million tests with a significance cutoff of 0.00001, how many false positives would you expect to come through?

24. When doing multiple tests in general, how can you limit the number of false positives that come through? How will that affect the number of true positives that come through?

25. What is the Bonferroni correction?

26. What is the False Discovery Rate (FDR)? …and what are q-values?

27. Does the False Discovery Rate control all types of false discoveries?

28. What types of false positives are not accounted for when controlling the FDR?

29. When reading a paper or report, if the FDR is tiny, how does that affect how you perceive the results?

30. What is Bayes' Rule?

31. What is a prior probability? a likelihood? a posterior probability?

32. What is a prior distribution?

33. What is a posterior distribution?

34. Compare null distributions to prior distributions.

35. Compare p-values and posterior probabilities.

36. Compare confidence intervals to credible intervals.

37. What is reproducibility? reliability? accuracy? validity?

38. What is the Irreproducible Discovery Rate (IDR)?

39. If a result is highly reproducible, does this ensure it is valid?

40. If a result is not reproducible, does this guarantee it is not valid?

41. If results are both reproducible and valid, does this guarantee the interpretation is correct?

42. Are the non-reproducible elements across replicates always "noise"? Or could they be signal?

43. Is it possible that the most reproducible elements across replicates are in fact the noise?

Books:

The Lady Tasting Tea: How Statistics Revolutionized Science In The Twentieth Century.

-- David Salsburg

-- This was a good read - a brief history of statistics, the birth and rise of statistical ideas, and the characters involved

In Pursuit of the Gene: From Darwin to DNA

-- James Schwartz

-- This is also a great history of statistics though not strictly relevant to the p-value topic. Think of it as the pre-P history. I mainly included it here because when I was done reading this, I read "The Lady Tasting Tea" and thought that in some ways it picked up where this book left off. Many of the great statisticians were interested in genetics - that is still true even today.

Mathematical Statistics

-- Wackerly, Mendenhall, Scheafer

Articles (chronicles) concerning the p-value battles:

The Case Against Statistical Significance Testing -- Carver, 1978

Statistical ritual in clinical journals: is there a cure? -- Mainland, 1984

Significance Questing -- Rothman, 1986

Statistical Analysis and the Illusion of Objectivity -- Berger and Berry, 1988

The Earth Is Round (p < .05) -- Cohen, 1994

In Praise of the Null Hypothesis Significance Test -- Hagen, 1997

If Statistical Significance Tests Are Broken/Misused, What Practices Should Supplement or Replace Them? -- Thompson, 1999

Eight common but false objections to the discontinuation of significance testing in the analysis of research data. -- Schmidt, 1997

What If There Were No More Bickering About Statistical Significance Tests? -- Levin, 1998

The significance of "nonsignificance" -- Berry et al, 1998

P-Values and Confidence Intervals: Two Sides of the Same Unsatisfactory Coin -- Feinstein, 1998

On the Past and Future of Null Hypothesis Significance Testing -- Robinson and Wainer, 2001

Tests of significance considered as evidence -- Berkson, 2003

The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant -- Gelman and Stern, 2006

A Dirty Dozen: Twelve P-Value Misconceptions -- Goodman, 2008

What is the probability of replicating a statistically significant effect -- Miller, 2009

Weak statistical standards implicated in scientific irreproducibility -- Hayden, 2013

Scientific method: Statistical errors -- P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. -- Nuzzo, 2014

Other articles (some of which concern p-value battles -- e.g. motivation for Bayesian Inference):

Bayesian Statistical Inference for Psychological Research -- Edwards et al, 1964

Bayesian statistics in genetics: a guide for the uninitiated -- Shoemaker et al, 1999

The Bayesian Revolution in Genetics -- Beaumont and Rannala, 2004

Effect size, confidence interval and statistical significance: a practical guide for biologists -- Nakagawa and Cuthill, 2007

What to believe: Bayesian methods for data analysis -- Kruschke, 2010

A biologist's guide to statistical thinking and analysis -- Fay and Gerow, 2013

What is Bayesian statistics and why everything else is wrong -- Levine

Wiki pages:

Bayesian Inference

Frequentist Inference

p-value

multiple comparisons problem

posterior probability

sample size determination

the problem of "n" being too big (large sample sizes and small p-values)

effect size

Bonferroni correction

p-rep

statistical power

defining significance in terms of sigma

Other blogs, classes, etc that get into the topic of p-values -- for and against:

Everything Wrong With P-Values Under One Roof

On the scalability of statistical procedures: why the p-value bashers just don’t get it.

New book: Come on people, stop using p-values already

Are all significant p-values created equal?

The p-value is not . . .

Misunderstanding the p-value

Hypothesis Testing -- slides by Vladimir Janis

The Statistical Significance Testing Controversy:A Critical Analysis <-- R. Chris Fraley's syllabus for his class of that title. At bare-minimum, this is just a very extensive amount of references on these topics organized into points such as why to abandon significance tests, why to keep them, effect size, statistical power, etc.

IDR - a very new statistical idea:

ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.

Measuring reproducibility of High-Throughput Experiments

Tuesday, February 11, 2014

The Oil-and-Water of Website Design

by Unknown

Hear ye, hear ye: I have a meagre-but-workable knowledge of designing and maintaining small websites.

Over the years I've figured out some helpful website design/management practices that are probably pretty well known and obvious to real website managers, but I think are probably useful to anyone who likes to dabble. The underlying theme is layering and separation. Modularity. For example, the aesthetics of the website---the logo, the masthead---should be developed independently of coding (and preferably beforehand to help guide the coding). The coding itself naturally should separate into content, on-screen actions, and layout --- that is, e.g., your HTML, your javascript, and your CSS.

I'm going to skip a bunch of personal history and go straight to a case study: my lab's website. A few years ago I thought it would be good for my lab to have a web presence, so I thought: (1) We need a cool name, (2) We need a cool logo, (3) We need a cool website.

Turns out "cool" is defined such that "Terrestrial Lab" is a cool name and that this is a cool logo:

The definition is also flexible enough such that www.terrestrial-lab.org is a cool website. At the least, these will be our starting assumptions!

Before I made this website, my advisor often referred to our lab as the NJIT Middle and Upper Atmospheric Group. While that name was descriptive, I thought something like "Terrestrial Lab" was catchier. We're not studying rocks, oceans, volcanoes, or a plethora of other "terrestrial" features (hell, most of science fits under the "terrestrial" umbrella), so this name might seem remiss. But we do study space weather activity with a focus on the Earth's lower, middle, and upper atmosphere, and our work does make up the "terrestrial" component of our broader group, "Center for Solar-Terrestrial Research." So 1+1=2, and it's a great name!

With a hip new name, we could also have an easy-to-remember website URL. Our old website was TSMTM.net --- pretty dang easy to remember.....that is, if you're familiar with the -ospheres: troposphere, stratosphere, mesosphere, thermosphere, magnetosphere. But if you're anything other than an atmospheric scientist, I wouldn't feel bad if it just looks like a random string of letters. And, anyway, if you're anything like me, all you see at a glance is Teenage Mutant Ninja Turtles.

What's a hip new name without an eye-catching logo? I wanted something identifiable. Something simple and obvious from far. My idea was to make a logo that someone could mindlessly doodle in class. Something that would look good as a sticker, or on a T-shirt. Most importantly, it had to do all this while making us look like a serious group of scientists (we're actually just a bunch of clowns, so this is important!).

I started with a triangle: simple, serious, pointy. I played around with it on a piece of paper, adding bits of text until I got things looking like how I wanted. Then I went digital with Adobe Illustrator after the idea was already down. I had to play around with fonts until for a while, but I finally nailed it --- with printed T-shirts and all:

Great, but a logo by itself though isn't enough for a website: one also needs a masthead of sorts. Some kind of graphic. Something that is nice looking, professional, and somehow relevant. I had always thought the periodograms I produced with magnetometer data from Antarctica had an aesthetic appeal to them, so I started there. Overall, this should be a fun part. Just play around. This was my final product:

Ok, so we have a name, a logo, a main page masthead design. Now it's time to get your hands dirty and code. I'm going to assume you know a lot of basics (some HTML, a sprinkle of javascript, a dash of CSS) and just list some things that have helped me. For starters, my website looks like this:

It's important to realize that the TL website looked like this even when I had much less efficient coding practices. I did not yet fully appreciate the oil-and-water of website design --- the natural separation of content, on-screen actions, and layout! The focus of my pointers below is on efficiency, which has nothing to do with the final product, but everything to do with maintaining the final product without going insane.

1. Internal vs External Cascading Style Sheets: Formerly, I used internal cascading style sheets as my primary CSS with manual overrides in-line if necessary. This was fine when I first created the site as a proof-of-concept to show my advisor, but it quickly became awkward to maintain the site and develop new pages. If I wanted to change the CSS of one page, I'd have to manually go around and do it to the other pages as well---or just say "to heck with consistency!"

Solution: I updated my site's primary CSS to an external style sheet (.css file) and now can do manual over-rides both (locally) in-line, like I formerly could, and (page-globally) in the header using the <style> tag. (Look at the code below and/or Google these things if you want to learn more.)

2. Internal vs External Javascript:

Same story. I used to have it internal to each page on my site using the <script> tag. Although I rarely modify the javascript on the site, the individual pages look so much more cleaned up without it, which allows me to focus on a page's content. Each page now calls on an external .js file, which I only need to edit once if the need arises, not laboriously for each page.

3. The Navigation Bar: The meat of the TL website is the navigation bar at the top. It needs to remain constant as a user peruses the pages, which means it somehow has to exist within each page's HTML file. One terrible way to do it is to copy and paste the nav bar's HTML int each web page's HTML file. That's fine until you want to modify the nav bar --- to add new sections, delete old links, or edit the associated javascript or CSS. One small change to the nav bar requires manually going into each file and...oh man! How time-consuming. I used to actually do this. It was tiresome and oppressive, and it simply stamped out creativity and made website maintenance a chore.

Solution: Again, nothing new conceptually: separation of dynamic content from static content. It's possible to use an external HTML file for the nav bar and simply have each webpage of the TL site refer to that file. This dramatically changes the web design and management experience. How to do it? The answer is Server-Side Includes, which allows one HTML file to call upon an external HTML file. The only caveat for using SSI is that you must change your webpage extensions from .html (or .htm) to .shtml.

To learn basic SSI (which is likely all you need), the wikipedia article is good enough.

The code pasted below shows more than meets the eye. These few lines in an HTML file used to be tens of lines! Maybe more. I used to have both JS scripts and a huge CSS code in the head of every page. Now the website's server simply pastes it there before it gets sent to the user's browser. Same thing for the SSI and my nav bar. That's it. The server does all the hard work before this page ever goes live in someone's browser.

<head>
<title>CSTR - Terrestrial Laboratory</title>
<script type="text/javascript" src="TL_script1.js"></script>
<script type="text/javascript" src="TL_script2.js"></script>
<link rel="stylesheet" type="text/css" href="TL_in_style.css">
</head>


<body>
<!--#include virtual="topLinkBar.html"-->

#
# all other HTML code for the page
#

</body>

And, so the secret to sanity is the separation of seemingly singular suppositions! (And, of course, the mastery of alliteration.)