Biofinysics: 2014

Tuesday, November 25, 2014

When a MacBook goes bad: the Case of the Calamitous Clamshell

My MacBook Pro [MBP] has a problem.

Closing a MacBook's lid is supposed to every-so-gently ease the MacBook into a light sleep. Instead, this very act bludgeons my poor workhorse into hibernation.

On a healthy MBP, after its restful nap, opening the lid is tantamount to a gentle, but purposeful caress while whispering, "Rise and shine, Sugar." The MBP glows to life almost immediately, happy to be of service!

For my bludgeoned MBP, however, opening the lid is on par with securely tying its limbs to the bedposts before arousing it from unconsciousness with smelling salts and lighting the room on fire. The MBP awakes, but it doesn't know what happened. It knows it has to get moving, but it's tied down by ropes. The heat is rising, the fire at its toes --- there's no time! The MBP manages to break free and get to safety. But it's sluggish. It feels drunk. There are all these open apps, but why? "What was I doing last night? How did I get here?"

Some Details
(1) This problem only happens on battery power: when the laptop is plugged in to an outlet, there is no problem.
(2) This issue does not have to do with sleep itself: only closing the lid when unplugged induces the issue (e.g., I can type "pmset sleepnow" from the command line, have the computer go to sleep, and arouse it with no issue.)

I've reset the PRAM and the SMC . I've looked at the power management settings and logs in the Terminal (e.g., pmset -g custom, pmset -g pslog), and have messed with various parameters as suggested by some posts on the Mac Forums (e.g., hibernatemode, standbydelay, autopoweroff). I've logged in from the Guest Account, deleted my login items, run the computer in Safe Mode, verified and repaired disk permissions. Still the problem persists!

I've run the Apple Hardware Test (or, for an early 2011 MBP) and have tried recalibrating the battery. Nothing. I checked my battery's health. To my quantum superposition of chagrin and triumph, the battery was healthy.

Hours of googling and finagling later, I simply ran out of ideas. That is, besides those my brother provided concerning a faulty cable and/or taking my MBP to Mac Genius.

To the Apple Store!

The Mac Genius immediately knew what to do: upon learning that I had backed up my MBP before coming in, hearing about all that I'd already tried, and re-trying a thing or two himself to no avail, the Mac Genius re-installed the operating system on my computer. The idea was that the MBP might behave properly now that it had a clean slate --- that maybe there was some piece of offending software causing the problem that we now hopefully got rid of.

Plot twist: It didn't work.

To ease my worried mind, the Genius kindly offered to let me pay a $300 flat fee to send my computer away so that---maybe---someone who actually knew wtf they were doing could figure something out. "It is very likely the sleep indicator cable," he said, "But just in case it's not, the fee covers whatever they have to do." To be clear, this was the same as saying, "If you have $280 to pay for your own laziness, leave the computer here."

Inconclusive Clamshell Conclusion: I declined the offer and, as of 25-Nov-2014, have not figured it out yet. There is a cable or two I've been told to check, which I will when I get time. Any advice? Please leave a comment.

Software Issues
When I got home that day, I re-installed everything via my latest Time Machine back-up. I then attempted to work on my dissertation proposal, which is a set of LaTeX documents I edit using MacVim (with the help of TeX-9), and do some data analysis in IDL. I sensed something was awry --- probably because neither MacVim or IDL would respond to my call.

"Uh-oh," my analytical brain assessed as I continued seeking out how doomed I might be. Almost anything I'd ever brew-installed didn't work, e.g., FFMPEG and ImageMagick.

In all fairness, this might not have been the fault of the Mac Genius who wiped my hard drive clean: indeed, it could have been my fault. At the Apple Store, when we re-installed the operating system, I whimsically decided to change the user name from a previous user's name to my own... Was this the cause of all my headaches? (Google that for me.)

On the next episode of "When a MacBook goes bad": Kevin bumbles through solving his software issues like a blindfolded man in a forest.

Thursday, May 29, 2014

Where does bowtie2 assign true multireads (AS=XS)?

by JohnUrbanGenome

Bowtie2 is an ultra-fast program for aligning next generation sequence reads to large genomes written by Ben Langmead and which happens to be my aligner of choice. RTFM here. Download it here or learn how to get it on your Mac OS X with "homebrew" from one of my previous posts, "The Craft of Homebrew-ing for Mac OS". Learn about how it assigns MAPQ scores and related ideas in my previous posts, "How does bowtie2 assign MAPQ scores?" and "The slow death of the term 'uniquely mappable'...". In this post, through mini experiments, I discuss whether or not bowtie2 randomly chooses a site for reads with multiple equal alignment positions. I chose to do this to support or put to rest various concerns raised in forums (such as seqanswers and biostars) that it always puts reads at the same spot or with a heavy bias for 1 or a couple spots.

Bowtie2 searches for as many alignments as it can (with given time and memory constraints) for a given read against the given reference (e.g. human genome) that have alignment scores above the minimum score cutoff. The output contains information on each read's best alignment score (AS) and, if a second alignment existed with a score above the cutoff, it's second best alignment score (XS). If an XS does not exist, this tautologically means there was no second alignment with a score above the minimum cutoff. Though this is a seemingly vacuous interpretation, it is the most robust one being a tautology and all. Some take "no XS" to mean its AS alignment was the only alignment or that it is a "uniread" while simultaneously not considering other reads with AS > XS as "unireads". Instead, this latter group of reads are lumped into the 'multiread' category, which are reads that map to more than one site. Others acknowledge that having no XS is just another way of saying AS > XS and that all reads with AS > XS are "unireads". This latter view acknowledges that having an XS is in part dependent on that minimum score cutoff as well as penalty sizes, whether or not mismatches and/or gaps are allowed, and whether or not you use end-to-end or local mode. Multireads, in this view, are only reads where AS = XS. To be clear, I try to refer to reads with AS = XS as "true multireads" since there is more than one interpretation of "multiread". However, when I do say 'multiread' I do still mean reads with AS = XS. Moreover, since the terms 'uniquely mappable' and 'uniread' has more than one meaning now, I try to refer to reads with AS > XS (including reads with no XS) as "maxireads" since they have a single maximum AS that is greater than all other XS. For practical reasons, some times I refer to reads with no XS as "true unireads", but this is a misnomer and I should stop it. I do it only out of the need to refer to that group of reads with no XS since some people are interested in only them. Perhaps I will just call them "classic unireads". Find out more about the fickleness of the state of being unique in my previous post, "The slow death of the term 'uniquely mappable'…".

This post is interested in "true multireads" where AS = XS. If you are keeping true multireads in your data set, it could be an issue if they all get assigned to the same spot. Of course, this effect would be mitigated if you removed all "redundant reads" (reads that map to the same exact position) except one before moving on to downstream analyses. There also seemed to be some debate out there as to what bowtie2 does with multireads. So I designed a couple very simple experiments to show what it does.

The take away: these experiments support exactly what the bowtie2 manual says about what it does with randomly assigning places for multireads. If you do not know what the manual says than RTFM or read on.

Methods:

Randomly generated (multi)reads used:
>read2
ACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAA
>read3
GTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGG
>read4
GACACGTGAGTGAACCATACACACAAGTTCCGCACAACTTAGCTTGCGTA
>read5
ATCAAATCAGTTCGGCGCTCTGGTGTCTACAAGAATGTCACCGGTAGTAG
>read10
AGGAAGATAATGAAGCCCGGTTGTCTTGCGGATAAGAACACGCCGCGGCA
>read100
AAGCACTCCTTCAGGTCTATAACCGCGACCGCCAGCTCCTGTTTTTTTGG

"Chromosomes" were made from the reads:
There is a chromosome with the same sequence as every read (e.g. chr2 for read2) made as tail-to-head repeats of the read in the copy number associated with the read (e.g. chr2 has 2 copies of read 2). This is simply done in python taking advantage of its string multiplication (e.g. string*2 = stringstring). Therefore, every read can map to its respective chromosome >= 2 times with AS = XS and they are all true multireads. The chromosomes were indexed with "bowtie2-build -f chr.fa chr".

Using many copies of the same multireads:
One thousand copies of each mutiread was placed in the fasta file, reads.fa, to study the behavior of mapping the same multiread over multiple trials.

Experiments, results, and conclusions:

First I executed the following bowtie2 command and pipes:
bowtie2 --very-sensitive -N 1 -x chr -U multireads.fa -f | grep -v ^@ | cut -f 1,3,4 | sort | uniq -c

In bowtie2, the output is SAM format and fields 1, 3, and 4 are read name, chr name, and 1-based start position respectively. Thus, this pipeline ends in counting the number of times "read chr pos" were identical. Here are the results:

read_name chr_name pos count
read2 chr2 1 1000
read3 chr3 101 1000
read4 chr4 151 1000
read5 chr5 51 1000
read10 chr10 251 1000
read100 chr100 2901 1000

All one thousand copies of each multiread mapped to the same place. If you RTFM of bowtie2, it explains this behavior. Bowtie2 uses a pseudo-random number generator (for those not familiar with computer science, this is how randomness is simulated in general) to randomize where a read will go given more than 1 choice. So, why do they all go to the same place? The answer is because bowtie2 uses the read name, nucleotide string, and quality string to determine the seed to use for the pseudo-random number generator. Briefly, if you use the same seed every time, then you get the same string of pseudo-random numbers every time. Here all of my multireads have the same name, the same sequence, and since they are fasta files, the same string of quality scores (bowtie2 assumes they are all 40). Thus, this is the expected behavior. In real life, you will not encounter (or at least not often) a situation where you have multiple reads with the same name. Reads can have the same sequence, but it is less likely they will share the same exact quality string. Nonetheless, a different seed would be generated given 2 identical sequences because the read names (and probably quality strings) would differ.

However, for now I am dealing with totally identical reads. If you want fully simulated randomness with identical reads, you need to use the --non-deterministic flag. This turns off the deterministic behavior of picking a seed from the fixed read parameters and instead just picks a seed using the current time (each time it encounters one of the identical reads, it is a different time, so a different seed is used). So, I tested if this randomly distributes the reads with:

bowtie2 --non-deterministic --very-sensitive -N 1 -x chr -U multireads.fa -f | grep -v ^@ | cut -f 1,3,4 | sort | uniq -c

read_name chr_name pos   count expectedCountIfuniform
read2 chr2 1 496 500
read2 chr2 51 504 500
read3 chr3 1 338 333.333
read3 chr3 101 326   333.333
read3 chr3 51 336   333.333
read4 chr4 1 249 250
read4 chr4 101 235   250
read4 chr4 151 254   250
read4 chr4 51 262   250
read5 chr5 1 168 200
read5 chr5 101 205   200
read5 chr5 151 201   200
read5 chr5 201 223   200
read5 chr5 51 203   200

It really does not take statistical tests to show that bowtie2 is doing a good job uniformly distributing the reads across all possible sites for a read. The table for all of the read100 datapoints would be too large to show, so instead, here is a plot (made in R):

The line was generated by making a linear model from the data using the lm() function in R. If it is random, the y-int should reflect the expected number of reads when 1000 reads are uniformly distributed across 10 sites. That would be 10 reads per site and the y-intercept was 9.961. The slope should be close to 0 to show no bias over the positions. The observed slope was ~0.000016. All in all, bowtie2 indeed seems to be approximating a uniform distribution over the sites.

The only thing left to do is to change the read names to see if that is sufficient for randomized results as well. To do this, I just added a number to the read name such as read2_1. The number added was just the line number as it occurred in the file. For these, since the names were all different and since each read can only map to its respective chromosome, I only counted how many times a read aligned to "chr, pos" with:
bowtie2 --very-sensitive -N 1 -x chr -U diffNameMultiReads.fa -f | grep -v ^@ | cut -f 3,4 | sort | uniq -c

chr_name pos   count expectedCountIfuniform
chr2 1 503 500
chr2 51 497     500
chr3 1 362 333.3333
chr3 101 316   333.3333
chr3 51 322   333.3333
chr4 1 277 250
chr4 101 246     250
chr4 151 251     250
chr4 51 226     250
chr5 1 200 200
chr5 101 192 200
chr5 151 216   200
chr5 201 192   200
chr5 51 200   200

Again, it is clear from just the table alone that bowtie2 is doing well at uniformly distributing the multireads as it claims to do.

Bowtie2 definitely assigns 'true multireads' uniformly at random.
A long time ago, I took a very high multiplicity (~12,000) multiread from one of our experiments and tried mapping it with bowtie2 again.. and again… and again. It mapped to the same place every time rather than one of the other 11,999 places. Back then, I was momentarily horrified until I RTFM. This behavior is the expected behavior in deterministic mode, where bowtie2 uses the read name, sequence, and quality string to seed the pseudo-random number generator. The solution would be to use the --non-deterministic flag. The real experiment would have been to change the read's name each time. In our real NGS experiments, none of the reads share a name. So, given the results above, it is safe to say that if 2 true multireads with different names but the same sequence are encountered, they will likely be assigned to different sites uniformly at random. If reads with the same sequence are filtered beforehand to keep only one or if there is just naturally a single instance of a multiread, you can trust that the site it is assigned to is picked at random. Even if it was not, most people filter multireads out. To filter all true multireads out (and some maxireads including classic unireads with poor AS), use MAPQ >= 2. See my previous post for more information on how bowtie2 assigns MAPQ scores. Even if someone does not filter out the multireads, most people for many applications remove all redundant reads except 1 per site (redundant reads are those that map to the same exact site). Thus, even if bowtie2 was not random, the pileup effect would be smashed at that step. Finally, since most people only care about reads of AS > XS for most applications, random assignment does not matter. Those go to the site corresponding to AS.

As a closing remark, if you want to work with multireads, then the a priori uniform distribution over all sites is not the best you can do. One could create a posterior distribution over those sites after looking at the distribution of the high quality maxiread (AS > XS) data. I discuss this more here.

Wednesday, May 28, 2014

How does bowtie2 assign MAPQ scores?

by JohnUrbanGenome

Outline:
I. Introduction

II. A little bit to know about bowtie2
A. How bowtie2 scores a mismatch
B. How bowtie2 decides if an alignment is valid
C. How bowtie2 describes its alignments
III. Experiments to look at how Bt2 assigns MAPQ scores
A. Methods
B. Experiments, Results, and Conslusions
IV. An English translation of how bowtie2 assigns MAPQs from the code and experiments
V. Python code for calculating bt2-style MAPQs

Introduction:

Recently, I discussed "The slow death of the term 'uniquely mappable'". There I talked about the older definition of "uniquely mappable", which describes a read that maps to only a single site in the genome, and how it has has become a less-than-reliable way to extract only high quality mappings. I also talked about the newer definition of uniquely mappable, which describes any read for which the alignment score of the best alignment is better than that of the 2nd best and all other alignments. Reads that meet the old definition are a subset of the reads in the newer definition since their alignment score is by default better than scores in the empty set of other possible alignments. To disambiguate the term "uniquely mappable read" or "uniread", I refer to any read that meets the newer definition as a "maxiread" because it has a single position with the maximum alignment score. To reiterate in the newer language, the traditionally defined "unireads" make a proper subset of the set of maxireads. Even though the newer definition encompasses more possible high quality mapped reads, it does not by itself eliminate mapped reads when the best or only alignment score is poor (and the 2nd best is really poor). It just takes the best of the totally horrible situation. Thus, the value of the alignment score itself also needs to be taken into consideration, not just whether or not it is the only score or better than the next best one. Mapping quality (MAPQ or MQ) scores are used by aligners such as bowtie2 and bwa to achieve this goal. Briefly, MAPQ is supposed to represent a transformation of p, the probability a read is mapped wrong by:

Read more background about MAPQ scores, unireads, multireads, and maxireads in my previous post, "The slow death of the term 'uniquely mappable' in deep sequencing studies and resisting the "conservative" urge to toss out data". In practice, MAPQ ranges from 0 to 37, 40, or 42, depending on what aligner you are using. My experiences below are referring to bowtie2 MAPQ scores. These MAPQ scores are not calculated with that equation above. In fact, as far as I know, no program does that. Each program has ways to estimate or approximate the MAPQ for a read mapping that it then assigns it. The goal of this post is to talk about how bowtie2 sees mismatches, alignment scores, and MAPQ scores, then to translate into English, through tiny experiments and looking into the code, what each MAPQ score from bowtie2 means. I also provide python code to calculate bowtie2 style MAPQs yourself. The code itself might be better than English for some people (assuming anyone other than spam bots read this).

When I started learning bioinformatics to analyze NGS datasets, I knew I should filter out all reads with a MAPQ lower than X, but it was not clear what X should be. One trusted colleague who works on Drosophila told me to keep reads with MAPQ >= 30. However, this in practice seemed too conservative when working with reads (that can have errors) from a human cancer cell line, which has SNPs, microdeletion, microinsertions, breakpoints, etc - i.e. things that make the alignment score lower even at "correct" site. Searching the forums (e.g. seqanswers and biostars) did not lead to a concrete answer. I read one thread that concluded any read with a MAPQ >= 10 is fine, and another that testified that any MAPQ >= 1 is fine. Both claimed that after 1 or 10, all of the reads are "unique" (usually it is not clear by which definition someone means). Then someone else said a MAPQ of 255 is for reads with a unique mapping (presumably referring to the older definition). That, however, is contradicted by the SAM format specifications, which says that 255 is to be used for mapped reads where the MAPQ is not available. Another thread provided the seeming gem that a MAPQ of 0 is for reads that with 10+ mapping locations, a MAPQ of 1 is for reads with 4-9 locations, a MAPQ of 2 is for reads with 3 locations, a MAPQ of 3 is for reads with 2 locations, and presumably anything with a MAPQ of 4 or better mapped to 1 location with increasingly good alignment scores. If you use the equation above, you will find that this set of MAPQ descriptions essentially came from that equation rather than an actual breakdown of what bowtie2 is doing, for example. Yet other people said that multireads in bowtie2 are only given a 0 or a 1 and uniquely mappable reads had scores of 2 or higher (with no indication on what their definition of "uniquely mappable" was). Contrary to MAPQs of 0 and 1 being reserved for multireads, I had uniquely mapped (single site, old definition) reads with MAPQ scores of 0 to explain. Thus, I set out on a series of tiny experiments and looking for snippets of the code to tell me exactly what bowtie2 does.

First, here is a little bit to know about bowtie2 end-to-end mode:

How bowtie2 scores a mismatch:

- The maximum alignment score possible is 0. From there penalties are subtracted.

- MX is the maximum penatly and MN is minimum penalty. These are tune-able parameters. The defaults are MN=2 and MX=6.

- Recall base call qualities can be between 0-40, lowest to highest, and are referred to by the variable, Q or BQ to distinguish it from MAPQ.

- When a mismatch is encountered, a penalty between MN-MX is subtracted.

penalty = MN + floor( (MX-MN)*(MIN(Q, 40.0)/40.0) )

Lets say: Diff = MX-MN

And note: MN + Diff = MX

Lets say: fraction = Q/40

That is, "fraction" is the base quality, Q, divided by the best possible base quality (40)

And lets ignore "floor".

Then,

penalty = MN + Diff*fraction

--> So if the base quality is perfect (40), then

fraction = 40/40 = 1

and,

penatly = MN + Diff = MX = 6 (default)

--> And the lower quality the base call, the more we believe it is an error so the less of a penalty we give the mismatch. For example if Q = 0, then

fraction = 0/40 = 0

and,

penatly = MN + 0 = MN = 2 (default)

Or if Q = 20, then

fraction = 20/40 = 0.5

and,

penatly = MN + 0.5*Diff = 4 (default)

- In summary, MN is subtracted when the base call quality is 0. This is because when the base call quality is low, we assume the mismatch is more likely due to a base call error than to a real mismatch. MX is subtracted when the base call quality is 40. This is because when the base quality is high, we assume the mismatch is more likely to be real. It could be due to a real SNP though, which gives rise to the unfortunate case where we penalize unjustly when aligning it to the "correct" spot on the reference. All other mismatches over bases with quality Q=1-39 have penalties between MN and MX. If you choose to ignore quality scores, MX is always subtracted. This is because you are telling it to always assume the mismatch is a real mismatch and are not distinguishing between mismatches potentially caused by a sequencing error.

- Other penalties and defaults:

N penalty: default -1 for every position with an N (whether the N is either in the read, ref, or both)

gap open penalty: default -5 for both read and ref

gap extend penalty: default -3 for both read and ref

How bowtie2 decides if an alignment is valid:

- Bowtie2 considers an alignment valid if it is higher than minimum Alignment Score allowed

- Bowtie2 uses a function to determine the minimum alignment score, which is dependent on the length of the read:

score minimum = f(readLength) = b + m * readLength

- Bowtie2 allows you to set the "b" and "m" in the minimum alignment score function with the "--score-min" flag

- The default values are b = -0.6 and m = -0.6:

default score minimum = f(readLength) = -0.6 + -0.6 * readLength

- For a 50 bp single end (SE) read, this translates to a minimum alignment score of -30.6:

default score minimum = f(50) = -0.6 + -0.6 * 50 = -30.6

- To put this in terms of mismatches, this will accept 5 mismatches over high quality (Q=40) bases (though MAPQ goes to 0 after 3), up to 7 mismatches over 'medium' quality (Q=20) base calls (though MAPQ goes to 0 after 5), and up to 15 mismatches over the lowest quality (Q=0) bases (though MAPQ goes to 0 after 10 mismatches). Why and how MAPQ goes to 0 for different numbers of bases given the quality is because it only cares about the alignment score and those are the numbers before the alignment score gets bad enough for MAPQ to go to 0.

How bowtie2 describes its alignments:

Bowtie2 output is in SAM format. For more information on SAM format files, see the SAM format specifications. For now suffice it to say just what I say here.

Fields 1-11 are fixed.

Fields 12+ are not. They may or may not be present.

Field What You Find There

1 readname

2 bitflag -- for SE bt2 mapping, only 0 (pos strand), 4 (unmapped), 16 (neg strand) are relevant

3 chrName/refname

4 1-based start on fwd strand (this is the end if on neg strand)

5 MAPQ

6 CIGAR string rep of alignment (most are 50M which means 50 match+mismatch)

7 chrName/refName of mate if PE; * if SE

8 1 based start of mate; * if SE

9 Inferred fragment length if PE; * if SE

10 the read's sequence

11 the reads base call qualities (Qs)

Optional field formats:

All optional fields follow the TAG:TYPE:VALUE format where:

TAG is a two-character string

TYPEs:

A (printable character)

i (integer)

f (float)

Z (printable string including space)

H (byte array in Hex format)

B (integer or numeric array)

*only types i and Z apply to bt2 optional fields

Optional fields 12+:

AS:i:<N> aln score

XS:i:<N> aln score for next best aln (only reported if >1 aln found)

YS:i:<N> aln score for mate (PE only)

XN:i:<N> number of ambiguous bases in the reference covering the aln (#Ns)

XM:i:<N> number of mismatches in aln

XO:i:<N> number of gap opens

XG:i:<N> number of gap extensions

NM:i:<N> the edit distance (whereas XM more like hamming -- only if no gaps were present)

Note: if XM == NM, then XO and XG == 0

YF:Z:<S> reports why bt2 ignored this read (filtered out) if it did -- see codes in bt2 manual

YT:Z:<S> reports on pair information

all SE reads are given: YT:Z:UU -- meaning they are not part of a pair

MD:Z:<S> A string representation of the mismatched reference bases in the alignment. See SAM specs.

Experiments to look at how Bt2 assigns MAPQ scores:

Methods:

Making reads:

I randomly generated 50 bp sequences in python using this code:

def genSeq(seqLen, numSeq, probs=True):
'''Depends on random module
bases = 'ACGT'
seqs = []
for i in range(numSeq):
seq = ''
for j in range(seqLen):
index = random.randint(0,3)
seq += bases[index]
seqs += [seq]
return seqs

Since I only made a few, reads were manually turned into fasta or fastq format. Note that when mapping with bowtie2, fasta reads are assumed to all be of the highest quality base calls (base quality scores of 40). In fastq files, I either made all bases in a read have base call qualities of 20 or 0.

Example reads:

Fasta Read:

>read1

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

Fastq Read Q20:

@read1_Q20

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

55555555555555555555555555555555555555555555555555

Fastq Read Q0:

@read1_6mm_Q20

TAACCCTAACCCTAACCCTAACGCTAACCGTAACCCAAACCCAAACCCAT

55555555555555555555555555555555555555555555555555

Making chromosomes to map the reads to:

To start, I made eleven 50bp fasta reads named read1 - read11. Each read was made into a "chromosome" by having the read sequence repeat itself to have as many copies as the integer in its name. For example, read1 was kept as is and renamed chr1, read2 was lengthened by repeating it once for a copy number of 2 and renamed chr2, read3 was lengthened by repeating it 2 times and renamed chr3. The randomly generated 50 bp reads throughout the experiments have a range of hamming distances between 27-47 (median = 38). These parameters hold true even when generating thousands of pairwise comparisons this way. Thus, these random reads only ever align to their corresponding chromosome even when mismatches are engineered into them. For example, read1 should align only to chr1 (once), reads2 should align only to chr2 (twice), read3 should align only to chr3 ('thrice'), etc...

Example "chromosomes":

>chr1

TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

>chr2

ACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAAACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAA

>chr3

GTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGGGTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGGGTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGG

Indexing the fasta file of chromosomes:

bowtie2-build -f chr.fa chr

Randomly adding mutations to reads (or chromosomes):

To randomly mutate a read I just manually changed a base to another base in the read since these were such tiny experiments. Reads with mismatches were given names that told the read they derived from, how many mismatches, and when the base call quality was important (fastq files) what the Q value was for all bases (As described above every base in the read was assigned the same base call quality score).

Example read names:
read1_1mm
read1_1mm_Q20

Mapping reads:

bowtie2 --very-sensitive -N 1 -x chr -U reads.fa

-f used when fasta provided for reads

-a used when I wanted to count all possible alignments for a read
As the above command illustrates, defaults for the minimum alignment score cutoff and for the penalties for mismatches, gap opens and gap extends always used.
Only end to end alignments were performed in these experiments.

Counting number of times a chromosome was mapped to:

In general the following command was used to find all alignments for each read:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa -a

-f used when fasta provided for reads

Since bowtie2 does not have a SAM flag that reports how many times a read aligned, I used various command line tricks to count how many times each read mapped such as:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa -f -a | samtools view -F 4 -S - | cut -f 1 | uniq -c

Extracting alignment scores for the best (AS:i:<N>) and second best (XS:i:<N>) alignments:

bowtie2 --very-sensitive -N 1 -x chr -U reads.fa | grep -v ^@ | cut -f 1,3,5,12,13

-f used when fasta provided for reads
Note that if there is not a second best alignment, then the above command extracts XN:i:<N>

Experiments, Results, and Conclusions:

For each experiment, note that reads only ever aligned to their respective chromosomes (e.g. read1 to chr1, see methods). Moving forward, keep in mind that for bowtie2, AS is alignment score of the best alignment and XS is the alignment score for the second best alignment. In these experiments, AS always equals XS if an XS exists by design. So these are "true multireads" as compared to "maxireads", reads with a single best alignment score, but other mappings with poorer scores.

Experiment 1 - true multireads with "perfect" base qualities only receive MAPQs of 0 or 1:

I made reads 1-11 and 100 (fasta format) and their corresponding chromosomes chr1-11 and chr 100 with copy numbers of 1-11 and 100 respectively (see methods). I also made 6 more derivative reads with 1-6 mismatches respectively from each read1, read2, read3, and read100. Since all base calls are considered perfect for fasta reads and because of the defaults of bowtie2, the maximum number of mismatches allowed should be 5 to be considered a valid alignment. The read with 6 mismatches should not map. One of the hypotheses that I tested was that bowtie2 MAPQ scores follow the rules in one forum stating 0 means the read mapped 10+ times, 1 means 4-9 times, 2 means 3 times, and 3 means 2 times.

Results:

Read Name NumAligns AS XS MAPQ (for best position)

read1 1 AS:i:0 * 42

read1_1mm 1 AS:i:-6 * 40

read1_2mm 1 AS:i:-12 * 23

read1_3mm 1 AS:i:-18 * 3

read1_4mm 1 AS:i:-24 * 0

read1_5mm 1 AS:i:-30 * 0

read1_6mm 0 * * *

read2 2 AS:i:0 XS:i:0 1

read2_1mm 2 AS:i:-6 XS:i:-6 1

read2_2mm 2 AS:i:-12 XS:i:-12 0

read2_3mm 2 AS:i:-18 XS:i:-18 0

read2_4mm 2 AS:i:-24 XS:i:-24 0

read2_5mm 2 AS:i:-30 XS:i:-30 0

read2_6mm 0 * * *

read3 3 AS:i:0 XS:i:0 1

read3_1mm 3 AS:i:-6 XS:i:-6 1

read3_2mm 3 AS:i:-12 XS:i:-12 0

read3_3mm 3 AS:i:-18 XS:i:-18 0

read3_4mm 3 AS:i:-24 XS:i:-24 0

read3_5mm 3 AS:i:-30 XS:i:-30 0

read3_6mm 0 * * *

read4 4 AS:i:0 XS:i:0 1

read5 5 AS:i:0 XS:i:0 1

read6 6 AS:i:0 XS:i:0 1

read7 7 AS:i:0 XS:i:0 1

read8 8 AS:i:0 XS:i:0 1

read9 9 AS:i:0 XS:i:0 1

read10 10 AS:i:0 XS:i:0 1

read11 11 AS:i:0 XS:i:0 1

read100 100 AS:i:0 XS:i:0 1

read100_1mm 100 AS:i:-6 XS:i:-6 1

read100_2mm 100 AS:i:-12 XS:i:-12 0

read100_3mm 100 AS:i:-18 XS:i:-18 0

read100_4mm 100 AS:i:-24 XS:i:-24 0

read100_5mm 100 AS:i:-30 XS:i:-30 0

read100_6mm 0 * * *

All AS and XS were as expected for 0-5 mismatches: 0, -6, -12, -18, -24, and -30 respectively. The MAPQ for read1 was 42 because there was only a single perfect alignment. The MAPQs for read1 with 1-5 mismatches were 40, 23, 3, 0, and 0 respectively. The MAPQ for all other reads with 0 mismatches was 1 regardless of whether it had 2-100 alignments. The MAPQ for all reads with 1 mismatch was 1 regardless of whether it had 2-100 alignments. The MAPQ for all reads with 2-5 mismatches was 0 regardless of whether it had 2-100 alignments.

Conclusions:

The hypothesis stated above is false. In bowtie2, the best a true multiread (AS=XS) can get is MAPQ=1 regardless of how low or high its multiplicity. This occurs when there are 0 or 1 mismatches over perfect base calls in the read, or when AS=XS goes down to -6. When there are 2-5 mismatches over perfect base calls (or the AS=XS <= -12 ---- i.e. -12 to -30.6), the MAPQ becomes 0.

If someone wanted to exclude "true multireads" from their data set, using MAPQ >= 2 would work. This would also exclude any uniquely mapping reads with >=4 mismatches over high quality bases. In terms of high quality bases and unireads, MAPQ >= 3 allows up to 3 mismatches, MAPQ >= 23 allows up to 2 mismatches, MAPQ >= 40 allows up to 1 mismatch, and MAPQ >= 42 allows 0 mismatches. There will also be other "maxireads" in most or all of these sets.

Experiment 2 - true multireads with "medium" and "poor" base qualities only receive MAPQs of 0 or 1:

For this mini-experiment, I converted the first 6 fasta format reads above into fastq format reads to be able to have control over base call quality values. Each read was given either all Q=20 or Q=0 base qualities. Note in experiment one all were considered Q=40 since the reads were from a fasta file. Here I wanted to see how poorer base qualities affect MAPQs, for example, of multireads. Moreover, I wanted to see if they had any affect on "perfect matches". Note that with these lower quality bases, more mismatches are allowed to accrue.

Q0 entries in blue to make them easier to find.

Read Name NumAligns AS XS MAPQ (for best position)

read1_Q20 1 AS:i:0 * 42
read1_1mm_Q20 1 AS:i:-4 * 42
read1_2mm_Q20 1 AS:i:-8 * 40
read1_3mm_Q20 1 AS:i:-12   * 23
read1_4mm_Q20 1 AS:i:-16   * 8
read1_5mm_Q20 1 AS:i:-20   * 3
read1_6mm_Q20 1 AS:i:-24   * 0
read1_1mm_Q0 1 AS:i:-2 * 42
read1_2mm_Q0 1 AS:i:-4 * 42
read1_3mm_Q0 1 AS:i:-6 * 40
read1_4mm_Q0 1 AS:i:-8 * 40
read1_5mm_Q0 1 AS:i:-12   * 23
read1_6mm_Q0 1 AS:i:-12   * 23
read1_7mm_Q0 1 AS:i:-14   * 23
read1_8mm_Q0 1 AS:i:-16   * 8
read1_9mm_Q0 1 AS:i:-18   * 3
read1_10mm_Q0 1 AS:i:-20   * 3
read1_11mm_Q0 1 AS:i:-22   * 0
read2_Q20 2 AS:i:0 XS:i:0 1
read2_1mm_Q20 2 AS:i:-4 XS:i:-4 1
read2_2mm_Q20 2 AS:i:-8 XS:i:-8 1
read2_3mm_Q20 2 AS:i:-12 XS:i:-12 0
read2_3mm_Q0 2 AS:i:-6 XS:i:-6 1
read2_4mm_Q0 2 AS:i:-8 XS:i:-8 1
read2_5mm_Q0 2 AS:i:-10 XS:i:-10 0
read3_Q20 3 AS:i:0 XS:i:0 1
read3_1mm_Q20 3 AS:i:-4 XS:i:-4 1
read3_2mm_Q20 3 AS:i:-8 XS:i:-8 1
read3_3mm_Q20 3 AS:i:-12 XS:i:-12 0
read4_Q20 4 AS:i:0 XS:i:0 1
read4_1mm_Q20 4 AS:i:-4 XS:i:-4 1
read4_2mm_Q20 4 AS:i:-8 XS:i:-8 1
read4_3mm_Q20 4 AS:i:-12 XS:i:-12 0
read5_Q20 5 AS:i:0 XS:i:0 1
read5_2mm_Q20 5 AS:i:-8 XS:i:-8 1
read5_3mm_Q20 5 AS:i:-12 XS:i:-12 0
read6_Q20 6 AS:i:0 XS:i:0 1
read6_2mm_Q20 6 AS:i:-8 XS:i:-8 1
read6_2mm_Q0 6 AS:i:-4 XS:i:-4 1
read6_4mm_Q0 6 AS:i:-8 XS:i:-8 1
read6_5mm_Q0 6 AS:i:-10 XS:i:-10 0
read6_6mm_Q0 6 AS:i:-12 XS:i:-12 0

When read1 (which aligns to a single position) had either 0 or 1 mismatch over a Q20 base, it received a perfect MAPQ of 42. When it had between 1-2 mismatches over Q0 bases, it also still had a perfect MAPQ. MAPQs remain non-zero for up to 5 Q20 mistmatches, up to 10 Q0 mismatches, and from expt 1, up to 3 mismatches over Q40 bases. What those 3 cutoffs have in common are alignment scores >= -20 (i.e. between 0 to -20) while the one more mismatch in each case that caused a MAPQ of 0 had alignment scores <= -22 (i.e. -22 to -30.6). Thus, the AS cutoff is somewhere between -20 and -22 for having a non-zero MAPQ. All true multireads (AS=XS), no matter what the copy number, still could only have a MAPQ of 0 or 1. The MAPQ was 1 when the AS was >= -8 (i.e. 0 to -8) and was 0 when the AS was <= -10 (i.e. -10 to -30.6). So, a true multiread (AS=XS) gets a MAPQ of 1if its AS is "good" and a 0 if its "bad" and the cutoff between good and bad is somewhere between -8 and -10.

An English translation of how bowtie2 assigns MAPQs from the code and experiments

Here I use the experiments above and the actual code to provide a useful table on what select MAPQ cutoffs mean. The code I looked at was actually from the C code version by Devon Ryan here.

Bowtie2 does not use the number of times a read mapped in the calculation for MAPQ. Instead, it just uses the AS and the XS, which is either >= the minimum score or below it. If AS == XS, the read is considered a true multiread and can only get a score of 0 or 1. If XS is below the minimum score, the read is considered a true uniread under the given scoring scheme. True unireads can get scores of 0, 3, 8, 23, 24, 40 and 42. A score of 0 is reserved for the true unireads with an AS in the bottom 30% of allowable scores. The scores 3, 8, 23, 24, 40, and 42 are unique to true unireads. Therefore, if someone is hell bent on taking only "unireads" with decent alignment scores, the way to do it would be to take only reads with those scores.

awk '$5 == 3 || $5 == 8 || $5 == 23 || $5 == 24 || $5 == 40 || $5 == 42' file.sam

The other possibility is to grep for reads that do not have XS:i::

grep -v XS:i: file.sam

Note that it is important to grep for "XS:i:" and not "XS" which will also take reads with base quality encodings that have "XS". If you use this approach, you will also get the low quality MAPQ=0 true unireads.Though it is possible to extract only "true unireads" that way, you should note that since mismatches and gaps were allowed during these alignments, that this action may be way more conservative than you want to be. It may also be just kind of wrong to do. Read my previous posting about why.

Other maxireads can get scores between 0-39. These are the exact scores that can be assigned: 0-7, 11-12, 14-22, 25-27, 30-39. It would seem that the scores 9, 10, 13, 28, 29, and 41 can not be assigned in bowtie2. Though maybe someone can prove me wrong by looking at their bt2 outputs.

Recall, "true multireads" are those where AS=XS, "true unireads" are reads that have only a single mapping position in the scoring scheme, and "maxireads" include true unireads and reads where AS > XS. The following are MAPQ cutoffs and how they will affect each of these types of reads. Note, however, that other "maxireads" will be in each set. These experiments just do not elucidate which ones. This table puts MAPQs in terms of maximum number of mismatches possible over a certain quality base, but combinations of base qualities are more likely. Also remember gaps can occur as well.

MAPQ >= X #MM Q40 #MM Q20 #MM Q0 Description

0 5 7 15 All mappable reads

1 3 5 10 True multi w/ "good" AS, maxi of MAPQ >= 1

2 3 5 10 No true multi, maxi of MAPQ >= 2

3 3 5 10 No true multi, maxi of MAPQ >= 3

8 2 4 8 No true multi, maxi of MAPQ >= 8

23 2 3 7 No true multi, maxi of MAPQ >= 23

30 1 2 4 No true multi, maxi of MAPQ >= 30

39 1 2 4 No true multi, maxi of MAPQ == 39*

40 1 2 4 No true multi, only true uni-reads

42 0 1 2 Only "perfect" true unireads

The next table puts it in terms of AS, XS, and minimum score. Maxireads refer to any read where AS > XS, but for now not when there is no XS. "classic unireads" will refer to when there is no XS. "Multireads" refer to AS=XS true multireads. I will use the term percentile to mean that percentile of all possible alignment scores. For example, AS in the 80th percentile means the AS is >= 80% of all possible alignment scores given the minimum score allowed -- here the column will just say 80th to mean all reads with AS >= 80% of all possible AS. "dAS" will refer to the absolute value of AS-XS and how this compares to the absolute value of the minimum score. Thus, dAS_80th would mean that the difference between the best alignment score and the 2nd best is equivalent to an AS in the 80th percentile.

##TODO :-P -- read the python code

MAPQ
>= X multi maxi maxi

0 all all all
1 68th 30th dAS<=30% & AS 68th OR dAS in 30th*
2 None   30th dAS<=30% & AS 68th OR dAS in 30th*
3 None   30th dAS<=30% & AS 68th OR dAS in 30th*
4   None
5   None
6   None
7   None
8   None
11   None
12   None
14   None
15   None
16   None
17   None
18   None
21   None
22   None
23   None
24   None
25   None
26   None
27   None
30   None
31   None
32   None
33   None
34   None
35   None
36   None
37   None
38   None
39   None
40   None
42 None None 80th Only high AS classic unireads

*not strictly followed in the case of MAPQ=2, check out the code

Python code for calculating bt2-style MAPQs

Here is my python code translated from Devon Ryan's C code version here. I made some changes to the values (e.g. 0.61 in stead of 0.60) to make the MAPQ calculations 100% consistent with the MAPQ values I obtained in the above experiments with the same AS and XS.

def bt2_mapq_end2end(AS, XS=None, scMin=-30.6):

'''scMin = minScore'''

if XS == None:

XS = scMin-1

if XS > AS:

return None

diff = abs(scMin)

bestOver = AS-scMin

bestdiff = abs(abs(AS)-abs(XS))

if XS < scMin:

if bestOver >= diff*0.8:

return 42

elif bestOver >= diff*0.7:

return 40

elif bestOver >= diff*0.61:

return 24

elif bestOver >= diff*0.5:

return 23

elif bestOver >= diff*0.42:

return 8

elif bestOver >= diff*0.3:

return 3

else:

return 0

else:

if bestdiff >= diff*0.9:

if bestOver == diff:

return 39

else:

return 33

elif bestdiff >= diff*0.8:

if bestOver == diff:

return 38

else:

return 27

elif bestdiff >= diff*0.97:

if bestOver == diff:

return 37

else:

return 26

elif bestdiff >= diff*0.6:

if bestOver == diff:

return 36

else:

return 22

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.84:

return 25

elif bestOver >= diff*0.68:

return 16

elif bestOver >= diff*0.68:

return 5

elif bestdiff >= diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.84:

return 21

elif bestOver >= diff*0.68:

return 14

else:

return 4

elif bestdiff >= diff*0.3:

if bestOver == diff:

return 32

elif bestOver >= diff*0.88:

return 18

elif bestOver >= diff*0.67:

return 15

else:

return 3

elif bestdiff >= diff*0.2:

if bestOver == diff:

return 31

elif bestOver >= diff*0.88:

return 17

elif bestOver >= diff*0.67:

return 11

else:

return 0

elif bestdiff >= diff*0.1:

if bestOver == diff:

return 30

elif bestOver >= diff*0.88:

return 12

elif bestOver >= diff*0.67:

return 7

else:

return 0

elif bestdiff > 0:

if bestOver >= diff*0.67:

return 6

else:

return 2

else:

if bestOver >= diff*0.68:

return 1

else:

return 0

The End.