Wednesday, May 28, 2014

How does bowtie2 assign MAPQ scores?



Bowtie2 is an ultra-fast program for aligning next generation sequence reads to large genomes written by Ben Langmead and which happens to be my aligner of choice. RTFM here. Download it here or learn how to get it on your Mac OS X with "homebrew" from one of my previous posts, "The Craft of Homebrew-ing for Mac OS".

Recently, I discussed "The slow death of the term 'uniquely mappable'". There I talked about the older definition of "uniquely mappable", which describes a read that maps to only a single site in the genome, and how it has has become a less-than-reliable way to extract only high quality mappings. I also talked about the newer definition of uniquely mappable, which describes any read for which the alignment score of the best alignment is better than that of the 2nd best and all other alignments. Reads that meet the old definition are a subset of the reads in the newer definition since their alignment score is by default better than scores in the empty set of other possible alignments. To disambiguate the term "uniquely mappable read" or "uniread", I refer to any read that meets the newer definition as a "maxiread" because it has a single position with the maximum alignment score. To reiterate in the newer language, the traditionally defined "unireads" make a proper subset of the set of maxireads.  Even though the newer definition encompasses more possible high quality mapped reads, it does not by itself eliminate mapped reads when the best or only alignment score is poor (and the 2nd best is really poor). It just takes the best of the totally horrible situation. Thus, the value of the alignment score itself also needs to be taken into consideration, not just whether or not it is the only score or better than the next best one. Mapping quality (MAPQ or MQ) scores are used by aligners such as bowtie2 and bwa to achieve this goal. Briefly, MAPQ is supposed to represent a transformation of p, the probability a read is mapped wrong by:


                       

Read more background about MAPQ scores, unireads, multireads, and maxireads in my previous post, "The slow death of the term 'uniquely mappable' in deep sequencing studies and resisting the "conservative" urge to toss out data". In practice, MAPQ ranges from 0 to 37, 40, or 42, depending on what aligner you are using. My experiences below are referring to bowtie2 MAPQ scores. These MAPQ scores are not calculated with that equation above. In fact, as far as I know, no program does that. Each program has ways to estimate or approximate the MAPQ for a read mapping that it then assigns it. The goal of this post is to talk about how bowtie2 sees mismatches, alignment scores, and MAPQ scores, then to translate into English, through tiny experiments and looking into the code, what each MAPQ score from bowtie2 means. I also provide python code to calculate bowtie2 style MAPQs yourself. The code itself might be better than English for some people (assuming anyone other than spam bots read this).

When I started learning bioinformatics to analyze NGS datasets, I knew I should filter out all reads with a MAPQ lower than X, but it was not clear what X should be. One trusted colleague who works on Drosophila told me to keep reads with MAPQ >= 30. However, this in practice seemed too conservative when working with reads (that can have errors) from a human cancer cell line, which has SNPs, microdeletion, microinsertions, breakpoints, etc - i.e. things that make the alignment score lower even at "correct" site. Searching the forums (e.g. seqanswers and biostars) did not lead to a concrete answer. I read one thread that concluded any read with a MAPQ >= 10 is fine, and another that testified that any MAPQ >= 1 is fine. Both claimed that after 1 or 10, all of the reads are "unique" (usually it is not clear by which definition someone means). Then someone else said a MAPQ of 255 is for reads with a unique mapping (presumably referring to the older definition). That, however, is contradicted by the SAM format specifications, which says that 255 is to be used for mapped reads where the MAPQ is not available. Another thread provided the seeming gem that a MAPQ of 0 is for reads that with 10+ mapping locations, a MAPQ of 1 is for reads with 4-9 locations, a MAPQ of 2 is for reads with 3 locations, a MAPQ of 3 is for reads with 2 locations, and presumably anything with a MAPQ of 4 or better mapped to 1 location with increasingly good alignment scores. If you use the equation above, you will find that this set of MAPQ descriptions essentially came from that equation rather than an actual breakdown of what bowtie2 is doing, for example. Yet other people said that multireads in bowtie2 are only given a 0 or a 1 and uniquely mappable reads had scores of 2 or higher (with no indication on what their definition of "uniquely mappable" was). Contrary to MAPQs of 0 and 1 being reserved for multireads, I had uniquely mapped (single site, old definition) reads with MAPQ scores of 0 to explain. Thus, I set out on a series of tiny experiments and looking for snippets of the code to tell me exactly what bowtie2 does.

First, here is a little bit to know about bowtie2 end-to-end mode:


- The maximum alignment score possible is 0. From there penalties are subtracted.

- MX is the maximum penatly and MN is minimum penalty. These are tune-able parameters. The defaults are MN=2 and MX=6. 

- Recall base call qualities can be between 0-40, lowest to highest, and are referred to by the variable, Q or BQ to distinguish it from MAPQ.

- When a mismatch is encountered, a penalty between MN-MX is subtracted.
                penalty = MN + floor( (MX-MN)*(MIN(Q, 40.0)/40.0) )
                   Lets say: Diff = MX-MN
                         And note: MN + Diff = MX
                   Lets say: fraction = Q/40
                          That is, "fraction" is the base quality, Q, divided by the best possible base quality (40)
                   And lets ignore "floor".
                     Then, 
                                           penalty = MN + Diff*fraction

--> So if the base quality is perfect (40), then 
                                                    fraction = 40/40 = 1
                                                          and,
                                                     penatly = MN + Diff  = MX = 6 (default)

--> And the lower quality the base call, the more we believe it is an error so the less of a penalty we give the mismatch. For example if Q = 0, then 
                                                    fraction = 0/40 = 0
                                                          and,
                                                     penatly = MN + 0 = MN = 2 (default)
                              Or if Q = 20, then 
                                                    fraction = 20/40 = 0.5
                                                          and,
                                                    penatly = MN + 0.5*Diff = 4 (default)

- In summary, MN is subtracted when the base call quality is 0. This is because when the base call quality is low, we assume the mismatch is more likely due to a base call error than to a real mismatch. MX is subtracted when the base call quality is 40. This is because when the base quality is high, we assume the mismatch is more likely to be real. It could be due to a real SNP though, which gives rise to the unfortunate case where we penalize unjustly when aligning it to the "correct" spot on the reference. All other mismatches over bases with quality Q=1-39 have penalties between MN and MX. If you choose to ignore quality scores, MX is always subtracted. This is because you are telling it to always assume the mismatch is a real mismatch and are not distinguishing between mismatches potentially caused by a sequencing error.

- Other penalties and defaults:
     N penalty: default -1 for every position with an N (whether the N is either in the read, ref, or both)
     gap open penalty: default -5 for both read and ref
     gap extend penalty: default -3 for both read and ref



- Bowtie2 considers an alignment valid if it is higher than minimum Alignment Score allowed

- Bowtie2 uses a function to determine the minimum alignment score, which is dependent on the length of the read:
                               score minimum   =   f(readLength)   =   b + m * readLength

- Bowtie2 allows you to set the "b" and "m" in the minimum alignment score function with the "--score-min" flag

- The default values are b = -0.6 and m = -0.6:
                            default score minimum   =   f(readLength)   =   -0.6 + -0.6 * readLength

- For a 50 bp single end (SE) read, this translates to a minimum alignment score of -30.6:
                          default score minimum   =   f(50)   =   -0.6 + -0.6 * 50 = -30.6

- To put this in terms of mismatches, this will accept 5 mismatches over high quality (Q=40) bases (though MAPQ goes to 0 after 3), up to 7 mismatches over 'medium' quality (Q=20) base calls (though MAPQ goes to 0 after 5), and up to 15 mismatches over the lowest quality (Q=0) bases (though MAPQ goes to 0 after 10 mismatches). Why and how MAPQ goes to 0 for different numbers of bases given the quality is because it only cares about the alignment score and those are the numbers before the alignment score gets bad enough for MAPQ to go to 0.


How bowtie2 describes its alignments: 

Bowtie2 output is in SAM format. For more information on SAM format files, see the SAM format specifications. For now suffice it to say just what I say here.

Fields 1-11 are fixed.
Fields 12+ are not. They may or may not be present.

Field    What You Find There
1          readname
2          bitflag -- for SE bt2 mapping, only 0 (pos strand), 4 (unmapped), 16 (neg strand) are relevant
3          chrName/refname
4          1-based start on fwd strand (this is the end if on neg strand)
5          MAPQ
6          CIGAR string rep of alignment (most are 50M which means 50 match+mismatch)
7          chrName/refName of mate if PE; * if SE
8          1 based start of mate; * if SE
9          Inferred fragment length if PE; * if SE
10        the read's sequence
11        the reads base call qualities (Qs)

Optional field formats:
All optional fields follow the TAG:TYPE:VALUE format where:
TAG is a two-character string
TYPEs: 
     A (printable character)
     i (integer)
     f (float)
     Z (printable string including space)
     H (byte array in Hex format)
     B (integer or numeric array)
*only types i and Z apply to bt2 optional fields

Optional fields 12+:
AS:i:<N> aln score
XS:i:<N> aln score for next best aln (only reported if >1 aln found)
YS:i:<N> aln score for mate (PE only)
XN:i:<N> number of ambiguous bases in the reference covering the aln (#Ns)
XM:i:<N> number of mismatches in aln
XO:i:<N> number of gap opens
XG:i:<N> number of gap extensions
NM:i:<N> the edit distance (whereas XM more like hamming -- only if no gaps were present)
Note: if XM == NM, then XO and XG == 0
YF:Z:<S> reports why bt2 ignored this read (filtered out) if it did -- see codes in bt2 manual
YT:Z:<S> reports on pair information
all SE reads are given: YT:Z:UU -- meaning they are not part of a pair
MD:Z:<S> A string representation of the mismatched reference bases in the alignment. See SAM specs.


Methods:

Making reads:
I randomly generated 50 bp sequences in python using this code:

def genSeq(seqLen, numSeq, probs=True):
    '''Depends on random module
    bases = 'ACGT'
    seqs = []
    for i in range(numSeq):
        seq = ''
        for j in range(seqLen):
            index = random.randint(0,3)
            seq += bases[index]
        seqs += [seq]
    return seqs


Since I only made a few, reads were manually turned into fasta or fastq format. Note that when mapping with bowtie2, fasta reads are assumed to all be of the highest quality base calls (base quality scores of 40). In fastq files, I either made all bases in a read have base call qualities of 20 or 0.

Example reads:
                       Fasta Read:
                                 >read1
                                 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA


                       Fastq Read Q20:
                                 @read1_Q20
                                 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
                                 +
                                 55555555555555555555555555555555555555555555555555


                       Fastq Read Q0:
                                 @read1_6mm_Q20
                                 TAACCCTAACCCTAACCCTAACGCTAACCGTAACCCAAACCCAAACCCAT
                                 +
                                 55555555555555555555555555555555555555555555555555



Making chromosomes to map the reads to:
To start, I made eleven 50bp fasta reads named read1 - read11. Each read was made into a "chromosome" by having the read sequence repeat itself to have as many copies as the integer in its name. For example, read1 was kept as is and renamed chr1, read2 was lengthened by repeating it once for a copy number of 2 and renamed chr2, read3 was lengthened by repeating it 2 times and renamed chr3. The randomly generated 50 bp reads throughout the experiments have a range of hamming distances between 27-47 (median = 38). These parameters hold true even when generating thousands of pairwise comparisons this way. Thus, these random reads only ever align to their corresponding chromosome even when mismatches are engineered into them. For example,  read1 should align only to chr1 (once), reads2 should align only to chr2 (twice), read3 should align only to chr3 ('thrice'), etc...

Example "chromosomes":
>chr1
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA

>chr2
ACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAAACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAA

>chr3
GTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGGGTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGGGTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGG

Indexing the fasta file of chromosomes:
bowtie2-build -f chr.fa chr

Randomly adding mutations to reads (or chromosomes):
To randomly mutate a read I just manually changed a base to another base in the read since these were such tiny experiments. Reads with mismatches were given names that told the read they derived from, how many mismatches, and when the base call quality was important (fastq files) what the Q value was for all bases (As described above every base in the read was assigned the same base call quality score).

Example read names:
read1_1mm
read1_1mm_Q20



Mapping reads:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa 
-f used when fasta provided for reads
-a used when I wanted to count all possible alignments for a read
As the above command illustrates, defaults for the minimum alignment score cutoff and for the penalties for mismatches, gap opens and gap extends always used.
Only end to end alignments were performed in these experiments.

Counting number of times a chromosome was mapped to:
In general the following command was used to find all alignments for each read:
   bowtie2 --very-sensitive -N 1 -x chr -U reads.fa -a 
-f used when fasta provided for reads

Since bowtie2 does not have a SAM flag that reports how many times a read aligned, I used various command line tricks to count how many times each read mapped such as:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa -f -a | samtools view -F 4 -S - | cut -f 1 | uniq -c


Extracting alignment scores for the best (AS:i:<N>) and second best (XS:i:<N>) alignments:
bowtie2 --very-sensitive -N 1 -x chr -U reads.fa | grep -v ^@ | cut -f 1,3,5,12,13
-f used when fasta provided for reads
Note that if there is not a second best alignment, then the above command extracts XN:i:<N>



Experiments, Results, and Conclusions:

For each experiment, note that reads only ever aligned to their respective chromosomes (e.g. read1 to chr1, see methods). Moving forward, keep in mind that for bowtie2, AS is alignment score of the best alignment and XS is the alignment score for the second best alignment. In these experiments, AS always equals XS if an XS exists by design. So these are "true multireads" as compared to "maxireads", reads with a single best alignment score, but other mappings with poorer scores. 

Experiment 1 - true multireads with "perfect" base qualities only receive MAPQs of 0 or 1:
I made reads 1-11 and 100 (fasta format) and their corresponding chromosomes chr1-11 and chr 100 with copy numbers of 1-11 and 100 respectively (see methods). I also made 6 more derivative reads with 1-6 mismatches respectively from each read1, read2, read3, and read100. Since all base calls are considered perfect for fasta reads and because of the defaults of bowtie2, the maximum number of mismatches allowed should be 5 to be considered a valid alignment. The read with 6 mismatches should not map. One of the hypotheses that I tested was that bowtie2 MAPQ scores follow the rules in one forum stating 0 means the read mapped 10+ times, 1 means 4-9 times, 2 means 3 times, and 3 means 2 times.

Results:
Read Name  NumAligns    AS             XS         MAPQ (for best position)
read1         1         AS:i:0         *              42
read1_1mm     1         AS:i:-6        *              40
read1_2mm     1         AS:i:-12       *              23
read1_3mm     1         AS:i:-18       *              3
read1_4mm     1         AS:i:-24       *              0
read1_5mm     1         AS:i:-30       *              0
read1_6mm     0         *              *              * 
read2         2         AS:i:0         XS:i:0         1
read2_1mm     2         AS:i:-6        XS:i:-6        1
read2_2mm     2         AS:i:-12       XS:i:-12       0
read2_3mm     2         AS:i:-18       XS:i:-18       0
read2_4mm     2         AS:i:-24       XS:i:-24       0
read2_5mm     2         AS:i:-30       XS:i:-30       0
read2_6mm     0         *              *              * 
read3         3         AS:i:0         XS:i:0         1
read3_1mm     3         AS:i:-6        XS:i:-6        1
read3_2mm     3         AS:i:-12       XS:i:-12       0
read3_3mm     3         AS:i:-18       XS:i:-18       0
read3_4mm     3         AS:i:-24       XS:i:-24       0
read3_5mm     3         AS:i:-30       XS:i:-30       0
read3_6mm     0         *              *              * 
read4         4         AS:i:0         XS:i:0         1
read5         5         AS:i:0         XS:i:0         1
read6         6         AS:i:0         XS:i:0         1
read7         7         AS:i:0         XS:i:0         1
read8         8         AS:i:0         XS:i:0         1
read9         9         AS:i:0         XS:i:0         1
read10        10        AS:i:0         XS:i:0         1
read11        11        AS:i:0         XS:i:0         1
read100       100       AS:i:0         XS:i:0         1
read100_1mm   100       AS:i:-6        XS:i:-6        1
read100_2mm   100       AS:i:-12       XS:i:-12       0
read100_3mm   100       AS:i:-18       XS:i:-18       0
read100_4mm   100       AS:i:-24       XS:i:-24       0
read100_5mm   100       AS:i:-30       XS:i:-30       0 
read100_6mm   0         *              *              * 


All AS and XS were as expected for 0-5 mismatches: 0, -6, -12, -18, -24, and -30 respectively. The MAPQ for read1 was 42 because there was only a single perfect alignment. The MAPQs for read1 with 1-5 mismatches were 40, 23, 3, 0, and 0 respectively. The MAPQ for all other reads with 0 mismatches was 1 regardless of whether it had 2-100 alignments. The MAPQ for all reads with 1 mismatch was 1 regardless of whether it had 2-100 alignments.  The MAPQ for all reads with 2-5 mismatches was 0 regardless of whether it had 2-100 alignments.

Conclusions:
The hypothesis stated above is false. In bowtie2, the best a true multiread (AS=XS) can get is MAPQ=1 regardless of how low or high its multiplicity. This occurs when there are 0 or 1 mismatches over perfect base calls in the read, or when AS=XS goes down to -6. When there are 2-5 mismatches over perfect base calls (or the AS=XS <= -12 ---- i.e. -12 to -30.6), the MAPQ becomes 0. 

If someone wanted to exclude "true multireads" from their data set, using MAPQ >= 2 would work. This would also exclude any uniquely mapping reads with >=4 mismatches over high quality bases. In terms of high quality bases and unireads, MAPQ >= 3 allows up to 3 mismatches, MAPQ >= 23 allows up to 2 mismatches, MAPQ >= 40 allows up to 1 mismatch, and MAPQ >= 42 allows 0 mismatches. There will also be other "maxireads" in most or all of these sets.




Experiment 2 - true multireads with "medium" and "poor" base qualities only receive MAPQs of 0 or 1:
For this mini-experiment, I converted the first 6 fasta format reads above into fastq format reads to be able to have control over base call quality values. Each read was given either all Q=20 or Q=0 base qualities. Note in experiment one all were considered Q=40 since the reads were from a fasta file. Here I wanted to see how poorer base qualities affect MAPQs, for example, of multireads. Moreover, I wanted to see if they had any affect on "perfect matches". Note that with these lower quality bases, more mismatches are allowed to accrue.

Q0 entries in blue to make them easier to find.
Read Name      NumAligns    AS             XS         MAPQ (for best position)
read1_Q20         1         AS:i:0         *              42
read1_1mm_Q20     1         AS:i:-4        *              42
read1_2mm_Q20     1         AS:i:-8        *              40
read1_3mm_Q20     1         AS:i:-12       *              23
read1_4mm_Q20     1         AS:i:-16       *              8
read1_5mm_Q20     1         AS:i:-20       *              3
read1_6mm_Q20     1         AS:i:-24       *              0
read1_1mm_Q0      1         AS:i:-2        *              42
read1_2mm_Q0      1         AS:i:-4        *              42
read1_3mm_Q0      1         AS:i:-6        *              40
read1_4mm_Q0      1         AS:i:-8        *              40
read1_5mm_Q0      1         AS:i:-12       *              23
read1_6mm_Q0      1         AS:i:-12       *              23
read1_7mm_Q0      1         AS:i:-14       *              23
read1_8mm_Q0      1         AS:i:-16       *              8
read1_9mm_Q0      1         AS:i:-18       *              3
read1_10mm_Q0     1         AS:i:-20       *              3
read1_11mm_Q0     1         AS:i:-22       *              0
read2_Q20         2         AS:i:0         XS:i:0         1
read2_1mm_Q20     2         AS:i:-4        XS:i:-4        1
read2_2mm_Q20     2         AS:i:-8        XS:i:-8        1
read2_3mm_Q20     2         AS:i:-12       XS:i:-12       0
read2_3mm_Q0      2         AS:i:-6        XS:i:-6        1
read2_4mm_Q0      2         AS:i:-8        XS:i:-8        1
read2_5mm_Q0      2         AS:i:-10       XS:i:-10       0
read3_Q20         3         AS:i:0         XS:i:0         1
read3_1mm_Q20     3         AS:i:-4        XS:i:-4        1
read3_2mm_Q20     3         AS:i:-8        XS:i:-8        1
read3_3mm_Q20     3         AS:i:-12       XS:i:-12       0
read4_Q20         4         AS:i:0         XS:i:0         1
read4_1mm_Q20     4         AS:i:-4        XS:i:-4        1
read4_2mm_Q20     4         AS:i:-8        XS:i:-8        1
read4_3mm_Q20     4         AS:i:-12       XS:i:-12       0
read5_Q20         5         AS:i:0         XS:i:0         1
read5_2mm_Q20     5         AS:i:-8        XS:i:-8        1
read5_3mm_Q20     5         AS:i:-12       XS:i:-12       0
read6_Q20         6         AS:i:0         XS:i:0         1
read6_2mm_Q20     6         AS:i:-8        XS:i:-8        1
read6_2mm_Q0      6         AS:i:-4        XS:i:-4        1
read6_4mm_Q0      6         AS:i:-8        XS:i:-8        1
read6_5mm_Q0      6         AS:i:-10       XS:i:-10       0
read6_6mm_Q0      6         AS:i:-12       XS:i:-12       0

When read1 (which aligns to a single position) had either 0 or 1 mismatch over a Q20 base, it received a perfect MAPQ of 42. When it had between 1-2 mismatches over Q0 bases, it also still had a perfect MAPQ.  MAPQs remain non-zero for up to 5 Q20 mistmatches, up to 10 Q0 mismatches, and from expt 1, up to 3 mismatches over Q40 bases.  What those 3 cutoffs have in common are alignment scores >= -20 (i.e. between 0 to -20) while the one more mismatch in each case that caused a MAPQ of 0 had alignment scores <= -22 (i.e. -22 to -30.6). Thus, the AS cutoff is somewhere between -20 and -22 for having a non-zero MAPQ. All true multireads (AS=XS), no matter what the copy number, still could only have a MAPQ of 0 or 1. The MAPQ was 1 when the AS was >= -8 (i.e. 0 to -8) and was 0 when the AS was <= -10 (i.e. -10 to -30.6). So, a true multiread (AS=XS) gets a MAPQ of 1if its AS is "good" and a 0 if its "bad" and the cutoff between good and bad is somewhere between -8 and -10. 




An English translation of how bowtie2 assigns MAPQs from the code and experiments

Here I use the experiments above and the actual code to provide a useful table on what select MAPQ cutoffs mean.  The code I looked at was actually from the C code version by Devon Ryan here.

Bowtie2 does not use the number of times a read mapped in the calculation for MAPQ. Instead, it just uses the AS and the XS, which is either >= the minimum score or below it. If AS == XS, the read is considered a true multiread and can only get a score of 0 or 1. If XS is below the minimum score, the read is considered a true uniread under the given scoring scheme. True unireads can get scores of 0, 3, 8, 23, 24, 40 and 42. A score of 0 is reserved for the true unireads with an AS in the bottom 30% of allowable scores. The scores 3, 8, 23, 24, 40, and 42 are unique to true unireads. Therefore, if someone is hell bent on taking only "unireads" with decent alignment scores, the way to do it would be to take only reads with those scores.

awk '$5 == 3 || $5 == 8 || $5 == 23 || $5 == 24 || $5 == 40 || $5 == 42' file.sam

The other possibility is to grep for reads that do not have XS:i::

grep -v XS:i: file.sam


Note that it is important to grep for "XS:i:" and not "XS" which will also take reads with base quality encodings that have "XS". If you use this approach, you will also get the low quality MAPQ=0 true unireads.Though it is possible to extract only "true unireads" that way, you should note that since mismatches and gaps were allowed during these alignments, that this action may be way more conservative than you want to be. It may also be just kind of wrong to do. Read my previous posting about why.

Other maxireads can get scores between 0-39. These are the exact scores that can be assigned: 0-7, 11-12, 14-22, 25-27, 30-39. It would seem that the scores 9, 10, 13, 28, 29, and 41 can not be assigned in bowtie2.  Though maybe someone can prove me wrong by looking at their bt2 outputs.


Recall, "true multireads" are those where AS=XS, "true unireads" are reads that have only a single mapping position in the scoring scheme, and "maxireads" include true unireads and reads where AS > XS. The following are MAPQ cutoffs and how they will affect each of these types of reads.  Note, however, that other "maxireads" will be in each set. These experiments just do not elucidate which ones. This table puts MAPQs in terms of maximum number of mismatches possible over a certain quality base, but combinations of base qualities are more likely. Also remember gaps can occur as well. 

MAPQ >= X   #MM Q40   #MM Q20      #MM Q0    Description
0           5         7            15        All mappable reads
1           3         5            10        True multi w/ "good" AS, maxi of MAPQ >= 1
2           3         5            10        No true multi, maxi of MAPQ >= 2
3           3         5            10        No true multi,  maxi of MAPQ >= 3
8           2         4            8         No true multi, maxi of MAPQ >= 8
23          2         3            7         No true multi, maxi of MAPQ >= 23
30          1         2            4         No true multi, maxi of MAPQ >= 30
39          1         2            4         No true multi, maxi of MAPQ == 39*
40          1         2            4         No true multi, only true uni-reads
42          0         1            2         Only "perfect" true unireads
                       

The next table puts it in terms of AS, XS, and minimum score. Maxireads refer to any read where AS > XS, but for now not when there is no XS. "classic unireads" will refer to when there is no XS. "Multireads" refer to AS=XS true multireads. I will use the term percentile to mean that percentile of all possible alignment scores. For example, AS in the 80th percentile means the AS is >= 80% of all possible alignment scores given the minimum score allowed -- here the column will just say 80th to mean all reads with AS >= 80% of all possible AS.  "dAS" will refer to the absolute value of AS-XS and how this compares to the absolute value of the minimum score. Thus, dAS_80th would mean that the difference between the best alignment score and the 2nd best is equivalent to an AS in the 80th percentile.

##TODO :-P -- read the python code
MAPQ
>= X  multi   maxi       maxi        
0     all     all        all        
1     68th    30th       dAS<=30% & AS 68th OR dAS in 30th*
2     None    30th       dAS<=30% & AS 68th OR dAS in 30th*
3     None    30th       dAS<=30% & AS 68th OR dAS in 30th*
4     None
5     None
6     None
7     None
8     None
11    None
12    None
14    None
15    None
16    None
17    None
18    None
21    None
22    None
23    None
24    None
25    None
26    None
27    None
30    None
31    None
32    None
33    None
34    None
35    None
36    None
37    None
38    None
39    None
40    None
42    None           None          80th   Only high AS classic unireads

*not strictly followed in the case of MAPQ=2, check out the code

Python code for calculating bt2-style MAPQs

Here is my python code translated from Devon Ryan's C code version here. I made some changes to the values (e.g. 0.61 in stead of 0.60) to make the MAPQ calculations 100% consistent with the MAPQ values I obtained in the above experiments with the same AS and XS.


def bt2_mapq_end2end(AS, XS=None, scMin=-30.6):
    '''scMin = minScore'''
    if XS == None:
        XS = scMin-1
    if XS > AS:
        return None
    diff = abs(scMin) 
    bestOver = AS-scMin 
    bestdiff = abs(abs(AS)-abs(XS))
    if XS < scMin: 
        if bestOver >= diff*0.8:
            return 42
        elif bestOver >= diff*0.7:
            return 40
        elif bestOver >= diff*0.61:
            return 24
        elif bestOver >= diff*0.5:
            return 23
        elif bestOver >= diff*0.42:
            return 8
        elif bestOver >= diff*0.3:
            return 3
        else:
            return 0
    else:
        if bestdiff >= diff*0.9:
            if bestOver == diff:
                return 39
            else:
                return 33
        elif bestdiff >= diff*0.8:
            if bestOver == diff:
                return 38
            else:
                return 27
        elif bestdiff >= diff*0.97:
            if bestOver == diff:
                return 37
            else:
                return 26
        elif bestdiff >= diff*0.6:
            if bestOver == diff:
                return 36
            else:
                return 22
        elif bestdiff >= diff*0.5:
            if bestOver == diff:
                return 35
            elif bestOver >= diff*0.84:
                return 25
            elif bestOver >= diff*0.68:
                return 16
            elif bestOver >= diff*0.68:
                return 5
        elif bestdiff >= diff*0.4:
            if bestOver == diff:
                return 34
            elif bestOver >= diff*0.84:
                return 21
            elif bestOver >= diff*0.68:
                return 14
            else:
                return 4
        elif bestdiff >= diff*0.3:
            if bestOver == diff:
                return 32
            elif bestOver >= diff*0.88:
                return 18
            elif bestOver >= diff*0.67:
                return 15
            else:
                return 3
        elif bestdiff >= diff*0.2:
            if bestOver == diff:
                return 31
            elif bestOver >= diff*0.88:
                return 17
            elif bestOver >= diff*0.67:
                return 11
            else:
                return 0
        elif bestdiff >= diff*0.1:
            if bestOver == diff:
                return 30
            elif bestOver >= diff*0.88:
                return 12
            elif bestOver >= diff*0.67:
                return 7
            else:
                return 0
        elif bestdiff > 0:
            if bestOver >= diff*0.67:
                return 6
            else:
                return 2
        else: 
            if bestOver >= diff*0.68: 
                return 1
            else:
                return 0


The End.

1 comment: