Sunday, July 18, 2021

How does Bowtie2 assign MAPQ (in local mode)?

 

REVIEW:
For a longer discussion on this question, see my aging posts from 2014:


Also see my more recent post on MAPQ for paired reads:

Finally, don't forget to check out my python code for computing Bowtie2 MAPQ scores:
- https://github.com/JohnUrban/MAPQcalculators

WITHOUT FURTHER ADO:
When I originally wrote about how bowtie2 assigns MAPQ scores years ago, I was thinking about it for experiments with single-end reads that I aligned in end-to-end mode. More recently, I became curious about how MAPQ was computed for paired reads, and how it is computed in local mode. The answer to the first question (on paired MAPQ) is essentially, but perhaps not totally, that bowtie2 sums the alignment scores for the mates and doubles the effective minimum score to calculate a paired MAPQ, as if computing a MAPQ for a read twice as long as the input reads (assuming a constant read length here). There was a little bit of fudging around with XS (the second best alignments) in my calculations that may be bypassed when bowtie2 is calculating the paired MAPQ with potentially more information than it outputs in the SAM file. Specifically, Bowtie2 likely compares the best AS sum to the second best AS sum whereas with what we are provided in the SAM file (at least as far as the AS, XS, and YS flags go) only allows us to compare the best AS sum to an estimation of the second best AS sum. The estimation I've concocted uses the XS values provided in the SAM entries for the individual mates, and when there is no second best (XS) for one of the mates, it uses the AS value for that mate. In a test of 49059 paired-end reads aligned in local mode, my python code gave the same paired MAPQ for all but 27 pairs. That is an error rate of 0.055%, or a success rate of 99.945%.

In this blog post, I will briefly detail the bowtie2 code for computing MAPQ, highlighting the local mode calculations, and showing the python code for it. 

The short take-home messages are:
1. The MAPQ calculations for local mode differ from end-to-end mode.

2. The code structure for calculating MAPQ in both modes looks very similar.

3. The main differences is what MAPQ is assigned for the different if/elif rules.

4. One highlighted difference is that, whereas the max MAPQ for end-to-end reads is 42, the max for locally aligned reads is 44.

5. Locally aligned reads meeting similar "rule thresholds" generally get higher MAPQs than their end-to-end aligned counterparts, likely b/c the nature of local mode makes it less likely to cross a given quality threshold.



Bowtie2 code for calculating MAPQ:
The bowtie2 code can be found in the "unique.h" file in the bowtie2 package:
https://github.com/BenLangmead/bowtie2/blob/master/unique.h

The snippet of interest begins around line 168-170 in the current version. Look. for"V2 of the MAPQ calculator". I will paste it at the end of this blog to make life easier. It has the code for both end-to-end and local mode. Below I will focus on local mode, and show the python code for it.


Shared concepts between end-to-end and local mode:

Both modes depend on calculating a minimum score to define a valid alignment. This minimum score is defined by the parameters given to the --score-min option. See my recent post summarizing this step.

Both modes depend on calculating what would be the "perfect score" given the alignment parameters. This is essentially the read length multiplied by what is set for the "match bonus" (--ma option to bowtie2). In end-to-end mode, this defaults to 0. In local mode, it defaults to 2. Thus, by default, 0 is the perfect score for any read length in end-to-end mode, and 2*read_length is the perfect score in local mode. For example, the perfect score for a 50 bp single-end read = 2*50 = 100.

Both modes depend on comparing the best alignment score (AS) with the second best alignment score (XS). If there was no valid second best alignment score, then bowtie2 arbitrarily makes XS equal the minimum score minus one (XS = minscore - 1). Under the alignment score scheme, this is the most conservative thing to assume about the next best, meaning it will help give a worst-case MAPQ.

Both modes calculate the maximum difference between the perfect score and the minimum score.

Both modes track the best alignment score for a read. This is typically AS, but in paired-end mode where mates get assigned to the place with the highest sum of alignment scores for both mates, one of the mates might not be in the location of its highest alignment score. In these cases, XS > AS for that mate, and XS is the best alignment score for that mate.

Both modes track the best difference over the minimum score. This is like above when the maximum difference was found between the perfect score and the minimum score. However, here it is tracking the difference between the best alignment score and the minimum score.

Both keep track of the difference between the best alignment score and the second best. When AS >= XS, this is just the absolute difference between AS and XS. However, in the situations described above for paired-end reads where XS>AS for one of the mates, the difference here ought to be negative. In my code, I have found setting this to 1 when XS>AS (i.e. not AS >= XS) helps reproduce bowtie2 MAPQ results.

The above is encapsulated in the python function below (code here):

def bt2_mapq(AS, XS=None, alnmode=None, scMin=None, scPer=None):
    '''
    scMin = minScore
    scPer = perfect score
    '''
    
    ## Did the read have a second-best alignment?
    if XS == None: ## make it less than scMin
        ## scMin = score of a just barely valid match
        XS = scMin-1
 
    ## Difference between the perfect score and the minimum score
    diff = max(1, abs(scPer-scMin)) 

    ## Best alignment score found
    best = max(AS, XS)
    
    ## Difference between best alignment score seen and score minumum
    bestOver = best - scMin 


    ## Absolute difference between the best and second best alignment scores (usually AS>XS)
    if AS >= XS:
        bestdiff = abs(abs(AS)-abs(XS)) 
    else:
        bestdiff = 1 ## seems like 0 or negative would be better, but so far this has minimized discrepancies

    
    ## Was alignment mode in end-to-end or local?
    if alnmode == 'endtoend':
        return endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)
    elif alnmode == 'local':
        return local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)




Where the modes depart in the MAPQ score calculations:
Both share a very similar code structure here, but different MAPQs are awarded for meeting different quality thresholds. Most easily noticeable is the max MAPQ is 44 in local mode as compared to 42 in end-to-end mode. 

The end-to-end code was featured in my original post on computing MAPQ scores (How does bowtie2 assign MAPQ scores?), but is reproduced below. 


Note - I will color in the python code below as above at a later point.

I leave you with the code, which is the best way to see how MAPQs are being assigned. I translated it into Python from the original Bowtie2 code as it is simpler to read (and I was trying to recapitulate the original code anyway).

Be you bioinformatics Padawan or Jedi, I hope I have been able to help you answer a question. 
--John


End-to-end mode:

def endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):
    #print(AS, XS, scMin, diff, bestOver, bestdiff)
    ## If does not have second best
    if XS < scMin: ## if only single alignment (no XS)
        if bestOver >= diff*0.8:
            return 42 #42 when 
        elif bestOver >= diff*0.7:
            return 40
        elif bestOver >= diff*0.61: ## originally >= 0.60 but this made it agree with my bt2 tests better
            return 24
        elif bestOver >= diff*0.5:
            return 23
        elif bestOver >= diff*0.4: 
            return 8
        elif bestOver >= diff*0.3:
            return 3
        else:
            return 0
    else:   ## it does have second best
        if bestdiff >= diff*0.9:
            if bestOver == diff:
                return 39
            else:
                return 33
        elif bestdiff >= diff*0.8:
            if bestOver == diff:
                return 38
            else:
                return 27
        elif bestdiff >= diff*0.7:
            if bestOver == diff:
                return 37
            else:
                return 26
        elif bestdiff >= diff*0.6:
            if bestOver == diff:
                return 36
            else:
                return 22
        elif bestdiff >= diff*0.5:
            if bestOver == diff:
                return 35
            elif bestOver >= diff*0.84:
                return 25
            elif bestOver >= diff*0.68:
                return 16
            else:
            #elif bestOver >= diff*0.68:
                return 5
        elif bestdiff >= diff*0.4:
            if bestOver == diff:
                return 34
            elif bestOver >= diff*0.84:
                return 21
            elif bestOver >= diff*0.68:
                return 14
            else:
                return 4
        elif bestdiff >= diff*0.3:
            if bestOver == diff:
                return 32
            elif bestOver >= diff*0.88:
                return 18
            elif bestOver >= diff*0.67:
                return 15
            else:
                return 3
        elif bestdiff >= diff*0.2:
            if bestOver == diff:
                return 31
            elif bestOver >= diff*0.88:
                return 17
            elif bestOver >= diff*0.67:
                return 11
            else:
                return 0
        elif bestdiff >= diff*0.1:
            if bestOver == diff:
                return 30
            elif bestOver >= diff*0.88:
                return 12
            elif bestOver >= diff*0.67:
                return 7
            else:
                return 0
        elif bestdiff > 0:
            if bestOver >= diff*0.67:
                return 6
            else:
                return 2
        else: ## best and 2nd best AS are the same (multireads where AS=XS)
            if bestOver >= diff*0.68: ## originally >= 0.67 but this made it agree with my bt2 tests better
                return 1
            else:
                return 0








Local mode:

def local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):
    ## BELOW - I have changed ">=" in original bowtie2 code to ">" where it gave more results consistent w/ bt2 output
    ##      Future testing may lead me to change it back, but it fixed hundreds of discrepancies in original testing of paired MAPQ calculations
    ## If does not have second best
    if XS < scMin: ## if only single alignment (AS and no XS)
        if bestOver > diff*0.8:
            return 44 # 44 is best in local mode, but only 42 in endtoend mode...
        elif bestOver >= diff*0.7:
            return 42
        elif bestOver > diff*0.6: 
            return 41
        elif bestOver >= diff*0.5:
            return 36
        elif bestOver >= diff*0.4: 
            return 28
        elif bestOver >= diff*0.3:
            return 24
        else:
            return 22
    else:   ## it does have second best
        if bestdiff >= diff*0.9:
            return 40
        elif bestdiff > diff*0.8:
            return 39
        elif bestdiff >= diff*0.7:
            return 38
        elif bestdiff > diff*0.6: 
            return 37
        elif bestdiff >= diff*0.5:
            if bestOver == diff:
                return 35
            elif bestOver >= diff*0.5:
                return 25
            else:
                return 20
        elif bestdiff > diff*0.4: 
            if bestOver == diff:
                return 34
            elif bestOver >= diff*0.5:
                return 21
            else:
                return 19
        elif bestdiff > diff*0.3:
            if bestOver == diff:
                return 33
            elif bestOver >= diff*0.5:
                return 18
            else:
                return 16
        elif bestdiff > diff*0.2:
            if bestOver == diff:
                return 32
            elif bestOver >= diff*0.5:
                return 17
            else:
                return 12
        elif bestdiff > diff*0.1:
            if bestOver == diff:
                return 31
            elif bestOver >= diff*0.5:
                return 14
            else:
                return 9
        elif bestdiff > 0:
            if bestOver >= diff*0.5:
                return 11
            else:
                return 2
        else: ## best and 2nd best AS are the same (multireads where AS=XS)
            if bestOver >= diff*0.5: 
                return 1
            else:
                return 0







Bowtie2 code snippet - V2 of MAPQ calculator
==================================
/**
 * V2 of the MAPQ calculator
 */
class BowtieMapq2 : public Mapq {

public:

BowtieMapq2(
const SimpleFunc& scoreMin,
const Scoring& sc) :
scoreMin_(scoreMin),
sc_(sc)
{ }

virtual ~BowtieMapq2() { }

/**
* Given an AlnSetSumm, return a mapping quality calculated.
*/
virtual TMapq mapq(
const AlnSetSumm& s,
const AlnFlags&   flags,
bool              mate1,
size_t            rdlen,
size_t            ordlen,
char             *inps)   // put string representation of inputs here
const
{
// Did the read have a second-best alignment?
bool hasSecbest = s.paired() ?
VALID_AL_SCORE(s.bestUnchosenCScore()) :
VALID_AL_SCORE(s.bestUnchosenScore(mate1));
// This corresponds to a scenario where we found one and only one
// alignment but didn't really look for a second one
if(!flags.isPrimary() ||
   (!flags.canMax() && !s.exhausted(mate1) && !hasSecbest)) {
return 255;
}
// scPer = score of a perfect match
TAlScore scPer = (TAlScore)sc_.perfectScore(rdlen);
if(s.paired()) {
scPer += (TAlScore)sc_.perfectScore(ordlen);
}
// scMin = score of a just barely valid match
TAlScore scMin = scoreMin_.f<TAlScore>((float)rdlen);
if(s.paired()) {
scMin += scoreMin_.f<TAlScore>((float)ordlen);
}
TAlScore secbest = scMin-1;
                TAlScore diff = std::max<TAlScore>(1, scPer - scMin); // scores can vary by up to this much
                TMapq ret = 0;
TAlScore best = s.paired() ?
s.bestCScore().score() : s.bestScore(mate1).score();
// best score but normalized so that 0 = worst valid score
TAlScore bestOver = best - scMin;
if(sc_.monotone) {
// End-to-end alignment
if(!hasSecbest) {
if     (bestOver >= diff * (double)0.8f) ret = 42;
else if(bestOver >= diff * (double)0.7f) ret = 40;
else if(bestOver >= diff * (double)0.6f) ret = 24;
else if(bestOver >= diff * (double)0.5f) ret = 23;
else if(bestOver >= diff * (double)0.4f) ret = 8;
else if(bestOver >= diff * (double)0.3f) ret = 3;
else                                     ret = 0;
} else {
secbest = s.paired() ?
s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();
TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));
if(bestdiff >= diff * (double)0.9f) {
if(bestOver == diff) {
ret = 39;
} else {
ret = 33;
}
} else if(bestdiff >= diff * (double)0.8f) {
if(bestOver == diff) {
ret = 38;
} else {
ret = 27;
}
} else if(bestdiff >= diff * (double)0.7f) {
if(bestOver == diff) {
ret = 37;
} else {
ret = 26;
}
} else if(bestdiff >= diff * (double)0.6f) {
if(bestOver == diff) {
ret = 36;
} else {
ret = 22;
}
} else if(bestdiff >= diff * (double)0.5f) {
// Top third is still pretty good
if       (bestOver == diff) {
ret = 35;
} else if(bestOver >= diff * (double)0.84f) {
ret = 25;
} else if(bestOver >= diff * (double)0.68f) {
ret = 16;
} else {
ret = 5;
}
} else if(bestdiff >= diff * (double)0.4f) {
// Top third is still pretty good
if       (bestOver == diff) {
ret = 34;
} else if(bestOver >= diff * (double)0.84f) {
ret = 21;
} else if(bestOver >= diff * (double)0.68f) {
ret = 14;
} else {
ret = 4;
}
} else if(bestdiff >= diff * (double)0.3f) {
// Top third is still pretty good
if       (bestOver == diff) {
ret = 32;
} else if(bestOver >= diff * (double)0.88f) {
ret = 18;
} else if(bestOver >= diff * (double)0.67f) {
ret = 15;
} else {
ret = 3;
}
} else if(bestdiff >= diff * (double)0.2f) {
// Top third is still pretty good
if       (bestOver == diff) {
ret = 31;
} else if(bestOver >= diff * (double)0.88f) {
ret = 17;
} else if(bestOver >= diff * (double)0.67f) {
ret = 11;
} else {
ret = 0;
}
} else if(bestdiff >= diff * (double)0.1f) {
// Top third is still pretty good
if       (bestOver == diff) {
ret = 30;
} else if(bestOver >= diff * (double)0.88f) {
ret = 12;
} else if(bestOver >= diff * (double)0.67f) {
ret = 7;
} else {
ret = 0;
}
} else if(bestdiff > 0) {
// Top third is still pretty good
if(bestOver >= diff * (double)0.67f) {
ret = 6;
} else {
ret = 2;
}
} else {
assert_eq(bestdiff, 0);
// Top third is still pretty good
if(bestOver >= diff * (double)0.67f) {
ret = 1;
} else {
ret = 0;
}
}
}
} else {
// Local alignment
if(!hasSecbest) {
if     (bestOver >= diff * (double)0.8f) ret = 44;
else if(bestOver >= diff * (double)0.7f) ret = 42;
else if(bestOver >= diff * (double)0.6f) ret = 41;
else if(bestOver >= diff * (double)0.5f) ret = 36;
else if(bestOver >= diff * (double)0.4f) ret = 28;
else if(bestOver >= diff * (double)0.3f) ret = 24;
else                                     ret = 22;
} else {
secbest = s.paired() ?
s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();
TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));
if     (bestdiff >= diff * (double)0.9f) ret = 40;
else if(bestdiff >= diff * (double)0.8f) ret = 39;
else if(bestdiff >= diff * (double)0.7f) ret = 38;
else if(bestdiff >= diff * (double)0.6f) ret = 37;
else if(bestdiff >= diff * (double)0.5f) {
if     (bestOver == diff)                 ret = 35;
else if(bestOver >= diff * (double)0.50f) ret = 25;
else                                      ret = 20;
} else if(bestdiff >= diff * (double)0.4f) {
if     (bestOver == diff)                 ret = 34;
else if(bestOver >= diff * (double)0.50f) ret = 21;
else                                      ret = 19;
} else if(bestdiff >= diff * (double)0.3f) {
if     (bestOver == diff)                 ret = 33;
else if(bestOver >= diff * (double)0.5f)  ret = 18;
else                                      ret = 16;
} else if(bestdiff >= diff * (double)0.2f) {
if     (bestOver == diff)                 ret = 32;
else if(bestOver >= diff * (double)0.5f)  ret = 17;
else                                      ret = 12;
} else if(bestdiff >= diff * (double)0.1f) {
if     (bestOver == diff)                 ret = 31;
else if(bestOver >= diff * (double)0.5f)  ret = 14;
else                                      ret = 9;
} else if(bestdiff > 0) {
if(bestOver >= diff * (double)0.5f)       ret = 11;
else                                      ret = 2;
} else {
assert_eq(bestdiff, 0);
if(bestOver >= diff * (double)0.5f)       ret = 1;
else                                      ret = 0;
}
}
}
// Note: modifications to inps must be synchronized
//if(inps != NULL) {
// inps = itoa10<TAlScore>(best, inps);
// *inps++ = ',';
// inps = itoa10<TAlScore>(secbest, inps);
// *inps++ = ',';
// inps = itoa10<TMapq>(ret, inps);
//}
return ret;
}

protected:

SimpleFunc      scoreMin_;
const Scoring&  sc_;
};

==================================
















 

No comments:

Post a Comment