Biofinysics: How does Bowtie2 assign MAPQ (in local mode)?

REVIEW:

For a longer discussion on this question, see my aging posts from 2014:

2. Where does bowtie2 assign true multireads (AS=XS)?

3. The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data

Also see my more recent post on MAPQ for paired reads:

How does Bowtie2 assign MAPQ to each mate in paired-end reads?

Finally, don't forget to check out my python code for computing Bowtie2 MAPQ scores:

- https://github.com/JohnUrban/MAPQcalculators

WITHOUT FURTHER ADO:

When I originally wrote about how bowtie2 assigns MAPQ scores years ago, I was thinking about it for experiments with single-end reads that I aligned in end-to-end mode. More recently, I became curious about how MAPQ was computed for paired reads, and how it is computed in local mode. The answer to the first question (on paired MAPQ) is essentially, but perhaps not totally, that bowtie2 sums the alignment scores for the mates and doubles the effective minimum score to calculate a paired MAPQ, as if computing a MAPQ for a read twice as long as the input reads (assuming a constant read length here). There was a little bit of fudging around with XS (the second best alignments) in my calculations that may be bypassed when bowtie2 is calculating the paired MAPQ with potentially more information than it outputs in the SAM file. Specifically, Bowtie2 likely compares the best AS sum to the second best AS sum whereas with what we are provided in the SAM file (at least as far as the AS, XS, and YS flags go) only allows us to compare the best AS sum to an estimation of the second best AS sum. The estimation I've concocted uses the XS values provided in the SAM entries for the individual mates, and when there is no second best (XS) for one of the mates, it uses the AS value for that mate. In a test of 49059 paired-end reads aligned in local mode, my python code gave the same paired MAPQ for all but 27 pairs. That is an error rate of 0.055%, or a success rate of 99.945%.

In this blog post, I will briefly detail the bowtie2 code for computing MAPQ, highlighting the local mode calculations, and showing the python code for it.

The short take-home messages are:

1. The MAPQ calculations for local mode differ from end-to-end mode.

2. The code structure for calculating MAPQ in both modes looks very similar.

3. The main differences is what MAPQ is assigned for the different if/elif rules.

4. One highlighted difference is that, whereas the max MAPQ for end-to-end reads is 42, the max for locally aligned reads is 44.

5. Locally aligned reads meeting similar "rule thresholds" generally get higher MAPQs than their end-to-end aligned counterparts, likely b/c the nature of local mode makes it less likely to cross a given quality threshold.

Bowtie2 code for calculating MAPQ:

The bowtie2 code can be found in the "unique.h" file in the bowtie2 package:

https://github.com/BenLangmead/bowtie2/blob/master/unique.h

The snippet of interest begins around line 168-170 in the current version. Look. for"V2 of the MAPQ calculator". I will paste it at the end of this blog to make life easier. It has the code for both end-to-end and local mode. Below I will focus on local mode, and show the python code for it.

Shared concepts between end-to-end and local mode:

Both modes depend on calculating a minimum score to define a valid alignment. This minimum score is defined by the parameters given to the --score-min option. See my recent post summarizing this step.

Both modes depend on calculating what would be the "perfect score" given the alignment parameters. This is essentially the read length multiplied by what is set for the "match bonus" (--ma option to bowtie2). In end-to-end mode, this defaults to 0. In local mode, it defaults to 2. Thus, by default, 0 is the perfect score for any read length in end-to-end mode, and 2*read_length is the perfect score in local mode. For example, the perfect score for a 50 bp single-end read = 2*50 = 100.

Both modes depend on comparing the best alignment score (AS) with the second best alignment score (XS). If there was no valid second best alignment score, then bowtie2 arbitrarily makes XS equal the minimum score minus one (XS = minscore - 1). Under the alignment score scheme, this is the most conservative thing to assume about the next best, meaning it will help give a worst-case MAPQ.

Both modes calculate the maximum difference between the perfect score and the minimum score.

Both modes track the best alignment score for a read. This is typically AS, but in paired-end mode where mates get assigned to the place with the highest sum of alignment scores for both mates, one of the mates might not be in the location of its highest alignment score. In these cases, XS > AS for that mate, and XS is the best alignment score for that mate.

Both modes track the best difference over the minimum score. This is like above when the maximum difference was found between the perfect score and the minimum score. However, here it is tracking the difference between the best alignment score and the minimum score.

Both keep track of the difference between the best alignment score and the second best. When AS >= XS, this is just the absolute difference between AS and XS. However, in the situations described above for paired-end reads where XS>AS for one of the mates, the difference here ought to be negative. In my code, I have found setting this to 1 when XS>AS (i.e. not AS >= XS) helps reproduce bowtie2 MAPQ results.

The above is encapsulated in the python function below (code here):

def bt2_mapq(AS, XS=None, alnmode=None, scMin=None, scPer=None):

'''

scMin = minScore

scPer = perfect score

'''

## Did the read have a second-best alignment?

if XS == None: ## make it less than scMin

## scMin = score of a just barely valid match

XS = scMin-1

## Difference between the perfect score and the minimum score

diff = max(1, abs(scPer-scMin))

## Best alignment score found

best = max(AS, XS)

## Difference between best alignment score seen and score minumum

bestOver = best - scMin

## Absolute difference between the best and second best alignment scores (usually AS>XS)

if AS >= XS:

bestdiff = abs(abs(AS)-abs(XS))

else:

bestdiff = 1 ## seems like 0 or negative would be better, but so far this has minimized discrepancies

## Was alignment mode in end-to-end or local?

if alnmode == 'endtoend':

return endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)

elif alnmode == 'local':

return local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)

Where the modes depart in the MAPQ score calculations:

Both share a very similar code structure here, but different MAPQs are awarded for meeting different quality thresholds. Most easily noticeable is the max MAPQ is 44 in local mode as compared to 42 in end-to-end mode.

The end-to-end code was featured in my original post on computing MAPQ scores (How does bowtie2 assign MAPQ scores?), but is reproduced below.

Note - I will color in the python code below as above at a later point.

I leave you with the code, which is the best way to see how MAPQs are being assigned. I translated it into Python from the original Bowtie2 code as it is simpler to read (and I was trying to recapitulate the original code anyway).

Be you bioinformatics Padawan or Jedi, I hope I have been able to help you answer a question.

--John

End-to-end mode:

def endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):

#print(AS, XS, scMin, diff, bestOver, bestdiff)

## If does not have second best

if XS < scMin: ## if only single alignment (no XS)

if bestOver >= diff*0.8:

return 42 #42 when

elif bestOver >= diff*0.7:

return 40

elif bestOver >= diff*0.61: ## originally >= 0.60 but this made it agree with my bt2 tests better

return 24

elif bestOver >= diff*0.5:

return 23

elif bestOver >= diff*0.4:

return 8

elif bestOver >= diff*0.3:

return 3

else:

return 0

else: ## it does have second best

if bestdiff >= diff*0.9:

if bestOver == diff:

return 39

else:

return 33

elif bestdiff >= diff*0.8:

if bestOver == diff:

return 38

else:

return 27

elif bestdiff >= diff*0.7:

if bestOver == diff:

return 37

else:

return 26

elif bestdiff >= diff*0.6:

if bestOver == diff:

return 36

else:

return 22

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.84:

return 25

elif bestOver >= diff*0.68:

return 16

else:

#elif bestOver >= diff*0.68:

return 5

elif bestdiff >= diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.84:

return 21

elif bestOver >= diff*0.68:

return 14

else:

return 4

elif bestdiff >= diff*0.3:

if bestOver == diff:

return 32

elif bestOver >= diff*0.88:

return 18

elif bestOver >= diff*0.67:

return 15

else:

return 3

elif bestdiff >= diff*0.2:

if bestOver == diff:

return 31

elif bestOver >= diff*0.88:

return 17

elif bestOver >= diff*0.67:

return 11

else:

return 0

elif bestdiff >= diff*0.1:

if bestOver == diff:

return 30

elif bestOver >= diff*0.88:

return 12

elif bestOver >= diff*0.67:

return 7

else:

return 0

elif bestdiff > 0:

if bestOver >= diff*0.67:

return 6

else:

return 2

else: ## best and 2nd best AS are the same (multireads where AS=XS)

if bestOver >= diff*0.68: ## originally >= 0.67 but this made it agree with my bt2 tests better

return 1

else:

return 0

Local mode:

def local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):

## BELOW - I have changed ">=" in original bowtie2 code to ">" where it gave more results consistent w/ bt2 output

## Future testing may lead me to change it back, but it fixed hundreds of discrepancies in original testing of paired MAPQ calculations

## If does not have second best

if XS < scMin: ## if only single alignment (AS and no XS)

if bestOver > diff*0.8:

return 44 # 44 is best in local mode, but only 42 in endtoend mode...

elif bestOver >= diff*0.7:

return 42

elif bestOver > diff*0.6:

return 41

elif bestOver >= diff*0.5:

return 36

elif bestOver >= diff*0.4:

return 28

elif bestOver >= diff*0.3:

return 24

else:

return 22

else: ## it does have second best

if bestdiff >= diff*0.9:

return 40

elif bestdiff > diff*0.8:

return 39

elif bestdiff >= diff*0.7:

return 38

elif bestdiff > diff*0.6:

return 37

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.5:

return 25

else:

return 20

elif bestdiff > diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.5:

return 21

else:

return 19

elif bestdiff > diff*0.3:

if bestOver == diff:

return 33

elif bestOver >= diff*0.5:

return 18

else:

return 16

elif bestdiff > diff*0.2:

if bestOver == diff:

return 32

elif bestOver >= diff*0.5:

return 17

else:

return 12

elif bestdiff > diff*0.1:

if bestOver == diff:

return 31

elif bestOver >= diff*0.5:

return 14

else:

return 9

elif bestdiff > 0:

if bestOver >= diff*0.5:

return 11

else:

return 2

else: ## best and 2nd best AS are the same (multireads where AS=XS)

if bestOver >= diff*0.5:

return 1

else:

return 0

Bowtie2 code snippet - V2 of MAPQ calculator

==================================

/**

* V2 of the MAPQ calculator

class BowtieMapq2 : public Mapq {

public:

BowtieMapq2(

const SimpleFunc& scoreMin,

const Scoring& sc) :

scoreMin_(scoreMin),

sc_(sc)

{ }

virtual ~BowtieMapq2() { }

/**

* Given an AlnSetSumm, return a mapping quality calculated.

virtual TMapq mapq(

const AlnSetSumm& s,

const AlnFlags& flags,

bool mate1,

size_t rdlen,

size_t ordlen,

char *inps) // put string representation of inputs here

const

{

// Did the read have a second-best alignment?

bool hasSecbest = s.paired() ?

VALID_AL_SCORE(s.bestUnchosenCScore()) :

VALID_AL_SCORE(s.bestUnchosenScore(mate1));

// This corresponds to a scenario where we found one and only one

// alignment but didn't really look for a second one

if(!flags.isPrimary() ||

(!flags.canMax() && !s.exhausted(mate1) && !hasSecbest)) {

return 255;

}

// scPer = score of a perfect match

TAlScore scPer = (TAlScore)sc_.perfectScore(rdlen);

if(s.paired()) {

scPer += (TAlScore)sc_.perfectScore(ordlen);

}

// scMin = score of a just barely valid match

TAlScore scMin = scoreMin_.f<TAlScore>((float)rdlen);

if(s.paired()) {

scMin += scoreMin_.f<TAlScore>((float)ordlen);

}

TAlScore secbest = scMin-1;

TAlScore diff = std::max<TAlScore>(1, scPer - scMin); // scores can vary by up to this much

TMapq ret = 0;

TAlScore best = s.paired() ?

s.bestCScore().score() : s.bestScore(mate1).score();

// best score but normalized so that 0 = worst valid score

TAlScore bestOver = best - scMin;

if(sc_.monotone) {

// End-to-end alignment

if(!hasSecbest) {

if (bestOver >= diff * (double)0.8f) ret = 42;

else if(bestOver >= diff * (double)0.7f) ret = 40;

else if(bestOver >= diff * (double)0.6f) ret = 24;

else if(bestOver >= diff * (double)0.5f) ret = 23;

else if(bestOver >= diff * (double)0.4f) ret = 8;

else if(bestOver >= diff * (double)0.3f) ret = 3;

else ret = 0;

} else {

secbest = s.paired() ?

s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();

TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));

if(bestdiff >= diff * (double)0.9f) {

if(bestOver == diff) {

ret = 39;

} else {

ret = 33;

}

} else if(bestdiff >= diff * (double)0.8f) {

if(bestOver == diff) {

ret = 38;

} else {

ret = 27;

}

} else if(bestdiff >= diff * (double)0.7f) {

if(bestOver == diff) {

ret = 37;

} else {

ret = 26;

}

} else if(bestdiff >= diff * (double)0.6f) {

if(bestOver == diff) {

ret = 36;

} else {

ret = 22;

}

} else if(bestdiff >= diff * (double)0.5f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 35;

} else if(bestOver >= diff * (double)0.84f) {

ret = 25;

} else if(bestOver >= diff * (double)0.68f) {

ret = 16;

} else {

ret = 5;

}

} else if(bestdiff >= diff * (double)0.4f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 34;

} else if(bestOver >= diff * (double)0.84f) {

ret = 21;

} else if(bestOver >= diff * (double)0.68f) {

ret = 14;

} else {

ret = 4;

}

} else if(bestdiff >= diff * (double)0.3f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 32;

} else if(bestOver >= diff * (double)0.88f) {

ret = 18;

} else if(bestOver >= diff * (double)0.67f) {

ret = 15;

} else {

ret = 3;

}

} else if(bestdiff >= diff * (double)0.2f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 31;

} else if(bestOver >= diff * (double)0.88f) {

ret = 17;

} else if(bestOver >= diff * (double)0.67f) {

ret = 11;

} else {

ret = 0;

}

} else if(bestdiff >= diff * (double)0.1f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 30;

} else if(bestOver >= diff * (double)0.88f) {

ret = 12;

} else if(bestOver >= diff * (double)0.67f) {

ret = 7;

} else {

ret = 0;

}

} else if(bestdiff > 0) {

// Top third is still pretty good

if(bestOver >= diff * (double)0.67f) {

ret = 6;

} else {

ret = 2;

}

} else {

assert_eq(bestdiff, 0);

// Top third is still pretty good

if(bestOver >= diff * (double)0.67f) {

ret = 1;

} else {

ret = 0;

}

} else {

// Local alignment

if(!hasSecbest) {

if (bestOver >= diff * (double)0.8f) ret = 44;

else if(bestOver >= diff * (double)0.7f) ret = 42;

else if(bestOver >= diff * (double)0.6f) ret = 41;

else if(bestOver >= diff * (double)0.5f) ret = 36;

else if(bestOver >= diff * (double)0.4f) ret = 28;

else if(bestOver >= diff * (double)0.3f) ret = 24;

else ret = 22;

} else {

secbest = s.paired() ?

s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();

TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));

if (bestdiff >= diff * (double)0.9f) ret = 40;

else if(bestdiff >= diff * (double)0.8f) ret = 39;

else if(bestdiff >= diff * (double)0.7f) ret = 38;

else if(bestdiff >= diff * (double)0.6f) ret = 37;

else if(bestdiff >= diff * (double)0.5f) {

if (bestOver == diff) ret = 35;

else if(bestOver >= diff * (double)0.50f) ret = 25;

else ret = 20;

} else if(bestdiff >= diff * (double)0.4f) {

if (bestOver == diff) ret = 34;

else if(bestOver >= diff * (double)0.50f) ret = 21;

else ret = 19;

} else if(bestdiff >= diff * (double)0.3f) {

if (bestOver == diff) ret = 33;

else if(bestOver >= diff * (double)0.5f) ret = 18;

else ret = 16;

} else if(bestdiff >= diff * (double)0.2f) {

if (bestOver == diff) ret = 32;

else if(bestOver >= diff * (double)0.5f) ret = 17;

else ret = 12;

} else if(bestdiff >= diff * (double)0.1f) {

if (bestOver == diff) ret = 31;

else if(bestOver >= diff * (double)0.5f) ret = 14;

else ret = 9;

} else if(bestdiff > 0) {

if(bestOver >= diff * (double)0.5f) ret = 11;

else ret = 2;

} else {

assert_eq(bestdiff, 0);

if(bestOver >= diff * (double)0.5f) ret = 1;

else ret = 0;

}

// Note: modifications to inps must be synchronized

//if(inps != NULL) {

// inps = itoa10<TAlScore>(best, inps);

// *inps++ = ',';

// inps = itoa10<TAlScore>(secbest, inps);

// *inps++ = ',';

// inps = itoa10<TMapq>(ret, inps);

//}

return ret;

}

protected:

SimpleFunc scoreMin_;

const Scoring& sc_;

};

==================================

1 comment:

christin bibarMay 2, 2025 at 1:10 AM
The way Bowtie2 assigns MAPQ in local mode is quite technical, but it’s fascinating how it reflects alignment confidence. Understanding this helps a lot in interpreting sequencing data more accurately. If you’re into performance-focused machine learning, llama.cppis a powerful tool that runs large language models efficiently even on local devices—super handy for developers and researchers alike!

Sunday, July 18, 2021

How does Bowtie2 assign MAPQ (in local mode)?

1 comment: