Thursday, July 15, 2021

How does Bowtie2 assign MAPQ to each mate in paired-end reads?

 


WARNING:
This post is actually still in development. It turns out my solution may not be as successful as I thought. It still gets the majority correct, but something like >95% vs the >99% I was reporting before. I will erase this message and update on a future date. 


Review:

For a longer discussion on this question, see my aging posts from 2014:

Also see my more recent post on MAPQ for local mode:

Finally, don't forget to check out my python code for computing Bowtie2 MAPQ scores for end-to-end and local modes, for single- and paired-end reads, for any scoring scheme given to Bowtie2:


Intro:

I am just writing an addendum to "How does bowtie2 assign MAPQ scores?" here, as I couldn't find a simple answer on the internet. That post was mainly concerned about single-end reads. What about pairs? Does each mate get a separate MAPQ score? Are the mates assigned the same score? What score are they given: is it computed as a probability based on both alignments? Is it simply using the min, max, or mean of the MAPQs of the mates?

All I found in the bowtie2 manual was how alignment scores are calculated for paired-end reads in -k and -a modes:
The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. 

If this is what is done in the default alignment mode as well, then it would predict that where the pair aligns is a function of having the highest sum of alignment scores for the mates, and MAPQ would be computed based on alignment score sums, and that both mates would therefore be assigned the same MAPQ.


Question set 1: To be or not to be (the same)?

For now, I will just provide the no nonsense answer to the first two questions:
- Does each mate get a separate MAPQ score? 
- Are the mates assigned the same score?



Answer 1: To be (the same)

Both mates are given the same score.

How did I come to this answer? 
Simple. I aligned a bunch of paired-end reads. Then I counted how many times both mates had the same score. It was 100% of the time.

For some reason, I was under the impression that they were assigned different MAPQs. Perhaps because BEDtools bamToBed -bedpe uses the "minimum MAPQ" between the mates. That made me assume there were different MAPQs assigned to each mate -- and I specifically made that erroneous assumption about bowtie2 in default paired-end alignment mode(s). There would certainly be different MAPQs assigned to mates in non-default strategies for mapping the read pairs (e.g. independently of each other), and possibly from other aligners more generally. But in a standard mapping with bowtie2, pairs are given the same MAPQ.

The implication for samtools view -q filtering for MAPQ means both mates will always be kept or retained, not orphaned.


Question set 2: What is it?

Okay - so both mates get the same MAPQ score... but....
What MAPQ score are the mates given: 
- is it computed as a probability based on both alignments? 
- Is it simply using the min, max, mean, or sum of the MAPQs of the mates?

The bowtie2 manual states for -a/-k modes that the alignment score (AS) for the pair is the sum of the AS's from each mate in the pair. 

In default mode, I see 4 SAM fields that matter:
1. MAPQ (field 5) - the MAPQ score given to the pair or read.
2. AS (typically field 12) - the Alignment Score for the mate/read described in that line.
3. XS (variable field number, and isn't always present) - the next best alignment score for the mate/read in question, if there were any other "valid" alignment scores (over the --score-min)
4. YS (variable field number) - the AS of the mate.

The following AWK line will extract the readnames and those 4 fields for you:
awk 'OFS="\t" {match($0, /AS:i:[0-9]*/, AS); match($0, /YS:i:[0-9]*/, YS); match($0, /XS:i:[0-9]*/, XS); if (XS[0]==""){XS[0]="XS:i:-1"}; print $1,$5,AS[0], YS[0], XS[0]}'  reads.sam

Or pipe the SAM fields from a BAM file:
samtools view reads.bam | awk 'OFS="\t" {match($0, /AS:i:[0-9]*/, AS); match($0, /YS:i:[0-9]*/, YS); match($0, /XS:i:[0-9]*/, XS); if (XS[0]==""){XS[0]="XS:i:-1"}; print $1,$5,AS[0], YS[0], XS[0]}' 

- I have a script that collects all MAPQ, AS, YS, and XS from both mates to put on single line.
- This allowed me to figure out how to compute the "pair MAPQ".

Answer 2: MAPQ for the pair is computed as a function of the sum of alignment scores

Both mates in a pair get the same MAPQ. How? Essentially, but perhaps not totally, bowtie2 sums the alignment scores for the mates and doubles the effective minimum score to calculate a paired MAPQ. It is similar to computing the MAPQ for a read twice as long as the input reads (assuming a constant read length here). There was a little bit of fudging around with XS (the second best alignments) in my calculations that may be bypassed when bowtie2 is calculating the paired MAPQ with potentially more information than it outputs in the SAM file. Specifically, Bowtie2 likely compares the best AS sum to the second best AS sum whereas with what we are provided in the SAM file (at least as far as the AS, XS, and YS flags go) only allows us to compare the best AS sum to an estimation of the second best AS sum. The estimation I've concocted uses the XS values provided in the SAM entries for the individual mates, and when there is no second best (XS) for one of the mates, it uses the AS value for that mate. In a test of 49059 paired-end reads aligned in local mode, my python code gave the same paired MAPQ for all but 27 pairs. That is an error rate of 0.055%, or a success rate of 99.945%.

See my explanation using a code snippet below.


Code snippet (from here) - Approach #1:

To calculate the pair MAPQ, first sum the alignment scores.

ASJ = AS + AS2

Then determine a suitable second best alignment score sum. What I do may not be what Bowtie2 does, as it likely uses the alignment score sums from specific paired-end alignments found. I try to estimate that with the following code and the AS/XS/YS flags given in the SAM file. It is correct >= 99.945% of the time. (((On later code tested, this was only >95% correct.)))

 if XS is not None and XS2 is None:
     XSJ = XS + AS2
 elif XS is None and XS2 is not None:
     XSJ = AS + XS2
 elif XS is None and XS2 is None:
     XSJ = (scoremin-1) * 2 
 else: ## neither none
     XSJ = XS + XS2     

After calculating ASJ and XSJ, I also found that ensuring ASJ>=XSJ helped improve consistency with BT2 MAPQ, but just a tiny bit. Often it offers no extra benefit, possibly because XSJ>ASJ is a rare condition.

    if XSJ > ASJ:
        XSJ = ASJ

Then multiply both the "minimum score" and the "perfect score" by 2. Call that scMinJ and scPerJ. For more information on the "minimum score" and the "perfect score", see the section titled, "Shared concepts between end-to-end and local mode", in the accompanying post on computing MAPQ in end-to-end vs local mode.

Lastly, use the new "combined" or "joint" values (ASJ, XSJ, scMinJ, and scPerJ) in the normal MAPQ calculator for the mode it was aligned in.


NOTE: I am currently improving this further. One condition I found to always underestimate MAPQ in is when 1 mate has a perfect AS (or really good) and no second best (no XS). That essentially means the place the mate with uncertainty is mapped nearby would have less uncertainty, i.e. a better "posterior MAPQ" than was calculated for it independently. Anyway -- in areas of discrepancy, the pair MAPQ given by Bowtie2 is always higher than mine. To be continued.....


Conclusions:

1. Mates are given the same MAPQ in standard bowtie2 mapping.

2. MAPQ for the pairs is calculated in the same way as it is done for single-end reads (in either alignment mode) but using the alignment score sum of the mates in the best paired alignment and the second best paired alignment, and the minimum alignment and perfect alignment scores multiplied by 2.



1 comment:

  1. This article dives deep into the mechanics of Bowtie2 and MAPQ assignment—truly fascinating! Just as flipperzerounleashed enhances your strategies for success, understanding these details can significantly improve your data analysis in genomic research!

    ReplyDelete