Biofinysics: 2021

Sunday, July 18, 2021

How does Bowtie2 calculate a minimum alignment score?

The Bowtie2 manual does a fine job at explaining what the minimum score is and how it is computed. So consider this blog post a supplement to the manual, if you wish to be generous. Otherwise, denounce it as redundant.

For relevant parts of the manual see the sections on (i) scoring, and particularly the sub-section therein on valid alignments; (ii) the section on setting function options ; and (iii) the descriptions of scoring options in the options section.

It may also be useful to review the following subsections in my aging blog on how Bowtie2 computes MAPQ scores (for single-end reads in end-to-end mode): (i) how bowtie2 scores a mismatch, and (ii) how bowtie2 decides if an alignment is valid. For more information on computing MAPQ scores, see my more recent posts on paired-end reads and local mode.

The short and simple is this:
Bowtie2 calculates a minimum score to define a valid alignment. This minimum score is defined by the 3 comma-separated parameters given to the --score-min option.

The rest is this:
Below I discuss those 3 comma-separated parameters in more detail, and give examples on (i) the defaults in end-to-end and local modes, (ii) defining a function where the minimum score is purely dependent on read length, (iii) defining a function where the minimum score is independent of read length.

--score-min Parameter Description:

- --score-min F,B,M :: F=transform function (C=constant, G=log, L=linear, S=sqrt); B=b in mx+b; M=m in mx+b

- computed min_score(read_length) = B + M*F(read_length)

- Minimum score = absolute minimum + length-dependent addition

- If the "absolute minimum" is set to 0, then it is purely dependent on read length.

- Otherwise, it requires first meeting a constant minimum like BWA, followed by becoming more stringent with read length.

DEFAULT IN END-TO-END MODE:

- Default for end-to-end mode = L,-0.6,-0.6

- L,-0.6,-0.6(read_length) = -0.6 + -0.6*read_length

- Read_length -> MinScore

- 20 -> -12.6

- 30 -> -18.6

- 50 -> -30.6

- 75 -> -45.6

- 100 -> -60.6

- 150 -> -90.6

DEFAULT IN LOCAL MODE:

- Default for local mode = G,20,8

- G,20,8(read_length) = 20 + 8*log(read_length)

- Read_length -> MinScore

- 20 -> 43.9659

- 30 -> 47.2096

- 50 -> 51.2962

- 75 -> 54.5399

- 100 -> 56.8414

- 150 -> 60.0851

DEFINING A FUNCTION PURELY DEPENDENT ON READ LENGTH:

Purely dependent on read length:

- End-to-end mode - similar to default without absolute minimum:

- L,0,-0.6

- Local mode - similar to default without absolute minimum:

- G,0,8

DEFINING A FUNCTION INDEPENDENT ON READ LENGTH:

- Only requires passing absolute minimum / independent of read length.

- This is similar to the default in another popular short read aligner, BWA, where the alignment score is calculated independent of read length, and the default minimum score is 30 (-T 30).

- If you wanted Bowtie2 to have a similar approach, the following codes work by analogy (Bowtie2 uses +2 for matches by default, BWA +1):

- -T 20 ==> C,40,0

- C,40,0(read_length) = 40 + 0*read_length = 40

- -T 30 ==> C,60,0

- C,60,0(read_length) = 60 + 0*read_length = 60

- ETC

- If you wanted it even more similar, the following should do the trick in theory, but differences are found anyway:

- -T 20 ==> --ma 1 --mp 4,4 --rdg 6,1 --rfg 6,1 --np 4 --score-min C,20,0

- -T 30 ==> "--ma 1 --mp 4,4 --rdg 6,1 --rfg 6,1 --np 4 --score-min C,30,0

- ETC

THE END:

Well, that is all folks. Best wishes to you.

How does Bowtie2 assign MAPQ (in local mode)?

by JohnUrbanGenome

REVIEW:

For a longer discussion on this question, see my aging posts from 2014:

1. How does bowtie2 assign MAPQ scores?

2. Where does bowtie2 assign true multireads (AS=XS)?

3. The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data

Also see my more recent post on MAPQ for paired reads:

How does Bowtie2 assign MAPQ to each mate in paired-end reads?

Finally, don't forget to check out my python code for computing Bowtie2 MAPQ scores:

- https://github.com/JohnUrban/MAPQcalculators

WITHOUT FURTHER ADO:

When I originally wrote about how bowtie2 assigns MAPQ scores years ago, I was thinking about it for experiments with single-end reads that I aligned in end-to-end mode. More recently, I became curious about how MAPQ was computed for paired reads, and how it is computed in local mode. The answer to the first question (on paired MAPQ) is essentially, but perhaps not totally, that bowtie2 sums the alignment scores for the mates and doubles the effective minimum score to calculate a paired MAPQ, as if computing a MAPQ for a read twice as long as the input reads (assuming a constant read length here). There was a little bit of fudging around with XS (the second best alignments) in my calculations that may be bypassed when bowtie2 is calculating the paired MAPQ with potentially more information than it outputs in the SAM file. Specifically, Bowtie2 likely compares the best AS sum to the second best AS sum whereas with what we are provided in the SAM file (at least as far as the AS, XS, and YS flags go) only allows us to compare the best AS sum to an estimation of the second best AS sum. The estimation I've concocted uses the XS values provided in the SAM entries for the individual mates, and when there is no second best (XS) for one of the mates, it uses the AS value for that mate. In a test of 49059 paired-end reads aligned in local mode, my python code gave the same paired MAPQ for all but 27 pairs. That is an error rate of 0.055%, or a success rate of 99.945%.

In this blog post, I will briefly detail the bowtie2 code for computing MAPQ, highlighting the local mode calculations, and showing the python code for it.

The short take-home messages are:

1. The MAPQ calculations for local mode differ from end-to-end mode.

2. The code structure for calculating MAPQ in both modes looks very similar.

3. The main differences is what MAPQ is assigned for the different if/elif rules.

4. One highlighted difference is that, whereas the max MAPQ for end-to-end reads is 42, the max for locally aligned reads is 44.

5. Locally aligned reads meeting similar "rule thresholds" generally get higher MAPQs than their end-to-end aligned counterparts, likely b/c the nature of local mode makes it less likely to cross a given quality threshold.

Bowtie2 code for calculating MAPQ:

The bowtie2 code can be found in the "unique.h" file in the bowtie2 package:

https://github.com/BenLangmead/bowtie2/blob/master/unique.h

The snippet of interest begins around line 168-170 in the current version. Look. for"V2 of the MAPQ calculator". I will paste it at the end of this blog to make life easier. It has the code for both end-to-end and local mode. Below I will focus on local mode, and show the python code for it.

Shared concepts between end-to-end and local mode:

Both modes depend on calculating a minimum score to define a valid alignment. This minimum score is defined by the parameters given to the --score-min option. See my recent post summarizing this step.

Both modes depend on calculating what would be the "perfect score" given the alignment parameters. This is essentially the read length multiplied by what is set for the "match bonus" (--ma option to bowtie2). In end-to-end mode, this defaults to 0. In local mode, it defaults to 2. Thus, by default, 0 is the perfect score for any read length in end-to-end mode, and 2*read_length is the perfect score in local mode. For example, the perfect score for a 50 bp single-end read = 2*50 = 100.

Both modes depend on comparing the best alignment score (AS) with the second best alignment score (XS). If there was no valid second best alignment score, then bowtie2 arbitrarily makes XS equal the minimum score minus one (XS = minscore - 1). Under the alignment score scheme, this is the most conservative thing to assume about the next best, meaning it will help give a worst-case MAPQ.

Both modes calculate the maximum difference between the perfect score and the minimum score.

Both modes track the best alignment score for a read. This is typically AS, but in paired-end mode where mates get assigned to the place with the highest sum of alignment scores for both mates, one of the mates might not be in the location of its highest alignment score. In these cases, XS > AS for that mate, and XS is the best alignment score for that mate.

Both modes track the best difference over the minimum score. This is like above when the maximum difference was found between the perfect score and the minimum score. However, here it is tracking the difference between the best alignment score and the minimum score.

Both keep track of the difference between the best alignment score and the second best. When AS >= XS, this is just the absolute difference between AS and XS. However, in the situations described above for paired-end reads where XS>AS for one of the mates, the difference here ought to be negative. In my code, I have found setting this to 1 when XS>AS (i.e. not AS >= XS) helps reproduce bowtie2 MAPQ results.

The above is encapsulated in the python function below (code here):

def bt2_mapq(AS, XS=None, alnmode=None, scMin=None, scPer=None):

'''

scMin = minScore

scPer = perfect score

'''

## Did the read have a second-best alignment?

if XS == None: ## make it less than scMin

## scMin = score of a just barely valid match

XS = scMin-1

## Difference between the perfect score and the minimum score

diff = max(1, abs(scPer-scMin))

## Best alignment score found

best = max(AS, XS)

## Difference between best alignment score seen and score minumum

bestOver = best - scMin

## Absolute difference between the best and second best alignment scores (usually AS>XS)

if AS >= XS:

bestdiff = abs(abs(AS)-abs(XS))

else:

bestdiff = 1 ## seems like 0 or negative would be better, but so far this has minimized discrepancies

## Was alignment mode in end-to-end or local?

if alnmode == 'endtoend':

return endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)

elif alnmode == 'local':

return local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)

Where the modes depart in the MAPQ score calculations:

Both share a very similar code structure here, but different MAPQs are awarded for meeting different quality thresholds. Most easily noticeable is the max MAPQ is 44 in local mode as compared to 42 in end-to-end mode.

The end-to-end code was featured in my original post on computing MAPQ scores (How does bowtie2 assign MAPQ scores?), but is reproduced below.

Note - I will color in the python code below as above at a later point.

I leave you with the code, which is the best way to see how MAPQs are being assigned. I translated it into Python from the original Bowtie2 code as it is simpler to read (and I was trying to recapitulate the original code anyway).

Be you bioinformatics Padawan or Jedi, I hope I have been able to help you answer a question.

--John

End-to-end mode:

def endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):

#print(AS, XS, scMin, diff, bestOver, bestdiff)

## If does not have second best

if XS < scMin: ## if only single alignment (no XS)

if bestOver >= diff*0.8:

return 42 #42 when

elif bestOver >= diff*0.7:

return 40

elif bestOver >= diff*0.61: ## originally >= 0.60 but this made it agree with my bt2 tests better

return 24

elif bestOver >= diff*0.5:

return 23

elif bestOver >= diff*0.4:

return 8

elif bestOver >= diff*0.3:

return 3

else:

return 0

else: ## it does have second best

if bestdiff >= diff*0.9:

if bestOver == diff:

return 39

else:

return 33

elif bestdiff >= diff*0.8:

if bestOver == diff:

return 38

else:

return 27

elif bestdiff >= diff*0.7:

if bestOver == diff:

return 37

else:

return 26

elif bestdiff >= diff*0.6:

if bestOver == diff:

return 36

else:

return 22

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.84:

return 25

elif bestOver >= diff*0.68:

return 16

else:

#elif bestOver >= diff*0.68:

return 5

elif bestdiff >= diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.84:

return 21

elif bestOver >= diff*0.68:

return 14

else:

return 4

elif bestdiff >= diff*0.3:

if bestOver == diff:

return 32

elif bestOver >= diff*0.88:

return 18

elif bestOver >= diff*0.67:

return 15

else:

return 3

elif bestdiff >= diff*0.2:

if bestOver == diff:

return 31

elif bestOver >= diff*0.88:

return 17

elif bestOver >= diff*0.67:

return 11

else:

return 0

elif bestdiff >= diff*0.1:

if bestOver == diff:

return 30

elif bestOver >= diff*0.88:

return 12

elif bestOver >= diff*0.67:

return 7

else:

return 0

elif bestdiff > 0:

if bestOver >= diff*0.67:

return 6

else:

return 2

else: ## best and 2nd best AS are the same (multireads where AS=XS)

if bestOver >= diff*0.68: ## originally >= 0.67 but this made it agree with my bt2 tests better

return 1

else:

return 0

Local mode:

def local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):

## BELOW - I have changed ">=" in original bowtie2 code to ">" where it gave more results consistent w/ bt2 output

## Future testing may lead me to change it back, but it fixed hundreds of discrepancies in original testing of paired MAPQ calculations

## If does not have second best

if XS < scMin: ## if only single alignment (AS and no XS)

if bestOver > diff*0.8:

return 44 # 44 is best in local mode, but only 42 in endtoend mode...

elif bestOver >= diff*0.7:

return 42

elif bestOver > diff*0.6:

return 41

elif bestOver >= diff*0.5:

return 36

elif bestOver >= diff*0.4:

return 28

elif bestOver >= diff*0.3:

return 24

else:

return 22

else: ## it does have second best

if bestdiff >= diff*0.9:

return 40

elif bestdiff > diff*0.8:

return 39

elif bestdiff >= diff*0.7:

return 38

elif bestdiff > diff*0.6:

return 37

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.5:

return 25

else:

return 20

elif bestdiff > diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.5:

return 21

else:

return 19

elif bestdiff > diff*0.3:

if bestOver == diff:

return 33

elif bestOver >= diff*0.5:

return 18

else:

return 16

elif bestdiff > diff*0.2:

if bestOver == diff:

return 32

elif bestOver >= diff*0.5:

return 17

else:

return 12

elif bestdiff > diff*0.1:

if bestOver == diff:

return 31

elif bestOver >= diff*0.5:

return 14

else:

return 9

elif bestdiff > 0:

if bestOver >= diff*0.5:

return 11

else:

return 2

else: ## best and 2nd best AS are the same (multireads where AS=XS)

if bestOver >= diff*0.5:

return 1

else:

return 0

Bowtie2 code snippet - V2 of MAPQ calculator

==================================

/**

* V2 of the MAPQ calculator

class BowtieMapq2 : public Mapq {

public:

BowtieMapq2(

const SimpleFunc& scoreMin,

const Scoring& sc) :

scoreMin_(scoreMin),

sc_(sc)

{ }

virtual ~BowtieMapq2() { }

/**

* Given an AlnSetSumm, return a mapping quality calculated.

virtual TMapq mapq(

const AlnSetSumm& s,

const AlnFlags& flags,

bool mate1,

size_t rdlen,

size_t ordlen,

char *inps) // put string representation of inputs here

const

{

// Did the read have a second-best alignment?

bool hasSecbest = s.paired() ?

VALID_AL_SCORE(s.bestUnchosenCScore()) :

VALID_AL_SCORE(s.bestUnchosenScore(mate1));

// This corresponds to a scenario where we found one and only one

// alignment but didn't really look for a second one

if(!flags.isPrimary() ||

(!flags.canMax() && !s.exhausted(mate1) && !hasSecbest)) {

return 255;

}

// scPer = score of a perfect match

TAlScore scPer = (TAlScore)sc_.perfectScore(rdlen);

if(s.paired()) {

scPer += (TAlScore)sc_.perfectScore(ordlen);

}

// scMin = score of a just barely valid match

TAlScore scMin = scoreMin_.f<TAlScore>((float)rdlen);

if(s.paired()) {

scMin += scoreMin_.f<TAlScore>((float)ordlen);

}

TAlScore secbest = scMin-1;

TAlScore diff = std::max<TAlScore>(1, scPer - scMin); // scores can vary by up to this much

TMapq ret = 0;

TAlScore best = s.paired() ?

s.bestCScore().score() : s.bestScore(mate1).score();

// best score but normalized so that 0 = worst valid score

TAlScore bestOver = best - scMin;

if(sc_.monotone) {

// End-to-end alignment

if(!hasSecbest) {

if (bestOver >= diff * (double)0.8f) ret = 42;

else if(bestOver >= diff * (double)0.7f) ret = 40;

else if(bestOver >= diff * (double)0.6f) ret = 24;

else if(bestOver >= diff * (double)0.5f) ret = 23;

else if(bestOver >= diff * (double)0.4f) ret = 8;

else if(bestOver >= diff * (double)0.3f) ret = 3;

else ret = 0;

} else {

secbest = s.paired() ?

s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();

TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));

if(bestdiff >= diff * (double)0.9f) {

if(bestOver == diff) {

ret = 39;

} else {

ret = 33;

}

} else if(bestdiff >= diff * (double)0.8f) {

if(bestOver == diff) {

ret = 38;

} else {

ret = 27;

}

} else if(bestdiff >= diff * (double)0.7f) {

if(bestOver == diff) {

ret = 37;

} else {

ret = 26;

}

} else if(bestdiff >= diff * (double)0.6f) {

if(bestOver == diff) {

ret = 36;

} else {

ret = 22;

}

} else if(bestdiff >= diff * (double)0.5f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 35;

} else if(bestOver >= diff * (double)0.84f) {

ret = 25;

} else if(bestOver >= diff * (double)0.68f) {

ret = 16;

} else {

ret = 5;

}

} else if(bestdiff >= diff * (double)0.4f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 34;

} else if(bestOver >= diff * (double)0.84f) {

ret = 21;

} else if(bestOver >= diff * (double)0.68f) {

ret = 14;

} else {

ret = 4;

}

} else if(bestdiff >= diff * (double)0.3f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 32;

} else if(bestOver >= diff * (double)0.88f) {

ret = 18;

} else if(bestOver >= diff * (double)0.67f) {

ret = 15;

} else {

ret = 3;

}

} else if(bestdiff >= diff * (double)0.2f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 31;

} else if(bestOver >= diff * (double)0.88f) {

ret = 17;

} else if(bestOver >= diff * (double)0.67f) {

ret = 11;

} else {

ret = 0;

}

} else if(bestdiff >= diff * (double)0.1f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 30;

} else if(bestOver >= diff * (double)0.88f) {

ret = 12;

} else if(bestOver >= diff * (double)0.67f) {

ret = 7;

} else {

ret = 0;

}

} else if(bestdiff > 0) {

// Top third is still pretty good

if(bestOver >= diff * (double)0.67f) {

ret = 6;

} else {

ret = 2;

}

} else {

assert_eq(bestdiff, 0);

// Top third is still pretty good

if(bestOver >= diff * (double)0.67f) {

ret = 1;

} else {

ret = 0;

}

} else {

// Local alignment

if(!hasSecbest) {

if (bestOver >= diff * (double)0.8f) ret = 44;

else if(bestOver >= diff * (double)0.7f) ret = 42;

else if(bestOver >= diff * (double)0.6f) ret = 41;

else if(bestOver >= diff * (double)0.5f) ret = 36;

else if(bestOver >= diff * (double)0.4f) ret = 28;

else if(bestOver >= diff * (double)0.3f) ret = 24;

else ret = 22;

} else {

secbest = s.paired() ?

s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();

TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));

if (bestdiff >= diff * (double)0.9f) ret = 40;

else if(bestdiff >= diff * (double)0.8f) ret = 39;

else if(bestdiff >= diff * (double)0.7f) ret = 38;

else if(bestdiff >= diff * (double)0.6f) ret = 37;

else if(bestdiff >= diff * (double)0.5f) {

if (bestOver == diff) ret = 35;

else if(bestOver >= diff * (double)0.50f) ret = 25;

else ret = 20;

} else if(bestdiff >= diff * (double)0.4f) {

if (bestOver == diff) ret = 34;

else if(bestOver >= diff * (double)0.50f) ret = 21;

else ret = 19;

} else if(bestdiff >= diff * (double)0.3f) {

if (bestOver == diff) ret = 33;

else if(bestOver >= diff * (double)0.5f) ret = 18;

else ret = 16;

} else if(bestdiff >= diff * (double)0.2f) {

if (bestOver == diff) ret = 32;

else if(bestOver >= diff * (double)0.5f) ret = 17;

else ret = 12;

} else if(bestdiff >= diff * (double)0.1f) {

if (bestOver == diff) ret = 31;

else if(bestOver >= diff * (double)0.5f) ret = 14;

else ret = 9;

} else if(bestdiff > 0) {

if(bestOver >= diff * (double)0.5f) ret = 11;

else ret = 2;

} else {

assert_eq(bestdiff, 0);

if(bestOver >= diff * (double)0.5f) ret = 1;

else ret = 0;

}

// Note: modifications to inps must be synchronized

//if(inps != NULL) {

// inps = itoa10<TAlScore>(best, inps);

// *inps++ = ',';

// inps = itoa10<TAlScore>(secbest, inps);

// *inps++ = ',';

// inps = itoa10<TMapq>(ret, inps);

//}

return ret;

}

protected:

SimpleFunc scoreMin_;

const Scoring& sc_;

};

==================================

Thursday, July 15, 2021

How does Bowtie2 assign MAPQ to each mate in paired-end reads?

by JohnUrbanGenome

WARNING:

This post is actually still in development. It turns out my solution may not be as successful as I thought. It still gets the majority correct, but something like >95% vs the >99% I was reporting before. I will erase this message and update on a future date.

Review:

For a longer discussion on this question, see my aging posts from 2014:

1. How does bowtie2 assign MAPQ scores?

2. Where does bowtie2 assign true multireads (AS=XS)?

3. The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data

Also see my more recent post on MAPQ for local mode:

How does Bowtie2 assign MAPQ (in local mode)?

Finally, don't forget to check out my python code for computing Bowtie2 MAPQ scores for end-to-end and local modes, for single- and paired-end reads, for any scoring scheme given to Bowtie2:

- https://github.com/JohnUrban/MAPQcalculators

Intro:

I am just writing an addendum to "How does bowtie2 assign MAPQ scores?" here, as I couldn't find a simple answer on the internet. That post was mainly concerned about single-end reads. What about pairs? Does each mate get a separate MAPQ score? Are the mates assigned the same score? What score are they given: is it computed as a probability based on both alignments? Is it simply using the min, max, or mean of the MAPQs of the mates?

All I found in the bowtie2 manual was how alignment scores are calculated for paired-end reads in -k and -a modes:

The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates.

If this is what is done in the default alignment mode as well, then it would predict that where the pair aligns is a function of having the highest sum of alignment scores for the mates, and MAPQ would be computed based on alignment score sums, and that both mates would therefore be assigned the same MAPQ.

Question set 1: To be or not to be (the same)?

For now, I will just provide the no nonsense answer to the first two questions:

- Does each mate get a separate MAPQ score?

- Are the mates assigned the same score?

Answer 1: To be (the same)

Both mates are given the same score.

How did I come to this answer?

Simple. I aligned a bunch of paired-end reads. Then I counted how many times both mates had the same score. It was 100% of the time.

For some reason, I was under the impression that they were assigned different MAPQs. Perhaps because BEDtools bamToBed -bedpe uses the "minimum MAPQ" between the mates. That made me assume there were different MAPQs assigned to each mate -- and I specifically made that erroneous assumption about bowtie2 in default paired-end alignment mode(s). There would certainly be different MAPQs assigned to mates in non-default strategies for mapping the read pairs (e.g. independently of each other), and possibly from other aligners more generally. But in a standard mapping with bowtie2, pairs are given the same MAPQ.

The implication for samtools view -q filtering for MAPQ means both mates will always be kept or retained, not orphaned.

Question set 2: What is it?

Okay - so both mates get the same MAPQ score... but....

What MAPQ score are the mates given:

- is it computed as a probability based on both alignments?

- Is it simply using the min, max, mean, or sum of the MAPQs of the mates?

The bowtie2 manual states for -a/-k modes that the alignment score (AS) for the pair is the sum of the AS's from each mate in the pair.

In default mode, I see 4 SAM fields that matter:

1. MAPQ (field 5) - the MAPQ score given to the pair or read.

2. AS (typically field 12) - the Alignment Score for the mate/read described in that line.

3. XS (variable field number, and isn't always present) - the next best alignment score for the mate/read in question, if there were any other "valid" alignment scores (over the --score-min)

4. YS (variable field number) - the AS of the mate.

The following AWK line will extract the readnames and those 4 fields for you:

awk 'OFS="\t" {match($0, /AS:i:[0-9]*/, AS); match($0, /YS:i:[0-9]*/, YS); match($0, /XS:i:[0-9]*/, XS); if (XS[0]==""){XS[0]="XS:i:-1"}; print $1,$5,AS[0], YS[0], XS[0]}' reads.sam

Or pipe the SAM fields from a BAM file:

samtools view reads.bam | awk 'OFS="\t" {match($0, /AS:i:[0-9]*/, AS); match($0, /YS:i:[0-9]*/, YS); match($0, /XS:i:[0-9]*/, XS); if (XS[0]==""){XS[0]="XS:i:-1"}; print $1,$5,AS[0], YS[0], XS[0]}'

- I have a script that collects all MAPQ, AS, YS, and XS from both mates to put on single line.

- This allowed me to figure out how to compute the "pair MAPQ".

Answer 2: MAPQ for the pair is computed as a function of the sum of alignment scores

Both mates in a pair get the same MAPQ. How? Essentially, but perhaps not totally, bowtie2 sums the alignment scores for the mates and doubles the effective minimum score to calculate a paired MAPQ. It is similar to computing the MAPQ for a read twice as long as the input reads (assuming a constant read length here). There was a little bit of fudging around with XS (the second best alignments) in my calculations that may be bypassed when bowtie2 is calculating the paired MAPQ with potentially more information than it outputs in the SAM file. Specifically, Bowtie2 likely compares the best AS sum to the second best AS sum whereas with what we are provided in the SAM file (at least as far as the AS, XS, and YS flags go) only allows us to compare the best AS sum to an estimation of the second best AS sum. The estimation I've concocted uses the XS values provided in the SAM entries for the individual mates, and when there is no second best (XS) for one of the mates, it uses the AS value for that mate. In a test of 49059 paired-end reads aligned in local mode, my python code gave the same paired MAPQ for all but 27 pairs. That is an error rate of 0.055%, or a success rate of 99.945%.

See my explanation using a code snippet below.

Code snippet (from here) - Approach #1:

To calculate the pair MAPQ, first sum the alignment scores.

ASJ = AS + AS2

Then determine a suitable second best alignment score sum. What I do may not be what Bowtie2 does, as it likely uses the alignment score sums from specific paired-end alignments found. I try to estimate that with the following code and the AS/XS/YS flags given in the SAM file. It is correct >= 99.945% of the time. (((On later code tested, this was only >95% correct.)))

if XS is not None and XS2 is None:
XSJ = XS + AS2
elif XS is None and XS2 is not None:
XSJ = AS + XS2
elif XS is None and XS2 is None:
XSJ = (scoremin-1) * 2
else: ## neither none
XSJ = XS + XS2

After calculating ASJ and XSJ, I also found that ensuring ASJ>=XSJ helped improve consistency with BT2 MAPQ, but just a tiny bit. Often it offers no extra benefit, possibly because XSJ>ASJ is a rare condition.

if XSJ > ASJ:

XSJ = ASJ

Then multiply both the "minimum score" and the "perfect score" by 2. Call that scMinJ and scPerJ. For more information on the "minimum score" and the "perfect score", see the section titled, "Shared concepts between end-to-end and local mode", in the accompanying post on computing MAPQ in end-to-end vs local mode.

Lastly, use the new "combined" or "joint" values (ASJ, XSJ, scMinJ, and scPerJ) in the normal MAPQ calculator for the mode it was aligned in.

NOTE: I am currently improving this further. One condition I found to always underestimate MAPQ in is when 1 mate has a perfect AS (or really good) and no second best (no XS). That essentially means the place the mate with uncertainty is mapped nearby would have less uncertainty, i.e. a better "posterior MAPQ" than was calculated for it independently. Anyway -- in areas of discrepancy, the pair MAPQ given by Bowtie2 is always higher than mine. To be continued.....

Conclusions:

1. Mates are given the same MAPQ in standard bowtie2 mapping.

2. MAPQ for the pairs is calculated in the same way as it is done for single-end reads (in either alignment mode) but using the alignment score sum of the mates in the best paired alignment and the second best paired alignment, and the minimum alignment and perfect alignment scores multiplied by 2.

Wednesday, May 12, 2021

An extremely brief introduction to coloring data points in scatter plots by cluster in R.

by JohnUrbanGenome

Title: An extremely brief introduction to coloring data points in scatter plots by cluster in R.

Topics: PCA, K-means clustering, hierarchical clustering, scatter plots, heat maps.

Created: 2021-05-12

By: John Urban

For: Anyone interested (Spradling Lab, Carnegie Emb, or otherwise)

You're here because you want to learn to how to color data points in scatterplots by the cluster to which they belong. Heck, you want to learn how to cluster columns or rows in a data frame. Maybe you even want to learn to plot data points by their first and second principle components. You might accidentally learn any of these things in this tutorial.

Maybe you're here because you have been making PCA plots, and circling or coloring the clusters yourself by eye for powerpoint presentations. The world is a crazy place. It happens.

Perhaps you want to make the world just a little less crazy by coloring the data points using the output of k-means or hierarchical clustering.

I'm here to give you the briefest quick start possible. The rest is up to you.

Let's pretend we are in R / RStudio.

Let's pretend we did RNA-seq on 3 samples, and only care about 4 genes.

#1. Make a toy dataframe that has 3 replicates each of 3 different sample types (A, B, and C) and 4 genes. Here we are just simulating technical replicates using the rnorm function.

> gdf <- data.frame(sample=c("A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"), g1=c(rnorm(3, 10, 1), rnorm(3, 20,2), rnorm(3, 30,2)), g2=c(rnorm(3,30,3), rnorm(3, 25,3), rnorm(3,20,3)), g3=c(rnorm(3, 15,1), rnorm(3,45,5), rnorm(3,60,5)), g4=c(rnorm(3,5,1), rnorm(3,10,2), rnorm(3,15,1)))

#2. Go ahead. Take a look at it. You know you want to. Heck, I want to watch you do it.

> gdf

#3. Also look at it this way!

> t(gdf[,2:5])

#4. Now run PCA on it, and take a look at the output.

> p <- prcomp(t(gdf[,2:5]))

> p

#5. Go ahead and plot that.

> plot(p$rotation[,1], p$rotation[,2], pch=19, cex=2)

#6. If you don't yet know how to color by clusters found with algorithms, use your biological neural network (BNN) to cluster by eye:

#7. Your BNN may have been extremely powerful, but let's try some clustering algorithms. First is k-means clustering. Remember to take a peak at the result.

## Note: K-means is not necessarily deterministic, and the end-result depends on a few things, such as the initial starting points, the number of iterations, and the number of times you restart the algorithm to see if you can do better. Nonetheless, it usually does quite well. Just provide it the number of clusters, k, that you are expecting. If you don't know "k", then a whole other topic is on ways to select "k". Here we know we have 3 samples, so we should have 3 clusters.

> k <- kmeans(x = gdf[,2:5], centers = 3)

> k

#8. Now use the "clustering vector" to color your data points in the PCA plot. One way to do it is this:

> plot(p$rotation[,1], p$rotation[,2], pch=19, cex=2, xlab="PC1", ylab="PC2", main="Colored by K-means clustering.", col=k$cluster)

#9. Another clustering approach is "hierarchical". There are 3 steps here instead of just 1: creating a distance matrix, clustering based on the distance matrix, and cutting the "tree" to give you k=3 clusters.

## Note: hierarchical clustering is deterministic.

## Step 1: creating a distance matrix

> d <- dist(gdf[,2:5])

> d

## Step 2: clustering based on the distance matrix

> h <- hclust(d)

> h

## Step 3: cutting the tree to give you k=3 clusters

> hc <- cutree(h, k=3)

> hc

#10. Now use the hc "cluster vector" to color your data points in the PCA plot. One way to do it is this:

> plot(p$rotation[,1], p$rotation[,2], pch=19, cex=2, xlab="PC1", ylab="PC2", main="Colored by hierarchical clustering.", col=hc)

#11. Note: it is also possible to do hierarchical clustering and visualization of the dataframe as a heatmap in a single step. One way to do that is shown below.

## First just plot the dataframe as a heatmap with no clustering

> heatmap(t(gdf[,2:5]), Rowv = NA, Colv = NA, labCol = gdf$sample)

# For making it easier on the eye to track the technical replicates from each sample, here I am also showing how to use the "known clusters" (the technical replicates of a given sample type) to make "color bars" at the top.

> heatmap(t(gdf[,2:5]), Rowv = NA, Colv = NA, labCol = gdf$sample, ColSideColors = c("red","green","blue")[c(1,1,1,2,2,2,3,3,3)])

# Next, use clustering on the columns to show how the samples relate to each other.

> heatmap(t(gdf[,2:5]), Rowv = NA, labCol = gdf$sample, ColSideColors = c("red","green","blue")[c(1,1,1,2,2,2,3,3,3)])

# Finally, you can optionally cluster on the rows (genes) too

> heatmap(t(gdf[,2:5]), labCol = gdf$sample, ColSideColors = c("red","green","blue")[c(1,1,1,2,2,2,3,3,3)])

=================================================================

That concludes this very brief introduction.

This is only a quick start.

Have fun googling further for more advanced guides!