Biofinysics

Thursday, May 26, 2022

A Newb's Exploration of Protein Structure Prediction and Analysis Web Servers

..this is a living document that I will update as I go....
...I do not guarantee it is finished now or ever will be...
...it is safe to assume I don't know what I'm talking about and am willing to learn...
..this is version 1.13; 2022-05-28
..previous versions 1.0-1.7; 2022-05-26; 1.8-1.12 2022-05-27
... I would like to give a special shout out to my colleague, Karina Gutierrez-Garcia, for helping me to get started with Robetta, AlphaFold, SAVES, and some other software, links, and advice I've still not had time to explore.

A little song and dance about the old days of protein structures:

I confess: I've been a warm body in the audience of many structural biology presentations, but my mind was elsewhere. I get it. I get it. That atom was a tricky one to place. That protein fold was unexpected, sure I guess. And that was something how the protein looked like a ball of yarn before you mutated the alanine at position 369, and still looked like a ball of yarn after.

There were some very good protein structure presentations I've been to as well. Usually about ribosomes. When a structural biologist is also a great presenter (rare in science generally), then it becomes fascinating. Look, it's just a little machine with gears, and ratcheting mechanisms, and scissors! This part here - look how it works like a socket wrench, and this part like a circuit switch. Look how this forms the hardest material in the biological world, and this other part that makes it flexible. See the lattice work!

I've always been glad there were structural biologists crystallizing proteins, doing NMR, cryo-EM, etc; but I was content to not do it myself. It is one of those branches of science that you understand is necessary and foundation-building; that has already yielded big rewards, and will continue to do so slowly but surely; but that seems somewhat sectioned off from what most researchers are doing. The average biologist may have thought, "Sure, it'd be amazing to have the structure to all my favorite proteins, whether to help with designing mutational strategies, to search for 'structural homology', or to just see the little nuts, and bolts, and gears doing that thing it does. But what is the likelihood of ever getting a structure for my favorite proteins?" The likelihood has been practically 0 for a long time, especially if you're not working with a model organism. It wasn't really an option for most people, so one just needed to think about other tools, other aspects of biology, other experiments, other mechanisms, other models, and so on.

...but then predicting the 3D structure of proteins leveled up with deep learning, from programs like AlphaFold. Now it seems there is a whole new world of structure-function questions and applications for any regular, middle-of-the-road, non-structural biologist. It feels like the deep sequencing revolution from 10-15 years ago. There is more and more attention to the predicted structures from computational biologists looking to make programs that do even better, and from molecular and cellular and developmental biologists who can make use of these predictions. These programs are going through all the protein sequence databases, predicting what-appear-to-be accurate or accurate-enough structures for thousands or hundreds of thousands of proteins. It's reminiscent of the open fire-hydrant-like deluge of sequencing data that began pouring out with Solexa/Illumina technology. It seems a clubhouse I was never cool enough to be allowed in, nor were you, has suddenly opened its doors for everyone...

So... I decided to walk in and have a peep around.

This is what I've found so far.

Pretending to know Predicting a protein structure:

First of all, if you're working with a model organism, your protein-of-interest may already have an AlphaFold structure here: https://alphafold.ebi.ac.uk/ . These structures are already hyperlinked across the internet ; featured in common databases like UniProt. For example, UniProt features both the experimentally-derived and AlphaFold-predicted structures of the human MCM2 protein: https://www.uniprot.org/uniprot/P49736#structure .

If you are working with a non-model organism, or want to predict the 'de novo' structure of a mutated form of a protein, then you can use AlphaFold or other programs yourself. I'll be honest -- it feels like I'm "supposed to be using AlphaFold", but I haven't been. I found two options when setting out to do this:
- AlphaFold
- Robetta (roseTTAFold and other programs)

I just found Robetta so easy to use that I've spent most of my time there: go ahead, sign up, copy/paste a protein sequence, and press go. That's all there is to it.

Things to do with the PDB structure files:

Search structure databases for similar protein structures.

My aim was to use the predicted structures to look for ‘structural homology’ in PDB, AF, and other databases for proteins that are lowly conserved at the sequence level. Ideally, one would just be able to use a "structure BLAST" interface at NCBI. I did not find that, but I found three programs to do structural database searches. Here I will use Drosophila MCM2 from AlphaFold as an example. In some I use a Robetta model for MCM2. At the moment, I have reviewed FoldSeek, DALI, FATCAT, and 3D-surfer. Overall, I am bullish on FoldSeek, DALI, andFATCAT; but I am scratching my head a little bit at 3D-surfer.

FoldSeek:

I’ve found that FoldSeek is fast (in "local mode" : 3Di/AA), and seems to be quite good at known true positive PDB files I give it (e.g. Human or Drosophila protein structures); in my experience, the top hit was exactly the protein used, and all other top hits were actual homologs in other species. I tried to run "TM-align" (global) mode, but tens of minutes later it was still not done and, after being spoiled on the seconds time-scale for "local mode", I threw a tantrum and closed out the tab.

Search Parameters:

...mere seconds later....

Easily interpretable BLAST-like alignment visualization

Easily interpretable BLAST-like alignment statistics table:

An issue with the results as presented is that it doesn't tell you the molecule names of the hits, just the model name and the species. You have to click on the model names. From above, it may seem like AlphaFold ("AF") models are all in the top hits, not real experimentally-derived structures from PDB... but that is just an illusion. They group the results by database. See below.

Other database alignment visualizations:

PDB alignment statistics:

These are MCM2 structures or larger complex structures with MCM2 in them.

Having shown that though, the top scoring AF hits have scores in the 5000-7000 range, with e-values of 0, whereas the top hits in the PDB database have scores in the 2000-4000 range with e-values < 1e-60. This may reflect a bias of using an "AF" structure as the query, or that experimentally-derived structures rarely include the full-length protein; sometimes with N- or C- terminal bits trimmed off; or disordered parts trimmed off; or only featuring a certain domain; etc.

UPDATE:

After getting Robetta Structures back for Dmel MCM2, I revisited FoldSeek. The "3Di/AA" mode gave very similar (maybe identical) results for the Robetta MCM2 Model_1 to results I saw for the AlphaFold2 MCM2 model -- I recall reading somewhere that 3Di/AA mode takes the protein sequence into account (thus the "AA" part of 3Di/AA), so this may not be surprising.

I also gave TM-align another shot, and yes it took longer, but not so long this time that I abandoned it. Not very long at all really -- on the scale of minutes. As a reminder, FoldSeek can look at AF and PDB databases. For TM-align mode with the Robetta MCM2 structure (Model 1), the AF database yielded many structures predicted for MCM2. The main difference between the TM-align and 3Di/AA results, was from which organisms the predicted MCM2 structures came. For TM-align, the top hit was not the predicted Drosophila MCM2, for example: it was from "Dracunculus medinensis". The PDB structures also were all MCM2-containing DNA replication complexes. So I can vouch for both search modes (3Di/AA and TM-align) for FoldSeek now. I will need to read more into the differences between these modes to say any more though.

The statistics tables also differ a little bit between 3Di/AA and TM-align modes. The both share the first 4 columns for a given Database (Target, Scientific Name, Seq. Id., Score) and the last 3 (Query Pos., Target Pos., Alignment). However, (i) the "Score" column represents different scoring strategies and scales, and (ii) where 3Di/AA shows an E-value, TM-align shows a TM-score. The E-value is readily familiar from BLAST searches: smaller (closer to 0) is better. It appears that the TM-score is between 0 and 1: higher is better. It appears that the Score for TM-align is just FLOOR(TM-score*100) -- i.e. the TM-score multiplied by 100 and rounded down to the nearest integer; or more simply, just the first two decimal places of the TM-score multiplied by 100.

For the Robetta Model1 of Drosophila MCM2, the top 10 hits in the given database had the following ranges:

3Di/AA AFDB:

- seq id 51-100%, and higher ranked hits had higher seq id.

- score 4274-5869

- E-value 5.255e-96 to 0.

3Di/AA PDB:

- seq id 53.4-95.9%, not necessarily correlated with rank

- score 3203-4012

- E-value 1.198e-71 to 3.524e-90

TM-align AFDB:

- seq id 37.2-67.6%, not in same order as rank

- score 76-79,

- TM-score 0.7615-0.7905

TM-align PDB:

- seq id 24.5-63.4%, not in same order as rank

- score 60-66,

- TM-score 0.6003-0.6688

Note that the FoldSeek authors consider good TM-scores to be >0.5, and bad ones to be <0.5, although it appears that true positive hits can easily have TM-scores <0.5 (e.g. distant homologs).

UPDATE: I highly recommend reading the FoldSeek preprint as a primer not only on FoldSeek, but the 3d-alignment/database search field generally.

A peak at FoldSeek: In short, exhaustive structural searches in huge databases could take eons, so FoldSeek offers a strategy, similar to how sequence searches use heuristics like k-mers, to drastically pre-filter, reduce the search space, and gain orders of magnitude in speed. This is done by converting the structural information of a protein into a sequence as if it were a protein or DNA sequence, then using sequence-based approaches (like BLAST) to pre-filter. Older approaches used "structural alphabets" focused on the AA backbone, whereas FoldSeek created a new structural alphabet focusing on 3D interactions (hence "3Di"). The pre-filtered high-scoring hits are either aligned locally by combining structural and Amino Acid substitution information (3Di/AA) or globally (TM-align). The authors claim that Foldseek is much more sensitive than previous strutural-alphabet-based tools (e.g. 3D-BLAST), has lower sensitivity than Dali, but higher than the structural aligner CE, and similar to TMalign and TMalign-fast; all while being up to >100,000x faster than previous tools.

More info on TM-align here.

DALI:

I’ve also messed around with DALI (distance matrix alignment method): more info here and here; all refs here.

The DALI server was overloaded, and my jobs were queued. It took a day to get the results back for the structural search of a known protein against PDB. Nonetheless, the top hit was the expected protein, and other top hits were homologs in other species.

At the moment, with my limited clicking around, these results do not seem to link you to the AF or PDB pages of structures they aligned to....

In addition to matches and match stats, the results also offered other "buttons" to click and explore. The one I found most interesting for the moment was the "Pfam" button where, if I'm not mistaken, they shown Pfam domains on your protein that were structural matches (as opposed to normal sequence matching of Pfam domains).

FATCAT:

FATCAT stands for "Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists" and is discussed more in the "Compare protein structures to each other" section below. More info here, here, and here; simple explanation here.

The FATCAT server has a few modes:

- Pairwise Alignment: align two structures with FATCAT. Will be discussed more below.

- Structural Comparison of Close Homologs: search for proteins with high sequence similarity to a query, then align it to structures for those proteins with FATCAT. I used Model 1 for Drosophila MCM2 from Robetta as an example. It gave the following hits:

It requires high sequence similarity to a protein in the PDB database before trying to align, so it is no wonder that it pulls up the MCM2 structures. If you click on "view" in the "align" column, it brings you to a page like this:

It has the downloads you might want, and also an interactive viewer to see the structural alignment:

- Search for Similar Protein Structures: Search a protein structure database for similar structures with FATCAT.

Since the above "close homolog" mode keeps only the very small subset of candidates in the PDB website with highly similar protein sequences, it runs relatively fast. This mode doesn't have that pruning process. Instead, I assume, it is attempting to perform alignments of your query to ALL of the structures in the database, and thus it takes far far longer to finish. FoldSeek is much much faster in both 3Di/AA and TM-align modes, but I am unable to comment on why -- could be purely algorithmic, could be based on a pruning process, both...

UPDATE:

Ultimately it took < 2 hours. The top hit was an MCM2 structure, so I can also vouch for FATCAT structure database searches. There were 86 hits. It appears that these hits were filtered to have a p-value < 0.05. It is unclear how many tests were done, and FDR/q-values is not given. At the top of the page it has an interactive plot that allows you to see some summary statistics on the hits

P-value vs. Length

Length vs Alignment Length

Length vs Sequence Identity

Lower on the page, it gives the structural alignment table:

You can click on the structures: for example, the top hit brings you to a crystal structure of the MCM hexamer. I clicked the top 10 structures: all contained MCM. Clicking on "view" under "align" brings you to a page to download files or view more plots:

3D-Surfer:

3D-Surfer offers database searches as well, but the results on knowns were confusing to me… None of the "hits" it showed were MCM2 -- well at least none that I clicked on. Even when I had them show me the results table for 1000 hits, searching only the AlphaFold database, none of the hits were the AlphaFold MCM2 model I used... So, I can’t vouch for 3D-surfer per se (like I can for FoldSeek and Dali), but 3D-surfer does at least give me hits to consider for all my query structures predicted by Robetta for my proteins-of-interest — whereas FoldSeek found no hits for most. So I don't know what to make of the hits it is giving me in that scenario... but I can see that the protein structures returned do look a bit like the queries when it is simple enough to conclude that "by eye".

Example 3D-Surfer Results:

Example 3D-Surfer results table:

Compare protein structures to each other:

In addition to searching databases for 'structure hits', I also want to be able to align different structures together. For example, if there are 2 predicted structure models, I want to align them to see the areas of agreement or disagreement. And I want to be able to align known structures to predicted structures. And so on. There are bound to be many tools that do this type of stuff. I know none of them at the moment, although the programs above likely offer this type of service. For example, I know DALI allows you to do pairwise alignments on all your own stuff, not just databases. And FoldSeek should too... though I might have to download it and do that at the command-line. A colleague uses PyMol. For the moment, that's all I have.

RSCB PDB 3D-View and Pairwise Structure Alignment:

The 3D-View Tool allows you to import multiple PDBs. By default the structures are drawn in different parts of the 3D space. This allows you to compare them by eye, but they are not structurally aligned. While there seem to be buttons concerning aligning the structures, I couldn't figure out how to do that. Meanwhile, the Pairwise Structure Alignment Tool does do structural alignments, allowing a few different ways to do that. I gave all of them a try using the first two models predicted for Drosophila MCM2 by Robetta (which seems to predict 5 by default). Next I will name the alignment modes, and describe how I interpret the results for these two structures based on a nearly absent (but not absent) understanding of what each mode is doing. For more info on what the modes do, see here.

Overall, it appears from a visual point-of-view that the alignment modes fall into roughly 3 categories.

Category 1 : Rigid Optimization

Most modes (4 of 6) fit into what I'd term a "rigid optimization" category: jFATCAT (rigid), jCE, TM-Align, and Smith-Waterman 3D. The all give similar alignments (the SW-3D mode to a lesser extendt). They seem to optimize the amount of the two MCM structure models matching in 3D-space, but do not make changes (or at least not large changes) to the structure or assume it can take on other conformations. The result is that the MCM2 models almost fully align in the ball-of-yarn globular domain(s) of the protein but the N-terminal disordered arms (on bottom left of image) go off in different directions (and are not force aligned in 3D-space). Smith-Waterman 3D, by the looks of it, does the least amount of structure-changing/conforming (or none at all): so while you can see similar structures between models nearby in 3D space, they are much less overlapping than the other 3 modes. Note that FATCAT stands for "Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists". CE stands for "Combinatorial Extension".

- jFATCAT (Rigid) : Gives similar results to jCE and TM-align, and to a lesser-extent, Smith-Waterman 3D. This mode is explained by RCSB as allowing "for flexible protein structure comparison", and, "The rigid flavor of the algorithm is used to run a rigid-body superposition that only considers alignments with matching sequence order. For most structures the performance of this structure alignment is similar to that of CE." Also see: https://fatcat.godziklab.org/fatcat/fatcat_pair.html

- jCE : Gives similar results to jFATCAT (rigid) and TM-align, and to a lesser-extent, Smith-Waterman 3D. RSCB says it "works by identifying segments of the two structures with similar local structure, and then combining those regions to align the maximum number of residues in order to keep the root mean squared deviations (rmsd) between the pair of structures low. This Java port of the original CE uses a rigid-body alignment algorithm. Relative orientations of atoms in the structures being compared are kept fixed during superposition. It assumes that aligned residues occur in the same order in both proteins (i.e., the alignment is sequence-order dependent)."

- TM-align : Gives similar results to jFATCAT (rigid) and jCE, and to a lesser-extent, Smith-Waterman 3D. RCSB explains TM-align as, "Sequence-independent protein structure comparison" that is "sensitive to global topology". Note that TM-align can also be performed at this website that explains TM-Align as "an algorithm for sequence-independent protein structure comparisons. For two protein structures of unknown equivalence, TM-align first generates optimized residue-to-residue alignment based on structural similarity using heuristic dynamic programming iterations."

- Smith-Waterman 3D : Similar to above 3, but more separation between clearly-matching structures. RSCB explains that it "aligns similar sequence segments using Blosum65 scoring matrix ... and aligns two structures based on the sequence alignment." They give the following advice/warnings: "Note that this method works well for structures with significant sequence similarity and is faster than the structure-based methods. However, any errors in locating gaps, or a small number of badly aligned residues can lead to high RMSD in the resulting superposition."

Category 2 : Flexible Optimization

- jFATCAT (flexible) : This seems to further optimize the amount of the two MCM models that match in 3D space by assuming more flexibility/conformational freedom. The net result is basically that it is able to align those disordered N-terminal arms as well. This may be reasonable and the two models may simply reflect that the N-terminal arm is flexible and may take on 1 of multiple conformations. RSCB explains the flexible option in the following way: "The flexible flavor of FATCAT introduces twists (hinges) between different parts of the superposed proteins so that these parts are aligned independently. This makes it possible to effectively compare protein structures that undergo conformational changes in specific parts of the molecule such that global (rigid body) superposition cannot capture the underlying similarity between domains. For example, when the two polymers being compared are in different functional forms (e.g., bound to partner proteins/ligands), were crystallized under different conditions, or have mutations. The downside of this approach is that it can lead to false positive matches in unrelated structures, requiring that results be carefully reviewed." The take-home is that it increases sensitivity at the cost of losing specificity. Also see: https://fatcat.godziklab.org/fatcat/fatcat_pair.html

Category 3 : WTF

- jCE-CP : I'm actually not entirely sure how to make sense out of what happened here.... more colors than I expected for one.... RSCB explains it: "Some protein pairs are related by a circular permutation, i.e., the N-terminal part of one protein is related to the C-terminal part of the other or vice versa, or the topology of loops connecting secondary structural elements in a domain are different. Combinatorial Extension with Circular Permutations allows the structural comparison of such circularly permuted proteins." The take-home for this analysis though is that it does not appear to be an appropriate choice.

FATCAT :

- Pairwise Alignment: align two structures with FATCAT.

As mentioned twice above, there is a server dedicated to FATCAT that allows pairwise alignments and database searches (the latter discussed above).

Here is an example of aligning Model_1 and Model_2 together (as above). First it shows you a summary page like this:

The interactive viewer allows you to see the alignment:

Note that it offers both the "rigid" and "flexible" modes. The above was run in "flexible" mode. Compare it to the image shown for FATCAT (flexible) run on the RSCB server.

POSA - Partial Order Structure Alignment:

From the makers of FATCAT, is POSA, a way to align >2 structures and visualize in 3D space. My immediate use application for this is simple: Robetta output 5 models for Drosophila MCM2, let's align them.

There were two sets of results: results with flexibilities incorporated, and results without. I'm going to just use the FATCAT language here and call those two "flexible" and "rigid", respectively. First the rigid results. You can look at all the structures side-by-side or overlapping. When overlapping you can show all parts of each molecule, but you can also highlight or only show the "common core regions" shared by the molecules.

Rigid: side-by-side structures

Rigid: Show All

Rigid: Highlight common core regions

Rigid: Show only the common core regions

Above, the only region not in the "common core" is the N-terminal disordered arm. So may be reasonable to conclude that the 5 models are very similar except for that arm, although I can definitely see 'disagreeable' areas in the other part as well. We can also see the flexible mode results to see if the arm and those other regions were considered flexible enough to align more closely together:

Flexible: Show All

Flexible; Highlight common core regions

Flexible: show only the common core regions

DALI All-by-All pairwise analysis of 5 Robetta Models of a given protein sequence:

DALI database searches were discussed above. It also offers pairwise analyses. I didn't find a viewer for the structures aligned in 3D space per se, but it offers metrics on structure similarity as shown below.

DALI: Structural similarity tree

DALI: structural similarity matrix

DALI: Correspondence Analysis (conceptually similar to PCA)

Evaluating protein structures.

AlphaFold and Robetta may offer scores regarding how confident they are in the protein structures they predict. For example, Robetta offers a confidence score between 0-1. However, one may also be interested in third party programs to run evaluations of experimental- and predicted- structures from various programs. A colleague told me to try "SAVES":
- https://www.doe-mbi.ucla.edu/saves/
- https://saves.mbi.ucla.edu/

It offers a few programs, all you need is the PDB file. Here, again, I will use Drosophila MCM2 from AlphaFold:
- ERRAT: statistics of non-bonded interactions between different atom types ; compares with statistics from highly refined structures.
- Verify3D: Determines the compatibility of an atomic model (3D) with its own amino acid sequence (1D); compares the results to good structures.
- PROVE: Calculates the volumes of atoms in macromolecules; calculates a statistical Z-score deviation for the model from highly refined structures.
- WHATCHECK: extensive checking of stereochemical parameters of the residues in the model.
- PROCHECK: stereochemical quality of residue-by-residue geometry and overall structure geometry.
- CRYST: searches the Protein Data Bank for entries that have a unit cell similar to your input file.

Example Results:

Clicking on the "Results" button for each gives more information and plots on each analysis. Overall, ERRAT liked this predicted structure. VERIFY and PROVE didn't. PROCHECK was mixed. WHATCHECK was a lot of information that triggered me into a TLDR state. The colleague who pointed me here said she only uses "ERRAT" - so if that's the case, then great job AlphaFold! Otherwise, I have questions.

Sunday, July 18, 2021

How does Bowtie2 calculate a minimum alignment score?

by JohnUrbanGenome

The Bowtie2 manual does a fine job at explaining what the minimum score is and how it is computed. So consider this blog post a supplement to the manual, if you wish to be generous. Otherwise, denounce it as redundant.

For relevant parts of the manual see the sections on (i) scoring, and particularly the sub-section therein on valid alignments; (ii) the section on setting function options ; and (iii) the descriptions of scoring options in the options section.

It may also be useful to review the following subsections in my aging blog on how Bowtie2 computes MAPQ scores (for single-end reads in end-to-end mode): (i) how bowtie2 scores a mismatch, and (ii) how bowtie2 decides if an alignment is valid. For more information on computing MAPQ scores, see my more recent posts on paired-end reads and local mode.

The short and simple is this:
Bowtie2 calculates a minimum score to define a valid alignment. This minimum score is defined by the 3 comma-separated parameters given to the --score-min option.

The rest is this:
Below I discuss those 3 comma-separated parameters in more detail, and give examples on (i) the defaults in end-to-end and local modes, (ii) defining a function where the minimum score is purely dependent on read length, (iii) defining a function where the minimum score is independent of read length.

--score-min Parameter Description:

- --score-min F,B,M :: F=transform function (C=constant, G=log, L=linear, S=sqrt); B=b in mx+b; M=m in mx+b

- computed min_score(read_length) = B + M*F(read_length)

- Minimum score = absolute minimum + length-dependent addition

- If the "absolute minimum" is set to 0, then it is purely dependent on read length.

- Otherwise, it requires first meeting a constant minimum like BWA, followed by becoming more stringent with read length.

DEFAULT IN END-TO-END MODE:

- Default for end-to-end mode = L,-0.6,-0.6

- L,-0.6,-0.6(read_length) = -0.6 + -0.6*read_length

- Read_length -> MinScore

- 20 -> -12.6

- 30 -> -18.6

- 50 -> -30.6

- 75 -> -45.6

- 100 -> -60.6

- 150 -> -90.6

DEFAULT IN LOCAL MODE:

- Default for local mode = G,20,8

- G,20,8(read_length) = 20 + 8*log(read_length)

- Read_length -> MinScore

- 20 -> 43.9659

- 30 -> 47.2096

- 50 -> 51.2962

- 75 -> 54.5399

- 100 -> 56.8414

- 150 -> 60.0851

DEFINING A FUNCTION PURELY DEPENDENT ON READ LENGTH:

Purely dependent on read length:

- End-to-end mode - similar to default without absolute minimum:

- L,0,-0.6

- Local mode - similar to default without absolute minimum:

- G,0,8

DEFINING A FUNCTION INDEPENDENT ON READ LENGTH:

- Only requires passing absolute minimum / independent of read length.

- This is similar to the default in another popular short read aligner, BWA, where the alignment score is calculated independent of read length, and the default minimum score is 30 (-T 30).

- If you wanted Bowtie2 to have a similar approach, the following codes work by analogy (Bowtie2 uses +2 for matches by default, BWA +1):

- -T 20 ==> C,40,0

- C,40,0(read_length) = 40 + 0*read_length = 40

- -T 30 ==> C,60,0

- C,60,0(read_length) = 60 + 0*read_length = 60

- ETC

- If you wanted it even more similar, the following should do the trick in theory, but differences are found anyway:

- -T 20 ==> --ma 1 --mp 4,4 --rdg 6,1 --rfg 6,1 --np 4 --score-min C,20,0

- -T 30 ==> "--ma 1 --mp 4,4 --rdg 6,1 --rfg 6,1 --np 4 --score-min C,30,0

- ETC

THE END:

Well, that is all folks. Best wishes to you.

How does Bowtie2 assign MAPQ (in local mode)?

by JohnUrbanGenome

REVIEW:

For a longer discussion on this question, see my aging posts from 2014:

1. How does bowtie2 assign MAPQ scores?

2. Where does bowtie2 assign true multireads (AS=XS)?

3. The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data

Also see my more recent post on MAPQ for paired reads:

How does Bowtie2 assign MAPQ to each mate in paired-end reads?

Finally, don't forget to check out my python code for computing Bowtie2 MAPQ scores:

- https://github.com/JohnUrban/MAPQcalculators

WITHOUT FURTHER ADO:

When I originally wrote about how bowtie2 assigns MAPQ scores years ago, I was thinking about it for experiments with single-end reads that I aligned in end-to-end mode. More recently, I became curious about how MAPQ was computed for paired reads, and how it is computed in local mode. The answer to the first question (on paired MAPQ) is essentially, but perhaps not totally, that bowtie2 sums the alignment scores for the mates and doubles the effective minimum score to calculate a paired MAPQ, as if computing a MAPQ for a read twice as long as the input reads (assuming a constant read length here). There was a little bit of fudging around with XS (the second best alignments) in my calculations that may be bypassed when bowtie2 is calculating the paired MAPQ with potentially more information than it outputs in the SAM file. Specifically, Bowtie2 likely compares the best AS sum to the second best AS sum whereas with what we are provided in the SAM file (at least as far as the AS, XS, and YS flags go) only allows us to compare the best AS sum to an estimation of the second best AS sum. The estimation I've concocted uses the XS values provided in the SAM entries for the individual mates, and when there is no second best (XS) for one of the mates, it uses the AS value for that mate. In a test of 49059 paired-end reads aligned in local mode, my python code gave the same paired MAPQ for all but 27 pairs. That is an error rate of 0.055%, or a success rate of 99.945%.

In this blog post, I will briefly detail the bowtie2 code for computing MAPQ, highlighting the local mode calculations, and showing the python code for it.

The short take-home messages are:

1. The MAPQ calculations for local mode differ from end-to-end mode.

2. The code structure for calculating MAPQ in both modes looks very similar.

3. The main differences is what MAPQ is assigned for the different if/elif rules.

4. One highlighted difference is that, whereas the max MAPQ for end-to-end reads is 42, the max for locally aligned reads is 44.

5. Locally aligned reads meeting similar "rule thresholds" generally get higher MAPQs than their end-to-end aligned counterparts, likely b/c the nature of local mode makes it less likely to cross a given quality threshold.

Bowtie2 code for calculating MAPQ:

The bowtie2 code can be found in the "unique.h" file in the bowtie2 package:

https://github.com/BenLangmead/bowtie2/blob/master/unique.h

The snippet of interest begins around line 168-170 in the current version. Look. for"V2 of the MAPQ calculator". I will paste it at the end of this blog to make life easier. It has the code for both end-to-end and local mode. Below I will focus on local mode, and show the python code for it.

Shared concepts between end-to-end and local mode:

Both modes depend on calculating a minimum score to define a valid alignment. This minimum score is defined by the parameters given to the --score-min option. See my recent post summarizing this step.

Both modes depend on calculating what would be the "perfect score" given the alignment parameters. This is essentially the read length multiplied by what is set for the "match bonus" (--ma option to bowtie2). In end-to-end mode, this defaults to 0. In local mode, it defaults to 2. Thus, by default, 0 is the perfect score for any read length in end-to-end mode, and 2*read_length is the perfect score in local mode. For example, the perfect score for a 50 bp single-end read = 2*50 = 100.

Both modes depend on comparing the best alignment score (AS) with the second best alignment score (XS). If there was no valid second best alignment score, then bowtie2 arbitrarily makes XS equal the minimum score minus one (XS = minscore - 1). Under the alignment score scheme, this is the most conservative thing to assume about the next best, meaning it will help give a worst-case MAPQ.

Both modes calculate the maximum difference between the perfect score and the minimum score.

Both modes track the best alignment score for a read. This is typically AS, but in paired-end mode where mates get assigned to the place with the highest sum of alignment scores for both mates, one of the mates might not be in the location of its highest alignment score. In these cases, XS > AS for that mate, and XS is the best alignment score for that mate.

Both modes track the best difference over the minimum score. This is like above when the maximum difference was found between the perfect score and the minimum score. However, here it is tracking the difference between the best alignment score and the minimum score.

Both keep track of the difference between the best alignment score and the second best. When AS >= XS, this is just the absolute difference between AS and XS. However, in the situations described above for paired-end reads where XS>AS for one of the mates, the difference here ought to be negative. In my code, I have found setting this to 1 when XS>AS (i.e. not AS >= XS) helps reproduce bowtie2 MAPQ results.

The above is encapsulated in the python function below (code here):

def bt2_mapq(AS, XS=None, alnmode=None, scMin=None, scPer=None):

'''

scMin = minScore

scPer = perfect score

'''

## Did the read have a second-best alignment?

if XS == None: ## make it less than scMin

## scMin = score of a just barely valid match

XS = scMin-1

## Difference between the perfect score and the minimum score

diff = max(1, abs(scPer-scMin))

## Best alignment score found

best = max(AS, XS)

## Difference between best alignment score seen and score minumum

bestOver = best - scMin

## Absolute difference between the best and second best alignment scores (usually AS>XS)

if AS >= XS:

bestdiff = abs(abs(AS)-abs(XS))

else:

bestdiff = 1 ## seems like 0 or negative would be better, but so far this has minimized discrepancies

## Was alignment mode in end-to-end or local?

if alnmode == 'endtoend':

return endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)

elif alnmode == 'local':

return local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff)

Where the modes depart in the MAPQ score calculations:

Both share a very similar code structure here, but different MAPQs are awarded for meeting different quality thresholds. Most easily noticeable is the max MAPQ is 44 in local mode as compared to 42 in end-to-end mode.

The end-to-end code was featured in my original post on computing MAPQ scores (How does bowtie2 assign MAPQ scores?), but is reproduced below.

Note - I will color in the python code below as above at a later point.

I leave you with the code, which is the best way to see how MAPQs are being assigned. I translated it into Python from the original Bowtie2 code as it is simpler to read (and I was trying to recapitulate the original code anyway).

Be you bioinformatics Padawan or Jedi, I hope I have been able to help you answer a question.

--John

End-to-end mode:

def endtoend_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):

#print(AS, XS, scMin, diff, bestOver, bestdiff)

## If does not have second best

if XS < scMin: ## if only single alignment (no XS)

if bestOver >= diff*0.8:

return 42 #42 when

elif bestOver >= diff*0.7:

return 40

elif bestOver >= diff*0.61: ## originally >= 0.60 but this made it agree with my bt2 tests better

return 24

elif bestOver >= diff*0.5:

return 23

elif bestOver >= diff*0.4:

return 8

elif bestOver >= diff*0.3:

return 3

else:

return 0

else: ## it does have second best

if bestdiff >= diff*0.9:

if bestOver == diff:

return 39

else:

return 33

elif bestdiff >= diff*0.8:

if bestOver == diff:

return 38

else:

return 27

elif bestdiff >= diff*0.7:

if bestOver == diff:

return 37

else:

return 26

elif bestdiff >= diff*0.6:

if bestOver == diff:

return 36

else:

return 22

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.84:

return 25

elif bestOver >= diff*0.68:

return 16

else:

#elif bestOver >= diff*0.68:

return 5

elif bestdiff >= diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.84:

return 21

elif bestOver >= diff*0.68:

return 14

else:

return 4

elif bestdiff >= diff*0.3:

if bestOver == diff:

return 32

elif bestOver >= diff*0.88:

return 18

elif bestOver >= diff*0.67:

return 15

else:

return 3

elif bestdiff >= diff*0.2:

if bestOver == diff:

return 31

elif bestOver >= diff*0.88:

return 17

elif bestOver >= diff*0.67:

return 11

else:

return 0

elif bestdiff >= diff*0.1:

if bestOver == diff:

return 30

elif bestOver >= diff*0.88:

return 12

elif bestOver >= diff*0.67:

return 7

else:

return 0

elif bestdiff > 0:

if bestOver >= diff*0.67:

return 6

else:

return 2

else: ## best and 2nd best AS are the same (multireads where AS=XS)

if bestOver >= diff*0.68: ## originally >= 0.67 but this made it agree with my bt2 tests better

return 1

else:

return 0

Local mode:

def local_MAPQ(AS, XS, scMin, diff, bestOver, bestdiff):

## BELOW - I have changed ">=" in original bowtie2 code to ">" where it gave more results consistent w/ bt2 output

## Future testing may lead me to change it back, but it fixed hundreds of discrepancies in original testing of paired MAPQ calculations

## If does not have second best

if XS < scMin: ## if only single alignment (AS and no XS)

if bestOver > diff*0.8:

return 44 # 44 is best in local mode, but only 42 in endtoend mode...

elif bestOver >= diff*0.7:

return 42

elif bestOver > diff*0.6:

return 41

elif bestOver >= diff*0.5:

return 36

elif bestOver >= diff*0.4:

return 28

elif bestOver >= diff*0.3:

return 24

else:

return 22

else: ## it does have second best

if bestdiff >= diff*0.9:

return 40

elif bestdiff > diff*0.8:

return 39

elif bestdiff >= diff*0.7:

return 38

elif bestdiff > diff*0.6:

return 37

elif bestdiff >= diff*0.5:

if bestOver == diff:

return 35

elif bestOver >= diff*0.5:

return 25

else:

return 20

elif bestdiff > diff*0.4:

if bestOver == diff:

return 34

elif bestOver >= diff*0.5:

return 21

else:

return 19

elif bestdiff > diff*0.3:

if bestOver == diff:

return 33

elif bestOver >= diff*0.5:

return 18

else:

return 16

elif bestdiff > diff*0.2:

if bestOver == diff:

return 32

elif bestOver >= diff*0.5:

return 17

else:

return 12

elif bestdiff > diff*0.1:

if bestOver == diff:

return 31

elif bestOver >= diff*0.5:

return 14

else:

return 9

elif bestdiff > 0:

if bestOver >= diff*0.5:

return 11

else:

return 2

else: ## best and 2nd best AS are the same (multireads where AS=XS)

if bestOver >= diff*0.5:

return 1

else:

return 0

Bowtie2 code snippet - V2 of MAPQ calculator

==================================

/**

* V2 of the MAPQ calculator

class BowtieMapq2 : public Mapq {

public:

BowtieMapq2(

const SimpleFunc& scoreMin,

const Scoring& sc) :

scoreMin_(scoreMin),

sc_(sc)

{ }

virtual ~BowtieMapq2() { }

/**

* Given an AlnSetSumm, return a mapping quality calculated.

virtual TMapq mapq(

const AlnSetSumm& s,

const AlnFlags& flags,

bool mate1,

size_t rdlen,

size_t ordlen,

char *inps) // put string representation of inputs here

const

{

// Did the read have a second-best alignment?

bool hasSecbest = s.paired() ?

VALID_AL_SCORE(s.bestUnchosenCScore()) :

VALID_AL_SCORE(s.bestUnchosenScore(mate1));

// This corresponds to a scenario where we found one and only one

// alignment but didn't really look for a second one

if(!flags.isPrimary() ||

(!flags.canMax() && !s.exhausted(mate1) && !hasSecbest)) {

return 255;

}

// scPer = score of a perfect match

TAlScore scPer = (TAlScore)sc_.perfectScore(rdlen);

if(s.paired()) {

scPer += (TAlScore)sc_.perfectScore(ordlen);

}

// scMin = score of a just barely valid match

TAlScore scMin = scoreMin_.f<TAlScore>((float)rdlen);

if(s.paired()) {

scMin += scoreMin_.f<TAlScore>((float)ordlen);

}

TAlScore secbest = scMin-1;

TAlScore diff = std::max<TAlScore>(1, scPer - scMin); // scores can vary by up to this much

TMapq ret = 0;

TAlScore best = s.paired() ?

s.bestCScore().score() : s.bestScore(mate1).score();

// best score but normalized so that 0 = worst valid score

TAlScore bestOver = best - scMin;

if(sc_.monotone) {

// End-to-end alignment

if(!hasSecbest) {

if (bestOver >= diff * (double)0.8f) ret = 42;

else if(bestOver >= diff * (double)0.7f) ret = 40;

else if(bestOver >= diff * (double)0.6f) ret = 24;

else if(bestOver >= diff * (double)0.5f) ret = 23;

else if(bestOver >= diff * (double)0.4f) ret = 8;

else if(bestOver >= diff * (double)0.3f) ret = 3;

else ret = 0;

} else {

secbest = s.paired() ?

s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();

TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));

if(bestdiff >= diff * (double)0.9f) {

if(bestOver == diff) {

ret = 39;

} else {

ret = 33;

}

} else if(bestdiff >= diff * (double)0.8f) {

if(bestOver == diff) {

ret = 38;

} else {

ret = 27;

}

} else if(bestdiff >= diff * (double)0.7f) {

if(bestOver == diff) {

ret = 37;

} else {

ret = 26;

}

} else if(bestdiff >= diff * (double)0.6f) {

if(bestOver == diff) {

ret = 36;

} else {

ret = 22;

}

} else if(bestdiff >= diff * (double)0.5f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 35;

} else if(bestOver >= diff * (double)0.84f) {

ret = 25;

} else if(bestOver >= diff * (double)0.68f) {

ret = 16;

} else {

ret = 5;

}

} else if(bestdiff >= diff * (double)0.4f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 34;

} else if(bestOver >= diff * (double)0.84f) {

ret = 21;

} else if(bestOver >= diff * (double)0.68f) {

ret = 14;

} else {

ret = 4;

}

} else if(bestdiff >= diff * (double)0.3f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 32;

} else if(bestOver >= diff * (double)0.88f) {

ret = 18;

} else if(bestOver >= diff * (double)0.67f) {

ret = 15;

} else {

ret = 3;

}

} else if(bestdiff >= diff * (double)0.2f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 31;

} else if(bestOver >= diff * (double)0.88f) {

ret = 17;

} else if(bestOver >= diff * (double)0.67f) {

ret = 11;

} else {

ret = 0;

}

} else if(bestdiff >= diff * (double)0.1f) {

// Top third is still pretty good

if (bestOver == diff) {

ret = 30;

} else if(bestOver >= diff * (double)0.88f) {

ret = 12;

} else if(bestOver >= diff * (double)0.67f) {

ret = 7;

} else {

ret = 0;

}

} else if(bestdiff > 0) {

// Top third is still pretty good

if(bestOver >= diff * (double)0.67f) {

ret = 6;

} else {

ret = 2;

}

} else {

assert_eq(bestdiff, 0);

// Top third is still pretty good

if(bestOver >= diff * (double)0.67f) {

ret = 1;

} else {

ret = 0;

}

} else {

// Local alignment

if(!hasSecbest) {

if (bestOver >= diff * (double)0.8f) ret = 44;

else if(bestOver >= diff * (double)0.7f) ret = 42;

else if(bestOver >= diff * (double)0.6f) ret = 41;

else if(bestOver >= diff * (double)0.5f) ret = 36;

else if(bestOver >= diff * (double)0.4f) ret = 28;

else if(bestOver >= diff * (double)0.3f) ret = 24;

else ret = 22;

} else {

secbest = s.paired() ?

s.bestUnchosenCScore().score() : s.bestUnchosenScore(mate1).score();

TAlScore bestdiff = abs(abs(static_cast<long>(best))-abs(static_cast<long>(secbest)));

if (bestdiff >= diff * (double)0.9f) ret = 40;

else if(bestdiff >= diff * (double)0.8f) ret = 39;

else if(bestdiff >= diff * (double)0.7f) ret = 38;

else if(bestdiff >= diff * (double)0.6f) ret = 37;

else if(bestdiff >= diff * (double)0.5f) {

if (bestOver == diff) ret = 35;

else if(bestOver >= diff * (double)0.50f) ret = 25;

else ret = 20;

} else if(bestdiff >= diff * (double)0.4f) {

if (bestOver == diff) ret = 34;

else if(bestOver >= diff * (double)0.50f) ret = 21;

else ret = 19;

} else if(bestdiff >= diff * (double)0.3f) {

if (bestOver == diff) ret = 33;

else if(bestOver >= diff * (double)0.5f) ret = 18;

else ret = 16;

} else if(bestdiff >= diff * (double)0.2f) {

if (bestOver == diff) ret = 32;

else if(bestOver >= diff * (double)0.5f) ret = 17;

else ret = 12;

} else if(bestdiff >= diff * (double)0.1f) {

if (bestOver == diff) ret = 31;

else if(bestOver >= diff * (double)0.5f) ret = 14;

else ret = 9;

} else if(bestdiff > 0) {

if(bestOver >= diff * (double)0.5f) ret = 11;

else ret = 2;

} else {

assert_eq(bestdiff, 0);

if(bestOver >= diff * (double)0.5f) ret = 1;

else ret = 0;

}

// Note: modifications to inps must be synchronized

//if(inps != NULL) {

// inps = itoa10<TAlScore>(best, inps);

// *inps++ = ',';

// inps = itoa10<TAlScore>(secbest, inps);

// *inps++ = ',';

// inps = itoa10<TMapq>(ret, inps);

//}

return ret;

}

protected:

SimpleFunc scoreMin_;

const Scoring& sc_;

};

==================================