Thursday, May 26, 2022

A Newb's Exploration of Protein Structure Prediction and Analysis Web Servers

 
..this is a living document that I will update as I go....
...I do not guarantee it is finished now or ever will be...
...it is safe to assume I don't know what I'm talking about and am willing to learn...
..this is version 1.13; 2022-05-28
..previous versions 1.0-1.7; 2022-05-26; 1.8-1.12 2022-05-27
... I would like to give a special shout out to my colleague, 
Karina Gutierrez-Garcia, for helping me to get started with Robetta, AlphaFold, SAVES, and some other software, links, and advice I've still not had time to explore.

A little song and dance about the old days of protein structures:

I confess: I've been a warm body in the audience of many structural biology presentations, but my mind was elsewhere. I get it. I get it. That atom was a tricky one to place. That protein fold was unexpected, sure I guess. And that was something how the protein looked like a ball of yarn before you mutated the alanine at position 369, and still looked like a ball of yarn after. 

There were some very good protein structure presentations I've been to as well. Usually about ribosomes. When a structural biologist is also a great presenter (rare in science generally), then it becomes fascinating. Look, it's just a little machine with gears, and ratcheting mechanisms, and scissors! This part here - look how it works like a socket wrench, and this part like a circuit switch. Look how this forms the hardest material in the biological world, and this other part that makes it flexible. See the lattice work!

I've always been glad there were structural biologists crystallizing proteins, doing NMR, cryo-EM, etc; but I was content to not do it myself. It is one of those branches of science that you understand is necessary and foundation-building; that has already yielded big rewards, and will continue to do so slowly but surely; but that seems somewhat sectioned off from what most researchers are doing. The average biologist may have thought, "Sure, it'd be amazing to have the structure to all my favorite proteins, whether to help with designing mutational strategies, to search for 'structural homology', or to just see the little nuts, and bolts, and gears doing that thing it does. But what is the likelihood of ever getting a structure for my favorite proteins?" The likelihood has been practically 0 for a long time, especially if you're not working with a model organism. It wasn't really an option for most people, so one just needed to think about other tools, other aspects of biology, other experiments, other mechanisms, other models, and so on. 

...but then predicting the 3D structure of proteins leveled up with deep learning, from programs like AlphaFold. Now it seems there is a whole new world of structure-function questions and applications for any regular, middle-of-the-road, non-structural biologist. It feels like the deep sequencing revolution from 10-15 years ago. There is more and more attention to the predicted structures from computational biologists looking to make programs that do even better, and from molecular and cellular and developmental biologists who can make use of these predictions. These programs are going through all the protein sequence databases, predicting what-appear-to-be accurate or accurate-enough structures for thousands or hundreds of thousands of proteins. It's reminiscent of the open fire-hydrant-like deluge of sequencing data that began pouring out with Solexa/Illumina technology. It seems a clubhouse I was never cool enough to be allowed in, nor were you, has suddenly opened its doors for everyone... 

So... I decided to walk in and have a peep around.

This is what I've found so far.


Pretending to know Predicting a protein structure:

First of all, if you're working with a model organism, your protein-of-interest may already have an AlphaFold structure here: https://alphafold.ebi.ac.uk/ . These structures are already hyperlinked across the internet ; featured in common databases like UniProt. For example, UniProt features both the experimentally-derived and AlphaFold-predicted structures of the human MCM2 protein: https://www.uniprot.org/uniprot/P49736#structure .

If you are working with a non-model organism, or want to predict the 'de novo' structure of a mutated form of a protein, then you can use AlphaFold or other programs yourself. I'll be honest -- it feels like I'm "supposed to be using AlphaFold", but I haven't been. I found two options when setting out to do this:
- AlphaFold
- Robetta (roseTTAFold and other programs)


I just found Robetta so easy to use that I've spent most of my time there: go ahead, sign up, copy/paste a protein sequence, and press go. That's all there is to it.


Things to do with the PDB structure files:

Search structure databases for similar protein structures.

My aim was to use the predicted structures to look for ‘structural homology’ in PDB, AF, and other databases for proteins that are lowly conserved at the sequence level. Ideally, one would just be able to use a "structure BLAST" interface at NCBI. I did not find that, but I found three programs to do structural database searches. Here I will use Drosophila MCM2 from AlphaFold as an example. In some I use a Robetta model for MCM2. At the moment, I have reviewed FoldSeek, DALI, FATCAT, and 3D-surfer. Overall, I am bullish on FoldSeek, DALI, andFATCAT; but I am scratching my head a little bit at 3D-surfer.

FoldSeek: 

I’ve found that FoldSeek is fast (in "local mode" : 3Di/AA), and seems to be quite good at known true positive PDB files I give it (e.g. Human or Drosophila protein structures); in my experience, the top hit was exactly the protein used, and all other top hits were actual homologs in other species. I tried to run "TM-align" (global) mode, but tens of minutes later it was still not done and, after being spoiled on the seconds time-scale for "local mode", I threw a tantrum and closed out the tab.

Search Parameters:


...mere seconds later....


Easily interpretable BLAST-like alignment visualization
    

Easily interpretable BLAST-like alignment statistics table:

An issue with the results as presented is that it doesn't tell you the molecule names of the hits, just the model name and the species. You have to click on the model names. From above, it may seem like AlphaFold ("AF") models are all in the top hits, not real experimentally-derived structures from PDB... but that is just an illusion. They group the results by database. See below.

Other database alignment visualizations:

PDB alignment statistics:
These are MCM2 structures or larger complex structures with MCM2 in them.

Having shown that though, the top scoring AF hits have scores in the 5000-7000 range, with e-values of 0, whereas the top hits in the PDB database have scores in the 2000-4000 range with e-values < 1e-60. This may reflect a bias of using an "AF" structure as the query, or that experimentally-derived structures rarely include the full-length protein; sometimes with N- or C- terminal bits trimmed off; or disordered parts trimmed off; or only featuring a certain domain; etc.

UPDATE: 
After getting Robetta Structures back for Dmel MCM2, I revisited FoldSeek. The "3Di/AA" mode gave very similar (maybe identical) results for the Robetta MCM2 Model_1 to results I saw for the AlphaFold2 MCM2 model -- I recall reading somewhere that 3Di/AA mode takes the protein sequence into account (thus the "AA" part of 3Di/AA), so this may not be surprising.

I also gave TM-align another shot, and yes it took longer, but not so long this time that I abandoned it. Not very long at all really -- on the scale of minutes. As a reminder, FoldSeek can look at AF and PDB databases. For TM-align mode with the Robetta MCM2 structure (Model 1), the AF database yielded many structures predicted for MCM2. The main difference between the TM-align and 3Di/AA results, was from which organisms the predicted MCM2 structures came. For TM-align, the top hit was not the predicted Drosophila MCM2, for example: it was from "Dracunculus medinensis". The PDB structures also were all MCM2-containing DNA replication complexes. So I can vouch for both search modes (3Di/AA and TM-align) for FoldSeek now. I will need to read more into the differences between these modes to say any more though.

The statistics tables also differ a little bit between 3Di/AA and TM-align modes. The both share the first 4 columns for a given Database (Target,  Scientific Name, Seq. Id., Score) and the last 3 (Query Pos., Target Pos., Alignment). However, (i) the "Score" column represents different scoring strategies and scales, and (ii) where 3Di/AA shows an E-value, TM-align shows a TM-score. The E-value is readily familiar from BLAST searches: smaller (closer to 0) is better. It appears that the TM-score is between 0 and 1: higher is better. It appears that the Score for TM-align is just FLOOR(TM-score*100) -- i.e. the TM-score multiplied by 100 and rounded down to the nearest integer; or more simply, just the first two decimal places of the TM-score multiplied by 100.

For the Robetta Model1 of Drosophila MCM2, the top 10 hits in the given database had the following ranges:

3Di/AA AFDB: 
- seq id 51-100%, and higher ranked hits had higher seq id.
- score 4274-5869
- E-value 5.255e-96 to 0.

3Di/AA PDB: 
- seq id 53.4-95.9%, not necessarily correlated with rank
- score 3203-4012
- E-value 1.198e-71 to 3.524e-90

TM-align AFDB:
- seq id 37.2-67.6%, not in same order as rank
- score 76-79, 
- TM-score 0.7615-0.7905

TM-align PDB:
- seq id 24.5-63.4%, not in same order as rank
- score 60-66, 
- TM-score 0.6003-0.6688

Note that the FoldSeek authors consider good TM-scores to be >0.5, and bad ones to be <0.5, although it appears that true positive hits can easily have TM-scores <0.5 (e.g. distant homologs). 


UPDATE: I highly recommend reading the FoldSeek preprint as a primer not only on FoldSeek, but the 3d-alignment/database search field generally. 

A peak at FoldSeek: In short, exhaustive structural searches in huge databases could take eons, so FoldSeek offers a strategy, similar to how sequence searches use heuristics like k-mers, to drastically pre-filter, reduce the search space, and gain orders of magnitude in speed. This is done by converting the structural information of a protein into a sequence as if it were a protein or DNA sequence, then using sequence-based approaches (like BLAST) to pre-filter. Older approaches used "structural alphabets" focused on the AA backbone, whereas FoldSeek created a new structural alphabet focusing on 3D interactions (hence "3Di"). The pre-filtered high-scoring hits are either aligned locally by combining structural and Amino Acid substitution information (3Di/AA) or globally (TM-align). The authors claim that Foldseek is much more sensitive than previous strutural-alphabet-based tools (e.g. 3D-BLAST), has lower sensitivity than Dali, but higher than the structural aligner CE, and similar to TMalign and TMalign-fast; all while being up to >100,000x faster than previous tools. 

More info on TM-align here.
More on TM-score here.
More on TM-align is also below in the pairwise alignment section.


DALI

I’ve also messed around with DALI (distance matrix alignment method): more info here and here; all refs here

The DALI server was overloaded, and my jobs were queued. It took a day to get the results back for the structural search of a known protein against PDB. Nonetheless, the top hit was the expected protein, and other top hits were homologs in other species. 

At the moment, with my limited clicking around, these results do not seem to link you to the AF or PDB pages of structures they aligned to....

In addition to matches and match stats, the results also offered other "buttons" to click and explore. The one I found most interesting for the moment was the "Pfam" button where, if I'm not mistaken, they shown Pfam domains on your protein that were structural matches (as opposed to normal sequence matching of Pfam domains).  



FATCAT:  

FATCAT stands for "Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists" and is discussed more in the "Compare protein structures to each other" section below. More info here, here, and here; simple explanation here.

The FATCAT server has a few modes:
- Pairwise Alignment: align two structures with FATCAT. Will be discussed more below.

- Structural Comparison of Close Homologs: search for proteins with high sequence similarity to a query, then align it to structures for those proteins with FATCAT. I used Model 1 for Drosophila MCM2 from Robetta as an example. It gave the following hits:

It requires high sequence similarity to a protein in the PDB database before trying to align, so it is no wonder that it pulls up the MCM2 structures. If you click on "view" in the "align" column, it brings you to a page like this:


It has the downloads you might want, and also an interactive viewer to see the structural alignment:




Search for Similar Protein Structures: Search a protein structure database for similar structures with FATCAT.

Since the above "close homolog" mode keeps only the very small subset of candidates in the PDB website with highly similar protein sequences, it runs relatively fast. This mode doesn't have that pruning process. Instead, I assume, it is attempting to perform alignments of your query to ALL of the structures in the database, and thus it takes far far longer to finish. FoldSeek is much much faster in both 3Di/AA and TM-align modes, but I am unable to comment on why -- could be purely algorithmic, could be based on a pruning process, both... 

UPDATE:
Ultimately it took < 2 hours. The top hit was an MCM2 structure, so I can also vouch for FATCAT structure database searches. There were 86 hits. It appears that these hits were filtered to have a p-value < 0.05. It is unclear how many tests were done, and FDR/q-values is not given. At the top of the page it has an interactive plot that allows you to see some summary statistics on the hits
P-value vs. Length

Length vs Alignment Length

Length vs Sequence Identity


Lower on the page, it gives the structural alignment table:

You can click on the structures: for example, the top hit brings you to a crystal structure of the MCM hexamer. I clicked the top 10 structures: all contained MCM.  Clicking on "view" under "align" brings you to a page to download files or view more plots:






3D-Surfer:  

3D-Surfer offers database searches as well, but the results on knowns were confusing to me… None of the "hits" it showed were MCM2 -- well at least none that I clicked on. Even when I had them show me the results table for 1000 hits, searching only the AlphaFold database, none of the hits were the AlphaFold MCM2 model I used... So, I can’t vouch for 3D-surfer per se (like I can for FoldSeek and Dali), but 3D-surfer does at least give me hits to consider for all my query structures predicted by Robetta for my proteins-of-interest — whereas FoldSeek found no hits for most. So I don't know what to make of the hits it is giving me in that scenario... but I can see that the protein structures returned do look a bit like the queries when it is simple enough to conclude that "by eye". 

Example 3D-Surfer Results: 


Example 3D-Surfer results table:




Compare protein structures to each other:

In addition to searching databases for 'structure hits', I also want to be able to align different structures together. For example, if there are 2 predicted structure models, I want to align them to see the areas of agreement or disagreement. And I want to be able to align known structures to predicted structures. And so on. There are bound to be many tools that do this type of stuff. I know none of them at the moment, although the programs above likely offer this type of service. For example, I know DALI allows you to do pairwise alignments on all your own stuff, not just databases. And FoldSeek should too... though I might have to download it and do that at the command-line. A colleague uses PyMol. For the moment, that's all I have.

RSCB PDB 3D-View and Pairwise Structure Alignment:

The 3D-View Tool allows you to import multiple PDBs. By default the structures are drawn in different parts of the 3D space. This allows you to compare them by eye, but they are not structurally aligned. While there seem to be buttons concerning aligning the structures, I couldn't figure out how to do that. Meanwhile, the Pairwise Structure Alignment Tool does do structural alignments, allowing a few different ways to do that. I gave all of them a try using the first two models predicted for Drosophila MCM2 by Robetta (which seems to predict 5 by default). Next I will name the alignment modes, and describe how I interpret the results for these two structures based on a nearly absent (but not absent) understanding of what each mode is doing. For more info on what the modes do, see here.

Overall, it appears from a visual point-of-view that the alignment modes fall into roughly 3 categories.

Category 1 :  Rigid Optimization

Most modes (4 of 6) fit into what I'd term a "rigid optimization" category: jFATCAT (rigid), jCE, TM-Align, and Smith-Waterman 3D. The all give similar alignments (the SW-3D mode to a lesser extendt). They seem to optimize the amount of the two MCM structure models matching in 3D-space, but do not make changes (or at least not large changes) to the structure or assume it can take on other conformations. The result is that the MCM2 models almost fully align in the ball-of-yarn globular domain(s) of the protein but the N-terminal disordered arms (on bottom left of image) go off in different directions (and are not force aligned in 3D-space). Smith-Waterman 3D, by the looks of it, does the least amount of structure-changing/conforming (or none at all): so while you can see similar structures between models nearby in 3D space, they are much less overlapping than the other 3 modes. Note that FATCAT stands for "Flexible structure AlignmenT by Chaining Aligned fragment pairs allowing Twists". CE stands for "Combinatorial Extension". 

- jFATCAT (Rigid) : Gives similar results to jCE and TM-align, and to a lesser-extent, Smith-Waterman 3D. This mode is explained by RCSB as allowing "for flexible protein structure comparison", and, "The rigid flavor of the algorithm is used to run a rigid-body superposition that only considers alignments with matching sequence order. For most structures the performance of this structure alignment is similar to that of CE." Also see: https://fatcat.godziklab.org/fatcat/fatcat_pair.html


- jCE : Gives similar results to jFATCAT (rigid) and TM-align, and to a lesser-extent, Smith-Waterman 3D. RSCB says it "works by identifying segments of the two structures with similar local structure, and then combining those regions to align the maximum number of residues in order to keep the root mean squared deviations (rmsd) between the pair of structures low. This Java port of the original CE uses a rigid-body alignment algorithm. Relative orientations of atoms in the structures being compared are kept fixed during superposition. It assumes that aligned residues occur in the same order in both proteins (i.e., the alignment is sequence-order dependent)."


- TM-align : Gives similar results to jFATCAT (rigid) and jCE, and to a lesser-extent, Smith-Waterman 3D. RCSB explains TM-align as, "Sequence-independent protein structure comparison" that is "sensitive to global topology". Note that TM-align can also be performed at this website that explains TM-Align as "an algorithm for sequence-independent protein structure comparisons. For two protein structures of unknown equivalence, TM-align first generates optimized residue-to-residue alignment based on structural similarity using heuristic dynamic programming iterations."



- Smith-Waterman 3D : Similar to above 3, but more separation between clearly-matching structures. RSCB explains that it "aligns similar sequence segments using Blosum65 scoring matrix ... and aligns two structures based on the sequence alignment." They give the following advice/warnings: "Note that this method works well for structures with significant sequence similarity and is faster than the structure-based methods. However, any errors in locating gaps, or a small number of badly aligned residues can lead to high RMSD in the resulting superposition.



Category 2 : Flexible Optimization

- jFATCAT (flexible) : This seems to further optimize the amount of the two MCM models that match in 3D space by assuming more flexibility/conformational freedom. The net result is basically that it is able to align those disordered N-terminal arms as well. This may be reasonable and the two models may simply reflect that the N-terminal arm is flexible and may take on 1 of multiple conformations. RSCB explains the flexible option in the following way: "The flexible flavor of FATCAT introduces twists (hinges) between different parts of the superposed proteins so that these parts are aligned independently. This makes it possible to effectively compare protein structures that undergo conformational changes in specific parts of the molecule such that global (rigid body) superposition cannot capture the underlying similarity between domains. For example, when the two polymers being compared are in different functional forms (e.g., bound to partner proteins/ligands), were crystallized under different conditions, or have mutations. The downside of this approach is that it can lead to false positive matches in unrelated structures, requiring that results be carefully reviewed." The take-home is that it increases sensitivity at the cost of losing specificity. Also see: https://fatcat.godziklab.org/fatcat/fatcat_pair.html


Category 3 : WTF

- jCE-CP : I'm actually not entirely sure how to make sense out of what happened here.... more colors than I expected for one.... RSCB explains it: "Some protein pairs are related by a circular permutation, i.e., the N-terminal part of one protein is related to the C-terminal part of the other or vice versa, or the topology of loops connecting secondary structural elements in a domain are different. Combinatorial Extension with Circular Permutations allows the structural comparison of such circularly permuted proteins." The take-home for this analysis though is that it does not appear to be an appropriate choice.



FATCAT :
Pairwise Alignment: align two structures with FATCAT. 

As mentioned twice above, there is a server dedicated to FATCAT that allows pairwise alignments and database searches (the latter discussed above).

Here is an example of aligning Model_1 and Model_2 together (as above). First it shows you a summary page like this:

The interactive viewer allows you to see the alignment:


Note that it offers both the "rigid" and "flexible" modes. The above was run in "flexible" mode. Compare it to the image shown for FATCAT (flexible) run on the RSCB server.




POSA - Partial Order Structure Alignment:
From the makers of FATCAT, is POSA, a way to align >2 structures and visualize in 3D space. My immediate use application for this is simple: Robetta output 5 models for Drosophila MCM2, let's align them.

There were two sets of results: results with flexibilities incorporated, and results without. I'm going to just use the FATCAT language here and call those two "flexible" and "rigid", respectively. First the rigid results. You can look at all the structures side-by-side or overlapping. When overlapping you can show all parts of each molecule, but you can also highlight or only show the "common core regions" shared by the molecules.

Rigid: side-by-side structures

Rigid: Show All

Rigid: Highlight common core regions

Rigid: Show only the common core regions



Above, the only region not in the "common core" is the N-terminal disordered arm. So may be reasonable to conclude that the 5 models are very similar except for that arm, although I can definitely see 'disagreeable' areas in the other part as well. We can also see the flexible mode results to see if the arm and those other regions were considered flexible enough to align more closely together:

Flexible: Show All

Flexible; Highlight common core regions

Flexible: show only the common core regions




DALI All-by-All pairwise analysis of 5 Robetta Models of a given protein sequence:
DALI database searches were discussed above. It also offers pairwise analyses. I didn't find a viewer for the structures aligned in 3D space per se, but it offers metrics on structure similarity as shown below.

DALI: Structural similarity tree

DALI: structural similarity matrix


DALI: Correspondence Analysis (conceptually similar to PCA)




Evaluating protein structures.

AlphaFold and Robetta may offer scores regarding how confident they are in the protein structures they predict. For example, Robetta offers a confidence score between 0-1. However, one may also be interested in third party programs to run evaluations of experimental- and predicted- structures from various programs. A colleague told me to try "SAVES":
https://www.doe-mbi.ucla.edu/saves/
https://saves.mbi.ucla.edu/

It offers a few programs, all you need is the PDB file. Here, again, I will use Drosophila MCM2 from AlphaFold:
- ERRAT: statistics of non-bonded interactions between different atom types ; compares with statistics from highly refined structures.
- Verify3D: Determines the compatibility of an atomic model (3D) with its own amino acid sequence (1D); compares the results to good structures.
- PROVE: Calculates the volumes of atoms in macromolecules; calculates a statistical Z-score deviation for the model from highly refined structures.
- WHATCHECK: extensive checking of stereochemical parameters of the residues in the model.
- PROCHECK: stereochemical quality of residue-by-residue geometry and overall structure geometry.
- CRYST: searches the Protein Data Bank for entries that have a unit cell similar to your input file. 

Example Results:
Clicking on the "Results" button for each gives more information and plots on each analysis. Overall, ERRAT liked this predicted structure. VERIFY and PROVE didn't. PROCHECK was mixed. WHATCHECK was a lot of information that triggered me into a TLDR state. The colleague who pointed me here said she only uses "ERRAT" - so if that's the case, then great job AlphaFold! Otherwise, I have questions.