Wednesday, May 28, 2014

The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data

Here starts the little series of illustrations showing possible outcomes starting with requiring a perfect match (0 mismatches) and gradually allowing more complex alignments (e.g. allowing mismatches). This is not meant to show things in a proportionally correct manner (i.e. not in proportion to all of your reads). Above, 6 reads are shown that do not map to the human genome for various reasons: from a region not in the current assembly (NIA), has a SNP, has a base call error (BCE) with a low base quality score, or is from a contaminating E. coli source (Con). There are 4 examples of reads that map uniquely: 2 map to the correct positions and 2 map to incorrect positions since they had a SNP and/or BCE. Finally, there is an example of a multiread that maps to >1 position equally well.
Unireads are reads that align to a single, unique position in the genome.
Multireads are reads that align to multiple positions in the genome.

Original Publish Date: 
Wednesday, May 28, 2014

The slow death of the term "uniquely mappable"
Most genomics studies using next generation sequencing (e.g. ChIP-seq) use what people consider to be standard quality control steps such as using "uniquely mappable" reads. Even the ENCODE website says one of their quality metrics is using uniquely mappable reads. But what are "uniquely mappable reads"? Why or how are they useful? and do they always ensure high quality data? If you answer those questions too quickly, you will miss a lot of nuance. Anyone new to doing their own NGS bioinformatic analyses who thought through those questions just a moment too long can attest to how much they have haunted them since., BioStars, and other forums are filled with questions about what defines a uniquely mappable read and how to extract them. However, the answers on different threads are often different. Why?

"Uniquely mappable" is a concept that started when short reads first hit the scene and there was no sophisticated way to deal with them other than to see if a short read matched 1 site exactly (uniquely mappable), if it matched >1 site exactly, or if it did not match any site exactly. However, it was always kept in mind that SNPs and base call errors would require the allowance of some mismatches once the dust settled from all of the initial excitement. Once mismatches were allowed, "uniquely mappable" became a very obscured concept. The number of uniquely mappable reads decreases with an increasing number of mismatches allowed in the alignment. Another way of looking at it: the number of multireads increases with the number of mismatches allowed since it allows formerly "unique" reads to potentially match more sites. So with allowing more mismatches, more formerly uniquely mapped reads are tossed out when keeping only "unireads". Thus, in some ways using only uniquely mapped reads, especially after allowing, for example, 5 mismatches, is absurdly conservative and perhaps even plain wrong due to how much data is thrown out. Moreover, unique reads include increasing amounts of false positives with increasing numbers of mismatches allowed. 

Another way to envisage why uniquely mapped reads become multi-reads when mismatches are allowed is to think of a genome composed of all 4^50 50-mers (all 50-mers composed of A,C,G,T), each as its own chromosome so there are no overlapping 50-mers to think about. If you take one 50 bp read (no errors) from this genome and map it back requiring a perfect match, then there is only 1 place it can align -- to itself, where it came from. Once you allow 1 mismatch, it could map to 151 different 50-mers. Want to allow 2 mismatches? Now it can map to over 11 thousand 50-mers. If you allow up to 5 mismatches, it can map to over 534 million positions in that genome! In each of these cases, if you were keeping tabs on the number of mismatches somehow, then despite the fact that it mapped to over 534 million positions, you could pick the highest confidence position - the one with 0 mismatches in this case. OR you could throw it away because it is a "multiread". In real life, a genome is only composed of a tiny fraction of all those 50-mers and inside that tiny fraction each 50-mer is present in different copy numbers, not just 1. This is why even when allowing 0 mismatches, some reads map to multiple sites. In the real life case, when allowing, for example, 5 mismatches, yes there are theoretically >534 million 50-mers it could match, but it is possible none of them occur in the genome since 534 million 50-mers is <<<<1% of all possible 50-mers. However, it is also possible that one or a couple do exist in the genome and in this case, the read, which was unique when requiring a perfect match, is now considered a multi-read and would be tossed out in studies that keep only "unireads" rather than choosing the best position. Thus, there becomes an absurdity to keeping only uniquely mappable reads when more and more mismatches are allowed. Moreover, in real life, this is also why contaminating (false positive) reads begin to align more when more mismatches allowed. Again, if you allow 5 mismatches in the 50 bp read, then the contaminating read can map to >534 million 50-mers, 1 of which may be in the reference genome of interest and in this case, such a uniquely mappable read would be kept. However, if one was more nuanced in his/her judgement and thought the alignment score was too poor, they would toss it despite that it was uniquely mapped.

After allowing 1 MM, one of the unique reads is now able to map to >1 position and so would no longer be considered unique if you extract only the reads that map to one place. The other correctly mapped uniread stays unaffected. The next read, originally a multiread, increases in multiplicity though this new position has a lower alignment score than the first 2 and could be ignored by more sophisticated algorithms. The read with the SNP is now able to align to the correct position (an intended result for allowing mismatches). The read with the base call error that could not map now becomes a multiread - it can map to the correct position and a position that agreed with the BCE but needed 1 mismatch allowance elsewhere in the read for alignment. Note that in bowtie2, the correct alignment would have a higher alignment score since the mismatch occurred over the BCE (which in this example has a low base quality score) whereas the mismatch in the incorrect position occurs over a high quality base call. Thus, it could choose the correct position, but people who take only unireads would toss it. The read with both a BCE and a SNP still only aligns to the incorrect place but uniquely! The next read with a BCE now has become a multiread by now being able to align to the correct position. In bowtie2, the incorrect mapping would have the better alignment score since it did not require a mismatch and matches over low quality base calls are not penalized. Thus, this is an example of a read for which the best alignment is not the correct alignment (unless a different scoring scheme that penalized low quality base calls in general was adopted). The contaminating reads still do not map.

Now 2 mismatches are allowed. The read with both a SNP and a BCE that originally aligned to the incorrect position is now able to align to the correct position. However, the correct position has a lower alignment score since it requires 2 mismatches (and since, at least in bowtie2, low quality base calls are not penalized over matches). The only other new event here is that one of the contaminating reads is now able to align uniquely.

The problem of "uniquely mappable" does not end with mismatch tolerance. Once gaps are also allowed, "uniquely mappable" becomes almost impossible to define. Note that most current aligners allow gaps. Think about it -- it is possible to map reads that were considered "unique" by the perfect match definition in ways such that none end up being "uniquely mappable" when allowing gaps and mismatches (without constraints). What becomes important is the scoring scheme, the alignment score, and cutoffs on alignment scores to be considered a "valid alignment". Then it is important to form some definition of a well-mapped or reliably mapped position for a given read from the alignment score of its best alignment as well as those of its other alignments. This also means that the status of a read being "uniquely mappable" becomes directly dependent on the scoring scheme and cutoffs - which are tune-able parameters, thus making the term even more elastic. 

"Uniquely mappable reads" have even less of a practical relevance when end to end alignment is not enforced and, instead, positions of the best local alignments (where the alignment to the reference can start and end anywhere inside the read) are searched for. For many reads, the best local alignment may still be the position that was formerly "uniquely mappable" in an end to end alignment requiring a perfect match. However, there are likely to be many other possible local alignments (think BLAST results). What again becomes most important is the scoring scheme, cutoffs for valid alignments, and sophisticated ways to define "confidently mapped" and "best" positions rather than using the binary criterion of "uniquely mapped" as a proxy. At this point, very few reads will have been uniquely mapped and many that have been uniquely mapped can be false positives. Taking all "uniquely mapped" reads could blindly include those false positives whereas an alignment score that represents mismatches, big gaps, and the use of only small region of the read would represent the lack of confidence in the mapping at which point we can choose to ignore it.

Now gaps are allowed in addition to 2 MM. The multiread increases in multiplicity again, but again this new alignment has an even lower alignment score so could be ignored. The read with a SNP that was unique when allowing 1 or 2 mismatches, now has become a multiread. The 2 incorrect alignments would have lower alignment scores and could be ignored. When keeping only reads that align to a single position in the genome, though, it would be tossed. The next 2 reads, one with a BCE and the next with both a SNP and BCE, increase in multiplicity (again lower scores though which could be ignored). Finally, another contaminating read is now able to uniquely align.
The first  change in this picture is that the read from a region not in the assembly is now able to map, but to an incorrect position -- though uniquely! The 3rd read from the left, which was a uniread for all the conditions until now has finally lost its uniread status. It can now align to 2 positions, though the incorrect position would have a lower alignment score and could be ignored. If keeping only reads that aligned to a single position, it would be tossed. Now the remaining contaminating read is also able to align (uniquely) and one of the others has now become a multiread. All in all, this series of illustrations is not proportionally accurate -- it is just meant to show the absurdity of ignoring alignment scores and judging a read that maps uniquely as good and multiply as bad when in many cases an alignment score comparison could tell you where a multiread maps and that a uniread maps horribly.

So why is there lack of agreement about "uniquely mappable" in the forums?
Practically speaking, "uniquely mappable" has actually become an outdated and not-so-useful term of sorts. Technically speaking, the "uniquely mappable" status of a read can change with the algorithm used, with its heuristics for finding alignments in super reasonable periods of time and its scoring schemes and cutoffs. Culturally speaking, "uniquely mappable" has become a term associated with being "rigorous" with your data. The term persists and even as it persists it exists simultaneously in two states: some people latch on to the old definition and take it to mean that the read only has a single site in the genome that it maps to (but this changes with how the read is mapped as discussed above!) while others have adopted a newer definition and take it to mean that the read either maps to a single place OR the alignment score of the best alignment position is better than the score of the next best alignment position (e.g. bowtie2). The latter definition is more sophisticated and takes into consideration the fact that more and more reads will be able to map to more and more places once you start allowing mismatches, gaps, and local alignments, all of which may be necessary because of SNPs, base call errors, microdeletions, microinsertions, reads split across break points … in other words, the many issues that arise when mapping reads (often with errors) from a cancer genome back to the reference. The view is that even if it can map to more than one place, the better the best score is over the 2nd best score, the more likely that position is the correct position to align it to. However, it is not guaranteed that the best alignment score belongs to the correct mapping position, though this might be the case most of the time. 

The goal should not be to include only uniquely mappable reads. The goal should be to include any read that is mapped to a position with high confidence. This would include some "multireads" and exclude some "unireads":
Since, on some level, all reads can become multireads (some sooner than others), "uniquely mappable" is not a useful concept anymore. Truth is, it was always just a quick-and-dirty, black-and-white, binary hueristic to classify a mapped read. There always needed to be a more dynamic way to quantify a "well-mapped" read; a spectrum from lowest to highest in belief or confidence that the mapping is correct. Mapping Quality (MAPQ or MQ) scores, which typically range from 0-40 or 42, theoretically approach this ideal. MAPQ is supposed to be -10log10(p) where p is the probability the read is mapped wrong. This way, the smaller the probability it is mapped wrong, the higher the score it gets. Thus, if there are 2 places a read can go and it is assigned to 1 uniformly at random, the a priori probability that it is mapped wrong is 0.5 and the MAPQ = -10log10(0.5) ~= 3. Similarly, a MAPQ of 10, 20, 30 and 40 should mean the probability it mapped wrong is ~0.1, 0.01, 0.001, and 0.0001. I do not think most current aligners implement probabilistic models that first estimate the probability a read is mapped wrong and then convert it to a MAPQ though. For example, Bowtie2 seems to follow a rule based approach to assigning MAPQ values (which I discuss in another post). In Bowtie2, a read that aligns to more than 1 site equally well is never given higher than a MAPQ of 1 (even if it aligns to only 2 sites equally well as discussed above).

Though MAPQ is related to uniquely mappable (mainly the newer definition), it compounds it with other factors. MAPQ is just a transformation of the probability a read is mapped wrong, which should take into consideration, at least implicitly, the following: the alignment score, the number of times the read mapped, and the alignment scores of the other mappings (Bowtie2 looks only at the minimum alignment score allowed and the alignment scores of the best and second best alignments to estimate the MAPQ). Since the estimated probability that a read is mapped wrong depends on whether the read mapped more than once, it is also heavily influenced by the allowance of mismatches and gaps, scoring schemes, and cutoffs. Moreover, different algorithms approximate MAPQ scores in different ways. So the MAPQ of a mapped read from bowtie2 can be different than that of bwa given that everything else (e.g. finding and scoring alignments) is the same (which it isn't). MAPQ may not be completely standardized yet, but with current read mapping it is a more nuanced, more practical, more accurate way to quantify our belief that a read is correctly placed than is classifying it as "uniquely mapped" or not.

From here on forward let us refer to "multireads" as only those reads for which their best alignment score is the same as their second best alignment score. Reads for which their best alignment score is better than their second best, formerly multireads, will now be called "maxireads" to represent they were placed at the position of maximum score. This term, "maxiread" also encompasses reads that would previously be called "uniquely mappable" as they also by definition were placed at the position with maximum alignment score. So the two terms moving forward are "maxireads", those with single maximum alignment scores, and "multireads" for which there is no single maximum alignment score (i.e. the maximum is shared by >= 2 positions). 

The frontier is really in deciding how to find the best positions for all multi-reads. Multireads are still essentially just filtered out. Thinking about mapping all mappable reads like a puzzle, the problem is, we have been assuming we have 2 options:
   1. toss out the puzzle pieces that have shapes that fit in multiple places equally well 
   2. randomly choose one of the multiple places that a puzzle piece's shape fits equally well
Most of us have largely ignored the 3rd option: 
  3. use the picture we can already see formed by the puzzle pieces that fit into 1 spot best to inform us where the puzzle pieces with shapes that fit multiple places go

In other words, use the background already put together by the higher confidence maxiread data to form posterior probabilities of where a multi-read goes. Posterior MAPQs could be assigned and the reads could still be filtered out if they did not quite meet someone's MAPQ cutoff. Some posterior MAPQs would still be 0 if the maxi-read background was uninformative (i.e. balanced across all sites). Other posterior MAPQs could become quite high if the high quality maxi-read background was concentrated at just 1 of the optional sites. 

For example, imagine a read that maps to 10 sites equally well. A priori, the probability it maps to any of those sites is 10%. Thus, you have a 90% chance of placing it at the wrong site if you assign it uniformly at random. The prior MAPQ would therefore be -10log10(0.9) = ~0.46, which would be rounded to 0. However, if you look at the background distribution of "maxireads" and at one of those 10 sites there are 91 maxi-reads and each of the others has 1 for a total of 100 maxi-reads, the posterior probability might be 91% to the first site and 1% to each of the 9 others. Now, for a read mapped to the first site (91% chance it came from there), the posterior probability that a it is mapped wrong given it is a perfect match is 9%. Let's call it 10% to make the math easier. The posterior MAPQ would now be -10*log10(0.1) = 10 instead of 0. However, it is difficult for me to say how a read randomly assigned to one of the other 9 sites would be scored. Since reads mapped to the other 9 sites were assigned to them under the same posterior probability distribution as the reads mapped to the first site (at a rate of 1% to each site), it seems like they should be scored as equally correct as the read mapped to the first site -- not necessarily -10log10(0.99) = 0.436 ~= 0 -- though that would be one way to score them. Moreover, other factors need to be taken into consideration such as the mappability of each of those sites. Some regions of the genome might be deserts in terms of places maxi-reads could align while others might be chock full of them. There needs to be a way to normalize the number of maxi-reads at a site by the propensity of maxi-reads to occur at that site. If one could do that, then one could intelligently place multi-reads to reconstruct the most probabilistically accurate puzzle picture rather than to leave puzzle pieces out of it or to fit them in uniformly at random.

Before mapping all of the high quality "maxireads" (see text for definition), we have a completely uninformed, uniform distribution on how to assign this multiread.

After mapping all of the high quality "maxireads" (see text for definition), i.e. after looking at the data, we have better informed posterior distribution on how to assign this multiread.

I have found that some groups have began trying to implement ideas like this. This needs to catch on!

Slightly related -- others have come up with ways to study repetitive regions (from where many multireads arise) in general:
- TEA:

Note: of course these "goals" change according to the application. However, my hope is to open a discussion on all the data we toss out and the general assumption that keeping only uniquely mapped reads is "rigorous". It could be called "conservative", but it is not rigorous. Its more closely related intellectual laziness than to intellectual rigor. In another post, I will happily discuss more better ways to deal with "redundant reads" as well, instead of just removing them all or all but 1, which of course sets a very low ceiling on how many reads you can align to the genome in the era of increasing throughput.

No comments:

Post a Comment