Thursday, May 29, 2014

Where does bowtie2 assign true multireads (AS=XS)?


Bowtie2 is an ultra-fast program for aligning next generation sequence reads to large genomes written by Ben Langmead and which happens to be my aligner of choice. RTFM here. Download it here or learn how to get it on your Mac OS X with "homebrew" from one of my previous posts, "The Craft of Homebrew-ing for Mac OS". Learn about how it assigns MAPQ scores and related ideas in my previous posts, "How does bowtie2 assign MAPQ scores?" and "The slow death of the term 'uniquely mappable'...". In this post, through mini experiments, I discuss whether or not bowtie2 randomly chooses a site for reads with multiple equal alignment positions. I chose to do this to support or put to rest various concerns raised in forums (such as seqanswers and biostars) that it always puts reads at the same spot or with a heavy bias for 1 or a couple spots.



Bowtie2 searches for as many alignments as it can (with given time and memory constraints) for a given read against the given reference (e.g. human genome) that have alignment scores above the minimum score cutoff. The output contains information on each read's best alignment score (AS) and, if a second alignment existed with a score above the cutoff, it's second best alignment score (XS). If an XS does not exist, this tautologically means there was no second alignment with a score above the minimum cutoff. Though this is a seemingly vacuous interpretation, it is the most robust one being a tautology and all. Some take "no XS" to mean its AS alignment was the only alignment or that it is a "uniread" while simultaneously not considering other reads with AS > XS as "unireads". Instead, this latter group of reads are lumped into the 'multiread' category, which are reads that map to more than one site. Others acknowledge that having no XS is just another way of saying AS > XS and that all reads with AS > XS are "unireads". This latter view acknowledges that having an XS is in part dependent on that minimum score cutoff as well as penalty sizes, whether or not mismatches and/or gaps are allowed, and whether or not you use end-to-end or local mode. Multireads, in this view, are only reads where AS = XS. To be clear, I try to refer to reads with AS = XS as "true multireads" since there is more than one interpretation of "multiread". However, when I do say 'multiread' I do still mean reads with AS = XS. Moreover, since the terms 'uniquely mappable' and 'uniread' has more than one meaning now, I try to refer to reads with AS > XS (including reads with no XS) as "maxireads" since they have a single maximum AS that is greater than all other XS.  For practical reasons, some times I refer to reads with no XS as "true unireads", but this is a misnomer and I should stop it. I do it only out of the need to refer to that group of reads with no XS since some people are interested in only them. Perhaps I will just call them "classic unireads". Find out more about the fickleness of the state of being unique in my previous post, "The slow death of the term 'uniquely mappable'…".

This post is interested in "true multireads" where AS = XS. If you are keeping true multireads in your data set, it could be an issue if they all get assigned to the same spot. Of course, this effect would be mitigated if you removed all "redundant reads" (reads that map to the same exact position) except one before moving on to downstream analyses. There also seemed to be some debate out there as to what bowtie2 does with multireads. So I designed a couple very simple experiments to show what it does.

The take away: these experiments support exactly what the bowtie2 manual says about what it does with randomly assigning places for multireads. If you do not know what the manual says than RTFM or read on.


Methods:


Randomly generated (multi)reads used:
>read2
ACAACACGATGTTAAGCCCCAGCCCGTTTATCCTTACAGTTACCCTAGAA
>read3
GTTCCATTAAATATTTAGGTAACTCCAAACGCTGCCTTGTCCACCCCGGG
>read4
GACACGTGAGTGAACCATACACACAAGTTCCGCACAACTTAGCTTGCGTA
>read5
ATCAAATCAGTTCGGCGCTCTGGTGTCTACAAGAATGTCACCGGTAGTAG
>read10
AGGAAGATAATGAAGCCCGGTTGTCTTGCGGATAAGAACACGCCGCGGCA
>read100
AAGCACTCCTTCAGGTCTATAACCGCGACCGCCAGCTCCTGTTTTTTTGG


"Chromosomes" were made from the reads:
There is a chromosome with the same sequence as every read (e.g. chr2 for read2) made as tail-to-head repeats of the read in the copy number associated with the read (e.g. chr2 has 2 copies of read 2). This is simply done in python taking advantage of its string multiplication (e.g. string*2 = stringstring). Therefore, every read can map to its respective chromosome >= 2 times with AS = XS and they are all true multireads. The chromosomes were indexed with "bowtie2-build -f chr.fa chr".

Using many copies of the same multireads:
One thousand copies of each mutiread was placed in the fasta file, reads.fa, to study the behavior of mapping the same multiread over multiple trials.



Experiments, results, and conclusions:


First I executed the following bowtie2 command and pipes:
bowtie2 --very-sensitive -N 1 -x chr -U multireads.fa -f | grep -v ^@ | cut -f 1,3,4 | sort | uniq -c

In bowtie2, the output is SAM format and fields 1, 3, and 4 are read name, chr name, and 1-based start position respectively. Thus, this pipeline ends in counting the number of times "read chr pos" were identical. Here are the results:

read_name   chr_name     pos      count
read2       chr2         1        1000 
read3       chr3         101      1000
read4       chr4         151      1000 
read5       chr5         51       1000
read10      chr10        251      1000 
read100     chr100       2901     1000

All one thousand copies of each multiread mapped to the same place. If you RTFM of bowtie2, it explains this behavior. Bowtie2 uses a pseudo-random number generator (for those not familiar with computer science, this is how randomness is simulated in general) to randomize where a read will go given more than 1 choice. So, why do they all go to the same place? The answer is because bowtie2 uses the read name, nucleotide string, and quality string to determine the seed to use for the pseudo-random number generator. Briefly, if you use the same seed every time, then you get the same string of pseudo-random numbers every time. Here all of my multireads have the same name, the same sequence, and since they are fasta files, the same string of quality scores (bowtie2 assumes they are all 40). Thus, this is the expected behavior. In real life, you will not encounter (or at least not often) a situation where you have multiple reads with the same name. Reads can have the same sequence, but it is less likely they will share the same exact quality string. Nonetheless, a different seed would be generated given 2 identical sequences because the read names (and probably quality strings) would differ.

However, for now I am dealing with totally identical reads. If you want fully simulated randomness with identical reads, you need to use the --non-deterministic flag. This turns off the deterministic behavior of picking a seed from the fixed read parameters and instead just picks a seed using the current time (each time it encounters one of the identical reads, it is a different time, so a different seed is used). So, I tested if this randomly distributes the reads with:

bowtie2 --non-deterministic --very-sensitive -N 1 -x chr -U multireads.fa -f | grep -v ^@ | cut -f 1,3,4 | sort | uniq -c

read_name   chr_name   pos      count   expectedCountIfuniform
read2       chr2       1        496     500                   
read2       chr2       51       504     500                   
read3       chr3       1        338     333.333
read3       chr3       101      326     333.333
read3       chr3       51       336     333.333
read4       chr4       1        249     250                   
read4       chr4       101      235     250                   
read4       chr4       151      254     250                   
read4       chr4       51       262     250                   
read5       chr5       1        168     200
read5       chr5       101      205     200
read5       chr5       151      201     200
read5       chr5       201      223     200
read5       chr5       51       203     200


It really does not take statistical tests to show that bowtie2 is doing a good job uniformly distributing the reads across all possible sites for a read. The table for all of the read100 datapoints would be too large to show, so instead, here is a plot (made in R):

The line was generated by making a linear model from the data using the lm() function in R. If it is random, the y-int should reflect the expected number of reads when 1000 reads are uniformly distributed across 10 sites. That would be 10 reads per site and the y-intercept was 9.961. The slope should be close to 0 to show no bias over the positions. The observed slope was ~0.000016. All in all, bowtie2 indeed seems to be approximating a uniform distribution over the sites.


The only thing left to do is to change the read names to see if that is sufficient for randomized results as well. To do this, I just added a number to the read name such as read2_1. The number added was just the line number as it occurred in the file. For these, since the names were all different and since each read can only map to its respective chromosome, I only counted how many times a read aligned to "chr, pos" with:
bowtie2 --very-sensitive -N 1 -x chr -U diffNameMultiReads.fa -f | grep -v ^@ | cut -f 3,4 | sort | uniq -c

chr_name   pos      count   expectedCountIfuniform
chr2       1        503     500                   
chr2       51       497     500                   
chr3       1        362     333.3333
chr3       101      316     333.3333
chr3       51       322     333.3333
chr4       1        277     250                   
chr4       101      246     250                   
chr4       151      251     250                   
chr4       51       226     250                   
chr5       1        200     200
chr5       101      192     200
chr5       151      216     200
chr5       201      192     200
chr5       51       200     200

Again, it is clear from just the table alone that bowtie2 is doing well at uniformly distributing the multireads as it claims to do.



Bowtie2 definitely assigns 'true multireads' uniformly at random.
A long time ago, I took a very high multiplicity (~12,000) multiread from one of our experiments and tried mapping it with bowtie2 again.. and again… and again. It mapped to the same place every time rather than one of the other 11,999 places. Back then, I was momentarily horrified until I RTFM. This behavior is the expected behavior in deterministic mode, where bowtie2 uses the read name, sequence, and quality string to seed the pseudo-random number generator. The solution would be to use the --non-deterministic flag. The real experiment would have been to change the read's name each time. In our real NGS experiments, none of the reads share a name. So, given the results above, it is safe to say that if 2 true multireads with different names but the same sequence are encountered, they will likely be assigned to different sites uniformly at random. If reads with the same sequence are filtered beforehand to keep only one or if there is just naturally a single instance of a multiread, you can trust that the site it is assigned to is picked at random. Even if it was not, most people filter multireads out. To filter all true multireads out (and some maxireads including classic unireads with poor AS), use MAPQ >= 2. See my previous post for more information on how bowtie2 assigns MAPQ scores. Even if someone does not filter out the multireads, most people for many applications remove all redundant reads except 1 per site (redundant reads are those that map to the same exact site). Thus, even if bowtie2 was not random, the pileup effect would be smashed at that step. Finally, since most people only care about reads of AS > XS for most applications, random assignment does not matter. Those go to the site corresponding to AS.

As a closing remark, if you want to work with multireads, then the a priori uniform distribution over all sites is not the best you can do. One could create a posterior distribution over those sites after looking at the distribution of the high quality maxiread (AS > XS) data. I discuss this more here.




1 comment: