Friday, March 6, 2020

Coverage Over Single Repeat (e.g. a single rDNA unit)

Title: Coverage over single rDNA unit
Created: 2019-05-01
By: John Urban
For: Anyone interested (Spradling Lab, Carnegie Emb, or otherwise)

Originally posted as a Google Doc found here:

This example uses Drosophila and a single rDNA unit (the sequence for it can be found in the Google Doc).





May need to download dm6, find here:
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/dm6/bigZips/ .


If you want to look at the signal of rDNA, do the following.
1. Get dm6 FASTA. If you download it from above, you will need to combine all the chromosomes into one file called dm6.fasta or something similar.
2. Use the attached rDNA sequence to mask it with RepeatMasker (download and use at commandline: http://www.repeatmasker.org/):
mkdir masked
RepeatMasker dm6.fasta -lib rDNA.fasta -dir ./masked -x -pa 8
3. Add the attached rDNA sequence to the masked dm6 file. Now there is only 1 copy of rDNA in the genome sequence. Example command:
cat rDNA.fasta masked-dm6.fasta > masked-dm6-and-rDNA.fasta
4. Index it with Bowtie2:
bowtie2-build masked-dm6-and-rDNA.fasta masked-dm6-and-rDNA
5. Map reads to the masked genome + rDNA BT2 index with Bowtie2 and sort with SAMtools
bowtie2 …..
6. To get depth at each position use BEDTools:
bedtools genomecov -split -depth -ibam ${BAM} > depth.txt
7. To get the sum of all coverage, use Awk:
awk '{sum+=$3}END{ print sum } depth.txt > sum.txt
SUM=$( cat sum.txt )
8. Normalize the depth at each position to the sum of all coverage and multiply by 1million to get the signal per million bases covered: 
awk -v "sum=${SUM}" {OFS="\t"; print $1,$2,1e6*$3/sum}' depth.txt > normdepth.txt
9. Use browser or R to look at the coverage over rDNA. To bring just rDNA use awk (assuming the rDNA sequence was named “rDNA”, change if needed):
awk ‘$1==”rDNA”’ normdepth.txt > normdepth.rDNA.txt
10. With R or Awk, can do A/B or A-B or log2(A/B), etc.
11. If using IGV, open it up and go to create genome file, and point it towards masked-dm6-and-rDNA.fasta.

Normalization Note:
Note that it may be good to also try normalizing depth to the overall median depth value or to the median depth value from regions determined to be background. 

To get an overall sense of the depth for each contig (including the rDNA sequence) as well as the entire genome, do:
bedtools genomecov -split -ibam ${BAM} > coverage.hist.txt
Take into R to visualize and do various computations on.

BEDTools NOTE:
Do not use “-split” if you want to do fragment coverage rather than read coverage. For fragment coverage, it adds account to all bases between paired reads, for example. For some analyses this is needed. For others, it introduces unnecessary error and assumption into the analysis. Generally, use “-split” if fragment coverage is not essential to the analysis.


Literature:
I did this exact type of analysis in a 2015 Genome Research paper mapping replication origins:
There are some other suggestions therein on how to do this analysis, but it should be similar.





No comments:

Post a Comment