Wednesday, March 18, 2020

Simple RNA-seq quantification and differential expression pipelines.



Title: Simple RNA-seq quantification and differential expression pipelines.
Created: 2020-03-18
By: John Urban
For: Anyone interested (Spradling Lab, Carnegie Emb, or otherwise)


Looking at basics of the following programs:
- STAR+RSEM
- Salmon
- HiSat2+StringTie
- EdgeR
- DEseq2

Although this was written up with RNA-seq in mind, much of this can be used for other *-seq and *-ome-wide experiments with or without slight modifications -- e.g. ChIP-seq, ATAC-seq, ribosome profiling, etc.

Below, I describe how to install and use the pertinent software. Importantly, to emphasize that EdgeR and DEseq2 can do much more than independent pairwise analyses between a bunch of samples, I go through doing the DE analysis on an experiment with three time points (T1, T2, T3) with respect to the first time point, T1, or all pairwise comparisons (T1-v-T2, T1-v-T3, T2-v-T3). In my experience, results are more robust when all samples and conditions are analyzed together.

Installation:


Salmon:

wget https://github.com/COMBINE-lab/salmon/releases/download/v1.1.0/salmon-1.1.0_linux_x86_64.tar.gz
tar -xzf salmon-1.1.0_linux_x86_64.tar.gz

Star:

git clone https://github.com/alexdobin/STAR.git
cd STAR/source
make STAR

RSEM:

## Change next line for yourself
export PATH=${PATH}:/path/to/STAR/bin/Linux_x86_64
git clone https://github.com/deweylab/RSEM.git
cd RSEM
make
make ebseq
## Change next line for yourself
make install DESTDIR=/mnt/sequence/jurban prefix=/software

HiSat2:

wget http://ccb.jhu.edu/software/hisat2/dl/hisat2-2.1.0-Linux_x86_64.zip
unzip hisat2-2.1.0-Linux_x86_64.zip


StringTie:

wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.1.1.Linux_x86_64.tar.gz
tar -xzf stringtie-2.1.1.Linux_x86_64.tar.gz


EdgeR and DEseq2 (in RStudio):

## I am also adding in some dependencies and/or extensions (mostly for DEseq2 uses)
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("pasilla")
BiocManager::install("apeglm")
BiocManager::install("IHW")
BiocManager::install("ashr")
BiocManager::install("vsn")
BiocManager::install("pheatmap")
BiocManager::install("pwr")
BiocManager::install("DESeq2")
BiocManager::install("edgeR")

Basic Quantification Pipeline Commands:

Salmon:

https://combine-lab.github.io/salmon/


Indexing: 

https://combine-lab.github.io/salmon/getting_started/#indexing-txome

salmon index -i name -t annotation.fasta 


Quantification: 

https://combine-lab.github.io/salmon/getting_started/#quantifying-samples

## I used this for paired-end, strand-specific RNA-seq (which was correctly auto-detected by Salmon)

mkdir -p quants
salmon quant -i ${SIDX} -l A \
         -1 ${R1} \
         -2 ${R2} \
         -p 1 --validateMappings -o quants/${PRE}_quant





STAR+RSEM:

http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html

Indexing:


GFF=/path/to/annotation.gff
RSEM=/path/to/RSEM/rsem-prepare-reference
STAR=/path/to/STAR/bin/Linux_x86_64/ 
FASTA=/path/to/genome.fasta
${RSEM} --star --star-path ${STAR} --gff3 ${GFF} ${FASTA} star_idx


Quantification:


## This is for paired-end, strand-specific RNA-seq
IDX=/path/to/star_idx
STAR=/path/to/STAR/bin/Linux_x86_64
RSEM=/path/to/RSEM/rsem-calculate-expression

${RSEM} --star --star-path ${STAR} --star-gzipped-read-file --star-output-genome-bam --output-genome-bam --strandedness reverse -p ${P} --paired-end ${R1} ${R2} ${IDX} ${PREFIX} > out.err 2>&1





HiSat2+StringTie:

Indexing:


FASTA=/path/to/genome.fasta
hisat2-build ${FASTA} hisatidx

Quantification:

## This is for paired-end, strand-specific RNA-seq
HIDX=/path/to/hisatidx
STRANDEDNESS=RF
STRINGTIE=/path/to/stringtie-2.1.1.Linux_x86_64/stringtie 
GFF=/path/to/annotation.gff
P=1 ## or whatever integer number of threads you have
R1=/path/to/R1.fastq
R2=/path/to/R2.fastq

hisat2 --dta -p ${P} -x ${HIDX} -1 $R1 -2 $R2 --rna-strandness $STRANDEDNESS | samtools sort > reads.bam

mkdir -p ./for_ballgown
${STRINGTIE} --rf -G ${GFF} -e -b ./for_ballgown reads.bam -A ./for_ballgown/gene-level-counts.txt > ./for_ballgown/stringtie.out.gff

# The following command can help make a useful file, but also see output in for_ballgown/:
awk '$1 !~ /^#/ && $3=="transcript" {gsub(/"|;/,""); OFS="\t"; print $12,".",$1,$7,$4,$5,$14,$16,$18,$2}' ./for_ballgown/stringtie.out.gff > transcript-level-counts.txt





Basic Differential Expression Commands:


Getting started:


Both EdgeR and DEseq2 analyses can be done in RStudio.


Get your tab-separated counts or expected counts table (counts.txt) set up as such:

index c1_r1 c1_r2 c1_r3 c2_r1 c2_r2 c2_r3
Gene_1_isoform_1 118 220 240 207 156 66
Gene_2_isoform_1 8 17 2 2 1 0
Gene_3_isoform_1 4.2 0 0 0 0 0
Gene_4_isoform_1 370.8 839 583 488 316 89
Gene_5_isoform_1 84 218 102 102 68 33


...where c1 and c2 are two different conditions, tissues, or stages.
...and r1, r2, r3 are the replicates for each condition.

To extend upon 2 conditions, though, below I will pretend we have datasets from a time course of 3 time points: T1, T2, T3; with 3 replicates each.

Load needed libraries:

library("edgeR")
library("DESeq2")
library("IHW")
library("vsn")
library("pheatmap")
library("RColorBrewer")

EdgeR:


Depending on your experimental design, there is more than one way to perform DE analysis with EdgeR. Many experimental designs may be the simple A vs B format: treatment vs control, A vs B, time point 1 vs 2. Of course, though, you may want to pursue experimental designs with more than a simple single pairwise comparison: A vs B vs C; T1 vs T2 vs T3; or A,B,C at time points T1,T2,T3. For those, you have at least 3 choices:
1. Analyze each in independent pair-wise fashion (not recommended).
2. Use all of the datasets in the data preparation phase, then analyze each in pairwise fashion, or just the pairwise tests that you're concerned with. (more robust in my opinion/experience).
3. Use all datasets in ANOVA analysis to identify genes that differs between any 2 samples.

Here I will spell out approach #2 as I believe it is most similar to the accompanying DEseq2 analysis, and is what I have found most robust and useful for my purposes. In other words, if you feel overwhelmed at this point by all the options, then just use Approach#2 for now and move on with your life. For Approaches #1 and #3 above, see the end of this blog/document (beyond the DEseq2 section). Note that when you have just the simple pairwise experimental design (A vs B), Approach#2 = Approach#1.


## READ IN COUNTS TABLE
## round() can be used on x as it is below for DEseq2
## BUT it is not necessary, so I don't

x <- read.delim('counts.txt', row.names='index')
x <- x[complete.cases(x),]


## DEFINE GROUPS
group <- factor(c(rep("T1", 3), rep("T2", 3), rep("T3", 3)), 
                  levels = c("T1","T2", T3))

## PREPARE DATA
y <- DGEList(counts=x, group=group)
keep <- filterByExpr(y, min.count = 1, min.total.count=10, min.prop=0.25)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y); 
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
y <- estimateDisp(y, design, robust=TRUE)
fit <- glmQLFit(y, design, robust=TRUE)


## CONTRASTS AND DE TESTS
con12 <- makeContrasts(T2 - T1, levels=design)
qlf12 <- glmQLFTest(fit, contrast=con12)
  
con13 <- makeContrasts(T3 - T1, levels=design)
qlf13 <- glmQLFTest(fit, contrast=con13)

con23 <- makeContrasts(T3 - T2, levels=design)
qlf23 <- glmQLFTest(fit, contrast=con23)


## FINALIZING
## I add Benjamini-Hochberg adjusted p-values, and the -log10 versions of each before writing out to a text file.
## E.g. for QLF in  [qlf12, qlf13, qlf23]; do this:
## More and more, I have been doing all follow-up analyses in Jupyter Notebooks with Python3 rather than RStudio

## GET BH FDR
fdr <- p.adjust(QLF$PValue, method="BH")

## WRITE OUT
write.table(
    x = data.frame(names=rownames(QLF), 
                   QLF, 
                   fdr = fdr,
                   log10p = -log10(df$PValue),
                   log10fdr = -log10(fdr)), 
    file = out, 
    quote = FALSE, 
    row.names = FALSE, 
    sep = "\t")




DEseq2:

DEseq2 has 4 ways to look at the Log2-fold-change estimates. These differences are not part of the DE tests, they are used to have potentially-more-useful visualizations and rankings based on log2FC (that can often be more extreme and error-prone for lower-abundance transcripts).

The 4 ways are:
1. Default log2FC as seen before any shrinkage
2. The original DEseq2 shrinkage from their 2014 paper ("normal" option in lfcShrink())
3. The ashr approach from a 2016-2017 paper ('ashr')
4. The apeglm approach from their 2018-2019 paper ('apeglm')

The latter two (ashr and apeglm), and specifically the last one (apeglm), are recommended by the DEseq2 authors for visualization and ranking. In other words: too many options? confused? Just use apeglm for ranking (i.e. to select top 50 genes) and visualization (e.g in MA plots, Volcano plots) and move on with your life.

The latter 2 approaches (ashr and apeglm) also gives "S values".

From the lfcShrink help:
"The s-values returned in combination with apeglm provide the probability of FSOS events, "false sign or small", among the tests with equal or smaller s-value than a given gene's s-value, where "small" is specified by lfcThreshold."
and
"s-values provide the probability of false signs among the tests with equal or smaller s-value than a given given's s-value. See Stephens (2016) reference on s-values."

I will show how to. create a dataframe and output table that includes all 4 logFC estimates, the p-value and FDR from the original statistical tests, and the additional S values from ashr and apeglm. One can then mess around with exploratory analyses comparing these approaches anyway one sees fit.

This approach will only give the comparisons analogous to EdgeR's "T2 - T1" and
"T3 - T1" above.


## READ IN COUNTS TABLE
## Expected counts from RSEM or Salmon can be non-integer (e.g. 4000.245)
## DEseq2 wants integers though.
## round() is used here to convert to the nearest integer value.

x <- read.delim('counts.txt', row.names='index')
x <- round( x[complete.cases(x),] )

## FACTORS
group <- factor(c(rep("T1", 3), rep("T2", 3), rep("T3", 3)), 
                  levels = c("T1","T2", T3))


coldata <- data.frame(condition=group, row.names = colnames(cts))

## DEseq OBJ INITIATE
dds <- DESeqDataSetFromMatrix(countData = cts,
                                colData = coldata,
                                design = ~ condition)

## FILTER similar to EdgeR (change as you see fit)
keep <- rowSums(counts(dds)) >= 10 & (rowSums(counts(dds) >= 1)/rowSums(counts(dds) >= 0) >=0.4)

dds <- dds[keep,]

## DIFF EXP ANALYSIS
dds <- DESeq(dds)

  
## OUTPUTS:
system("mkdir -p deseqtables")
pfx <- "deseqtables/"
pwgrps <- resultsNames(dds)
ngrps <- length(pwgrps) ## idx1 is Intercept
for (i in 2:ngrps){
  grp <- pwgrps[i]
  print(grp)
  splt <- strsplit(grp, split = "_")[[1]]
  
  res <- results(dds,contrast=c(splt[1], splt[2], splt[4]))
  resLFC.deseq22014 <- lfcShrink(dds, coef=grp, type="normal")[,2:3]
  resLFC.ashr2016 <- lfcShrink(dds, coef=grp, type="ashr", svalue=TRUE)[,2:4]
  resLFC.apeglm2018 <- lfcShrink(dds, coef=grp, type="apeglm",svalue=TRUE)[,2:4]
    
  a12 <- merge(as.data.frame(res), as.data.frame(resLFC.deseq22014), by=0, suffix=c("",".shrink"))
  a12 <- as.data.frame(a12[,2:dim(a12)[2]],row.names = a12[,1])
    
  a34 <- merge(as.data.frame(resLFC.ashr2016), as.data.frame(resLFC.apeglm2018), by=0, suffix=c(".ashr",".apeglm"))
  a34 <- as.data.frame(a34[,2:dim(a34)[2]],row.names = a34[,1])
    
  a1234 <- merge(a12, a34, by=0)
  a1234 <- as.data.frame(a1234[,2:dim(a1234)[2]],row.names = a1234[,1])
    
  sfx <- paste0("-", splt[2], "_", splt[4], ".txt")
  writeDEseq(df = a1234, 
               out = paste0(pfx, outpre, sfx))
}
















EdgeR ALTERNATIVE APPROACHES:


ANOVA APPROACH:


## DEFINE GROUPS
group <- factor(c(rep("T1", 3), rep("T2", 3), rep("T3", 3)), 
                  levels = c("T1","T2", T3))

## PREPARE DATA
y <- DGEList(counts=x, group=group)
keep <- filterByExpr(y, min.count = 1, min.total.count=10, min.prop=0.25)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y); 
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
y <- estimateDisp(y, design, robust=TRUE)
fit <- glmQLFit(y, design, robust=TRUE)


## SINGLE CONTRAST AND DE TEST LINE
  conAnova <- makeContrasts(
    T1T2 = T2 - T1,
    T1T2 = T3 - T1,
    T2T3 =  T3 - T2,

    levels=design)

  anov <- glmQLFTest(fit, contrast=conAnova)
  

  FDR_anova <- p.adjust(anov$table$PValue, method="BH"); 




INDEPENDENT PAIRWISE APPROACH:

## This can be simply automated in a function and/or for loop, but I spell out everything below in case it is simpler to follow.

## Read in counts or expected counts table

  f <- '/path/to/counts.txt'
  x <- read.delim(f, row.names='index')
  x <- x[complete.cases(x),]

  x12 <- x[,c(1:3, 4:6)]; 
  x13 <- x[,c(1:3, 7:9)]; 
  x23 <- x[,c(4:6, 7:9)]; 

  ## Define groups
  group12 <- factor(c(rep("T1", 3), rep("T2", 3), 
                  levels = c("T1","T2"))
  group13 <- factor(c(rep("T1", 3), rep("T3", 3), 
                  levels = c("T1","T3"))
  group23 <- factor(c(rep("T2", 3), rep("T3", 3), 
                  levels = c("T2","T3"))

  y12 <- DGEList(counts=x12, group=group12)
  y13 <- DGEList(counts=x13, group=group13)
  y23 <- DGEList(counts=x23, group=group23)


  keep12 <- filterByExpr(y12, min.count = 1, min.total.count=10, min.prop=0.25)
  keep13 <- filterByExpr(y13, min.count = 1, min.total.count=10, min.prop=0.25)
  keep23 <- filterByExpr(y23, min.count = 1, min.total.count=10, min.prop=0.25)

  y12 <- y12[keep12, , keep.lib.sizes=FALSE]
  y13 <- y13[keep13, , keep.lib.sizes=FALSE]
  y23 <- y23[keep23, , keep.lib.sizes=FALSE]

  y12 <- calcNormFactors(y12)
  y13 <- calcNormFactors(y13)
  y23 <- calcNormFactors(y23)

  design12 <- model.matrix(~ 0 + group12)
  design13 <- model.matrix(~ 0 + group13)
  design23 <- model.matrix(~ 0 + group23)

  colnames(design12) <- levels(group12)
  colnames(design13) <- levels(group13)
  colnames(design23) <- levels(group23)

  y12 <- estimateDisp(y12, design12, robust=TRUE)
  y13 <- estimateDisp(y13, design13, robust=TRUE)
  y23 <- estimateDisp(y23, design23, robust=TRUE)

  fit12 <- glmQLFit(y12, design12, robust=TRUE)
  fit13 <- glmQLFit(y13, design13, robust=TRUE)
  fit23 <- glmQLFit(y23, design23, robust=TRUE)

  con12.pw <- makeContrasts(T2 - T1, levels=design12)
  con13.pw <- makeContrasts(T3 - T1, levels=design13)
  con23.pw <- makeContrasts(T3 - T2, levels=design23)

  qlf12.pw <- glmQLFTest(fit12, contrast=con12.pw)
  qlf13.pw <- glmQLFTest(fit13, contrast=con13.pw)
  qlf23.pw <- glmQLFTest(fit23, contrast=con23.pw)

  FDR_12.pw <- p.adjust(qlf12.pw$table$PValue, method="BH")
  FDR_13.pw <- p.adjust(qlf13.pw$table$PValue, method="BH")
  FDR_23.pw <- p.adjust(qlf23.pw$table$PValue, method="BH")




Some additional thoughts:

RNA-seq

Overall, for most people and most RNA-seq analyses I recommend:
1. Salmon or Kallisto:
Salmon is just so easy to use, so fast and light weight. It is my understanding that the same can be said for Kallisto, and that these 2 programs produce essentially the same results.
See this blog by the PI behind Kallisto:
It made me feel slightly bad for arbitrarily choosing to use Salmon first, but c'est la vie.

2. STAR+RSEM:
This may still slightly outperform Salmon and Kallisto, although I wouldn't be surprised if I was wrong on that.

Differential abundance/enrichment:

I don't have an opinion yet on whether EdgeR or DEseq2 is better. Perhaps neither is "better".
It appears to me that they have a lot in common, and in my hands give similar results.
Each is often considered the "best" in DE bake-off papers -- and often they are tied.

Both can handle some really complex experimental designs.
- e.g. time courses
- e.g. time courses with different drug treatments
- e.g. experiments where some samples were single-end, others were paired-end
- e.g. analyses using datasets from different labs
In the more complex cases above, these programs can identify batch effects and effects due to differences like SE vs PE reads.
As a result, it can remove those systematic biases and tell you the genes/transcripts/bins that differ due only to biological effects.

EdgeR has an ANOVA option for complex experiments as well.

DEseq2 has some very cool log2FC shrinkage options for visualization and ranking.

It might be worth your time to get familiar with each.


For ChIP-seq, ATAC-seq, and other DNA approaches:

1. Bowtie2 + RSEM
2. I am very interested in trying Salmon or Kallisto on genomic bins (e.g. bin sizes of 50 bp, 100 bp, 500 bp, 1 kb or X... depending on sequencing depth).

For tried-and-true, stick with #1.

For exploratory purposes, mess around with #2.

I foresee #1 being better than #2, though, b/c the EM steps in #2 would be treating genomic bins independently. This hypothetically could be an issue since neighboring bins clearly would have pertinent information for the EM ML estimates. In contrast, RSEM is likely not confined to discrete fixed bins. It is more likely that RSEM can gather information in a bin around each given read -- i.e. the bins are not fixed with reads potentially mapping within them, the read positions are fixed with bins formed around them. Perhaps a kind or hyper-critical reader will comment on that. Nevertheless, perhaps this hypothetical issue is not a big issue in practice, or perhaps it could simply finding the "best" kmer-size for Salmon/Kallisto would overcome this in your particular genomic anakysis. Or perhaps one could use overlapping windows (e.g. bin-size 100, step-size 20) to obtain the ML estimates, followed by taking the average or median within intersecting regions. That could incorporate information from neighboring bins.

I definitely recommend using EdgeR or DEseq2 to do between-sample comparisons for DNA-seq experiments, at least to get the appropriate normalized counts between treatment and controls. In practice, often times one can get away with a simple MACS2 analysis, but EdgeR and DEseq2 seem a bit more principled to me.



References and Resources:

Click images to go to PubMed page for given paper.


Alignment:

HiSat2:




STAR:







Quantification:

RSEM:

Salmon:



Kallisto:




SailFish:




StringTie:





Differential Expression Analysis:

Differential Naming Analysis:

- Differential abundance analysis
- Differential counts analysis
- Differential frequency analysis
- Differential level analysis
- Differential enrichment analysis
- Etc


EdgeR, including the pre-cursor, primary, and follow-up/update papers and resources:







The User's Guide explains the inner workings well, and has all other pertinent references (image below, but no links):










DEseq2, including the pre-cursor, primary, and follow-up/update papers and resources:




The 'ashr' adaptive shrinkage estimator:

The 'apeglm' adaptive shrinkage estimator:








sleuth (penalties incurred for capitalizing the "s"):


From sleuth paper:
1. "....sleuth displayed higher sensitivity than Cuffdiff 2, DESeq, DESeq2, EBSeq, edgeR, voom...."
2. "Our results show that by accounting for uncertainty in quantifications, sleuth is more accurate than previous approaches at both the gene and isoform levels. Crucially, the estimated FDRs reported by sleuth are well controlled and reflect the true FDRs, making the predictions of sleuth reliable and useful in practice."







Friday, March 6, 2020

Coverage Over Single Repeat (e.g. a single rDNA unit)

Title: Coverage over single rDNA unit
Created: 2019-05-01
By: John Urban
For: Anyone interested (Spradling Lab, Carnegie Emb, or otherwise)

Originally posted as a Google Doc found here:

This example uses Drosophila and a single rDNA unit (the sequence for it can be found in the Google Doc).





May need to download dm6, find here:
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/dm6/bigZips/ .


If you want to look at the signal of rDNA, do the following.
1. Get dm6 FASTA. If you download it from above, you will need to combine all the chromosomes into one file called dm6.fasta or something similar.
2. Use the attached rDNA sequence to mask it with RepeatMasker (download and use at commandline: http://www.repeatmasker.org/):
mkdir masked
RepeatMasker dm6.fasta -lib rDNA.fasta -dir ./masked -x -pa 8
3. Add the attached rDNA sequence to the masked dm6 file. Now there is only 1 copy of rDNA in the genome sequence. Example command:
cat rDNA.fasta masked-dm6.fasta > masked-dm6-and-rDNA.fasta
4. Index it with Bowtie2:
bowtie2-build masked-dm6-and-rDNA.fasta masked-dm6-and-rDNA
5. Map reads to the masked genome + rDNA BT2 index with Bowtie2 and sort with SAMtools
bowtie2 …..
6. To get depth at each position use BEDTools:
bedtools genomecov -split -depth -ibam ${BAM} > depth.txt
7. To get the sum of all coverage, use Awk:
awk '{sum+=$3}END{ print sum } depth.txt > sum.txt
SUM=$( cat sum.txt )
8. Normalize the depth at each position to the sum of all coverage and multiply by 1million to get the signal per million bases covered: 
awk -v "sum=${SUM}" {OFS="\t"; print $1,$2,1e6*$3/sum}' depth.txt > normdepth.txt
9. Use browser or R to look at the coverage over rDNA. To bring just rDNA use awk (assuming the rDNA sequence was named “rDNA”, change if needed):
awk ‘$1==”rDNA”’ normdepth.txt > normdepth.rDNA.txt
10. With R or Awk, can do A/B or A-B or log2(A/B), etc.
11. If using IGV, open it up and go to create genome file, and point it towards masked-dm6-and-rDNA.fasta.

Normalization Note:
Note that it may be good to also try normalizing depth to the overall median depth value or to the median depth value from regions determined to be background. 

To get an overall sense of the depth for each contig (including the rDNA sequence) as well as the entire genome, do:
bedtools genomecov -split -ibam ${BAM} > coverage.hist.txt
Take into R to visualize and do various computations on.

BEDTools NOTE:
Do not use “-split” if you want to do fragment coverage rather than read coverage. For fragment coverage, it adds account to all bases between paired reads, for example. For some analyses this is needed. For others, it introduces unnecessary error and assumption into the analysis. Generally, use “-split” if fragment coverage is not essential to the analysis.


Literature:
I did this exact type of analysis in a 2015 Genome Research paper mapping replication origins:
There are some other suggestions therein on how to do this analysis, but it should be similar.