Runhang Shu

RNA-seq analysis

Runhang / 2019-12-31


  今天是二零一九年的最后一天,留学时光匆匆而过,研究生生涯已经过去四分之一。在这里,我把这个学期学了的转录组测序方法总结了一下,放在这里供同行参考讨论。

Introduction

Quality Control

After sequencing, we will have fastq file. The first thing we will do is check the quality of the sequence results.

Tool: fastqc (0.11.7)
Script: (Run the bash file in hipergator)

#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH --time=01:00:00
#SBATCH --mail-user=r.shu@ufl.edu
#SBATCH --mail-type=ALL
#SBATCH --output=fastqc.log
#SBATCH --error=fastqc.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=20gb.   
#(Note: The command above is typically what we need include in a bash file, 
#in which you tell how many memories, how much time you request)

module load fastqc
fastqc ENCFF000ESM.fastq
fastqc ENCFF000ESS.fastq

Results
Once completed, we will get html files containing basic info of the fastq file such as total reads, GC content, sequence length, and most importantly the per base sequence quality (see the figure below)

An acceptable sequencing result should have at least 20 quality score (0.01 error rate) across all bases. Score 30 stand for 0.001 error rate. When it is near the end of the reads, the quality score is below 20, so the following Trim and Filter steps are required.

Trim & Filter
Tool: fastxtoolkit
cutadapt (process both pairs at the same time, which is better for paired-end data)
Script:
module load fastx
toolkit/0.0.14
fastqqualityfilter -i ENCFF000ESM.fastq -q 20 -p 50 -o ENCFF000ESM.filtered.fastq
fastqqualityfilter -i ENCFF000ESS.fastq -q 20 -p 50 -o ENCFF000ESS.filtered.fastq
fastqc ENCFF000ESM.filtered2.fastq
fastqc ENCFF000ESS.filtered2.fastq #get rid of reads for which 50% of the bases have quality lower that 20 fastqqualitytrimmer -i ENCFF000ESM.filtered.fastq -t 25 -l 50 -o ENCFFOOOESM.f.t.fastq fastqqualitytrimmer -i ENCFF000ESS.filtered.fastq -t 25 -l 50 -o ENCFFOOOESS.f.t.fastq #trim from the 3’ end until the quality score reach 25 and then remove reads that are shorter than 50 after trimming
Then we run fastqc check the quality again


An alternative of fastx_toolkit is cutadapt
module load cutadapt
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-AAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT - -quality-base=64 -q 20 -m 50 -o ENCFF000ESM.cutadapt.fastq -p ENCFF000ESS.cutadapt.fastq ENCFF000ESM.fastq ENCFF000ESS.fastq
fastqc ENCFF000ESM.cutadapt.fastq
fastqc ENCFF000ESS.cutadapt.fastq

Mapping
After sequence preprocessing, we can align sequencing reads to a reference (if we have reference genome). There are 2 options in term of mapping:


Tools: Bowtie2 and TopHat
Script:
module load bowtie2

build the spliced index of human genome hg38.fa file in bowtie2


bowtie2-build /ufrc/mcb4325/share/Week12/hg38.fa hg38 use the index built by bowtie2 for mapping
module load tophat
tophat2 –no-coverage-search hg38 ENCFFOOOESM.f.t.fastq ENCFFOOOESS.f.t.fastq
Results:
We will get a “Tophatresults” fold with acceptedhits.bam file in it.
BAM file is a compression format of SAM file - Binary file; Save disk (80% compression); Indexed for efficient data access; - Easy to interconvert using SAMtools - BAM to SAM
samtools view -h name.bam > name.sam - Sort BAM file by chromosomal coordinates
samtools sort name.bam name.sorted

Quantification


Method 1
As we already have the bam file, there is no need to assemble the reads for quantification. Again, we mapped the reads based on reference genome. Now, we quantify the reads based on reference genome (Homosapiens.GRCh28.86.OK.gtf file).

Tool: htseq; samtools
Scripts:
module load samtools
module load htseq
samtools sort -n ../Tophat
results/Embryo1/acceptedhits.bam -o sortedE1.bam
htseq-count -f bam -s no -i gene
id sortedE1.bam ./Homosapiens.GRCh38.86.OK.gtf > htseqcount_E1.txt

Results:
The result file is a txt file with Ensembl gene_id (such as ENSG0000.) and the number of counts.

Method 2 (Assembly and then quantify)
We can assemble the read based on reference genome with cufflinks and then we get both genes and isoform(transcripts) FPKM file.
FPKM stands for Fragments Per Kilobase of transcript per Million mapped reads, which normalize the basis of gene length because the longer a read is the more count it will have. So, I think this method is better than read count.

Assembly
Tool: cufflinks
Script:
module load cufflinks
cufflinks –library-type fr-firststrand -o /home/r.shu/norefassembly /home/r.shu/tophatout/acceptedhits.bam
#tophat mapped bam file was assembled without reference
OR…
cufflinks -g ../hg38.gtf –library-type fr-firststrand -o /home/r.shu/gassembly /home/r.shu/tophatout/acceptedhits.bam
#tophat mapped bam file was assmebled by GFT reference guide
#hg38 gtf file is reference genome file
OR…
cufflinks -G ./hg38.gtf –library-type fr-firststrand -o /home/r.shu/G
assembly /home/r.shu/tophatout/acceptedhits.bam
#tophat mapped bam file was assembled based on transcripts existing in GFT file

Results: We can get gene.fpkmtracking and isoforms.fpkmtracking files, and importantly, we get transcripts.gtf file which can be visualized in IGV software.
The tacking.gtf files contain quantification info that we can read in R.

De novo Assembly, then mapping and quantifying
For study organisms that are not well annotated genome, we may want to use some algorithm to assemble the raw reads together without any reference.
Tool: trinity; gmap
Script:
module load trinity
Trinity –seqType fq –maxmemory 5G –left /ufrc/mcb4325/share/Week12/UHLEN1T110R1.fastq –right /ufrc/mcb4325/share/Week12/UHLEN1T110R2.fastq –CPU 6
# we get Trinity.fasta file from this process
# If we have reference genome, we can map the trinity assembly result using GMAP
module load gmap
gmap
build -D . -d hg38.GMAP /ufrc/mcb4325/share/Week12/hg38.fa
# -D means the directory, -d means the name of the file
cd trinityoutdir
gmap -n 1 -t 8 -f 2 -D ../ -d hg38.GMAP Trinity.fasta > Trinity.gff
# -f 2 indicate output format and produce .gff file
awk ‘{if ($3 == “exon”) print $0}’ Trinity.gff > Trinity.exon.gff


Note: De nova transcriptome assembly has higher sensitivity to discover novel transcripts and trans-spliced transcripts BUT not as sensitive as reference-based assembly to discover low-abundance transcripts.

Differential expression analysis
We then move to R for statistics analysis. By now, we have quantified the number of each exon or gene. Also, get the gene length and GC content txt file from Ensembl to check the bias
mycounts_t = read.delim(“CountData.txt”)

dim(mycounts)
head(mycounts)

mylengtht = read.delim(“genetranscript_length.txt” ,as.is = TRUE)
dim(mylength)
head(mylength)

myGCt = read.delim(“geneGC_content.txt”, as.is = TRUE)
dim(myGC)
head(myGC)

Experimental design

myfactorst = data.frame(“treat” = substr(colnames(mycountst), start = 1, stop = 1), “time” = substr(colnames(mycountst), start = 3, stop = 3), “cond” = substr(colnames(mycountst), start = 1, stop = 3))
rownames(myfactorst) = colnames(mycounts)
head(myfactors
t)
myfactors_t

Loading R libraries to be used

if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”)
BiocManager::install(“EDASeq”, version = “3.8”)
BiocManager::install(“NOISeq”, version = “3.8”)
library(NOISeq)
library(EDASeq)

*Note: As the read length and GC content bias is common in Illumina sequencing, where longer read get more counts (FPKM) and low GC content get lower counts. *

BIAS ANALYSIS with NOISeq ########################

Preparing data for NOISeq package


mydata = NOISeq::readData(data = mycounts, factors = myfactors, gc = myGC, length = mylength)

Length bias


mylengthbias = dat(mydata, factor = “cond”, norm = FALSE, type = “lengthbias”)
par(mfrow = c(1,2))
explo.plot(mylengthbias, samples = 1)
explo.plot(mylengthbias, samples = 1:8)

GC content bias

myGCbias = dat(mydata, factor = “cond”, norm = FALSE, type = “GCbias”)
par(mfrow = c(1,2))
explo.plot(myGCbias, samples = 1)
explo.plot(myGCbias, samples = 1:8)

RNA composition

mycomp = dat(mydata, norm = FALSE, type = “cd”) par(mfrow = c(1,2)) explo.plot(mycomp, samples = 1:12) explo.plot(mycomp, samples = 13:24)

NORMALIZATION

First we remove genes with 0 counts in all samples

mycountst = mycountst[rowSums(mycounts_t) > 0, ]

Discarding genes without GC content information

genesWithGCt = intersect(rownames(mycountst), myGCt[,“gene”]) myGCt = myGCt[myGCt[,“gene”] %in% genesWithGCt,] rownames(myGCt) = myGCt[,“gene”] mycountst = mycountst[myGCt[,“gene”],]

head(mycountst) head(myfactorst)

Preparing data for EDASeq package

edadata = newSeqExpressionSet(counts=as.matrix(mycountst), featureData=myGCt[,2, drop = FALSE], phenoData=myfactors_t[,“cond”, drop = FALSE])

Correcting GC content bias with EDASeq package: loess method

edadata <- withinLaneNormalization(edadata,“GCcontent”, which=“full”) mynormdata = edadata@assayData$normalizedCounts

Applying TMM normalization (between-samples) with NOISeq package

mynormdata = tmm(mynormdata) head(mynormdata)

Checking the normalization results

More uniform distributions by TMM

par(mfrow = c(1,2)) boxplot(log(as.matrix(mycounts+1)) ~ col(mycounts), main = “Before normalization”) boxplot(log(mynormdata+1) ~ col(mynormdata), main = “After normalization”)

GC bias memoved

mydata.norm = NOISeq::readData(data = mynormdata, factors = myfactors, gc = myGC, length = mylength) myGCbias = dat(mydata.norm, factor = “cond”, norm = TRUE, type = “GCbias”) par(mfrow = c(1,2)) explo.plot(myGCbias, samples = 1) explo.plot(myGCbias, samples = 1:8)

Loading R libraries to be used

if (!requireNamespace(“BiocManager”, quietly = TRUE)) install.packages(“BiocManager”) BiocManager::install(“DESeq2”) BiocManager::install(“limma”) BiocManager::install(“maSigPro”) BiocManager::install(“statmod”) BiocManager::install(“edgeR”)

library(NOISeq) library(edgeR) library(DESeq2) library(limma) library(maSigPro) library(“edgeR”)

edgeR to profile the differentially expressed genes

############################

cond2compare = colnames(mycounts)[c(grep(“B0”, colnames(mycounts)), grep(“B6”, colnames(mycounts)))] cond2compare mygroups = rep(c(“B0”, “B6”), each = 3) mygroups

edgeR

myedgeR = DGEList(counts = mycounts[,cond2compare], group = mygroups)
myedgeR = calcNormFactors(myedgeR)

myedgeR = estimateCommonDisp(myedgeR)
myedgeR = estimateTagwiseDisp(myedgeR, trend = “movingave”)
myedgeR = exactTest(myedgeR)
names(myedgeR)
head(myedgeR$table)
topTags(myedgeR)
myedgeRdecide = decideTestsDGE(myedgeR, adjust.method = “BH”, p.value = 0.05, lfc = 0)
summary(myedgeR
decide) #Apart from edgeR, we can also use DESeq2 and NOIseq.

Reduce the dimension by PCA plot


Though there are thousands of variables(gene expression) in each treatments, with Principal coordinate analysis plot, we can appreciate the similarity between two or more samples.

Script
PCA<-princomp(normal_data,cor = TRUE)

create my PCA plot function

myPCAplot<- function(x, main,col=rep(c(1:6),each=3)) {
t.var<-cumsum(x$sdev^2^/sum(x$sdev^2^))
var.PC1<-round(t.var[1]100)
var.PC2<-round((t.var[2]-t.var[1])
100)
plot(P[,1],P[,2],
main = main,
col=col,
cex=2, lwd=2,
xlim=c(min(P[,1])-0.1,max(P[,1])+0.1) ,
ylim=range(P[,2])*1.3,
xlab= paste(“PC1:“,var.PC1,”% expel.var”, sep = “”),
ylab=paste(“PC2:“,var.PC2,”% expel.var”, sep = “”) )
legend(x=0.55,y=0.6,legend=c(“Embryo”,“Oocyte”),
col =c(1:2),cex=1,bty=“o”,pch=1,text.font = 2,text.col = c(1:4),
pt.cex = 1,pt.lwd = 2,xjust = 0.5,yjust = 0.5) }
apply the function I created and input the PCA dataset
myPCAplot(PCA, main= (“Gene expression of different cells”))

positive values on x axis indicate the data is not centered


mydata.c<-t(apply(normaldata, 1, scale, scale=FALSE))
colnames(mydata.c)<-colnames(normal
data)
PCA.c<-princomp(mydata.c,cor=TRUE)
myPCAplot(PCA.c, main=“Drug doses effect on genes expression of mice”)


Top10<- sort(abs(PCA.c$scores[,1]), decreasing= TRUE)[1:10]

Use scores ranking in PCA analysis as determinants for the top 10 differentially expressed genes.

Enrichment and Pathway Analysis
Tool: PaintOmics3

source(“http://bioconductor.org/biocLite.R")
biocLite(“NOISeq”)
biocLite(“goseq”)
biocLite(“goglm”)
library(NOISeq)
library(goseq)

setwd(“/Users/darson/Desktop/Genomics in R/Week 16/DataWeek16”)

NOISeq analysis

data <- read.delim(“Differentiation.txt”, as.is = TRUE, header = TRUE, row.names = 1)

normalized data

mygroups = rep(c(“P”, “D”), each = 3)

data.f = filtered.data(data, factor = mygroups, norm = TRUE, cv.cutoff = 200, cpm = 5) #filter low count genes mynoiseqbio = readData(data.f, factors = data.frame(“condi”=mygroups)) mynoiseqbio = noiseqbio(mynoiseqbio, norm = “n”, k = NULL, factor = “condi”, r = 30) mynoiseqbio.deg = degenes(mynoiseqbio, q = 0.99)

mynoiseqbio.deg <- mynoiseqbio.deg[abs(mynoiseqbio.deg[“log2FC”])> 0.3,]

Enrichment analysis

GOannot <- read.delim(“mmuGO.txt”, as.is = TRUE, header = TRUE)

Obtain gene length

mmuLength <- read.delim(“mmuLength.txt”, as.is = TRUE, header = TRUE)

mmuLength <- tapply(mmuLength[,3],mmuLength[,1], median)

Preparing data for GOseq analysis

mygenes = rownames(data.f) mmuLength = mmuLength[mygenes] NOISeq.DE = as.integer(mygenes %in% rownames(mynoiseqbio.deg)) # logical vector of DE names(NOISeq.DE) = mygenes head (NOISeq.DE,10)

GOseq

mipwf <- nullp(NOISeq.DE, bias.data = mmuLength)
walle <- goseq(pwf = mipwf, gene2cat = GOannot, method = “Wallenius”, repcnt = 2000)
enriched <- walle$category[walle$overrepresentedpvalue< 0.01]
GO.enriched <- GOannot[match(enriched, GOannot[,2]),3] # obtain GO names
head (GO.enriched, 30) # GO results
tail (GO.enriched, 30) # GO results

Prepare data for Paintomics

mynoiseqbio.all = degenes(mynoiseqbio, q = 0)
expr.data <- cbind (name = rownames(mynoiseqbio.all), FC = mynoiseqbio.all[,“log2FC”])
write.table( expr.data, “expr.data.txt”, row.names = F, col.names = T, quote = F, sep = “\t” )
write.table(rownames(mynoiseqbio.deg), “DE.data.txt”, row.names = F, col.names = F, quote = F, sep = “\t” )

Upload expr.data and DE.data.txt files in PaintOmics