The first steps in most RNA-Seq analyses involves aligning reads to a reference genome or transcriptome. Because these alignments are used as the starting point for assigning reads to genes, it is very important that the alignment procedure is done properly. This practical will cover the mapping steps involved in the GenPipes RNA-Seq pipeline.
Aligning or mapping reads to a reference is usually the first step in most genomics analyses. In the case of RNA-seq data, there are several different alignment strategies available, each with their advantages and disadvantages. Each strategy involves a completely different set of bioinformatics tools and therefore, it is important to consider all options before commiting to a strategy.
(Click to see more information)
We are assuming there is an annotated reference genome available for the organism of interest. If that is not the case, the only option is to assemble a reference de novo and align to that. This is usually the case for studies involving non-model organisms. A complete de novo assembly is outside the scope of this workshop, although GenPipes offers a version of the RNA-Seq pipeline for this purpose (rnaseq_denovo
) that uses the Trinity assembler to produce a reference transcriptome.
In the majority of cases, one or more reference genomes and transcriptomes are available for mapping. It is important to compare available assemblies and choose the most appropriate one for the experiment in question. Usually, more recent assemblies have better gene models and are therefore prefered to older references. If an experiment involves downstream analyses with legacy tools that do not have updated references, or if the purpose is to compare the results to previous experiments, it might be preferable to align to an older reference. Alternatively, for commonly sequenced model organisms such as Homo sapiens and Mus musculus, not only are there different assemblies available, but different annotation formats maintained by different institutions as well. These assemblies are mostly similar, but differ in subtle and yet important ways.
In terms of the actual alignment, the main difference between DNA and RNA alignments has to do with the presence of splicing sites in RNA alignments. Contrary to genomic data, where no gaps or only small ones are expected when aligning to the reference genome, RNA reads can span large introns. If the alignment software is not splice-aware, these gaps would penalize the alignment to the point that few reads would actually align back to the genome. For that reason, RNA reads are usually aligned using a splice-aware aligner. These aligners use an algorithm that does not penalize gaps to the same extent, so that RNA reads can be successfully aligned to a reference genome.
Before we start, make sure that you have the data for the workshop.
Steps:
## connect to server:
ssh my_cc_account@mp2.ccs.usherbrooke.ca
## download data IF YOU DONT HAVE A COPY:
#wget http://www.computationalgenomics.ca/tutorial/c3g_analysis_workshop/C3GAW_RNA_TestData_Aug2018.zip
#unzip C3GAW_RNA_TestData_Aug2018.zip
#cd C3GAW_RNA_TestData_Aug2018
Verify that you have the following directories and files:
- raw_data
,
- readset.rnaseq.txt
,
- design.rnaseq.txt
.
In this first exercise, you will prepare the input files required for aligning RNA-Seq data to a reference genome using the splice-aware aligner, STAR.
.fastq
filesThe inputs for the alignment step are .fastq
files, which include the reads outputted by the sequencer and quality metrics for each base. However, using the raw output of the sequencer can be problematic as some reads might include adapter sequences from the library preparation steps as well as low-quality bases that can affect the outcome of the analysis. Therefore, a trimming procedure to remove these problematic reads is required as a first step when preparing the input for an RNA alignment.
Think: How could an adapter sequence affect the outcome of an RNA-Seq alignment and why is it important to remove them?
readset.rnaseq.txt
file and check that the columns labeled Adapter1
and Adapter2
have the adapter sequences used to prepare your RNA-Seq libraries for sequencing. These sequences will be identified by the pipeline and removed from the reads that contain them, so it is important to verify that they are correct. Solution (click here)
Command:
less readset.rnaseq.txt
To know which adapter sequences to use, check with the provider of your RNA-Seq data. Additionally, if you know which method was used to prepare the library, you can check the manufactuer’s website (for Illumina libraries, check this webstie)
Solution (click here)
Command:
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-j slurm \
-o . \
-s 2-3 > rnaseq_steps2-3.sh
less rnaseq_steps2-3.sh
Explanation: The solution to the exercise above will create a series of Trimmommatic jobs. For a more detailed explanation of what the code is doing, click here.
This command will generate the Trimmommatic jobs. To understand what these jobs are doing, we can look at a standard Trimmommatic command:
java -XX:ParallelGCThreads=1 -Xmx30G -jar $TRIMMOMATIC_JAR PE \
-threads 30 \
-phred33 \
raw_data/rnaseq_H1ESC_chr19_Rep2_R1.fastq.gz \
raw_data/rnaseq_H1ESC_chr19_Rep2_R2.fastq.gz \
trim/H1ESC_Rep2/H1ESC_Rep2.trim.pair1.fastq.gz \
trim/H1ESC_Rep2/H1ESC_Rep2.trim.single1.fastq.gz \
trim/H1ESC_Rep2/H1ESC_Rep2.trim.pair2.fastq.gz \
trim/H1ESC_Rep2/H1ESC_Rep2.trim.single2.fastq.gz \
ILLUMINACLIP:trim/H1ESC_Rep2/H1ESC_Rep2.trim.adapters.fa:2:30:15:8:true \
TRAILING:30 \
MINLEN:32 \
2> trim/H1ESC_Rep2/H1ESC_Rep2.trim.log
The command can be broken down into the following components:
java -XX:ParallelGCThreads=1 -Xmx30G -jar $TRIMMOMATIC_JAR
: Instructions to run the java interpreter for TrimmommaticPE
: Paired end moderaw_data/...fastq.gz
: input filestrim/...trim...fastq.gz
: name of trimmed output filesILLUMINACLIP...
: instructions to trim adaptersTRAILING
: threshold quality to cut off bases at end of readMINLEN
: threshold length2> ...log
: save output into logYou can modify parameters for Trimmommatic in the .ini
files:
trailing_min_quality=30
: cuts bases off the end of a read, if below the specified threshold quality.min_length=32
: drops the read if it is below the specified lengthTry re-running these steps at home using different parameters to see how this affects the output.
Aligners require indices to efficiently map reads to a reference. Indices are usually generated before any alignment takes place. Each assembly requires its own indices and different alignment programs use different types of indices. Unless you trust the source of an index, it is usually better to generate indices on your own, that way you can ensure that the index is appropriate for the species and genome version.
The data for this tutorial consists of paired-end RNA-seq libraries from the Homo sapiens cell lines H1ESC and GM12878 (read length: 75 bp). The genome version that will be used is the NCBI GRCh38. The data will only focus on comparing expression data from chromosome 19, so a special index has been generated with the appropriate data for that chromosome.
Homo_sapiens.GRCh38.chr19/genome
). Explore the contents of the directory. Where can you find a copy of the full sequence? Solution (click here)
The sequence that is used as reference is found as a FASTA file in the following path: $MUGQIC_INSTALL_HOME/genomes/C3G_workshop/Homo_sapiens.GRCh38.chr19/genome/Homo_sapiens.GRCh38.chr19.fa
. To view the contents of this file, use the folowing command:
cd $MUGQIC_INSTALL_HOME/genomes/C3G_workshop/Homo_sapiens.GRCh38.chr19/genome
less Homo_sapiens.GRCh38.chr19.fa
Solution (click here)
The index files are found in the directory: Homo_sapiens.GRCh38.chr19/genome/star_index/Ensembl87.sjdbOverhang99
. These files are binary, so they can’t be read using the less
command.
.ini
file with the genome information. Where can you find this .ini
file? Solution (click here)
The ini
file can be found in the main directory you downloaded, and it is called: workshop.ini
. To view the contents of this file, use the following command:
less workshop.ini
The important information, with regards to the reference genome, is contained in the following lines:
[DEFAULT]
assembly_dir = $MUGQIC_INSTALL_HOME/genomes/C3G_workshop/%(scientific_name)s.%(assembly)s
scientific_name=Homo_sapiens
assembly=GRCh38.chr19
assembly_synonyms=hg38
dbsnp_version=149
source=Ensembl
version=87
STAR (Spliced Transcripts Alignment to a Reference) is a fast, splice-aware aligner that is used in several RNA-seq analysis pipelines. It is very verstile because it has several adjustable options that permit the user to fine-tune the mapping based on their needs. These options include thresholds on novel and canonical splice junctions, reporting or limiting multi-mapping reads, and options for gene fusions or chimeric alignments. This exercise will focus on running STAR within the GenPipes framework, which simplifies the process using sensible, pre-defined options that are adequate for most RNA-seq analyses.
The GenPipes RNA-Seq pipeline uses the two-pass STAR alignment procedure. For more information about this method, click here.
This option allows for the mapping to novel splice junctions across all samples, in addition to the canonical junctions available in the annotated reference. As the name indicates, the two-pass mapping procedures runs the alignment twice:
This procedure does not increase the total amount of novel junctions detected. However, it increases sensitivity of those novel junctions, because a read can align to a junction discovered in a different sample.
Think: How is it possible that a read maps to a junction that was not discovered in the first pass?
The inputs to run STAR are the trimmed .fastq
files generated in the previous trimming steps (2-3). The actual mapping consists of a single step (4) within the GenPipes RNA-seq pipeline.
.ini
file has been included in the command. Solution (click here)
Command:
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-j slurm \
-o . \
-s 4 > rnaseq_steps4.sh
less rnaseq_steps4.sh
Explanation: This command will generate the STAR jobs. To understand what these jobs are doing, click here.
First, we can look at the names of the jobs created:
The first four jobs are the “first-pass” mapping steps:
star_align.1.GM12878_Rep1
star_align.1.GM12878_Rep2
star_align.1.H1ESC_Rep1
star_align.1.H1ESC_Rep2
The next step is the step that merges the novel junctions and adds them to the index.
star_index.AllSamples
The next four steps are the “second-pass” mapping steps. Notice how in these mapping steps the original reference genome has been replaced for --genomeDir reference.Merged
.
star_align.2.GM12878_Rep1
star_align.2.GM12878_Rep2
star_align.2.H1ESC_Rep1
star_align.2.H1ESC_Rep2
What other differences do you notice between the first and second-pass mapping steps? How do you think these differences affect the mapping?
A final step is used to generate useful metrics from the alignment.
star_report
Once the STAR mapping steps have finished running, a sorted BAM file will be created for each sample. This file is the basis for several of the downstream analysis steps, so it is important to know where it is saved and how it can be accessed.
Solution (click here)
The outputs of the first and second pass are contained under the folders alignment_1stPass
and alignment
, respectively. Because, the results that are of most interest for downstream analysis are the results of the second pass, from now on, we will only focus on the contents of the alignment
directory.
To check the content of each folder:
ls alignment_1stPass
ls alignment
GM12878_Rep1
? How are the alignments sorted? Solution (click here)
The main output for the alignment step is found in the following path: alignment/GM12878_Rep1/GM12878_Rep1.sorted.bam
. The aligned reads in this file are sorted by coordinate.
To check:
ls alignment/GM12878_Rep1/
## to check sorting:
module load mugqic/samtools/1.4.1
samtools view -H alignment/GM12878_Rep1/GM12878_Rep1.sorted.bam
## look for "SO:" on the first line
Depending on their purpose and algorithm, different tools require alignment files to be sorted in different ways. By default, the output of the STAR alignment sorts reads by the coordinate (position) of the alignment. However, some tools, including HT-Seq Count, require that reads are sorted by their name instead of their position. Similarly, some tools require duplicate reads to be marked so that they can be filtered out easily. These different requirements are accounted for by GenPipes, and steps 5-12 of the pipeline consist of tools that produce separate versions of the alignment (bam
) file, using different methods to sort, filter and clip the reads. Most of these steps are carried out using the multi-purpose Picard tool.
Solution (click here)
Command:
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-j slurm \
-o . \
-s 5-12 > rnaseq_steps5-12.sh
less rnaseq_steps5-12.sh
The names of each step are:
picard_merger_sam_files
: if a sample has more than 1 readset, all the alignment files are merged to form only one file for each sample.picard_sort_sam
: sorts the sam/bam files by read name instead of position.picard_mark_duplicates
: duplicate reads will be marked.picard_rna_metrics
: quality control metrics are computed for all samples.estimate_ribosomal_rna
: ribosomal RNA is estimated using alignments to an rRNA reference.bam_hard_clip
: a hard-clipped version of the alignments is generated (where non-aligned bases are completely removed instead of just masked).rnaseqc
: additional QC metrics are computed and saved under metrics/rnaseqRep
wiggle
: wiggle tracks are generated so that alignments can be quickly visualized in genome browsersOne of the most important uses of alignment files is to generate raw counts for genes so that they can be compared to determine whether or not there is evidence of differential expression. GenPipes uses the counting tool HT-Seq to generate the raw counts matrix that will serve as input for the rest of the downstream analyses. This tool will count reads differently depending on the mode.
Click here to see a table summarizing the differences with how HT-Seq counts a read, depending on the mode.
Step 13 of the RNA-Seq pipeline uses HT-Seq count and the name-sorted alignment files to calculate the raw counts of every gene in every sample. By default, the mode is set to intersection-nonempty
, but it can be changed if a different behavior is desired. However, before changing the mode, it is very important to understand how this could impact downstream analyses, especially differential expression.
intersection-nonempty
. Solution (click here)
Command:
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-j slurm \
-o . \
-s 13 > rnaseq_steps13.sh
less rnaseq_steps13.sh
If HT-Seq is set up properly, the command for H1ESC_Rep1
should read as follows:
mkdir -p raw_counts && \
samtools view -F 4 \
alignment/H1ESC_Rep1/H1ESC_Rep1.QueryNameSorted.bam | \
htseq-count - \
-m intersection-nonempty \
--stranded=reverse \
--format=sam \
/lb/project/mugqic/projects/rdali/C3G/projects/TestDatasets/Homo_sapiens.GRCh38.chr19/annotations/Homo_sapiens.GRCh38.chr19.Ensembl87.gtf \
> raw_counts/H1ESC_Rep1.readcounts.csv
Notice how -m intersection-nonempty
is defined.
Solution (click here)
Command:
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-j slurm \
-o . \
-s 14 > rnaseq_steps14.sh
less rnaseq_steps14.sh
This step will create four jobs:
metrics.matrix
metrics.wigzip
rpkm_saturation
raw_count_metrics_report
The merged raw counts matrix is found under DGE/rawCountMatrix.csv
, you can check it using the following command:
less -S DGE/rawCountMatrix.csv