Introduction

The first steps in most RNA-Seq analyses involves aligning reads to a reference genome or transcriptome. Because these alignments are used as the starting point for assigning reads to genes, it is very important that the alignment procedure is done properly. This practical will cover the mapping steps involved in the GenPipes RNA-Seq pipeline.

Learning objectives:

  1. Understand the input files, indices, and outputs of mapping steps.
  2. Understand the commands and parameters used to align RNA-Seq data.
  3. Understand how the alignments are used to assign reads to genes and produce raw counts.

Theoretical Background: RNA-Seq alignment

Aligning or mapping reads to a reference is usually the first step in most genomics analyses. In the case of RNA-seq data, there are several different alignment strategies available, each with their advantages and disadvantages. Each strategy involves a completely different set of bioinformatics tools and therefore, it is important to consider all options before commiting to a strategy.

(Click to see more information)

We are assuming there is an annotated reference genome available for the organism of interest. If that is not the case, the only option is to assemble a reference de novo and align to that. This is usually the case for studies involving non-model organisms. A complete de novo assembly is outside the scope of this workshop, although GenPipes offers a version of the RNA-Seq pipeline for this purpose (rnaseq_denovo) that uses the Trinity assembler to produce a reference transcriptome.

In the majority of cases, one or more reference genomes and transcriptomes are available for mapping. It is important to compare available assemblies and choose the most appropriate one for the experiment in question. Usually, more recent assemblies have better gene models and are therefore prefered to older references. If an experiment involves downstream analyses with legacy tools that do not have updated references, or if the purpose is to compare the results to previous experiments, it might be preferable to align to an older reference. Alternatively, for commonly sequenced model organisms such as Homo sapiens and Mus musculus, not only are there different assemblies available, but different annotation formats maintained by different institutions as well. These assemblies are mostly similar, but differ in subtle and yet important ways.

In terms of the actual alignment, the main difference between DNA and RNA alignments has to do with the presence of splicing sites in RNA alignments. Contrary to genomic data, where no gaps or only small ones are expected when aligning to the reference genome, RNA reads can span large introns. If the alignment software is not splice-aware, these gaps would penalize the alignment to the point that few reads would actually align back to the genome. For that reason, RNA reads are usually aligned using a splice-aware aligner. These aligners use an algorithm that does not penalize gaps to the same extent, so that RNA reads can be successfully aligned to a reference genome.

Connect to server and Download data

Before we start, make sure that you have the data for the workshop.

Steps:

## connect to server:
ssh my_cc_account@mp2.ccs.usherbrooke.ca

## download data IF YOU DONT HAVE A COPY:
#wget http://www.computationalgenomics.ca/tutorial/c3g_analysis_workshop/C3GAW_RNA_TestData_Aug2018.zip
#unzip C3GAW_RNA_TestData_Aug2018.zip
#cd C3GAW_RNA_TestData_Aug2018


Verify that you have the following directories and files:
- raw_data,
- readset.rnaseq.txt,
- design.rnaseq.txt.

Exercise 1: Setting up an RNA-Seq alignment

In this first exercise, you will prepare the input files required for aligning RNA-Seq data to a reference genome using the splice-aware aligner, STAR.

Step A: Prepare input .fastq files

The inputs for the alignment step are .fastq files, which include the reads outputted by the sequencer and quality metrics for each base. However, using the raw output of the sequencer can be problematic as some reads might include adapter sequences from the library preparation steps as well as low-quality bases that can affect the outcome of the analysis. Therefore, a trimming procedure to remove these problematic reads is required as a first step when preparing the input for an RNA alignment.

Think: How could an adapter sequence affect the outcome of an RNA-Seq alignment and why is it important to remove them?

Question 1

  1. Open the readset.rnaseq.txt file and check that the columns labeled Adapter1 and Adapter2 have the adapter sequences used to prepare your RNA-Seq libraries for sequencing. These sequences will be identified by the pipeline and removed from the reads that contain them, so it is important to verify that they are correct.

Solution (click here)

Command:

less readset.rnaseq.txt

To know which adapter sequences to use, check with the provider of your RNA-Seq data. Additionally, if you know which method was used to prepare the library, you can check the manufactuer’s website (for Illumina libraries, check this webstie)

Question 2

  1. Run steps 2-3 of the GenPipes RNA-seq pipeline to use Trimmomatic tool, which will remove low quality reads and adaptor sequences.

Solution (click here) Command:

rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
    -r readset.rnaseq.txt \
    -d design.rnaseq.txt \
    -j slurm \
    -o . \
    -s 2-3 > rnaseq_steps2-3.sh
    
less rnaseq_steps2-3.sh

Explanation: The solution to the exercise above will create a series of Trimmommatic jobs. For a more detailed explanation of what the code is doing, click here.

This command will generate the Trimmommatic jobs. To understand what these jobs are doing, we can look at a standard Trimmommatic command:

java -XX:ParallelGCThreads=1 -Xmx30G -jar $TRIMMOMATIC_JAR PE \
  -threads 30 \
  -phred33 \
  raw_data/rnaseq_H1ESC_chr19_Rep2_R1.fastq.gz \
  raw_data/rnaseq_H1ESC_chr19_Rep2_R2.fastq.gz \
  trim/H1ESC_Rep2/H1ESC_Rep2.trim.pair1.fastq.gz \
  trim/H1ESC_Rep2/H1ESC_Rep2.trim.single1.fastq.gz \
  trim/H1ESC_Rep2/H1ESC_Rep2.trim.pair2.fastq.gz \
  trim/H1ESC_Rep2/H1ESC_Rep2.trim.single2.fastq.gz \
  ILLUMINACLIP:trim/H1ESC_Rep2/H1ESC_Rep2.trim.adapters.fa:2:30:15:8:true \
  TRAILING:30 \
  MINLEN:32 \
  2> trim/H1ESC_Rep2/H1ESC_Rep2.trim.log

The command can be broken down into the following components:

  • java -XX:ParallelGCThreads=1 -Xmx30G -jar $TRIMMOMATIC_JAR : Instructions to run the java interpreter for Trimmommatic
  • PE: Paired end mode
  • raw_data/...fastq.gz: input files
  • trim/...trim...fastq.gz: name of trimmed output files
  • ILLUMINACLIP...: instructions to trim adapters
  • TRAILING: threshold quality to cut off bases at end of read
  • MINLEN: threshold length
  • 2> ...log: save output into log

You can modify parameters for Trimmommatic in the .ini files:

  • trailing_min_quality=30 : cuts bases off the end of a read, if below the specified threshold quality.
  • min_length=32 : drops the read if it is below the specified length

Try re-running these steps at home using different parameters to see how this affects the output.

Step B: Verify you are using the right genome version and indices

Aligners require indices to efficiently map reads to a reference. Indices are usually generated before any alignment takes place. Each assembly requires its own indices and different alignment programs use different types of indices. Unless you trust the source of an index, it is usually better to generate indices on your own, that way you can ensure that the index is appropriate for the species and genome version.

The data for this tutorial consists of paired-end RNA-seq libraries from the Homo sapiens cell lines H1ESC and GM12878 (read length: 75 bp). The genome version that will be used is the NCBI GRCh38. The data will only focus on comparing expression data from chromosome 19, so a special index has been generated with the appropriate data for that chromosome.

Question 1

  1. Change directory to where the reference genome is stored (Homo_sapiens.GRCh38.chr19/genome). Explore the contents of the directory. Where can you find a copy of the full sequence?

Solution (click here) The sequence that is used as reference is found as a FASTA file in the following path: $MUGQIC_INSTALL_HOME/genomes/C3G_workshop/Homo_sapiens.GRCh38.chr19/genome/Homo_sapiens.GRCh38.chr19.fa. To view the contents of this file, use the folowing command:

cd $MUGQIC_INSTALL_HOME/genomes/C3G_workshop/Homo_sapiens.GRCh38.chr19/genome

less Homo_sapiens.GRCh38.chr19.fa

Question 2

  1. To perform an alignment using STAR, you will require a STAR index. Where can you find the appropriate STAR index for this dataset?

Solution (click here) The index files are found in the directory: Homo_sapiens.GRCh38.chr19/genome/star_index/Ensembl87.sjdbOverhang99. These files are binary, so they can’t be read using the less command.

Question 3

  1. To use this reference genome with GenPipes, you will require an appopriate .ini file with the genome information. Where can you find this .ini file?

Solution (click here) The ini file can be found in the main directory you downloaded, and it is called: workshop.ini. To view the contents of this file, use the following command:

less workshop.ini

The important information, with regards to the reference genome, is contained in the following lines:

[DEFAULT]
assembly_dir = $MUGQIC_INSTALL_HOME/genomes/C3G_workshop/%(scientific_name)s.%(assembly)s
scientific_name=Homo_sapiens
assembly=GRCh38.chr19
assembly_synonyms=hg38
dbsnp_version=149
source=Ensembl
version=87

Exercise 2: Running an RNA-Seq alignment using STAR

STAR (Spliced Transcripts Alignment to a Reference) is a fast, splice-aware aligner that is used in several RNA-seq analysis pipelines. It is very verstile because it has several adjustable options that permit the user to fine-tune the mapping based on their needs. These options include thresholds on novel and canonical splice junctions, reporting or limiting multi-mapping reads, and options for gene fusions or chimeric alignments. This exercise will focus on running STAR within the GenPipes framework, which simplifies the process using sensible, pre-defined options that are adequate for most RNA-seq analyses.

Step A: Understanding the two-pass alignment procedure

The GenPipes RNA-Seq pipeline uses the two-pass STAR alignment procedure. For more information about this method, click here.

This option allows for the mapping to novel splice junctions across all samples, in addition to the canonical junctions available in the annotated reference. As the name indicates, the two-pass mapping procedures runs the alignment twice:

  • During the first mapping pass, the alignment is run with the usual parameters. Novel junctions detected in this step are saved and collected for all samples.
  • All the junctions from every sample are collected, merged, and added to the genome index.
  • A second mapping pass is run using all the junctions collected in the first pass, allowing for reads to map to novel junctions found in other samples.

This procedure does not increase the total amount of novel junctions detected. However, it increases sensitivity of those novel junctions, because a read can align to a junction discovered in a different sample.

Think: How is it possible that a read maps to a junction that was not discovered in the first pass?

Step B: Running STAR

The inputs to run STAR are the trimmed .fastq files generated in the previous trimming steps (2-3). The actual mapping consists of a single step (4) within the GenPipes RNA-seq pipeline.

Question 1

  1. Run step 4 of the GenPipes RNA-seq pipeline to use STAR to carry out the two-pass mapping. Make sure that the appropriate genome .ini file has been included in the command.

Solution (click here) Command:

rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
    -r readset.rnaseq.txt \
    -d design.rnaseq.txt \
    -j slurm \
    -o . \
    -s 4 > rnaseq_steps4.sh
    
less rnaseq_steps4.sh

Explanation: This command will generate the STAR jobs. To understand what these jobs are doing, click here.

First, we can look at the names of the jobs created:

  • The first four jobs are the “first-pass” mapping steps:

    star_align.1.GM12878_Rep1
    star_align.1.GM12878_Rep2
    star_align.1.H1ESC_Rep1
    star_align.1.H1ESC_Rep2
  • The next step is the step that merges the novel junctions and adds them to the index.

    star_index.AllSamples
  • The next four steps are the “second-pass” mapping steps. Notice how in these mapping steps the original reference genome has been replaced for --genomeDir reference.Merged.

    star_align.2.GM12878_Rep1
    star_align.2.GM12878_Rep2
    star_align.2.H1ESC_Rep1
    star_align.2.H1ESC_Rep2

    What other differences do you notice between the first and second-pass mapping steps? How do you think these differences affect the mapping?

  • A final step is used to generate useful metrics from the alignment.

    star_report

Step C: Verifying the outputs of STAR

Once the STAR mapping steps have finished running, a sorted BAM file will be created for each sample. This file is the basis for several of the downstream analysis steps, so it is important to know where it is saved and how it can be accessed.

Question 1

  1. Go to the main output folder and list the contents. Which directories contain the results of the STAR mapping?

Solution (click here)

The outputs of the first and second pass are contained under the folders alignment_1stPass and alignment, respectively. Because, the results that are of most interest for downstream analysis are the results of the second pass, from now on, we will only focus on the contents of the alignment directory.

To check the content of each folder:

ls alignment_1stPass
ls alignment

Question 2

  1. Which file is the main alignment output for sample GM12878_Rep1? How are the alignments sorted?

Solution (click here)

The main output for the alignment step is found in the following path: alignment/GM12878_Rep1/GM12878_Rep1.sorted.bam. The aligned reads in this file are sorted by coordinate.

To check:

ls alignment/GM12878_Rep1/
## to check sorting:
module load mugqic/samtools/1.4.1
samtools view -H alignment/GM12878_Rep1/GM12878_Rep1.sorted.bam
## look for "SO:" on the first line

Step D: Making alignments easier to handle

Depending on their purpose and algorithm, different tools require alignment files to be sorted in different ways. By default, the output of the STAR alignment sorts reads by the coordinate (position) of the alignment. However, some tools, including HT-Seq Count, require that reads are sorted by their name instead of their position. Similarly, some tools require duplicate reads to be marked so that they can be filtered out easily. These different requirements are accounted for by GenPipes, and steps 5-12 of the pipeline consist of tools that produce separate versions of the alignment (bam) file, using different methods to sort, filter and clip the reads. Most of these steps are carried out using the multi-purpose Picard tool.

Question 1

  1. Run steps 5-12 of the GenPipes RNA-seq pipeline. What are the names of these steps and what is each step doing?

Solution (click here) Command:

rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
    -r readset.rnaseq.txt \
    -d design.rnaseq.txt \
    -j slurm \
    -o . \
    -s 5-12 > rnaseq_steps5-12.sh
    
less rnaseq_steps5-12.sh

The names of each step are:

  1. picard_merger_sam_files: if a sample has more than 1 readset, all the alignment files are merged to form only one file for each sample.
  2. picard_sort_sam: sorts the sam/bam files by read name instead of position.
  3. picard_mark_duplicates: duplicate reads will be marked.
  4. picard_rna_metrics: quality control metrics are computed for all samples.
  5. estimate_ribosomal_rna: ribosomal RNA is estimated using alignments to an rRNA reference.
  6. bam_hard_clip: a hard-clipped version of the alignments is generated (where non-aligned bases are completely removed instead of just masked).
  7. rnaseqc: additional QC metrics are computed and saved under metrics/rnaseqRep
  8. wiggle: wiggle tracks are generated so that alignments can be quickly visualized in genome browsers

Exercise 3: Using the alignment files to assign reads to genes

One of the most important uses of alignment files is to generate raw counts for genes so that they can be compared to determine whether or not there is evidence of differential expression. GenPipes uses the counting tool HT-Seq to generate the raw counts matrix that will serve as input for the rest of the downstream analyses. This tool will count reads differently depending on the mode.

Click here to see a table summarizing the differences with how HT-Seq counts a read, depending on the mode.

HT-Seq Count Modes

Step A: Use HTSeq-Count to generate the raw counts file

Step 13 of the RNA-Seq pipeline uses HT-Seq count and the name-sorted alignment files to calculate the raw counts of every gene in every sample. By default, the mode is set to intersection-nonempty, but it can be changed if a different behavior is desired. However, before changing the mode, it is very important to understand how this could impact downstream analyses, especially differential expression.

Question 1

  1. Run step 13 of the pipeline to compute the raw counts using HT-Seq. Verify that the mode is set to intersection-nonempty.

Solution (click here) Command:

rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
    -r readset.rnaseq.txt \
    -d design.rnaseq.txt \
    -j slurm \
    -o . \
    -s 13 > rnaseq_steps13.sh
    
less rnaseq_steps13.sh

If HT-Seq is set up properly, the command for H1ESC_Rep1 should read as follows:

mkdir -p raw_counts && \
samtools view -F 4 \
  alignment/H1ESC_Rep1/H1ESC_Rep1.QueryNameSorted.bam | \
htseq-count - \
  -m intersection-nonempty \
  --stranded=reverse \
  --format=sam \
  /lb/project/mugqic/projects/rdali/C3G/projects/TestDatasets/Homo_sapiens.GRCh38.chr19/annotations/Homo_sapiens.GRCh38.chr19.Ensembl87.gtf \
  > raw_counts/H1ESC_Rep1.readcounts.csv

Notice how -m intersection-nonempty is defined.

Question 2

  1. Run step 14 of the pipeline to merge all the raw counts for all samples into a single matrix which can then be exported to different tools for differential analysis.

Solution (click here) Command:

rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
    -r readset.rnaseq.txt \
    -d design.rnaseq.txt \
    -j slurm \
    -o . \
    -s 14 > rnaseq_steps14.sh
    
less rnaseq_steps14.sh

This step will create four jobs:

  • metrics.matrix
  • metrics.wigzip
  • rpkm_saturation
  • raw_count_metrics_report

The merged raw counts matrix is found under DGE/rawCountMatrix.csv, you can check it using the following command:

less -S DGE/rawCountMatrix.csv