Visualization and summarization are the last steps of an RNA-seq analysis. But in many respects, they are the most important part of the analysis. These final steps will enable the interpretation of results to answer scientific questions. GenPipes generates an HTML report that facilitates the understanding and exploration of results. Additionally, using track files or alignment files, users can visualize their alignments with a genome browser directly.
** Note:** We will be generating the GenPipes report on the server and then move it to our laptop for visualization.
Each scientific project is unique. The experiments consist of different samples, different designs, and they are ultimately trying to answer different questions. Therefore, the downstream analysis and interpretation of RNA-seq results can vary greatly between projects. This is why, it is important to understand the outputs of any standard RNA-seq analysis, and to tie those results to the respective reserarch objective. GenPipes tries to facilitate this process by generating a report in which results are grouped by step and summarized using tables and figures. This HTML report is a first step to understanding the output of an analysis, and can help guide the direction of additional downstream analyses by providing convenient links and descriptions.
GenPipes does not generate a report every time the pipeline is run, to avoid reporting incomplete or intermediate results. To generate a report, a user must manually run the pipeline script with the --report
flag, which will instruct GenPipes to search through the output files and produce a report. Aftewards, the report can be downloaded and opened with an internet browser to explore interactively. This exercise will guide users through the process of generating and opening a report using GenPipes.
Reports can be generated for one or more steps using the same command used to launch a GenPipes analysis, with the addition of the --report
flag.
Solution (click here)
Command:
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-j slurm \
-o . \
--report \
-s 1-23 > report_steps1-23.sh
bash report_steps1-23.sh
report
directory and verifying that index.html
exists. Solution (click here)
Command:
cd report
ls
index.html
can be found inside the directory, download the full report folder into your computer so you can open it with your browser. Solution (click here)
Use CyberDuck to download the results to your local computer. Alternative, you also can download it from the server using Rsync
:
Command:
#From the terminal on your laptop type the command AFTER you change <my_cc_account> to your user name (2 times).
#rsync -rltv <USERNAME>@<SERVER_ADDRESS>:<PATH/TO/REPORT> LOCAL/PATH
rsync -rltv <my_cc_account>@mp2.ccs.usherbrooke.ca:/home/<my_cc_account>/RNAseq_workshop/RNAseq_TestData/report .
Make sure you are downloading the full report directory, and not just index.html
, otherwise the report will not display properly.
Once you have saved the report in your computer, make sure you can open it with your internet browser.
index.html
file and wait for it to open in your internet browser. Once it has opened, read through the titles of each section and try to understand what information each section contains. Solution (click here)
If the report was properly generated, it will have the following section titles:
Read the descriptions of each section to make sure you understand what it contains. Try to download the full tables for each section by clicking on the link labelled: download full table. Open the table using a spreadsheet tool (such as Excel) or using a text editor.
GenPipes generates several plots for data exploration and quality assurance. It is important to understand what these plots contain and what they look like.
index.html
file with your internet browser and go to the section labelled Exploratory analysis
. What figures are contained in this section? Open each figure and understand what they are depicting. Hierachical clustering based on the correlation distance, log2(CPM) (click here)
This figure shows how the samples cluster based on their gene expression levels (in log2CPM). We expect to see samples from the same experimental group clustering next to each other, instead of clustering with samples from different groups.
PCA (first two components) of the gene log2CPM values (click here)
A PCA plot shows the projection of the samples into a two-dimensional space, based on their first and second principal components. The principal components are usually enough to differentiate groups of samples, based on the experimental design.
Heat map of most varying genes and most varying transcripts (click here)
To get a rough overview of the genes and transcripts that are contributing the most to the samples’ clustering pattern, heatmaps of the most variable genes and transcripts are created. Please note that these genes have not been selected to reflect any particular gene group or pathway, and therefore the biological relevance of these hetamaps is limited. They are mostly used to make sure that there are no biases underlying the final results.
All figures (click here)
This zip file contains all the figures discussed above, as well as additional figures that are mostly variations of the ones previously mentioned. They are not always useful, but if a surprising result is detected in the previous 4 plots, it is useful to open all the additional figures to gain additional insight.
For this exercise, you will require the IGV Genome Browser software. If you have not done so, please download and install it now before continuing.
The Integrative Genomics Viewer, IGV is a desktop genome browser that has many powerful tools that help visualize and analyze genomics data. It has a lot of pre-loaded data, including several reference genomes for model organisms as well as pre-defined parameters for common SAM/BAM flags. Because of its graphical user interface, the best way to learn IGV is actually to explore data interactivelly and search through the menus and options.
Solution (click here)
From the alignment
folder choose a sample. Use CyberDuck to download the alignment to your local computer desktop. Rembember to download both the alignment (sorted.mdup.bam
) and index (.bai
) files for that sample. Repeat for as many samples as you want to visualize.
You can also use Rsync
to do the same thing, using as an example the following command (from your local computer):
cd ~/Desktop
rsync -rltv <USERNAME>@<SERVER_ADDRESS>:<PATH/TO/PROJECT>/alignment/GM12878_Rep1/GM12878_Rep1.sorted.mdup.ba* .
GRCh38
and open the alignment file you just downloaded to your desktop. Solution (click here)
To change the reference genome, select “Human (hg38)” from the drop-down menu in the top left corner of the toolbar. Then, from the second menu to the right, select chr19 to visualize only chromosome 19 (see the figure below).
Once the appropriate reference has been selected, open the alignment file by clicking File > Load from File...
and then selecting GM12878_Rep1.sorted.mdup.bam
from the desktop.
Remember that in order to open an alignment file on IGV, an index file (.bai
) must always be available in the same directory
Solution (click here)
The figure below shows an example of how the data is displayed once the alignment files are open:
The following are useful exercises to practice using IGV to visualize alignment data:
The UCSC Genome Browser is a popular web-based genome browser that is useful for exploring tracks of data and comparing them with one or several curated annotations. Its main strength lies in the fact that it comes pre-loaded with many tracks that contain information ranging from genes and transcripts, to epigenomic marks, making it easy to see overlaps with the transcriptomic data. Unfortunately, it also requires you to host your own files in order to load them into the browser. This falls outside the scope of this workshop, but should you have hosting space available and you are intersted in using this browser, the instructions on how to load GenPipes results can be found below:
Instructions to open BigWig files with UCSC Genome Browser (click here)
Follow the following link to the Genome Browser Gateway: https://genome.ucsc.edu/cgi-bin/hgGateway. The page that opens should look as follows:
From the drop down menu in the top, select the human assembly labeled Dec.2013(GRCh38/hg38) then press Go. The Genome Browser should open and produce a view similar to this one:
Use CyberDuck to download tracks from the main results directory as a zipped file called tracks.zip
.
Decompress the tracks.zip
file and open the directory inside called BigWig
. Make sure you have two .bw
files (one forward
and one reverse
) for each sample.
To use .bw
files with the UCSC Genome Browser you need to host them somewhere that is accessible via URL. This is not under the scope of this workshop, so follow the instructions provided by your hosting service.
From the Genome Main View, click on the add custom tracks button. This will open a window that looks like this:
In that window, paste the URL links to the .bw
files you just uploaded. Make sure to include all the files and then click Submit. Wait until the files finish loading and then, return to the Genome Browser to visualize your data.