Before we start, make sure that you have the data for the workshop.
Verify that you have the following:
-the bam files in the raw_data
folder,
-the readset file readset.rnaseq.txt
,
-the design file design.rnaseq.txt
.
You should have generated the BAM files with STAR already.
Connect to mammouth with your credential.
Connect
ssh <my_cc_account>@mp2.ccs.usherbrooke.ca
This section covers the gene and transcript level differential expression analysis with GenPipes for RNA-seq data.
Remember, you can always refer to the GenPipes help page at https://bitbucket.org/mugqic/genpipes/src/master/pipelines/rnaseq/ for how to use the pipeline.
As a reminder, the gene level analysis is performed by the following steps:
13- raw_counts
14- raw_counts_metrics
22- differential_expression
In this practical we will focus mostly on step 22.
The transcript level analysis is performed by cufflinks, which is broken down into several steps:
15- cufflinks
16- cuffmerge
17- cuffquant
18- cuffdiff
19- cuffnorm
This practical will focus mostly on gene level analysis and further downstream analyses. We will briefly go over the transcript level analysis, if we have time. You can explore the bonus part at home when you have time.
At the end of this practical, you should:
You should already be familiar with GenPipes by now. In this step, we are going to generate the bash commands for the gene and transcript level analysis. We will have a look at them in a moment.
Generate the commands for the gene and transcript level analysis (but do not send it to server).
Solution (click here)
rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-s 13-19,22 \
-o . \
-j slurm > run_step13-19,22.sh
#bash run_step13-14,22.sh (don't run it right now)
Step:
13- raw_counts
14- raw_counts_metrics
This step is a reminder of what you’ve seen earlier.
GenPipes relies on HTSeq-counts to assign reads to genes. It takes the aligned BAM files as input and counts how many reads fall into genomic regions defined in the GTF file.
You can read the documentation at http://htseq.readthedocs.io/en/master/count.html. You can also access the help menu of HTSeq at any time by typing htseq --help
.
Example of HTSeq-count command (click here)
samtools view -F 4 \
alignment/GM12878_Rep1/GM12878_Rep1.QueryNameSorted.bam | \
htseq-count - \
-m intersection-nonempty \
--stranded=reverse \
--format=sam \
Ref_hg19_chr19/annotations/Homo_sapiens.hg19.chr19.UCSC2009-03-08.gtf \
> raw_counts/GM12878_Rep1.readcounts.csv
The raw_counts_metrics
step, among other things, aggregates the individual raw count files into a single, multi-column file. This file is located in DGE/rawCountMatrix.csv
. This is the file that will be given to EdgeR and DESeq for the differential expression analysis.
Step:
22- differential_expression
The differential expression at the gene level is performed by two R packages: edgeR and DESeq. They mostly do the same analysis but use different algorithms, leading to slightly different results. Some users have a preference for one of the two packages.
EdgeR and DESeq require a matrix of raw counts as input. In our case, we will give them the rawCountMatrix.csv
file generated by HTSeq-count. Do not use normalized counts (such as FPKM)!
The normalization is taken care of by the tools themselves. Both perform library size and composition bias normalization internally and are reported in the output files.
EdgeR and DESeq will perform the differential expression analysis for each contrast (i.e. comparison) specified in the design file. Only pairwise comparisons are supported by GenPipes though edgeR and DESeq allow for more complex designs.
Each tool will output a file containing the results of the differential expression analysis for each contrast. The files contain the pvalues and the FDRs. Both files are then merged together and the raw and normalized expression (from edgeR) added.
The output files are located in the DGE folder. Each subfolder corresponds to a contrast from the design file. The combined file (with edgeR and DESeq results) is called dge_results.csv
. This is the primary result file.
Explore the content of the
dge_results.csv
file.
Solution - dge_results.csv (click here)
less H1ESC_GM12787/dge_results.csv
Looking at the results of differential expression analysis can be a bit overwhelming at first. Thankfully, there are ways to reduce the results to a more manageable size.
This can be done in Excel or Google Spreadsheet for instance. You can also use R if you are familiar with it (not covered here).
You would need to download the file on your local machine. They are several options to transfer the data to your local computer. Here, we are going to use CyberDuck. You can also use the rsync
command that allows us to transfer files between the server and our local computer.
rsync (click here)
To download data from the server to your local machine:
rsync -rltv USERNAME@mp2.ccs.usherbrooke.ca:PATH/TO/FILE LOCAL/PATH/TO/FILE
USERNAME@SERVER_ADDRESS
is what you used to connect to the server (with ssh
).
To send data from your computer to the server:
rsync -rltv LOCAL/PATH/TO/FILE USERNAME@mp2.ccs.usherbrooke.ca:PATH/ON/SERVER
Download the results file of the gene level differential expression analysis on your computer.
rsync (click here)
Type this from your laptop terminal:
rsync -rtlv USERNAME@mp2.ccs.usherbrooke.ca:PATH_TO_YOUR_PROJECT/dge_results.csv .
How to select differentially expressed genes is not well defined. The convention is to define DGEs as genes having a logFC>1.5 and a FDR<0.05. But this is arbitrary and depends of your experiment.
You might want to relax the logFC if you don’t expect to see a large change in your data for instance. If the experiment involve a specific process, it could help to look where those genes rank to try to capture as many as possible without adding too much noise.
The goal is to obtain a workable list of DGEs without being overly stringent or overly relaxed. Ideally, the thresholds should be adjusted to obtain a reasonable gene list.
Here we are going to define DGE as having an absolute logFC>2 and an FDR<5%.
Load your data in Excel or Google Spreadsheet.
Filter the genes with FDR<5%.
PS: if you have issues with the formating in Excel, click on ‘Data’, then ‘Text to columns’ and press ‘next’ then ‘finish’.
Solution - FDR<0.05 (click here)
This is fairly simple:
click on ‘Data’ in the menu, then click on ‘Filter’ (‘Create a filter’ in Google Spreadsheet).
This will add a filter option in each column. This is symbolized by the arrow next to the column name.
You can now add a filter to the column of interest. In our case, we would like to filter based on the FDR value. This corresponds to the deseq.padj
(or edger.padj
, your choice).
Click on the arrow next to the column name then choose ‘filter by condition’ for Google Spreadsheet, then ‘less than or equal to’ and enter 0.05.
Add an extra filter for genes with an absolute logFC>2.
Solution - abs(logFC)>2 (click here)
This is similar to the previous step.
In Google Spreadsheet, you can directly exclude values that are within a given range.
Click on the arrow next to the logFC column then choose ‘filter by condition’, then ‘is not between’ option. Enter the values -2 and 2.
It’s a bit different in Excel as we need to specify two filters for the logFC column.
As before, click on the arrow next to the logFC column. Choose ‘Greater than or equal to’ and enter 2. Once you press Enter, you should see a option to add another filter. Click on the ‘or’ button. Add a filter ‘Less than or equal to’ -2.
How many genes have been selected?
You can also filter on the logCPM if you want.
Keep this file open for the next step.
There are several tools to perform gene ontology analysis. In this practical, we are going to learn how to use GOrilla.
GOrilla is a tool for identifying and visualizing enriched GO terms in ranked lists of genes. It takes a list of genes as input and returns a table of enriched GO categories, as well as a GO relationship graph.
For RNAseq data, we will use the “two unranked lists of genes” mode. This compares a list of DGE to the background (i.e. the whole list of human genes) in order to find enriched GO terms.
In our case, the list of DGE (FDR<5%) is quite large but still manageable. Usually, try to keep it under 1000 genes.
Keep the dge_results.csv
file open with the DGE that passed the thresholds you defined in the last step, we will need it soon.
You will also need the list of human gene. This file is located in the Extra
folder and is called Homo_sapiens.GRCh38.genes.txt
.
Open your web browser and go to http://cbl-gorilla.cs.technion.ac.il/. You should have a page like this:
Select the ‘Two unranked lists of genes’ mode.
Copy and paste the DGE genes from thedge_results.csv
file into the target set.
Select theHomo_sapiens.GRCh38.genes.txt
file for the background set.
Ensure that the ontology selected is ‘process’.
In the advanced parameters, set a p-value threshold of 10^-5.
Click ‘Search for enriched terms’ to submit the analysis.
GOrilla results (click here)
If you have time, you can try similar tools such as:
-DAVID: https://david.ncifcrf.gov/
-PANTHER: http://www.pantherdb.org/
-Enrichr: http://amp.pharm.mssm.edu/Enrichr
For you to explore at home when you have the time.
In this part, we will have a look at the analysis at the transcript level with cufflinks.
Cufflinks is a suite that performs:
-transcriptome assembly with cufflinks,
-the generation of a meta-transcriptome with cuffmerge,
-the transcript quantification with cuffquant,
-some extra normalization for follow-up analyses with cuffnorm.
Cufflinks is the main step of the pipeline. Its main role it to perform the transcriptome assembly i.e. identifying the different isoforms (known and novel) present in the data.
For this, cufflinks relies on the gtf file to identify known isoforms. It is also capable of finding novel transcripts. It is worth noting that this step also performs the transcript quantification.
Cufflinks takes the gtf and the aligned bam files as input and produces:
-the transcriptome assembly
-a first pass of the transcript quantification. The results are the isoforms.fpkm_tracking
and transcripts.gtf
in the cufflinks
folder.
You can have a look at the
isoforms.fpkm_tracking
(warning: this is a very large file. Use theless
command and do not try to open it in Excel). Notice the tracking_id, the reference_id (i.e. nearest gene) and the coverage.
Explore thetranscripts.gtf
file. Each row corresponds to a transcript. Known isoforms have a gene_id associated to them (e.g. ENSG00000281379) while novel isoforms are assigned to an arbitrary id (e.g. CUFF.1).
The next step, cuffmerge takes all the individual gtfs in order to generate a high quality transcriptome. This helps to recovers some low expressed transcripts that might not have been found in every sample.
This merged transcriptome merged.gtf
can be found in the cufflinks
folder, in the AllSamples
subfolder.
Finally, cuffquant uses the merged.gtf
file to perform a second pass of transcript quantification. Since the same gtf is used for every samples, it makes it fairly easy to compare the results afterwards. The results are the abundances.cxb
files in the cufflinks
folder. Unfortunately, they are not text files and are to be used exclusively with cuffdiff.
While not part of the analysis, cuffnorm can be useful nonetheless. What it does is normalize the transcript expression to be used for downstream analysis. For instance, you could obtain the FPKM from the cuffnorm/isoforms.fpkm_table
file. This file doesn’t contain the transcript or associated gene name so you would need to look for the tracking id in the isoforms.attr_table
file.
Finally, cuffdiff performs the differential expression analysis. It does so by comparing samples in two different groups as defined in the design file. Similarly to the DEA at the gene level, cuffdiff performs a test for each contrast, creating a sub folder in the cuffdiff
directory.
Cuffdiff takes the abundances.cxb
files produced by cuffquant and the design file as input. Cuffdiff is a complex tool that generates a lot of results. The one you are most likely interested in is the isoform_exp.diff
file in the cuffdiff
folder. The other .diff
files contain the results for other types of test (e.g. codons).
Each row of this file corresponds to a transcript. Once again, you would need to refer to the isoforms.fpkm_tracking
in the same directory to identify the transcript names and obtain more information about the transcripts.
The last 3 columns of the isoform_exp.diff
correspond to the p-value obtained from the differential expression analysis, the q-value (another name for FDR) and if the statistical test is significant or not (FDR<0.05).
Explore the
isoform_exp.diff
withless
.