Overview:

As mentioned in the lecture, GenPipes needs 3 files in order to run:

1- The Configuration, or ini file, which is provided with GenPipes and contains information about computational/tool resources.
2- The Readset file which contains information about the samples. You need to create it for your samples.
3- The Design file which contains information about sample comparisons. You need to create it for your samples.

In this practical, we will create the readset and design files and validate them. We will also launch GenPipes to analyze our samples.

Sample overview:

We will be working with test datasets for 2 human cell lines from ENCODE:
GM12878, a B-Lymphocyte cell line, and H1 hESC, embryonic stem cells.
Both cell lines have normal karyotypes.
We will try to identify stem cell associated genes by contrasting gene expression from the two cell lines.
The dataset is limited to chromosome 19 and will be aligned to the GRCh38 human genome build, to allow it to run quickly so that we can look at the full process.

Resources Needed: If you follow the instructions below to run the analysis on chr19 of the GRCh38 genome, the test run will need 7GB of total storage space and will run serially in about 11 hours (will run shorter on the worker nodes) and a maximum RAM of 30 GB.

Creating the Readset file:

The readset file is a tab separated file that contains the following information:

Sample	Readset	Library	RunType	Run Lane	Adapter1	Adapter2	QualityOffset	BED	FASTQ1	FASTQ2	BAM

Sample/Reaset: The “Sample” and “Readset” are the names of your sample. The difference is that “Readset” is the name of the technical replicate while the “Sample” is the name of the biological replicate. The Readsets with the same Sample name will be merged together into a single Bam file. If there are no replicates to be merged, then the Sample and the Readset names can be the same.
Library: is the library identifier. optional
RunType: is whether your library is paired or not. Must be: PAIRED_END or SINGLE_END.
Run: is the run number on the sequencer. optional
Lane: is the lane number on the sequencer. optional
Adapter1: is the sequence of the forward trimming adapter.
Adapter2: is the sequence of the reverse trimming adapter.
QualityOffset: quality score offset integer used for trimming. optional
BED: relative or absolute path to BED file; useful for capture techniques. optional
FASTQ1: relative or absolute path to first FASTQ file for paired-end readset or single FASTQ file for single-end readset; mandatory if BAM value is missing.
FASTQ2: relative or absolute path to second FASTQ file for paired-end readset; mandatory if RunType value is “PAIRED_END” and BAM is missing.
BAM: relative or absolute path to BAM file which will be converted into FASTQ files if they are not available; mandatory if FASTQ1 value is missing, ignored otherwise.

Note on Sample Vs Readset (click here)

As mentioned above, Sample name and Readset name allow GenPipes to merge replicates into a single sample. All Readsets with the same Sample name will be merged.

In the example below, we have 4 different readsets which will result in 3 samples: GM12878_Rep1, GM12878_Rep2 and H1ESC.

Sample	Readset	Library	RunType	Run	Lane	Adapter1	Adapter2	QualityOffset	FASTQ1	FASTQ2
GM12878_Rep1	GM12878_Rep1	myLibrary1	PAIRED_END	1	1	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_GM12878_chr19_Rep1_R1.fastq.gz	raw_data/rnaseq_GM12878_chr19_Rep1_R2.fastq.gz
GM12878_Rep2	GM12878_Rep2	myLibrary2	PAIRED_END	1	1	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_GM12878_chr19_Rep2_R1.fastq.gz	raw_data/rnaseq_GM12878_chr19_Rep2_R2.fastq.gz
H1ESC	H1ESC_Rep1	myLibrary3	PAIRED_END	1	1	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_H1ESC_chr19_Rep1_R1.fastq.gz	raw_data/rnaseq_H1ESC_chr19_Rep1_R2.fastq.gz
H1ESC	H1ESC_Rep2	myLibrary4	PAIRED_END	1	1	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_H1ESC_chr19_Rep2_R1.fastq.gz	raw_data/rnaseq_H1ESC_chr19_Rep2_R2.fastq.gz

Note on Adapter Sequences (click here)

Adapter sequences are part of the kit used to prepare the library that was sequenced. This is usually found in the kit manual. Some of the most popular used kits can be found on this page.

If you are not 100% sure what the Adapter sequences are, don’t panic. You can use the TruSeq sequences:

Adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
Adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Trimmomatic does a good job looking for adapters.

This is the information provided about the two samples we have. We have 2 replicates per cell line.

Name	Alias	Library Barcode	Run Type	Library Type	Type of Sequencing	Run Concentration (pM)	Concentration Method	Multiplex Key(s)	Adaptor	Adaptor Read 1	Adaptor Read 2	Run	Quality Offset	Region	Total Region	Genomic Database
GM12878_Rep1	GM12878_Rep1	myLibrary	PAIRED_END	RNA-Seq	Illumina HiSeq 4000 SR50	200	qPCR	Index_6	Illumina TruSeq LT	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	4462	33	4	8	Homo_sapiens
GM12878_Rep2	GM12878_Rep2	myLibrary	PAIRED_END	RNA-Seq	Illumina HiSeq 4000 SR50	200	qPCR	Index_2	Illumina TruSeq LT	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	4462	33	4	8	Homo_sapiens
H1ESC_Rep1	H1ESC_Rep1	myLibrary2	PAIRED_END	RNA-Seq	Illumina HiSeq 4000 SR51	200	qPCR	Index_6	Illumina TruSeq LT	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	4462	33	4	8	Homo_sapiens
H1ESC_Rep2	H1ESC_Rep2	myLibrary2	PAIRED_END	RNA-Seq	Illumina HiSeq 4000 SR52	200	qPCR	Index_2	Illumina TruSeq LT	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	4462	33	4	8	Homo_sapiens

Above is an example of the meta data provided by Genome Quebec in a CSV file through NANUQ.

Assume you have 4 samples, each with 2 fastq files, R1 and R2, that are in the folder “raw_data”, with the following file names:

rnaseq_GM12878_chr19_Rep1_R1.fastq.gz
rnaseq_GM12878_chr19_Rep1_R2.fastq.gz
rnaseq_GM12878_chr19_Rep2_R1.fastq.gz
rnaseq_GM12878_chr19_Rep2_R2.fastq.gz

rnaseq_H1ESC_chr19_Rep1_R1.fastq.gz
rnaseq_H1ESC_chr19_Rep1_R2.fastq.gz
rnaseq_H1ESC_chr19_Rep2_R1.fastq.gz
rnaseq_H1ESC_chr19_Rep2_R2.fastq.gz

How would you create a Readset file for these samples?

Creating Readset: Manually (click here)

Method 1: Manual in Excel

We will create the Readset file in Excel on our laptops and then move it to the server.

1- Open an Excel sheet.
2- Add the following column headers on the first row:

Sample	Readset	Library	RunType	Run	Lane	Adapter1	Adapter2	QualityOffset	BED	FASTQ1	FASTQ2

3- Fill in the information for the 4 samples based on the data provided about the samples in the table above.
4- Save the file as a “TAB DELIMITED TEXT”.

Creating Readset: csvToreadset.R (click here)

Method 2: csvToreadset.R

Note: To use this method, your data must have been sequenced at the McGill genome center and available in nanuq.

1- Download the csv file that contains the meta data of your experiment from nanuq if your data was sequenced at the Genome Centre.

csv From Nanuq

Then use the csvToreadset.R script to create the Readset file, using the following command in the terminal:

Rscript csvToreadset.R nanuq.csv ReadsetOutputName [fastq|bam] data_path

For example:

Rscript csvToreadset.R nanuq.csv myReadset.tsv fastq raw_data

Where [fastq|bam] is the format of the data you downloaded. data_path is the path you will store the raw data for the analysis.

There is a nanuq_example.csv file in your Extra folder, which you can experiment with.

Rscript Extra/csvToreadset.R Extra/nanuq_example.csv myReadset.tsv fastq raw_data

Creating Readset: Solution (click here)

The file should look something like this:

Sample	Readset	Library	RunType	Run	Lane	Adapter1	Adapter2	QualityOffset	FASTQ1	FASTQ2
GM12878_Rep1	GM12878_Rep1	myLibrary	PAIRED_END	4462	4	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_GM12878_chr19_Rep1_R1.fastq.gz	raw_data/rnaseq_GM12878_chr19_Rep1_R2.fastq.gz
GM12878_Rep2	GM12878_Rep2	myLibrary	PAIRED_END	4462	4	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_GM12878_chr19_Rep2_R1.fastq.gz	raw_data/rnaseq_GM12878_chr19_Rep2_R2.fastq.gz
H1ESC_Rep1	H1ESC_Rep1	myLibrary2	PAIRED_END	4462	4	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_H1ESC_chr19_Rep1_R1.fastq.gz	raw_data/rnaseq_H1ESC_chr19_Rep1_R2.fastq.gz
H1ESC_Rep2	H1ESC_Rep2	myLibrary2	PAIRED_END	4462	4	AGATCGGAAGAGCACACGTCTGAACTCCAGTCA	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT	33	raw_data/rnaseq_H1ESC_chr19_Rep2_R1.fastq.gz	raw_data/rnaseq_H1ESC_chr19_Rep2_R2.fastq.gz

How would you validate that the Readset file has no errors?

Readset Validation (click here)

You can use GenPipes’ file validator,mugqicValidator.py, to do so.

# command syntax:
python mugqicValidator.py -r myReadset.tsv

If there are errors, mugqicValidator.py will tell you more about the error. Otherwise, it will print something like this:

        ---------------------------------------------------------------------------

                                Your file has passed the check!                       
        ---------------------------------------------------------------------------

To test mugqicValidator on your computer:

python Extra/mugqicValidator.py -r myReadset.txt

If you are using mugqicValidator.py on the server, you can use:

python $MUGQIC_PIPELINES_HOME/utils/mugqicValidator.py -r myReadset.txt

Note: Remember that if you created your readset file in Excel or on windows to run dos2unix myReadset.txt
Note: Remember to move the readset file to the server using CyberDuck or the commands below.

How would you move your Readset file to the server?

Moving File to Server (click here)

You can use rsync or scp to move the file from your computer to the server, or a graphic sftp client like Cyberduck, FileZilla, Fugu, Putty or many others.

# rsync -avPL File account@server:path
rsync -avPL myReadset.tsv my_cc_account@mp2.ccs.usherbrooke.ca:~/RNAseq_workshop

#OR
#scp File account@server:path
scp myReadset.txt my_cc_account@mp2.ccs.usherbrooke.ca:~/RNAseq_workshop

Common Readset file errors?

Common Errors (click here)

1- Be careful not to add trailing/extra spaces at the end of text. It can cause parsing issues sometimes.
2- Be careful not to type in the wrong header. The header must be exactly as seen above. Capital letters, spaces or hidden characters can cause errors.
3- New line characters:
Windows, MacOS and Linux use different hidden characters to denote a newline.
This can cause errors. If you create your Readset/Design file in Excel (Windows software) and need to run it on a server (Linux), you need to transform the newlines, as follows:

dos2unix myfile

or in vim, open the file:

vim myReadset.tsv

Press escape and type in the command below and press enter.

:%s/\r/\r/g

Then exit using :wq

For more ways to change Windows newlines to Unix, check out this page.

Creating the Design file:

The Design file has infomation about sample comparisons. Here, we will run one comparison between H1 hESC cell line and GM12878.

The Design file has the following format:

Sample	Contrast1
Sample1	1
Sample2	2

This means that Sample2 is the treatment, while Sample1 is the control.

Note: the sample names should be written exactly as in the “Sample” column of the Readset file.

How would you create the Design file?

Creating the Design File: Manually (click here)

We will create the Design file in Excel on our laptops, as well then move it to the server.

1- Open an Excel sheet.
2- Add the following column headers on the first row: Sample, GM12787_H1ESC
3- Under Sample, add the 4 samples we are analyzing: H1ESC_Rep1, H1ESC_Rep2, GM12878_Rep1, GM12878_Rep1
4- Under GM12787_H1ESC, add 1 if the sample should be used as a Control (H1ESC) and 2 if it is a Treatment (GM12878).
5- Save the Design file as a “TAB DELIMITED FILE”.

Creating the Design File: Solution (click here)

It will look something like this:

Sample	GM12787_H1ESC
H1ESC_Rep1	1
H1ESC_Rep2	1
GM12878_Rep1	2
GM12878_Rep2	2

Note: You can call “GM12787_H1ESC” anything you would like.

The Sample names have to match the Sample column in the Readset file.

1 = Control

2 = Treatment

0 = Not part of comparison

How would you validate the Design file?

Design File Validation (click here)

Similar to the Readset file, you can use GenPipes’ file validator,mugqicValidator.py, to do so.

python mugqicValidator.py -d myDesign.txt

If validating the file on the server, you can use:

python $MUGQIC_PIPELINES_HOME/utils/mugqicValidator.py -d myDesign.txt

Note: Remember that if you created your design file in Excel or on windows to run dos2unix myDesignFile.txt
Note: Remember to move the designFile file to the server using CyberDuck or the commands below.

How would you move the Design file to a server?

Moving File to Server (click here)

Same as the readset; see above

Learning How to Run GenPipes:

Now that we have created the Readset and Design files, moved them to the server and validated them, we can launch GenPipes.

GenPipes has a series of pipelines, named <pipeline>.py. We are working with RNASeq data, so we will need GenPipes’ rnaseq.py script.

1- Config/ini files
We will need the ini files, located at: $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/

We need the base ini that contains all the paramters: $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini We also need the ini for the particular cluster we are using.

Today, we will use:

$MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini
$MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini
workshop.ini

2- Readset/Design files

We made our files in the steps above, but we will use the files provided with the test dataset to avoid errors.

3- Steps

We will run the rnaseq.py’s first 24 steps. To check you can run rnaseq.py -h

4- Scheduler

In order to send jobs to the working nodes, we need a scheduler. Different servers have different schedulers; Guillimin has a PBS scheduler, while Cedar, Graham, MP2B and Beluga have a SLURM scheduler.

For SLURM, you will need to add -j slurm.

You can get more information by using: rnaseq.py -h; see below.

Building the GenPipes command

More Information on GenPipes Options:

GenPipes Parameters (click here)

By typing rnaseq.py -h, you get:

rnaseq.py

Can you build the GenPipes RNASeq command on your own?

GenPipes Command (click here)

rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini workshop.ini -r myReadset.txt -d myDesign.txt -s 1-24 -j slurm > commands.txt

To run the commands:
bash commands.txt

You will only see the command prompt hang for a little bit, but the jobs are being sent. Don’t type the bash command more than once! It will send your commands to the queue every time you type it.

How do you check the jobs in the queue?

Check Queue (click here)

The command depends on the scheduler installed on the server you are using, and might change from server to server. For SLURM use:

squeue -u $USER

For a useful comparison of SLURM and PBS schedulers and related commands, you can refer to this page.

How do you cancel the jobs in the queue?

Canceling Jobs (click here)

The command depends on the scheduler used on the server you are using, and might change from server to server. For SLURM use:

scancel -u $USER

To cancel a specific job, you need the Job ID:

scancel <JOBID>

You can find the JOBID by using squeue -u $USER and finding the unique number of the job you want to cancel.

Running GenPipes:

Let us now run the commands to analyze the data from our test set.

1- Log into the server:

Open a terminal window and log into the server:

ssh my_cc_account@mp2.ccs.usherbrooke.ca

2- Download data:

## make a folder for the analysis:
mkdir -p RNAseq_workshop

## move into the folder:
cd RNAseq_workshop

## download test data:
## Normally, you can download the data off our website:
#wget http://www.computationalgenomics.ca/tutorial/c3g_analysis_workshop/C3GAW_RNA_TestData_Aug2018.zip
#unzip C3GAW_RNA_TestData_Aug2018.zip
#cd C3GAW_RNA_TestData_Aug2018

## For faster access, we have put a copy of the data on Mammouth:

cp -r /project/6019104/workshop/pub/C3GAW_08_2018/RNAseq_TestData .

## move to the data folder
cd RNAseq_TestData/

3- Generate the GenPipes’ commands:


rnaseq.py -c $MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.base.ini \
$MUGQIC_PIPELINES_HOME/pipelines/rnaseq/rnaseq.mp2b.ini \
workshop.ini \
-r readset.rnaseq.txt \
-d design.rnaseq.txt \
-s 1-24 \
-j slurm > commands.txt

4- Submit the commands:

bash commands.txt

Type only once!

5- Check the jobs in the queue:

squeue -u $USER

Congratulations! You just launched your RNASeq analysis!

Extras:

How would you Download your data from Nanuq?

Downloading Data from Nanuq (click here)

Nanuq Download

C3GAW - GenPipes File creation & validation

Rola Dali

2019-01-21