Illumina/MGI DNAseq WGS SOP
================================================================================

Graphical summary 
________________________________________________________________________________

.. image:: GenPipes_dnaseq_mugqic.png
  :scale: 20%
  :align: center

Description
________________________________________________________________________________

The Illumina DNASeq WGS analysis pipeline is based on *GATK* best practices with important modifications to improve performance.


Steps 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Trim sequencing adaptors with quality and size selection ``skewer`` 


1.1 Trim reads to remove low quality tail and apply size selection to remove small fragments

::

    skewer --threads 7 --min 25 -q 25 --compress -f sanger \
      -x trim/dnaseq_GM12878_chr19/adapter.tsv \
      data/${SAMPLE}/${SAMPLE}.pair1.fastq.gz \
      data/${SAMPLE}/${SAMPLE}.pair2.fastq.gz \
      -o trim/${SAMPLE}/${SAMPLE}

2. Align trimmed reads with ``bwa mem``, and apply refinement steps to fix the heuristic of the aligner

2.1 Trimmed reads are aligned with ``bwa mem`` and sorted by position with ``sambamba`` for downstream refinement steps

::
    
    bwa mem \
      -M -t 15 \
      -R '@RG\tID:${SAMPLE}tSM:${SAMPLE}\tLB:${LIBRARY}\tPU:run1_1\tCN:McGill University and Genome Quebec Innovation Centre\tPL:Illumina' \
      Homo_sapiens.GRCh38/genome/bwa_index/Homo_sapiens.GRCh38.fa \
      data/${SAMPLE}/${SAMPLE}.pair1.trim.fastq.gz \
      data/${SAMPLE}/${SAMPLE}.pair2.trim.fastq.gz | \
    sambamba view -S -f bam \
      /dev/stdin | \
    sambamba sort  \
      /dev/stdin \
      --out alignment/${SAMPLE}/${SAMPLE}.sorted.bam && \
    sambamba index  \
     alignment/${SAMPLE}/${SAMPLE}.sorted.bam \
     alignment/${SAMPLE}/${SAMPLE}.sorted.bam.bai


2.2 Corrections to variant flanking indels is improved by doing local reassembly of regions surrounding indels

::
    
    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx6G -jar $GATK_JAR \
        --analysis_type RealignerTargetCreator -nct 1 -nt 3 \
        --reference_sequence Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --input_file alignment/${SAMPLE}/${SAMPLE}.sorted.bam \
        --known Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.Mills_and_1000G_gold_standard.indels.vcf.gz \
        --out alignment/${SAMPLE}/realign/all.intervals && \
    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx12G -jar $GATK_JAR \
        --analysis_type IndelRealigner -nt 1 -nct 1 \
        --reference_sequence Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --input_file alignment/${SAMPLE}/${SAMPLE}.sorted.bam \
        --targetIntervals alignment/${SAMPLE}/realign/all.intervals \
        --knownAlleles Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.Mills_and_1000G_gold_standard.indels.vcf.gz \
        --out alignment/${SAMPLE}/realign/all.bam \
        --maxReadsInMemory 750000 && \
    ln -s -f \
        alignment/${SAMPLE}/realign/all.bam \
        alignment/${SAMPLE}/${SAMPLE}.sorted.realigned.bam

2.3 Marking of duplicate with ``picard``

::

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=2 -Dsamjdk.buffer_size=4194304 -Xmx16G -jar $PICARD_HOME/picard.jar \
        MarkDuplicates \
        REMOVE_DUPLICATES=false VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true \
        INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.realigned.bam \
        OUTPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
        METRICS_FILE=alignment/${SAMPLE}/${SAMPLE}.sorted.dup.metrics \
        MAX_RECORDS_IN_RAM=1000000
      
2.4 Base recalibration (BSQR) to detects and adjust base qualities of systematic errors made by the sequencing machine when it estimates the accuracy of each base call

::

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx55G -jar $GATK_JAR \
        --analysis_type BaseRecalibrator --bqsrBAQGapOpenPenalty 30 \
        -nt 1 --num_cpu_threads_per_data_thread 12 \
        --input_file alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
        --reference_sequence Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa  \
        --knownSites Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.dbSNP142.vcf.gz \
        --knownSites Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.gnomad.exomes.r2.0.2.sites.no-VEP.nohist.tidy.vcf.gz \
        --knownSites Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.Mills_and_1000G_gold_standard.indels.vcf.gz \
        --out alignment/${SAMPLE}/${SAMPLE}.sorted.dup.recalibration_report.grp
      
    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Dsamjdk.buffer_size=4194304 -Xmx30G -jar $GATK_JAR \
        --analysis_type PrintReads --generate_md5 \
        -nt 1 --num_cpu_threads_per_data_thread 5 \
        --input_file alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --BQSR alignment/${SAMPLE}/${SAMPLE}.sorted.dup.recalibration_report.grp \
        --out alignment/${SAMPLE}/${SAMPLE}.sorted.dup.recal.bam

3. Germline variant calling using ``GATK Haplotype caller``

3.1 Variant calling is performed on individual samples in padded vcf mode (gVCF). Typically run by chromosome or groups of chromosomes depending on coverage of sequenced sample whereby higher coverage approaches one chromosome per job.

::

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx30G -jar $GATK_JAR \
        --analysis_type HaplotypeCaller --useNewAFCalculator --emitRefConfidence GVCF -dt none -nct 1 -G StandardAnnotation -G StandardHCAnnotation \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --input_file alignment/${SAMPLE}/${SAMPLE}.sorted.dup.recal.bam \
        --out alignment/${SAMPLE}/rawHaplotypeCaller/${SAMPLE}.hc.g.vcf.gz
        
    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx30G -jar $GATK_JAR \
        --analysis_type GenotypeGVCFs --useNewAFCalculator -G StandardAnnotation -G StandardHCAnnotation \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --variant alignment/${SAMPLE}/rawHaplotypeCaller/${SAMPLE}.hc.g.vcf.gz \
        --out alignment/${SAMPLE}/${SAMPLE}.hc.vcf.gz
        
3.2 Joint calling of multiple N samples is preformed by ``GATK combineGVCF`` and ``GATK GenotypeGVCFs``. Typically preformed on a run/flowcell level then merged again accross all runs/flowcells

::

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=4 -Dsamjdk.buffer_size=4194304 -Xmx24G -jar $GATK_JAR \
        --analysis_type CombineGVCFs -G StandardAnnotation -G StandardHCAnnotation \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --variant alignment/${SAMPLE1}/${SAMPLE1}.hc.g.vcf.gz \
        --variant alignment/${SAMPLE2}/${SAMPLE2}.hc.g.vcf.gz \
        --variant alignment/${SAMPLE_N}/${SAMPLE_N}.hc.g.vcf.gz \
        --out variants/allSamples_run1.hc.g.vcf.gz

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx30G -jar $GATK_JAR \
        --analysis_type GenotypeGVCFs --useNewAFCalculator -G StandardAnnotation -G StandardHCAnnotation \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        --variant variants/allSamples_run1.hc.g.vcf.gz \
        --out variants/allSamples_run1.hc.vcf.gz


3.3 Jointed called variants from N samples are filtered using a variant recalibration step ``GATK Variant Quality Score Recalibration (VQSR)` which uses known, highly validated variant resources (omni, 1000 Genomes, hapmap).  SNPs and INDELs are processed independently.

3.3.1 Generate tranches
    
SNP variant recalibration

::  

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx24G -jar $GATK_JAR \
        --analysis_type VariantRecalibrator -nt 11 \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        -input variants/allSamples_run1.hc.vcf.gz \
        -resource:hapmap,known=false,training=true,truth=true,prior=15.0 \ $MUGQIC_INSTALL_HOME_DEV/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.hapmap_3.3.vcf.gz \
        -resource:omni,known=false,training=true,truth=false,prior=12.0 \ $MUGQIC_INSTALL_HOME_DEV/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.1000G_omni2.5.vcf.gz \
        -resource:1000G,known=false,training=true,truth=false,prior=10.0 \ $MUGQIC_INSTALL_HOME_DEV/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.1000G_phase1.snps.high_confidence.vcf.gz \
        -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 \ $MUGQIC_INSTALL_HOME_DEV/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.dbSNP142.vcf.gz \
        -an QD -an FS -an DP -an SOR -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -mode SNP \
        --recal_file variants/allSamples.hc.snps.recal \
        --tranches_file variants/allSamples.hc.snps.tranches  \
        --rscript_file variants/allSamples.hc.snps.R 
    
INDEL variant recalibration

::
    
    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx24G -jar $GATK_JAR \
        --analysis_type VariantRecalibrator -nt 11 \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        -input variants/allSamples_run1.hc.vcf.gz \
        -resource:mills,known=false,training=true,truth=true,prior=12.0 \ $MUGQIC_INSTALL_HOME/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.Mills_and_1000G_gold_standard.indels.vcf.gz \
        -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 \ $MUGQIC_INSTALL_HOME_DEV/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.dbSNP142.vcf.gz -an QD -an DP -an FS -an SOR \
        -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 -mode INDEL \
        --recal_file variants/allSamples.hc.indels.recal \
        --tranches_file variants/allSamples.hc.indels.tranches  \
        --rscript_file variants/allSamples.hc.indels.R

3.3.2 Apply tranches

SNPs

::

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx24G -jar $GATK_JAR \
        --analysis_type ApplyRecalibration -nt 11 \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        -input variants/allSamples_run1.hc.vcf.gz \
        --ts_filter_level 99.5 -mode SNP \
        --tranches_file variants/allSamples.hc.snps.tranches \
        --recal_file variants/allSamples.hc.snps.recal \
        --out variants/allSamples.hc.snps_raw_indels.vqsr.vcf.gz
        
INDELs

::

    java -Djava.io.tmpdir=${SLURM_TMPDIR} -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=1048576 -Xmx24G -jar $GATK_JAR \
        --analysis_type ApplyRecalibration -nt 11 \
        --disable_auto_index_creation_and_locking_when_reading_rods \
        --reference_sequence /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
        -input variants/allSamples.hc.snps_raw_indels.vqsr.vcf.gz \
        --ts_filter_level 99.0 -mode INDEL \
        --tranches_file variants/allSamples.hc.indels.tranches \
        --recal_file variants/allSamples.hc.indels.recal \
        --out variants/allSamples.hc.vqsr.vcf.gz

4. Compute quality control metrics  

A series of scripts compute several metrics derived from the output of the analysis above. Here are the full set
of commands used to generate these metrics.

:: 

    #alignment and insert size metrics
    gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
      CollectMultipleMetrics \
      --PROGRAM=CollectAlignmentSummaryMetrics \
      --PROGRAM=CollectInsertSizeMetrics \
      --VALIDATION_STRINGENCY=SILENT \
      --REFERENCE_SEQUENCE=/cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
      --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
      --OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.all.metrics \
      --MAX_RECORDS_IN_RAM=1000000
    
    #OxoG metrics
    gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
      CollectOxoGMetrics \
      --VALIDATION_STRINGENCY=SILENT  \
      --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
      --OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.oxog_metrics.txt \
      --REFERENCE_SEQUENCE=/cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
      --MAX_RECORDS_IN_RAM=4000000
    
    #GC biais metrics
    gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
      CollectGcBiasMetrics \
      --VALIDATION_STRINGENCY=SILENT \
      --ALSO_IGNORE_DUPLICATES=TRUE \
      --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
      --OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.qcbias_metrics.txt \
      --CHART=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.qcbias_metrics.pdf \
      --SUMMARY_OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.qcbias_summary_metrics.txt \
      --REFERENCE_SEQUENCE=/cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa \
      --MAX_RECORDS_IN_RAM=4000000
    
    #qualimap metrics
    qualimap bamqc -nt 2 \
      -bam alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam -outdir metrics/dna/${SAMPLE}/qualimap \
      --java-mem-size=55G
    
    #bedGraph generation on mark duplication
    bedtools genomecov -bga \
      -ibam alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam > alignment/${SAMPLE}/${SAMPLE}.sorted.dup.BedGraph    
    
    #generate tdf file
    java -Xmx8G  -Djava.awt.headless=true -jar $IGVTOOLS_JAR count -f min,max,mean -w 25 \
      alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
      alignment/${SAMPLE}/${SAMPLE}.sorted.dup.tdf \
      /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/genome/Homo_sapiens.GRCh38.fa.fai
      
    #verifyBAMid to estimate contamination
    verifyBamID \
        --vcf /cvmfs/soft.mugqic/CentOS6/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.1000G_phase1.snps.high_confidence.EUR.vcf \
        --bam alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
        --out metrics/dna/${SAMPLE}/verifyBamId/${SAMPLE} \
        --verbose --ignoreRG --noPhoneHome
        
    #vcftools 
    vcftools --missing-indv \
        --gzvcf variants/allSamples.hc.vcf.gz \
        --out variants/allSamples
        
    #vcftools
    vcftools --depth \
        --gzvcf variants/allSamples.hc.vcf.gz \
        --out variants/allSamples
        
    #GATK4 CrosscheckFingerprints for sample mixup detection    
    gatk --java-options "-Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 \ 
        -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Xmx16G" \
        CrosscheckFingerprints --NUM_THREADS 2 --EXIT_CODE_WHEN_MISMATCH 0 \
        --VALIDATION_STRINGENCY SILENT \
        --TMP_DIR ${SLURM_TMPDIR} \
        --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.dup.bam \
        --HAPLOTYPE_MAP $MUGQIC_INSTALL_HOME/genomes/species/Homo_sapiens.GRCh38/annotations/Homo_sapiens.GRCh38.fingerprint.map \
        --LOD_THRESHOLD 3.0 \
        --OUTPUT=metrics/dna/sample.fingerprint
  
    gatk --java-options "-Djava.io.tmpdir=${SLURM_TMPDIR} -XX:+UseParallelGC -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 \
    -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Xmx16G" \
        ClusterCrosscheckMetrics  \
        --VALIDATION_STRINGENCY SILENT \
        --TMP_DIR ${SLURM_TMPDIR} \
        --INPUT metrics/dna/sample.fingerprint \
        --LOD_THRESHOLD 5.0 \
        --OUTPUT=metrics/dna/cluster.fingerprints
        
        
Reference Genome and Software Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Reference Genome:** Genome Reference Consortium Human Build 38 patch release 13 (GenBank GCA_000001405.28)

**Software versions**
::

    skewer 0.2.2
    bwa 0.7.17
    GATK 4.1.8.1 and 3.8
    sambamba 0.7.0
    igvtools 2.3.67
    java 1.8.0_72
    picard 2.23.3
    python 2.7.14
    R 3.5.1_3.7
    samtools 1.9
    bcftools 1.9
    htslib 1.9
    snpeff 4.3
    qualimap 2.2.2dev
    fastqc 0.11.5
    verifyBAMid 1.1.3
    bedtools 2.29.2
    bvatools 1.6
    vcftools 0.1.14