Nanopore ARTIC-Nanopolish SOP 
================================================================================

Graphical summary 
________________________________________________________________________________

.. image:: ONT_ArticPipelineDigaram.png
  :scale: 20%
  :align: center

Description 
________________________________________________________________________________

The SOP for Nanopore data is based on the ARTIC SARS-CoV2 protocol using nanopolish. 
Their full documentation is found `here`_.

The protocol was closely followed with the majority of changes, involving technical adaptations 
to be able to run in a High Performance Computing environment where the usage Conda is not advisable. 

.. _here : https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html  


Steps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. **Base calling with Guppy** 

Raw reads are base-called with Guppy using the High Accuracy Model (flip-flop) and a minimun quality 
threshold of ``7``:: 

    SAMPLE="run_name"
    INPUT="/path/to/fast5/"
    OUTPUT="/path/to/fastq/"
    PROTOCOL="dna_r9.4.1_450bps_hac.cfg"
    MINQ=7
    GPUPARAMS="" 
    RDSPERFILE=4000
    OTHER_OPTIONS=""
    
    guppy_basecaller --input_path ${INPUT} \
       --recursive \
       --device auto \
       --save_path ${OUTPUT} \
       --config ${PROTOCOL} \
       --qscore_filtering --min_qscore ${MINQ} \
       --compress_fastq ${GPUPARAMS} \
       --records_per_fastq ${RDSPERFILE} \
       --disable_pings ${OTHER_OPTIONS} \
       --verbose |& tee -a ${OUTDIR}/${SAMPLE}.guppy.log


2. **Demultiplexing with Guppy**

Reads were demultiplexed using Guppy. The option ``--require_barcodes_both_ends`` is activated. Since 
regular nanopore barcodes were used, the demultiplex command points to the nanopore barcode config files.::

    INPUT="/path/to/fastq/"
    OUTPUT="/path/to/demultiplex/"
    
    guppy_barcoder --require_barcodes_both_ends \
        --input_path ${INPUT} \
        --save_path ${OUTPUT} \
        --arrangements_files "barcode_arrs_nb12.cfg barcode_arrs_nb24.cfg" \
  

3. **ARTIC-Nanopolish pipeline** 

Since some HPC environments like Compute Canada don't support python package managers like Conda, the python
environment to use a tool like the ARTIC protocol needs to be defined initially, before running the pipeline. 
Compute Canada maintains several python package wheels internally, so several dependencies are already met, 
however, others had to be manually downloaded and compiled. In this case, they were all deposited in a folder 
saved as an environment variable called ``$PYTHON_WHL``. Each ``SLURM`` job, then uses the following code to setup 
the python environment and actiavte it:: 

    export ENVDIR="${SLURM_TMPDIR}/env" 
    export PYTHON_WHL="/path/to/python/wheels/"
    export ARTIC_WHL="/path/to/artic.whl"
    
    virtualenv --no-download ${ENVDIR}
    source ${ENVDIR}/bin/activate
    pip install --no-index --upgrade pip 
    pip install --no-index biopython
    pip install --no-deps ${PYTHON_WHL}/clint-0.5.1/dist/clint-0.5.1-py3-none-any.whl
    pip install --no-index pandas
    pip install --no-index pysam
    pip install --no-index pytest 
    pip install --no-deps ${PYTHON_WHL}/PyVCF-0.6.8/dist/PyVCF-0.6.8-py3-none-any.whl 
    pip install --no-index requests 
    pip install --no-index tqdm 
    pip install --no-deps ${PYTHON_WHL}/whatshap-0.18/dist/whatshap-0.18-cp36-cp36m-linux_x86_64.whl
    pip install --no-deps ${ARTIC_WHL}

3.1 **Read size filtering**

Using the ARTIC ``guppyplex`` tool, we filter the reads that don't fit into the expected lengths.:: 

    MIN_LENGTH="400"
    MAX_LENGTH="700"
    FASTQ_DIR="/path/to/demultiplex/fatsq"
    POOL="pool_name"
    
    artic guppyplex \
       --min-length ${MIN_LENGTH} \
       --max-length ${MAX_LENGTH} \
       --directory ${FASTQ_DIR} \
       --prefix ${POOL}

3.2. **Run ARTIC Nanopolish pipeline** 

Run the ARTIC nanopolish pipeline using the following command::

    NORMALISE="800"
    THREADS="16"
    PRIMERS_DIR="/path/to/pimers/dir/"
    POOL="pool_name"
    BARCODE="barcodeXX"
    FAST5_DIR="/path/to/fast5/"
    SUMMARY="/path/to/sequencing_summary.txt"
    PRIMERS_VER="nCoV-2019/V3"
    SAMPLE="sample_name"
    
    artic minion \
        --normalise ${NORMALISE} \
        --threads ${THREADS} \
        --scheme-directory ${PRIMERS_DIR} \
        --read-file ${POOL}_${BARCODE}.fastq \
        --fast5-directory ${FAST5_DIR} \
        --sequencing-summary ${SUMMARY} \
        ${PRIMERS_VER} \
        ${SAMPLE} |& tee -a ${SAMPLE}_nanopolish.log

The above command actually runs the following tools:: 

    nanopolish index -s ${SUMMARY} -d ${OUTDIR}
    minimap2 -a -x map-ont -t 16 primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${POOL}_${BARCODE}.fastq | samtools view -bS -F 4 - | samtools sort -o ${SAMPLE}
    samtools index ${SAMPLE}.sorted.bam
    align_trim --start --normalise 800 primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed --report ${SAMPLE}.alignreport.txt < ${SAMPLE}.sorted.bam 2> ${SAMPLE}
    align_trim --normalise 800 primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed --remove-incorrect-pairs --report ${SAMPLE}.alignreport.txt < ${SAMPLE}.sorted
    samtools index ${SAMPLE}.trimmed.rg.sorted.bam
    samtools index ${SAMPLE}.primertrimmed.rg.sorted.bam
    nanopolish variants --min-flanking-sequence 10 -x 1000000 --progress -t 16 --reads ${POOL}_${BARCODE}.fastq -o ${SAMPLE}.nCoV-2019_2.vcf -b ${SAMPLE}.trimmed.rg.sorted.bam -g /lustre03/project/6007512/C3G/projects/Moreira_COVID19_Genotyping/artic
    nanopolish variants --min-flanking-sequence 10 -x 1000000 --progress -t 16 --reads ${POOL}_${BARCODE}.fastq -o ${SAMPLE}.nCoV-2019_1.vcf -b ${SAMPLE}.trimmed.rg.sorted.bam -g /lustre03/project/6007512/C3G/projects/Moreira_COVID19_Genotyping/artic
    artic_vcf_merge ${SAMPLE} primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed nCoV-2019_2:${SAMPLE}.nCoV-2019_2.vcf nCoV-2019_1:${SAMPLE}.nCoV-2019_1.v
    artic_vcf_filter --nanopolish ${SAMPLE}.merged.vcf ${SAMPLE}.pass.vcf ${SAMPLE}.fail.vcf
    artic_make_depth_mask primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${SAMPLE}.primertrimmed.rg.sorted.bam ${SAMPLE}.coverage_mask.txt
    bgzip -f ${SAMPLE}.pass.vcf
    tabix -p vcf ${SAMPLE}.pass.vcf.gz
    artic_mask primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${SAMPLE}.coverage_mask.txt ${SAMPLE}.fail.vcf ${SAMPLE}.preconsensus.fasta
    bcftools consensus -f ${SAMPLE}.preconsensus.fasta ${SAMPLE}.pass.vcf.gz -m ${SAMPLE}.coverage_mask.txt -o ${SAMPLE}.consensus.fasta
    artic_fasta_header ${SAMPLE}.consensus.fasta "${SAMPLE}/ARTIC/nanopolish"
    cat ${SAMPLE}.consensus.fasta primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta > ${SAMPLE}.muscle.in.fasta
    muscle -in ${SAMPLE}.muscle.in.fasta -out ${SAMPLE}.muscle.out.fasta

4. **Create metrics** 

For metrics generation, the following tools are run: 

- ``samtools`` for general alignment stats. 
- ``bedtools`` for coverage metrics 
- ``WUB`` for additional alignment metrics see this `repository`_.

.. _repository : https://github.com/nanoporetech/wub

:: 

    REFERENCE="/path/to/SARS-CoV2/reference.fasta"
    SAMPLE="sample_name"
    SORTED_BAM="/path/to/${SAMPLE}.sorted.bam"
    
    samtools view -@ 4 -F 0x900 \
        --output-fmt BAM \
        --reference ${REFERENCE} \
        -o ${SAMPLE}.nosup.bam \
        ${SORTED_BAM}
    
    samtools sort -@ 4 --reference ${REFERENCE} \
        --output-fmt BAM \
        -o ${SAMPLE}.nosup.sorted.bam \
        ${SAMPLE}.nosup.bam 
    
    samtools index -@ 5 -b ${SAMPLE}.nosup.sorted.bam 
    
    bedtools genomecov -bga -ibam ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bedGraph
    bedtools genomecov -ibam ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.histogram
    samtools stats ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bam.stats
    samtools depth ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bam.depth
    
    python ${WUB}/scripts/bam_alignment_qc.py -f ${REFERENCE} ${SAMPLE}.nosup.sorted.bam


Plotting and reporting is done using a combination of R scripts that parse the BAM, VCF and metrics files. 

Reference Genome and Software Versions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**Reference Genome:** Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 (GenBank MN908947.3)

**Software versions**
::

    guppy-GPU v3.4.4
    minimap2 v2.17
    samtools v1.9
    bcftools v1.9
    bedtools v2.27.0
    python v3.6
    nanopolish v0.13.1
    muscle v3.8.31