Nanopore ARTIC-Nanopolish SOP ================================================================================ Graphical summary ________________________________________________________________________________ .. image:: ONT_ArticPipelineDigaram.png :scale: 20% :align: center Description ________________________________________________________________________________ The SOP for Nanopore data is based on the ARTIC SARS-CoV2 protocol using nanopolish. Their full documentation is found `here`_. The protocol was closely followed with the majority of changes, involving technical adaptations to be able to run in a High Performance Computing environment where the usage Conda is not advisable. .. _here : https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html Steps ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1. **Base calling with Guppy** Raw reads are base-called with Guppy using the High Accuracy Model (flip-flop) and a minimun quality threshold of ``7``:: SAMPLE="run_name" INPUT="/path/to/fast5/" OUTPUT="/path/to/fastq/" PROTOCOL="dna_r9.4.1_450bps_hac.cfg" MINQ=7 GPUPARAMS="" RDSPERFILE=4000 OTHER_OPTIONS="" guppy_basecaller --input_path ${INPUT} \ --recursive \ --device auto \ --save_path ${OUTPUT} \ --config ${PROTOCOL} \ --qscore_filtering --min_qscore ${MINQ} \ --compress_fastq ${GPUPARAMS} \ --records_per_fastq ${RDSPERFILE} \ --disable_pings ${OTHER_OPTIONS} \ --verbose |& tee -a ${OUTDIR}/${SAMPLE}.guppy.log 2. **Demultiplexing with Guppy** Reads were demultiplexed using Guppy. The option ``--require_barcodes_both_ends`` is activated. Since regular nanopore barcodes were used, the demultiplex command points to the nanopore barcode config files.:: INPUT="/path/to/fastq/" OUTPUT="/path/to/demultiplex/" guppy_barcoder --require_barcodes_both_ends \ --input_path ${INPUT} \ --save_path ${OUTPUT} \ --arrangements_files "barcode_arrs_nb12.cfg barcode_arrs_nb24.cfg" \ 3. **ARTIC-Nanopolish pipeline** Since some HPC environments like Compute Canada don't support python package managers like Conda, the python environment to use a tool like the ARTIC protocol needs to be defined initially, before running the pipeline. Compute Canada maintains several python package wheels internally, so several dependencies are already met, however, others had to be manually downloaded and compiled. In this case, they were all deposited in a folder saved as an environment variable called ``$PYTHON_WHL``. Each ``SLURM`` job, then uses the following code to setup the python environment and actiavte it:: export ENVDIR="${SLURM_TMPDIR}/env" export PYTHON_WHL="/path/to/python/wheels/" export ARTIC_WHL="/path/to/artic.whl" virtualenv --no-download ${ENVDIR} source ${ENVDIR}/bin/activate pip install --no-index --upgrade pip pip install --no-index biopython pip install --no-deps ${PYTHON_WHL}/clint-0.5.1/dist/clint-0.5.1-py3-none-any.whl pip install --no-index pandas pip install --no-index pysam pip install --no-index pytest pip install --no-deps ${PYTHON_WHL}/PyVCF-0.6.8/dist/PyVCF-0.6.8-py3-none-any.whl pip install --no-index requests pip install --no-index tqdm pip install --no-deps ${PYTHON_WHL}/whatshap-0.18/dist/whatshap-0.18-cp36-cp36m-linux_x86_64.whl pip install --no-deps ${ARTIC_WHL} 3.1 **Read size filtering** Using the ARTIC ``guppyplex`` tool, we filter the reads that don't fit into the expected lengths.:: MIN_LENGTH="400" MAX_LENGTH="700" FASTQ_DIR="/path/to/demultiplex/fatsq" POOL="pool_name" artic guppyplex \ --min-length ${MIN_LENGTH} \ --max-length ${MAX_LENGTH} \ --directory ${FASTQ_DIR} \ --prefix ${POOL} 3.2. **Run ARTIC Nanopolish pipeline** Run the ARTIC nanopolish pipeline using the following command:: NORMALISE="800" THREADS="16" PRIMERS_DIR="/path/to/pimers/dir/" POOL="pool_name" BARCODE="barcodeXX" FAST5_DIR="/path/to/fast5/" SUMMARY="/path/to/sequencing_summary.txt" PRIMERS_VER="nCoV-2019/V3" SAMPLE="sample_name" artic minion \ --normalise ${NORMALISE} \ --threads ${THREADS} \ --scheme-directory ${PRIMERS_DIR} \ --read-file ${POOL}_${BARCODE}.fastq \ --fast5-directory ${FAST5_DIR} \ --sequencing-summary ${SUMMARY} \ ${PRIMERS_VER} \ ${SAMPLE} |& tee -a ${SAMPLE}_nanopolish.log The above command actually runs the following tools:: nanopolish index -s ${SUMMARY} -d ${OUTDIR} minimap2 -a -x map-ont -t 16 primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${POOL}_${BARCODE}.fastq | samtools view -bS -F 4 - | samtools sort -o ${SAMPLE} samtools index ${SAMPLE}.sorted.bam align_trim --start --normalise 800 primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed --report ${SAMPLE}.alignreport.txt < ${SAMPLE}.sorted.bam 2> ${SAMPLE} align_trim --normalise 800 primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed --remove-incorrect-pairs --report ${SAMPLE}.alignreport.txt < ${SAMPLE}.sorted samtools index ${SAMPLE}.trimmed.rg.sorted.bam samtools index ${SAMPLE}.primertrimmed.rg.sorted.bam nanopolish variants --min-flanking-sequence 10 -x 1000000 --progress -t 16 --reads ${POOL}_${BARCODE}.fastq -o ${SAMPLE}.nCoV-2019_2.vcf -b ${SAMPLE}.trimmed.rg.sorted.bam -g /lustre03/project/6007512/C3G/projects/Moreira_COVID19_Genotyping/artic nanopolish variants --min-flanking-sequence 10 -x 1000000 --progress -t 16 --reads ${POOL}_${BARCODE}.fastq -o ${SAMPLE}.nCoV-2019_1.vcf -b ${SAMPLE}.trimmed.rg.sorted.bam -g /lustre03/project/6007512/C3G/projects/Moreira_COVID19_Genotyping/artic artic_vcf_merge ${SAMPLE} primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed nCoV-2019_2:${SAMPLE}.nCoV-2019_2.vcf nCoV-2019_1:${SAMPLE}.nCoV-2019_1.v artic_vcf_filter --nanopolish ${SAMPLE}.merged.vcf ${SAMPLE}.pass.vcf ${SAMPLE}.fail.vcf artic_make_depth_mask primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${SAMPLE}.primertrimmed.rg.sorted.bam ${SAMPLE}.coverage_mask.txt bgzip -f ${SAMPLE}.pass.vcf tabix -p vcf ${SAMPLE}.pass.vcf.gz artic_mask primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${SAMPLE}.coverage_mask.txt ${SAMPLE}.fail.vcf ${SAMPLE}.preconsensus.fasta bcftools consensus -f ${SAMPLE}.preconsensus.fasta ${SAMPLE}.pass.vcf.gz -m ${SAMPLE}.coverage_mask.txt -o ${SAMPLE}.consensus.fasta artic_fasta_header ${SAMPLE}.consensus.fasta "${SAMPLE}/ARTIC/nanopolish" cat ${SAMPLE}.consensus.fasta primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta > ${SAMPLE}.muscle.in.fasta muscle -in ${SAMPLE}.muscle.in.fasta -out ${SAMPLE}.muscle.out.fasta 4. **Create metrics** For metrics generation, the following tools are run: - ``samtools`` for general alignment stats. - ``bedtools`` for coverage metrics - ``WUB`` for additional alignment metrics see this `repository`_. .. _repository : https://github.com/nanoporetech/wub :: REFERENCE="/path/to/SARS-CoV2/reference.fasta" SAMPLE="sample_name" SORTED_BAM="/path/to/${SAMPLE}.sorted.bam" samtools view -@ 4 -F 0x900 \ --output-fmt BAM \ --reference ${REFERENCE} \ -o ${SAMPLE}.nosup.bam \ ${SORTED_BAM} samtools sort -@ 4 --reference ${REFERENCE} \ --output-fmt BAM \ -o ${SAMPLE}.nosup.sorted.bam \ ${SAMPLE}.nosup.bam samtools index -@ 5 -b ${SAMPLE}.nosup.sorted.bam bedtools genomecov -bga -ibam ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bedGraph bedtools genomecov -ibam ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.histogram samtools stats ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bam.stats samtools depth ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bam.depth python ${WUB}/scripts/bam_alignment_qc.py -f ${REFERENCE} ${SAMPLE}.nosup.sorted.bam Plotting and reporting is done using a combination of R scripts that parse the BAM, VCF and metrics files. Reference Genome and Software Versions ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Reference Genome:** Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 (GenBank MN908947.3) **Software versions** :: guppy-GPU v3.4.4 minimap2 v2.17 samtools v1.9 bcftools v1.9 bedtools v2.27.0 python v3.6 nanopolish v0.13.1 muscle v3.8.31