Nanopore ARTIC-Nanopolish SOP¶
Graphical summary¶
Description¶
The SOP for Nanopore data is based on the ARTIC SARS-CoV2 protocol using nanopolish. Their full documentation is found here.
The protocol was closely followed with the majority of changes, involving technical adaptations to be able to run in a High Performance Computing environment where the usage Conda is not advisable.
Steps¶
Base calling with Guppy
Raw reads are base-called with Guppy using the High Accuracy Model (flip-flop) and a minimun quality
threshold of 7
:
SAMPLE="run_name"
INPUT="/path/to/fast5/"
OUTPUT="/path/to/fastq/"
PROTOCOL="dna_r9.4.1_450bps_hac.cfg"
MINQ=7
GPUPARAMS=""
RDSPERFILE=4000
OTHER_OPTIONS=""
guppy_basecaller --input_path ${INPUT} \
--recursive \
--device auto \
--save_path ${OUTPUT} \
--config ${PROTOCOL} \
--qscore_filtering --min_qscore ${MINQ} \
--compress_fastq ${GPUPARAMS} \
--records_per_fastq ${RDSPERFILE} \
--disable_pings ${OTHER_OPTIONS} \
--verbose |& tee -a ${OUTDIR}/${SAMPLE}.guppy.log
Demultiplexing with Guppy
Reads were demultiplexed using Guppy. The option --require_barcodes_both_ends
is activated. Since
regular nanopore barcodes were used, the demultiplex command points to the nanopore barcode config files.:
INPUT="/path/to/fastq/"
OUTPUT="/path/to/demultiplex/"
guppy_barcoder --require_barcodes_both_ends \
--input_path ${INPUT} \
--save_path ${OUTPUT} \
--arrangements_files "barcode_arrs_nb12.cfg barcode_arrs_nb24.cfg" \
ARTIC-Nanopolish pipeline
Since some HPC environments like Compute Canada don’t support python package managers like Conda, the python
environment to use a tool like the ARTIC protocol needs to be defined initially, before running the pipeline.
Compute Canada maintains several python package wheels internally, so several dependencies are already met,
however, others had to be manually downloaded and compiled. In this case, they were all deposited in a folder
saved as an environment variable called $PYTHON_WHL
. Each SLURM
job, then uses the following code to setup
the python environment and actiavte it:
export ENVDIR="${SLURM_TMPDIR}/env"
export PYTHON_WHL="/path/to/python/wheels/"
export ARTIC_WHL="/path/to/artic.whl"
virtualenv --no-download ${ENVDIR}
source ${ENVDIR}/bin/activate
pip install --no-index --upgrade pip
pip install --no-index biopython
pip install --no-deps ${PYTHON_WHL}/clint-0.5.1/dist/clint-0.5.1-py3-none-any.whl
pip install --no-index pandas
pip install --no-index pysam
pip install --no-index pytest
pip install --no-deps ${PYTHON_WHL}/PyVCF-0.6.8/dist/PyVCF-0.6.8-py3-none-any.whl
pip install --no-index requests
pip install --no-index tqdm
pip install --no-deps ${PYTHON_WHL}/whatshap-0.18/dist/whatshap-0.18-cp36-cp36m-linux_x86_64.whl
pip install --no-deps ${ARTIC_WHL}
3.1 Read size filtering
Using the ARTIC guppyplex
tool, we filter the reads that don’t fit into the expected lengths.:
MIN_LENGTH="400"
MAX_LENGTH="700"
FASTQ_DIR="/path/to/demultiplex/fatsq"
POOL="pool_name"
artic guppyplex \
--min-length ${MIN_LENGTH} \
--max-length ${MAX_LENGTH} \
--directory ${FASTQ_DIR} \
--prefix ${POOL}
3.2. Run ARTIC Nanopolish pipeline
Run the ARTIC nanopolish pipeline using the following command:
NORMALISE="800"
THREADS="16"
PRIMERS_DIR="/path/to/pimers/dir/"
POOL="pool_name"
BARCODE="barcodeXX"
FAST5_DIR="/path/to/fast5/"
SUMMARY="/path/to/sequencing_summary.txt"
PRIMERS_VER="nCoV-2019/V3"
SAMPLE="sample_name"
artic minion \
--normalise ${NORMALISE} \
--threads ${THREADS} \
--scheme-directory ${PRIMERS_DIR} \
--read-file ${POOL}_${BARCODE}.fastq \
--fast5-directory ${FAST5_DIR} \
--sequencing-summary ${SUMMARY} \
${PRIMERS_VER} \
${SAMPLE} |& tee -a ${SAMPLE}_nanopolish.log
The above command actually runs the following tools:
nanopolish index -s ${SUMMARY} -d ${OUTDIR}
minimap2 -a -x map-ont -t 16 primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${POOL}_${BARCODE}.fastq | samtools view -bS -F 4 - | samtools sort -o ${SAMPLE}
samtools index ${SAMPLE}.sorted.bam
align_trim --start --normalise 800 primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed --report ${SAMPLE}.alignreport.txt < ${SAMPLE}.sorted.bam 2> ${SAMPLE}
align_trim --normalise 800 primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed --remove-incorrect-pairs --report ${SAMPLE}.alignreport.txt < ${SAMPLE}.sorted
samtools index ${SAMPLE}.trimmed.rg.sorted.bam
samtools index ${SAMPLE}.primertrimmed.rg.sorted.bam
nanopolish variants --min-flanking-sequence 10 -x 1000000 --progress -t 16 --reads ${POOL}_${BARCODE}.fastq -o ${SAMPLE}.nCoV-2019_2.vcf -b ${SAMPLE}.trimmed.rg.sorted.bam -g /lustre03/project/6007512/C3G/projects/Moreira_COVID19_Genotyping/artic
nanopolish variants --min-flanking-sequence 10 -x 1000000 --progress -t 16 --reads ${POOL}_${BARCODE}.fastq -o ${SAMPLE}.nCoV-2019_1.vcf -b ${SAMPLE}.trimmed.rg.sorted.bam -g /lustre03/project/6007512/C3G/projects/Moreira_COVID19_Genotyping/artic
artic_vcf_merge ${SAMPLE} primer_schemes/nCoV-2019/V3/nCoV-2019.scheme.bed nCoV-2019_2:${SAMPLE}.nCoV-2019_2.vcf nCoV-2019_1:${SAMPLE}.nCoV-2019_1.v
artic_vcf_filter --nanopolish ${SAMPLE}.merged.vcf ${SAMPLE}.pass.vcf ${SAMPLE}.fail.vcf
artic_make_depth_mask primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${SAMPLE}.primertrimmed.rg.sorted.bam ${SAMPLE}.coverage_mask.txt
bgzip -f ${SAMPLE}.pass.vcf
tabix -p vcf ${SAMPLE}.pass.vcf.gz
artic_mask primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ${SAMPLE}.coverage_mask.txt ${SAMPLE}.fail.vcf ${SAMPLE}.preconsensus.fasta
bcftools consensus -f ${SAMPLE}.preconsensus.fasta ${SAMPLE}.pass.vcf.gz -m ${SAMPLE}.coverage_mask.txt -o ${SAMPLE}.consensus.fasta
artic_fasta_header ${SAMPLE}.consensus.fasta "${SAMPLE}/ARTIC/nanopolish"
cat ${SAMPLE}.consensus.fasta primer_schemes/nCoV-2019/V3/nCoV-2019.reference.fasta > ${SAMPLE}.muscle.in.fasta
muscle -in ${SAMPLE}.muscle.in.fasta -out ${SAMPLE}.muscle.out.fasta
Create metrics
For metrics generation, the following tools are run:
samtools
for general alignment stats.bedtools
for coverage metricsWUB
for additional alignment metrics see this repository.
REFERENCE="/path/to/SARS-CoV2/reference.fasta"
SAMPLE="sample_name"
SORTED_BAM="/path/to/${SAMPLE}.sorted.bam"
samtools view -@ 4 -F 0x900 \
--output-fmt BAM \
--reference ${REFERENCE} \
-o ${SAMPLE}.nosup.bam \
${SORTED_BAM}
samtools sort -@ 4 --reference ${REFERENCE} \
--output-fmt BAM \
-o ${SAMPLE}.nosup.sorted.bam \
${SAMPLE}.nosup.bam
samtools index -@ 5 -b ${SAMPLE}.nosup.sorted.bam
bedtools genomecov -bga -ibam ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bedGraph
bedtools genomecov -ibam ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.histogram
samtools stats ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bam.stats
samtools depth ${SAMPLE}.nosup.bam > ${SAMPLE}.nosup.bam.depth
python ${WUB}/scripts/bam_alignment_qc.py -f ${REFERENCE} ${SAMPLE}.nosup.sorted.bam
Plotting and reporting is done using a combination of R scripts that parse the BAM, VCF and metrics files.
Reference Genome and Software Versions¶
Reference Genome: Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 (GenBank MN908947.3)
Software versions
guppy-GPU v3.4.4
minimap2 v2.17
samtools v1.9
bcftools v1.9
bedtools v2.27.0
python v3.6
nanopolish v0.13.1
muscle v3.8.31