MGI CleanPlex SOP

Graphical summary

../_images/MGI_CleanplexPipeline.png

Description

The MGI analysis pipeline is based on the steps suggested by Paragon Genomics, the manufacturer of the CleanPlex amplicon panel, with important modifications to improve performance.

Steps

  1. Align read with bwa mem , filter out human reads on the bam file and convert to fastq

1.1 Raw reads are aligned with bwa mem and sorted by position with sambamba for merics

bwa mem  \
  -K 100000000 -v 3 -t 8 -Y \
  -R '@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tLB:${LIBRARY}\tPU:run1_2\tCN:McGill University and Genome Quebec Innovation Centre\tPL:MGI' \
  Coronavirinae.SARS-CoV-2/genome/bwa_index/Hybrid.SARS-CoV-2.GRCh38.fa \
  data/${SAMPLE}/${SAMPLE}.pair1.fastq.gz \
  data/${SAMPLE}/${SAMPLE}.pair2.fastq.gz | \
sambamba view -S -f bam \
  /dev/stdin | \
sambamba sort  \
  /dev/stdin \
  --out host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam

1.2 Human mapped reads with a mapping quality > 0 are filtered out and then sorted by name with sambamba

sambamba view \
  -f bam -F "not (ref_name =~ /chr*/ and mapping_quality >= 0)" \
  host_removal/${SAMPLE}/${SAMPLE}.hybrid.sorted.bam | \
sambamba sort \
  -N \
  /dev/stdin \
  --out host_removal/${SAMPLE}/${SAMPLE}.host_removed.nsorted.bam

1.3 Conversion of the bam filtered out and sorted by name into fastqs with samtools

samtools bam2fq \
  -@ 10 \
  -n \
  -1 host_removal/${SAMPLE}/${SAMPLE}.host_removed.pair1.fastq.gz \
  -2 host_removal/${SAMPLE}/${SAMPLE}.host_removed.pair2.fastq.gz \
  -0 host_removal/${SAMPLE}/${SAMPLE}.host_removed.other.fastq.gz \
  -s host_removal/${SAMPLE}/${SAMPLE}.host_removed.single.fastq.gz \
  host_removal/${SAMPLE}/${SAMPLE}.host_removed.nsorted.bam
  1. Trim sequencing adaptor with cutadapt

cutadapt -g GAACGACATGGCTACGATCCGACTT \
  -G GACCGCTTGGCCTCCGACTT \
  -a AAGTCGGAGGCCAAGCGGTC \
  -A AAGTCGGATCGTAGCCATGTCGTTC \
  -j 5 \
  -e 0.1 -O 9 -m 20 -n 2 --quality-cutoff 33 \
  -o cleaned_raw_reads/${SAMPLE}/${SAMPLE}.trim.pair1.fastq.gz \
  -p cleaned_raw_reads/${SAMPLE}/${SAMPLE}.trim.pair2.fastq.gz  \
  host_removal/${SAMPLE}/${SAMPLE}.host_removed.pair1.fastq.gz  \
  host_removal/${SAMPLE}/${SAMPLE}.host_removed.pair2.fastq.gz
  1. Align read with bwa mem and filter the bam file

3.1 Multimapping of the reads with bwa mem and sort with sambamba

bwa mem  \
  -K 100000000 -v 3 -t 8 -Y \
  -R '@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tLB:${LIBRARY}\tPU:run1_2\tCN:McGill University and Genome Quebec Innovation Centre\tPL:MGI' \
  Coronavirinae.SARS-CoV-2/genome/bwa_index/Coronavirinae.SARS-CoV-2.fa \
  trim/${SAMPLE}/${SAMPLE}.trim.pair1.fastq.gz \
  trim/${SAMPLE}/${SAMPLE}.trim.pair2.fastq.gz | \
sambamba view -S -f bam \
  /dev/stdin | \
sambamba sort  \
  /dev/stdin \
  --out alignment/${SAMPLE}/${SAMPLE}.sorted.bam && \
sambamba index  \
  alignment/${SAMPLE}/${SAMPLE}.sorted.bam \
  alignment/${SAMPLE}/${SAMPLE}.sorted.bam.bai

3.2 Mapped reads are filtered with sambamba and awk

sambamba view -h \
  alignment/${SAMPLE}/${SAMPLE}.sorted.bam | \
awk 'substr($0,1,1)=="@" || ($9 >= 60 && $9 <= 300) || ($9 <= -60 && $9 >= -300)' | \
sambamba view -S -f bam -F "not failed_quality_control and not unmapped and not secondary_alignment and mapping_quality > 0 and not supplementary and not (((mate_is_reverse_strand and reverse_strand) or (not reverse_strand and not mate_is_reverse_strand)))" \
  /dev/stdin \
  -o alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.bam
  1. Trim Amplicon primers with ivar and samtools

ivar trim -i alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.bam \
  -p alignment/${SAMPLE}/${SAMPLE}.primerTrim \
  -b Coronavirinae.SARS-CoV-2/Coronavirinae.SARS-CoV-2.MGI_ARTIC_primers.bed \
  -f Coronavirinae.SARS-CoV-2/Coronavirinae.SARS-CoV-2.MGI_ARTIC_primerpair.tsv \
  -m 30 -q 20 -s 4 -e && \
sambamba sort  \
  alignment/${SAMPLE}/${SAMPLE}.primerTrim.bam \
  --out alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.bam
  1. Call variants with ivar and samtools

samtools mpileup -A -d 1000000 -B -Q 0 \
  -l $MUGQIC_INSTALL_HOME_DEV/genomes/species/Coronavirinae.SARS-CoV-2/artic_amplicon.bed \
  --reference Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  -f Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.bam | \
ivar variants -p alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim \
  -r Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  -m 10 -q 20 -t 0.25 && \
ivar_variants_to_vcf.py alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.tsv alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.vcf && \
bgzip -cf \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.vcf > \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.vcf.gz && \
tabix -pvcf alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.vcf.gz
  1. Annotate the variants with SnpEff

java -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G \
  -jar $SNPEFF_HOME/snpEff.jar ann \
  -v -c $SNPEFF_HOME/snpEff.config \
  -s metrics/dna/${SAMPLE}/snpeff_metrics/${SAMPLE}.snpEff.html \
  MN908947.3 \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.vcf.gz > \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.annotate.vcf && \
bgzip -cf \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.annotate.vcf > \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.annotate.vcf.gz && \
tabix -pvcf alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.annotate.vcf.gz
  1. Generate consensus FASTA and rename header with ivar and samtools

7.1 Consensus generation with** ivar and samtools

samtools mpileup -aa -A -d 600000 -Q 0 \
  -f Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.bam | \
ivar consensus -p alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.consensus -q 20 -t 0.75 -m 10

7.2 Renaming of header for GISAID submission and consensus file for flagging outlayers (too many Ns, frameshifts and/or low coverage)

cons_len=`grep -oP "Total length \(>= 0 bp\)\t\K.*?(?=$)" metrics/dna/${SAMPLE}/quast_metrics/report.tsv`
N_count=`grep -oP "# N's\",\"quality\":\"Less is better\",\"values\":\[\K.*?(?=])" metrics/dna/${SAMPLE}/quast_metrics/report.html`
cons_perc_N=`echo "scale=2; 100*$N_count/$cons_len" | bc -l`
frameshift=`if grep -q "frameshift_variant" variant/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.annotate.vcf; then echo "FLAG"; fi`
genome_size=`awk '{print $2}' Coronavirinae.SARS-CoV-2/Coronavirinae.SARS-CoV-2.fa.fai`
bam_cov50X=`awk '{if ($4 > 50) {count = count + $3-$2}} END {if (count) {print count} else {print 0}}' alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.BedGraph`
bam_cov50X=`echo "scale=2; 100*$bam_cov50X/$genome_size" | bc -l`
STATUS=`awk -v bam_cov50X=$bam_cov50X -v frameshift=$frameshift -v cons_perc_N=$cons_perc_N 'BEGIN { if (cons_perc_N < 1 && frameshift != "FLAG" && bam_cov50X >= 90) {print "pass"} else if ((cons_perc_N >= 1 && cons_perc_N <= 5) || frameshift == "FLAG" || bam_cov50X < 90) {print "flag"} else if (cons_perc_N > 5) {print "rej"} }'`
export STATUS && \
awk '/^>/{print ">Canada/Qc-${SAMPLE}/2020 seq_method:MGI_NexteraFlex|assemb_method:ivar|snv_call_method:ivar"; next}{print}' < consensus/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.consensus.fa > consensus/${SAMPLE}/${SAMPLE}.consensus.MGI.${STATUS}.fasta
  1. Compute metrics

A series of scripts compute several metrics derived from the output of the analysis above. Those metrics are calculated at different steps in the pipeline and some are required for renaming the consensus file. Here are the full set of commands used to generate these metrics.:

#kraken2 metrics right after dehosting
kraken2 \
  --quick \
  --threads 5 \
  --db kraken2-2.1.0/db \
  --paired \
  --unclassified-out metrics/dna/${SAMPLE}/kraken_metrics/${SAMPLE}.unclassified_sequences#.fastq \
  --classified-out metrics/dna/${SAMPLE}/kraken_metrics/${SAMPLE}.classified_sequences#.fastq \
  --output metrics/dna/${SAMPLE}/kraken_metrics/${SAMPLE}.kraken2_output \
  --report metrics/dna/${SAMPLE}/kraken_metrics/${SAMPLE}.kraken2_report \
  host_removal/${SAMPLE}/${SAMPLE}.host_removed.pair1.fastq.gz \
  host_removal/${SAMPLE}/${SAMPLE}.host_removed.pair2.fastq.gz

#alignment and insert size metrics
gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
  CollectMultipleMetrics \
  --PROGRAM=CollectAlignmentSummaryMetrics \
  --PROGRAM=CollectInsertSizeMetrics \
  --VALIDATION_STRINGENCY=SILENT \
  --REFERENCE_SEQUENCE=Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  --OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.all.metrics \
  --MAX_RECORDS_IN_RAM=1000000

#OxoG metrics
gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
  CollectOxoGMetrics \
  --VALIDATION_STRINGENCY=SILENT  \
  --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  --OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.oxog_metrics.txt \
  --REFERENCE_SEQUENCE=Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  --MAX_RECORDS_IN_RAM=4000000

#GC biais metrics
gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
  CollectGcBiasMetrics \
  --VALIDATION_STRINGENCY=SILENT \
  --ALSO_IGNORE_DUPLICATES=TRUE \
  --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  --OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.qcbias_metrics.txt \
  --CHART=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.qcbias_metrics.pdf \
  --SUMMARY_OUTPUT=metrics/dna/${SAMPLE}/picard_metrics/${SAMPLE}.qcbias_summary_metrics.txt \
  --REFERENCE_SEQUENCE=Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  --MAX_RECORDS_IN_RAM=4000000

#qualimap metrics
qualimap bamqc -nt 10 \
  -bam alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam -outdir metrics/dna/${SAMPLE}/qualimap \
  --java-mem-size=55G

#flagstats metrics primerTrimed
sambamba flagstat  \
  alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  > metrics/dna/${SAMPLE}/flagstat/${SAMPLE}.sorted.primerTrim.flagstat

#bedGraph generation primerTrimmed
bedtools genomecov -bga \
  -ibam alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam > alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.BedGraph

#hs metrics
gatk --java-options "-XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx20G" \
  CollectHsMetrics \
  --INPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  --OUTPUT=alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.onTarget.tsv \
  --BAIT_INTERVALS=Coronavirinae.SARS-CoV-2/SARSCoV2.ampInsert.interval_list \
  --TARGET_INTERVALS=SARSCoV2.ampInsert.interval_list \
  --REFERENCE_SEQUENCE=Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa

#depth of coverage
java -XX:ParallelGCThreads=2 -Xmx8000M -jar $GATK_JAR \
  --analysis_type DepthOfCoverage --omitDepthOutputAtEachBase --logging_level ERROR \
  --reference_sequence Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  --input_file alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  --out alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.all.coverage \
  --intervals Coronavirinae.SARS-CoV-2/artic_amplicon.bed \
  --summaryCoverageThreshold 10 \
  --summaryCoverageThreshold 25 \
  --summaryCoverageThreshold 50 \
  --summaryCoverageThreshold 75 \
  --summaryCoverageThreshold 100 \
  --summaryCoverageThreshold 500 \
  --start 1 --stop 500 \
  --nBins 499 \
  --downsampling_type NONE

java -XX:ParallelGCThreads=1 -Dsamjdk.buffer_size=4194304 -Xmx31G -jar $BVATOOLS_JAR \
  depthofcoverage --gc --maxDepth 1001 --summaryCoverageThresholds 10,25,50,75,100,500,1000 --minMappingQuality 15 --minBaseQuality 15 --ommitN \
  --threads 8 \
  --ref Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  --intervals Coronavirinae.SARS-CoV-2/artic_amplicon.bed \
  --bam alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  > alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.coverage.tsv

#generate tdf file
java -Xmx8G  -Djava.awt.headless=true -jar $IGVTOOLS_JAR count -f min,max,mean -w 25 \
  alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.bam \
  alignment/${SAMPLE}/${SAMPLE}.sorted.primerTrim.tdf \
  Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa.fai

#generate quast metrics
quast.py -r Coronavirinae.SARS-CoV-2/genome/Coronavirinae.SARS-CoV-2.fa \
  --output-dir metrics/dna/${SAMPLE}/quast_metrics \
  --threads 5 \
  alignment/${SAMPLE}/${SAMPLE}.sorted.filtered.primerTrim.consensus.gisaid_renamed.fa

Reference Genome and Software Versions

Reference Genome: Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 (GenBank MN908947.3)

Software versions

bedtools v2.29.2
bvatools v1.6
bwa v0.7.17
cutadapt v2.10
GATK v4.1.2 and v3.8
igvtools v2.3.67
ivar v1.3
kraken2 v2.1.0
qualimap v2.2.1
quast v5.0.2
sambamba v0.7.0
samtools v1.11
snpEff v4.5covid19