Short-read alignment and small size variants calling

Nextflow and nf-core/sarek course.
Part 1: execute the test pipeline
Log on to the genobioinfo server.
ssh -X <username>@genobioinfo.toulouse.inrae.frCreate (first time only) the Nextflow directories and links with script the
/usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.sh.sh /usr/local/bioinfo/src/NextflowWorkflows/create_nfx_dirs.shCreate a
nextflow-sarekdirectory in your working directory. Create inside a LOGS directory.Solutioncd ~/work mkdir nextflow-sarek cd nextflow-sarek mkdir LOGSSearch the last nf-core module version installed on genobioinfo and load it.
Solutionsearch_module nfcore module load bioinfo/NextflowWorkflows/nfcore-Nextflow-v23.04.3Move to a computer node (with 4G of memory) and execute nextflow, how many sub commands are available?
Solution15 sub-commands are available:
srun --mem 4G --pty bash nextflow Usage: nextflow [options] COMMAND [arg...] Options: -C Use the specified configuration file(s) overriding any defaults -D Set JVM properties -bg Execute nextflow in background -c, -config Add the specified file to configuration set -config-ignore-includes Disable the parsing of config includes -d, -dockerize Launch nextflow via Docker (experimental) -h Print this help -log Set nextflow log file path -q, -quiet Do not print information messages -syslog Send logs to syslog server (eg. localhost:514) -trace Enable trace level logging for the specified package name - multiple packages can be provided separating them with a comma e.g. '-trace nextflow,io.seqera' -v, -version Print the program version Commands: clean Clean up project cache and work directories clone Clone a project into a folder config Print a project configuration console Launch Nextflow interactive console drop Delete the local copy of a project help Print the usage help for a command info Print project and system runtime information kuberun Execute a workflow in a Kubernetes cluster (experimental) list List all downloaded projects log Print executions log and runtime info pull Download or update a project run Execute a pipeline project secrets Manage pipeline secrets (preview) self-update Update nextflow runtime to the latest available version view View project script file(s)Execute the following command to ensure that the sarek pipeline version 3.3.2 can run properly on our cluster:
nextflow run nf-core/sarek -r 3.3.2 -profile genotoul,test --outdir test -work-dir work_test- Has the pipeline been successfully completed?
- Does nextflow use singularity?
- Which variant calling software was used?
- In which directory did the job
CREATE_INTERVALS_BEDrun?
Solutionnextflow run nf-core/sarek -r 3.3.2 -profile genotoul,test --outdir test -work-dir work_test nf-core/sarek v3.3.2-gf034b73 ------------------------------------------------------ Core Nextflow options revision : 3.3.2 runName : stoic_snyder containerEngine : singularity ... Main options split_fastq : 0 intervals : https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/genome.interval_list tools : strelka ... [28/646f54] process > NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_KNOWN_INDELS (mills_and_1000G.indels.vcf) [100%] 1 of 1 ✔ [- ] process > NFCORE_SAREK:SAREK:PREPARE_GENOME:TABIX_PON - [63/93aad8] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:CREATE_INTERVALS_BED (genome.interval_list) [100%] 1 of 1 ✔ [61/2910ad] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:GATK4_INTERVALLISTTOBED (genome) [100%] 1 of 1 ✔ [1a/79797b] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:TABIX_BGZIPTABIX_INTERVAL_SPLIT (chr22_1-40001) [100%] 1 of 1 ✔ ... -[nf-core/sarek] Pipeline completed successfully- Completed at: 06-Nov-2023 15:09:16 Duration : 2m 37s CPU hours : (a few seconds) ...Has the pipeline been successfully completed? Yes, it has been successfully completed as stated at the end of the execution log.
Does nextflow use singularity? Yes, the genotoul profile defines singularity as container engine. It is indicated at the beginning of the execution log.
Which variant calling software was used? Caller is indicated in the sarek main options at the tools section. The caller used is strelka.
In which directory did the job
CREATE_INTERVALS_BEDrun? Look at the line[63/93aad8] process > NFCORE_SAREK:SAREK:PREPARE_INTERVALS:CREATE_INTERVALS_BED. The job ran in thework/63/93aad855ad899aa4f7ebae39c89351directory
Part 2: configure the pipeline with your data
Leave the compute node using the
exitcommand and create asample.csvfile with one line per fastq pair, see example.SolutionThe
sample.csvfile should contains the following lines:cat > sample.csv patient,sample,lane,fastq_1,fastq_2 chicken,SRR7062654,1,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062654_R1.fastq.gz,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062654_R2.fastq.gz chicken,SRR7062655,1,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062655_R1.fastq.gz,/save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/SRR7062655_R2.fastq.gz ^CCreate a
nextflow.configfile containing the genome information.cat > nextflow.config params { genomes { 'Gallus_gallus-5.0_25-26' { fasta = "${params.genomes_base}/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel_chr25-26.fa" species = 'Gallus_gallus' known_indels = "${params.genomes_base}/Gallus_gallus_incl_consequences_chr25-26.vcf.gz" } } } ^C
Part 3: run the pipeline with the chicken data
Run sarek with the following options:
-r 3.3.2 -profile genotoul --genome_base /save/formation/public_html/16_SGS-SNP/Data_Chr25-26/ --genome Gallus_gallus-5.0_25-26 --tools haplotypecaller,freebayes,deepvariant --joint_germline --input sample.csv --outdir results_chickenSolutionWrite a
sarek_chicken.shscript as follows:cat > sarek_chicken.sh #!/bin/bash #SBATCH -J sarek_chicken #SBATCH -p workq #SBATCH --export=ALL #SBATCH --cpus-per-task=1 #SBATCH --mem=4G #SBATCH -e LOGS/%x_%j.err #SBATCH -o LOGS/%x_%j.out module purge module load bioinfo/NextflowWorkflows/nfcore-Nextflow-v23.04.3 nextflow run nf-core/sarek \ -r 3.3.2 \ -profile genotoul \ --genome 'Gallus_gallus-5.0_25-26' \ --genomes_base /save/user/formation/public_html/16_SGS-SNP/Data_Chr25-26/ \ --tools haplotypecaller,freebayes,deepvariant \ --joint_germline \ --input sample.csv \ --outdir results_chicken ^CAnd run it using
sbatch:sbatch sarek_chicken.shHas the pipeline been successfully completed?
SolutionYes, it has been successfully completed.
Two ways to check:
- read the file
LOGS/sarek_chicken*.outwritten by slurm and check the end message, - use the
nextflow logcommand and check the status of the last executed nextflow command.
💡 The nextflow command can provide a great deal of information about previous pipeline executions. Type the command
nextflow log -lto list all the fields that can be included in the output of this command. For example, the following command allows you to see the slurm job_id assigned to each of the jobs run by the last executed pipeline:nextflow log $(nextflow log -q | tail -1) -f name,status,workdir,native_id.NotesIf the pipeline failed, you can perform an error recovery using the nextfflow
-resumeoption.If you have to analyse new samples with the same reference genome, you can reuse indexes generated during a previous execution.
- you can create symbolic links to the
fastaandvcf.gzfiles into the directoryresults_chicken/referenceto group all files in the same directory, - and edit your
nextflow.configfile to add all required indexes.
Here is a
nextflow.configfile example :params { genomes { 'Gallus_gallus-5.0' { fasta = "${params.genomes_base}/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa" fai = "${params.genomes_base}/fai/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.fai" dict = "${params.genomes_base}/dict/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.dict" species = 'Gallus_gallus' bwa = "${params.genomes_base}/bwa/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.{amb,ann,bwt,pac,sa}" known_indels = "${params.genomes_base}/Gallus_gallus_incl_consequences.vcf.gz" known_indels_tbi = "${params.genomes_base}/Gallus_gallus_incl_consequences.vcf.gz.tbi" } } }During the next execution, the retrieved indexes will not be rebuilt but directly reused.
- read the file
Now we need to make our HTML reports viewable.
- Copy the directory
results_chicken/multiqcinto yourpublic_html, and explore themultiqc_report.htmlwith your navigator. - Copy the directory
results_chicken/pipeline_infointo yourpublic_htmland checkexecution_report_*.html,execution_timeline_*.htmlandsoftware_versions.ymlfiles.
Explore the results directory
results_chicken.- Copy the directory
Let's make a rough comparison of the output of the three caller used. Only HaplotypeCaller provides us with a merged vcf of the two samples (
results_chicken/variant_calling/haplotypecaller/joint_variant_calling/joint_germline.vcf.gz). For Freebayes and Deepvariant, we have to do the merging by hand. Merge the vcf of each sample for these two caller usingbcftools merge.Solutionsearch_module bcftools module load bioinfo/Bcftools/1.17 for caller in deepvariant freebayes; do echo "bcftools merge results_chicken/variant_calling/$caller/SRR706265?/SRR706265?.$caller.vcf.gz -o $caller.vcf.gz" done > bcftools_merge.jobs sarray -J bcftools_merge -e LOGS/%x_%j.err -o LOGS/%x_%j.out bcftools_merge.jobsIn the current directory, create a symbolic link to the HaplotypeCaller merged vcf with the name
haplotypecaller.vcf.gz.Solutionln -s results_chicken/variant_calling/haplotypecaller/joint_variant_calling/joint_germline.vcf.gz haplotypecaller.vcf.gzFor each merged vcf, extract chromosome/position columns to a tsv file and use jvenn to compare data.
Solutionfor caller in deepvariant freebayes haplotypecaller; do zcat $caller.vcf.gz | grep -v ^# | cut -f 1,2 > $caller.tsv doneWhat are your thoughts on the results of this comparison?
Merci de remplir le questionnaire de satisfaction suivant: https://sondages.inrae.fr/index.php/84236?lang=fr