Releases: theiagen/public_health_bioinformatics
v2.2.1
Public Health Bioinformatics v2.2.1 Patch Release Notes
🩹 This patch release fixes the output names for the NCBI-Scrub standalone workflows.
Our documentation has also been migrated to GitHub for easier maintenance.
Full release notes can be found here!
Find our documentation here!
What's Changed
- [Documentation] Transfer all PHB documentation to GitHub by @sage-wright in #605
- [NCBI Scrub Standalone Workflows] Correct output declarations for the number of spots removed by @cimendes in #610
- [v2.2.1] update version tag by @sage-wright in #622
Full Changelog: v2.2.0...v2.2.1
v2.2.0
Public Health Bioinformatics v2.2.0 Minor Release Notes
This minor release adds two new workflows, Create_Terra_Table_PHB and Snippy_Streamline_FASTA_PHB, and makes significant improvements to the TheiaProk, TheiaCoV, TheiaMeta, and Freyja workflow series. Additionally, several bug fixes have been made.
Full release notes can be found here!
Find our documentation here!
🆕 New workflows:
-
- The manual creation of Terra tables can be tedious and error-prone. This workflow will automatically create your Terra data table when provided with the location of the files. It can import assembly, paired-end (Illumina) and single-end (Illumina and Oxford Nanopore) data.
- Import the workflow from Dockstore.
-
- Since Snippy_Variants_PHB is now compatible with assembled sequences as input in FASTA format, we have developed Snippy_Streamline_FASTA, an all-in-one approach to generating a reference-based phylogeny using the Snippy tools, mirroring the Snippy_Streamline_PHB workflow. By default, it runs Snippy_Variants and Snippy_Tree, but will optionally run Assembly_Fetch if a reference genome is not provided.
- Import the workflow from Dockstore.
🚀 Changes to existing workflows:
-
All TheiaProk Workflows
- Genomic characterization with
emmtyper
is now enabled for Streptococcus pyogenes. (Thanks, @sam-baird!) - When
call_ani
istrue
, failures will no longer occur if multiple hits have the same score. - Support for Vibrio parahaemolyticus, Vibrio vulnificus and Enterobacter asburiae was added to the AMRFinderPlus task
- VirulenceFinder now runs on Shigella sonnei samples.
- The Docker containers for AMRFinderPlus, tbp-parser and mlst have been updated:
- AMRFinderPlus:
3.12.8-2024-07-22.1
- tbp-parser:
tbp-parser:1.6.0
- mlst:
2.23.0-2024-08-01
- AMRFinderPlus:
- Genomic characterization can now be skipped by setting the new optional input
perform_characterization
tofalse
. - The GAMBIT prokaryotic database has been updated to
v2.0.0-20240628
. - Optional inputs are now available for all tasks within the
merlin_magic
subworkflow.
- Genomic characterization with
-
All TheiaCoV Workflows
- GenoFLU has been added for H5N1 influenza typing.
- Additional VADR output files have been exposed:
File? vadr_feature_tbl_pass
File? vadr_feature_tbl_fail
File? vadr_classification_summary_file
File? vadr_all_outputs_tar_gz
- Aligned FASTQs no longer contain supplemental/secondary alignments.
-
TheiaCoV_Illumina_PE_PHB and TheiaCoV_ONT_PHB
- Workflow will no longer fail if an assembly cannot be produced. The
assembly_fasta
column will say "Assembly could not be generated".
- Workflow will no longer fail if an assembly cannot be produced. The
-
TheiaEuk_Illumina_PE_PHB
- TheiaEuk no longer abruptly fails if an organism outside of the expected list of taxa is detected by GAMBIT.
- All optional inputs and docker containers for taxa-specific sub-modules have been exposed.
-
All ONT workflows (TheiaProk and TheiaCoV)
- KMC is no longer used for genome-size prediction. Instead, for TheiaProk, the expected genome length is now set to 5 Mb, which is around 0.7 Mb larger than the average bacterial genome length. For TheiaCoV, species have default genome lengths associated with their organism tag.
-
TheiaCoV and TheiaMeta workflows
- The human read removal tool (HRRT) has been updated to
v2.2.1
. For paired-end data, reads are first interleaved to guarantee that no mates are orphaned by this tool.
- The human read removal tool (HRRT) has been updated to
-
All Freyja Workflows
- Freyja has been updated for all workflows to version
1.5.1
. - SARS-CoV-2 UShER barcodes file is now a .feather file.
- Freyja_FASTQ_PHB is now compatible with Illumina paired-end, Illumina single-end and Oxford Nanopore data. A new input
ont
has been added to control workflow behavior. - The UShER barcodes and lineage files used are now exposed as outputs in Freyja_FASTQ_PHB
- Freyja has been updated for all workflows to version
-
Snippy_Variants_PHB
- In addition to reads, paired-end, and single-end, assemblies are now accepted as input. If Illumina sequencing data is to be used, use the
read1
and optionally, theread2
, optional inputs to pass the forward and reverse-facing reads respectively, If assembled genomes are to be used, use theassembly_fasta
input and omitread1
andread2
.
- In addition to reads, paired-end, and single-end, assemblies are now accepted as input. If Illumina sequencing data is to be used, use the
-
SRA_Fetch_PHB
- SRA-Lite files are now detected when it's a low-quality file.
-
Augur_PHB
- mpox mutation context has been added to the
auspice_input_json
output which displays the fraction of G->A or C->T.
- mpox mutation context has been added to the
-
GAMBIT_Query_PHB
- The GAMBIT prokaryotic database has been updated to
v2.0.0-20240628
.
- The GAMBIT prokaryotic database has been updated to
-
Mercury_Prep_N_Batch_PHB
- Mercury has been moved to its own repository at https://github.com/theiagen/mercury.
- Mercury now processes BioSample & SRA metadata for flu
What's Changed
- [TheiaProk] Add emmtyper task for Streptococcus pyogenes by @sam-baird in #524
- [SRA-Fetch] Detect SRA-Lite when it's low quality file by @cimendes in #512
- Adding the Create_Terra_Table_PHB workflow by @sage-wright in #533
- [Create_Terra_Table] recognize fastq files that end in .fq by @sage-wright in #535
- [TheiaProk - ANI] prevent failures when multiple top hits have the same score by @sage-wright in #532
- [TheiaCoV] Flu: Prevent workflow failures when assembly cannot be produced; generate NanoPlot outputs regardless of assembly success by @sage-wright in #530
- [theiaprok] amrfinderplus: add support for Vibrio parahaemolyticus, Vibrio vulnificus, Enterobacter asburiae. Fix C diff bug by @kapsakcj in #542
- [TheiaCoV] Add GenoFLU for flu whole-genome genotyping by @sage-wright in #540
- [TheiaProk] Merlin_magic subwf bugfix: run virulencefinder on Shigella sonnei by @kapsakcj in #543
- [TheiaCoV and TheiaMeta] Update hrrt (ncbi-scrub) to version 2.2.1 and optimise task by @cimendes in #527
- [TheiaCoV and TheiaMeta - HRRT] Patch bug by removing unneeded awk verification by @cimendes in #550
- Create CODEOWNERS by @AndrewLangvt in #554
- [TheiaProk] Add additional input enabling characterization by @sage-wright in #547
- Updating templates & broken links in the readme by @sage-wright in #555
- [TheiaEuk] Fix bug where String outputs were being passed as File for Snippy_variants by @cimendes in #574
- [TheiaProk] update tbp-parser to latest version by @sage-wright in #576
- [Create_Terra_Table] fix bug, and enable ability for users to provide their own file ending suffixes by @sage-wright in #575
- [theiacov] Add additional vadr output files & tarball; upgrade VADR docker by @kapsakcj in #556
- [ONT] Remove KMC by @sage-wright in #578
- [Create_Terra_Table] fix sample name i...
v2.1.0
Public Health Bioinformatics v2.1.0 Minor Release Notes
This minor release improves the utility and usability of several Oxford Nanopore Technologies’ dedicated workflows for viral and bacterial genomic characterization (TheiaCoV and TheiaProk). Additionally, support for new organisms has been added to several workflows.
Full release notes can be found here!
Find our documentation here!
🚀 Changes to existing workflows:
-
All TheiaProk Workflows
- General Abricate is now available though the
call_abricate
andabricate_db
optional inputs. - Abricate specifically for Vibrio cholerae is now available. It launches automatically if the
gambit_predicted_taxon
orexpected_taxon
is Vibrio cholerae. - A new optional parameter
separate_betalactam_genes
is now available that splits AMRFinderPlus beta-lactam hits into new columns. - The
call_midas
optional input is now set to false by default.
- General Abricate is now available though the
-
TheiaProk_Illumina_PE
- New read quality-control outputs have been added:
r1_mean_q_clean
,r2_mean_q_clean
,r1_mean_readlength_clean
andr2_mean_readlength_clean
.
- New read quality-control outputs have been added:
-
TheiaProk_ONT
- New read quality-control outputs have been added:
nanoplot_r1_median_readlength_raw
,nanoplot_r1_stdev_readlength_raw
,nanoplot_r1_n50_raw
,nanoplot_r1_median_q_raw
,nanoplot_r1_est_coverage_raw
,nanoplot_r1_median_readlength_clean
,nanoplot_r1_stdev_readlength_clean
,nanoplot_r1_n50_clean
,nanoplot_r1_median_q_clean
andnanoplot_r1_est_coverage_clean
. - Kraken2 is now available through the
call_kraken
andkraken_db
optional inputs. - A maximum genome size of 10Mbp is set to prevent excessive runtimes.
- New read quality-control outputs have been added:
-
All TheiaCoV Workflows
- RSV-A and RSV-B are now able to be analyzed with the TheiaCoV workflows. Nextclade characterization and Kraken taxonomic analysis will now be run on RSV samples.
- The following default organisms now have the following Nextclade dataset tags:
Organism New default Nextclade dataset tag SARS-CoV-2 "2024-06-13--23-42-47Z" mpox "2024-04-19--07-50-39Z" Flu H1N1 HA "2024-04-19--07-50-39Z" Flu H1N1 NA "2024-04-19--07-50-39Z" Flu H3N2 HA "2024-04-19--07-50-39Z" Flu H3N2 NA "2024-04-19--07-50-39Z" Flu Victoria HA "2024-04-19--07-50-39Z" Flu Victoria NA "2024-04-19--07-50-39Z"
-
TheiaProk_ONT
- New read quality-control outputs have been added:
nanoplot_r1_median_readlength_raw
,nanoplot_r1_stdev_readlength_raw
,nanoplot_r1_n50_raw
,nanoplot_r1_median_q_raw
,nanoplot_r1_est_coverage_raw
,nanoplot_r1_median_readlength_clean
,nanoplot_r1_stdev_readlength_clean
,nanoplot_r1_n50_clean
,nanoplot_r1_median_q_clean
andnanoplot_r1_est_coverage_clean
.
- New read quality-control outputs have been added:
-
TheiaCoV Flu Track
- All of the flu-specific tasks now live in their own sub-workflow,
flu_track
. This has no effect on the end-user. - In TheiaCoV_ONT, flu samples will now have both the HA and NA segment’s assembly mean coverage appear in the assembly_mean_coverage output variable. This reflects the behaviour already present on TheiaCoV_Illumina_PE.
- The all-segments FASTA header lines now include samplename.
- The new output
irma_subtype_notes
now indicates if IRMA was able to determine the flu subtype - All workflows now uses
abricate_flu_subtype
(instead ofirma_subtype
) for selecting the appropriate nextclade_dataset_tag. - Nextclade outputs columns for flu now explicitly state either HA or NA.
- Padded assemblies, where
-
or.
present in the final assembly file are either removed or replaced byN
(respectively), are now being provided to MAFFT and VADR to prevent task failures.
- All of the flu-specific tasks now live in their own sub-workflow,
-
Terra_2_NCBI
- Skipping BioSample submission via the
skip_biosample
optional now skips the requirement to have BioSample metadata in your data table.
- Skipping BioSample submission via the
-
Augur_Prep_PHB and Augur_PHB
- RSV-A and RSV-B can now be analyzed with the Augur workflows.
- Metadata no longer required to run Augur. Only a distance tree will be created if metadata is not provided.
-
kSNP3 and other phylogenetic inference workflows
- Outputs from phylogenetic workflows (SNP matrices) and the summarize_data task will now have a properly toggleable Phandango coloring suffix.
- The
phandango_coloring
optional input is now off by default.
Docker container updates:
- IRMA has been updated to version v1.1.5
- AMRFinderPlus has been updated to version v3.12.8-2024-05-02.2
- ts_mlst database has been updated as of 2024-06-01
- Pangolin database has been updated to pdata v1.27
🐛 Bug fixes and small improvements:
- TheiaProk_ONT and TheiaProk_FASTA: Hicap was being run in TheiaProk_ONT but the outputs were never appearing in the data table! This has been fixed.
- All TheiaCoV workflows: Unsupported organisms will no longer cause workflow failures.
- Terra_2_NCBI: Fixed a typo when using the Wastewater Biosample package that was causing an error.
- Freyja_Dashboard: The freyja_dasbhoard output variable now correctly says freyja_dashboard.
- Workflows that accept String inputs that are used to name things: Several input variables such as
cluster_name
now accept Strings with whitespace. - All workflows: Runtime parameters have been adjusted for several tasks.
- TheiaCoV Flu Track: A bug has been fixed for IRMA running out of disk space. Additionally, another bug affecting Flu B samples was fixed related to empty HA segment FASTA files.
What's Changed
- TheiaCoV wf support for RSV - run nextclade by default and small optimizations (kraken_target_organism, genome_length) by @kapsakcj in #436
- [New workflow - internal] Gambitcore for assembly quality assessment with GAMBIT by @cimendes in #466
- [TheiaProk_ONT and TheiaCoV_ONT] Expose additional QC metrics from nanoplot for both raw and clean reads by @cimendes in #452
- Exposing r1 and r2 mean_q_clean and mean_readlength_clean by @jrotieno in #455
- [TheiaProk_ONT] add patch fix to kmc estimated genome size to not go over 10Mbp by @cimendes in #459
- Add abricate as optional module by @jrotieno in #431
- [TheiaProk_ONT] Add Kraken2 as part of read_qc by @cimendes in #438
- [Flu] Assembly mean coverage & read screen clean-up by @sage-wright in #469
- [Freyja_Dashboard] fix typo in freyja_dashboard output File variable name by @AndrewLangvt in #482
- [Terra_2_NCBI] remove metadata requirements with skip_biosample == true by @sage-wright in #475
- Augur Updates for RSV-A and RSV-B by @jrotieno in #478
- [kSNP3] fix behaviour when phandango colouring is set to false by @cimendes in #496
- [Internal] Updating runtime parameters by @sage-wright in #494
- Automatically convert spaces to dashes in workflows that accept strings by @AndrewLangvt in #498
- [TheiaCoV] Enable user to run TheiaCoV with an unsupported organism by @sage-wright in #501
- [AMRFinderPlus] parse BETA-LACTAM genes and subclasses into individual output columns by @sage-wright in #505
- IRMA bug fixes & improvements; theiacov_illumina_pe wf updates for Flu by @kapsakcj in #468
- Augur_PHB: Set sample_metadata_tsvs input to optional by @jrotieno in #503
- [Internal - Gambitcore] Downgrade database to stable 1.3.0 version by @cimendes in #473
- [TheiaCoV_Illumina_PE & _ONT] Create sub-workflow for flu-specific modules by @sage-wright in #502
- [TheiaProk] Add abricate module for vibrio characterization by @cimendes in #429
- [TheiaProk] expose hicap outputs in theiaprok_fasta and theiaprok_ont by @cimendes in #508
- Fix typo in Terra_2_NCBI Wastewater metadata by @michellescribner in #519
- [TheiaProk] Update amrfinderplus to v3.12.8; DB: v2024-05-02.2; reduce compute resources by @kapsakcj in #514
- [TheiaProk] upgrade mlst docker image to 2024-06-01 staphb build; reduced runtime parameters; enable preemptible by @kapsakcj in #516
- update default...
v2.0.1
Public Health Bioinformatics v2.0.1 Patch Release Notes
🩹 This patch release updates the default midas_db location
Full release notes can be found here!
Find our documentation here!
What's Changed
- update default midas_db location to requester pays bucket by @kapsakcj in #446
- Update version to v2.0.1 by @sage-wright in #448
Full Changelog: v2.0.0...v2.0.1
v2.0.0
Public Health Bioinformatics v2.0.0 Release Notes
This major release simplifies the usage of the TheiaCoV workflows and does major restructuring on all inputs and outputs on several workflows, including TheiaCoV, TheiaProk, TheiaEuk, and TheiaMeta. Additionally, it introduces three new workflows, improves on several workflows, and resolves various bugs.
Full release notes can be found here.
All inputs and outputs have been standardized across all of PHB. More information can be found here.
Find our documentation here!
🆕 New workflows:
-
Kraken2_ONT_PHB
- You can now analyze ONT data through the Kraken2 software.
- Import the workflow from Dockstore
-
TBProfiler_tNGS_PHB
- This workflow is still in a beta state; development is currently ongoing.
- It is used to process targeted next-generation sequencing (tNGS) Mycobacterium tuberculosis data for antimicrobial resistance (AMR) characterization with TBProfiler and tbp-parser. It includes quality assessment and control with Trimmomatic.
- Import the workflow from Dockstore
-
Find_Shared_Variants_PHB
Find_Shared_Variants_PHB
is a workflow for concatenating the variant results produced by theSnippy_Variants_PHB
workflow across multiple samples and reshaping the data to illustrate variants that are shared among multiple samples.- Import this workflow from Dockstore
🚀 Changes to existing workflows:
-
TheiaCoV, TheiaProk, TheiaEuk and TheiaMeta workflows
- All inputs and outputs have been standardized across all workflow series
-
TheiaCoV Workflow Series
-
The workflow_parameters sub-workflow now controls all taxa-specific optional inputs in TheiaCov. The default value for the organism input is still set to "sars-cov-2".
-
VADR is now enabled for
flu
,rsv-a
andrsv-b
. -
Nextclade has been updated to v3. Older dataset tags than the ones provided by default are not compatible with the current version. See below for the list of updated
nextclade_dataset_tag
s. -
Nextclade dataset names & their default values in TheiaCoV workflows have also changed. For example
"sars-cov-2"
is now called"nextstrain/sars-cov-2/wuhan-hu-1/orfs"
. The name"sars-cov-2"
still works as an alias, but we recommend using the full name because it is more descriptive and clearer, and will be supported by Nextclade for the foreseeable future.Organism Old Dataset Name New Dataset Name New Dataset Tag SARS-CoV-2 "sars-cov-2"
"nextstrain/sars-cov-2/wuhan-hu-1/orfs"
2024-04-15--15-08-22Z
Mpox (specifically, Mpox lineage B.1 dataset) "hMPXV_B1"
"nextstrain/mpox/lineage-b.1"
2024-01-16--20-31-02Z
Influenza A H1N1 HA "flu_h1n1pdm_ha"
"nextstrain/flu/h1n1pdm/ha/MW626062"
2024-01-16--20-31-02Z
Influenza A H3N2 HA "flu_h3n2_ha"
"nextstrain/flu/h3n2/ha/EPI1857216"
2024-02-22--16-12-03Z
Influenza B Victoria HA "flu_vic_ha"
"nextstrain/flu/vic/ha/KX058884"
2024-01-16--20-31-02Z
Influenza B Yamagata HA "flu_yam_ha"
"nextstrain/flu/yam/ha/JN993010"
2024-01-30--16-34-55Z
Influenza A H1N1 NA "flu_h1n1pdm_na"
"nextstrain/flu/h1n1pdm/na/MW626056"
2024-01-16--20-31-02Z
Influenza A H3N2 NA "flu_h3n2_na"
"nextstrain/flu/h3n2/na/EPI1857215"
2024-01-16--20-31-02Z
Influenza B Victoria NA "flu_vic_na"
"nextstrain/flu/vic/na/CY073894"
2024-01-16--20-31-02Z
RSV-A "rsv_a"
"nextstrain/rsv/a/EPI_ISL_412866"
2024-01-29--10-29-43Z
RSV-B "rsv_b"
"nextstrain/rsv/b/EPI_ISL_1653999"
2024-01-29--10-29-43Z
-
-
TheiaCoV Flu Track
- For the
flu
track:- Tamiflu-resistance determination has been removed in favor of the oseltamivir nomenclature. Additionally, amantadine and rimantadide were added.
- We now check for antiviral resistance mutations against the following 10 antiviral drugs: A_315675, amantadine, compound_367, favipiravir_resistanceflu_fludase, L_742_001, laninamivir, peramivir, pimodivir, rimantadine, oseltamivir, xofluza, zanamivir.
- For TheiaCoV_Illumina_PE, assembly coverage is now computed for both HA and NA segments
- Nexclade outputs are now computed for the NA fragment as well as HA
- Tamiflu-resistance determination has been removed in favor of the oseltamivir nomenclature. Additionally, amantadine and rimantadide were added.
- For the
-
TheiaProk Workflow Series
- Plasmidfinder can now be toggled off through the
call_plasmidfinder
optional input - Trimmomatic encoding is now set to 33 by default to avoid failures when processing SRA-Lite formatted FASTQ files
- Plasmidfinder can now be toggled off through the
-
TheiaMeta
- Automated binning has been integrated into TheiaMeta when a reference file is not provided. Binning is performed with SemiBin2
- The assembly module optional inputs have been exposed, allowing the user to control the behavior of metaSPAdes and Pilon
-
SRA_Fetch
- A new warning column has now been implemented indicating if the downloaded file is suspected to be in SRA-Lite format
Docker container updates:
- Augur has been updated to commit hash
cec4fa0ecd8612e4363d40662060a5a9c712d67e
, from 2024-02-01 - BUSCO has been updated to version v5.7.1. Due to memory issues when running eukaryotic assemblies, TheiaEuk was excluded from this update and still runs on version v5.3.2
- pasty has been updated to version v1.3.0
- tbp-parser has been updated to version v1.4.2
- theiavalidate has been updated to version v0.1.0
- ts_mlst database has been updated as of April 2024
- VADR has been updated to version v1.6.3
🐛 Bug fixes and small improvements:
- All workflows: Fastq_Scan outputs have been renamed (now prefixed with
fastq_scan_*
) to differentiate them from fastQC. Several outputs for FastP and fastQC are now exposed such as the respective report HTMLs. - TheiaCoV (all workflows): Edge-case bugs in QC_check and Pangolin have been resolved. The percent gene coverage task has been modularized.
- TheiaCoV Illumina PE:
read1_aligned
,read1_unaligned
,read2_aligned
,read2_unaligned
,sorted_bam_aligned
,sorted_bam_aligned_bai
,sorted_bam_unaligned
, andsorted_bam_uanligned_bai
are now outputted by the workflow. - TheiaProk (all workflows):
midas_secondary_genus_coverage
(the secondary genus absolute coverage) is now output. - TheiaEuk: Several outputs from the snippy_variants task have been exposed:
snippy_variants_num_reads_aligned
,snippy_variants_num_variants
,snippy_variants_coverage_tsv
, andsnippy_variants_percent_ref_coverage
. - BaseSpace_Fetch: A fix has been implemented that greatly speeds up the download of data from BaseSpace when using Basespace "Projects" to organize sequencing runs.
- Snippy_Streamline:
snippy_concatenated_variants
andsnippy_shared_variants
are now exposed as Snippy_Streamline outputs. Thesnippy_snp_matix
output has been deprecated in favor ofsnippy_wg_snp_matrix
andsnippy_cg_snp_matrix
. - kSNP3:
ksnp3_number_snps
,ksnp3_number_core_snps
andksnp3_core_snp_table
have been added to the collection of outputs. - Kraken2 Standalone (all workflows): Uncompressed read files can now be processed by all Kraken2 workflows
- Freyja_FASTQ: A new optional input
depth_cutoff
has been added, giving the user the option to exclude sites with coverage depth below the provided value (by default no cutoff is performed). New outputs added:freyja_coverage
andfreyja_barcode_file
What's Changed
- Adding
assembly_mean_coverage
metrics for flu in TheiaCoV_Illumina_PE_PHB by @jrotieno in #314 - pangolin TMPDIR add and CI updates & improvements by @kapsakcj in #327
- expose optional input parameter disk_size for kraken2 standalone wfs by @kapsakcj in #316
- TheiaValidate: Compare file contents (#264) by @sage-wright in #335
- Added Freyja coverage output to Terra table by @emmadoughty in #317
- [TheiaMeta] Binning with SemiBin2 by @cimendes in https://github.com/theiagen/public_health_bioinformatics/...
v1.3.0
Public Health Bioinformatics v1.3.0 Release Notes
This minor release introduces two new workflows, improves on several workflows, and resolves various bugs
Full release notes can be found here.
🆕 New workflows:
-
TheiaCoV_FASTA_Batch_PHB
- This workflow implements TheiaCoV_FASTA for many SARS-CoV-2 samples at once.
- This a set-level workflow that populates the results to a sample-level data table in Terra.bio
- Currently, this workflow only runs Pangolin4 and NextClade
- Import the workflow from Dockstore
-
Rename_FASTQ_PHB
- This workflow is a utility to quickly and easily rename a set of FASTQ files, either paired-end or single-end.
- Import the workflow from Dockstore
🚀 Changes to existing workflows:
-
TheiaCoV_ONT_PHB
- Influenza is now supported. Use
"flu"
for theorganism
optional input String parameter."sars-cov-2"
and"HIV"
tracks are unchanged.
- Influenza is now supported. Use
-
TheiaProk Workflow Series
- If user-input (
expected_taxon
) or predicted taxon by Gambit belongs to theShigella
genus, the Extensively Drug-Resistant phenotype is predicted using the new resfinder pointfinder database. - If user-input (
expected_taxon
) or predicted taxon by Gambit is the Mycobacterium tuberculosis species, bcftools indexes and merges all potential VCF files created by TbProfiler (both .bcf and .gz files). - Kraken2 has been added as an optional module (except for TheiaProk_ONT_PHB). If
call_kraken
istrue
, a database must be provided throughkraken_db
. - Two new optional inputs were added to control ANIm behaviour:
ani_threshold
(default85.00
) andpercent_bases_aligned_threshold
(default70.00
).
- If user-input (
-
TheiaCoV_FASTA_PHB
- The list of allowed input
organism
now includes"sars-cov-2"
(default),"rsv_a"
,"rsv_b"
,"WNV"
,"MPXV"
and"flu"
.
- The list of allowed input
-
TheiaCoV_Illumina_PE_PHB
- If organism is set as
"flu"
, the workflow searches for antiviral mutations in the HA, NA, PA, PB1 and PB2 assembly segments, targeting the following 10 antivirals.: A_315675, compound_367, Favipiravir, Fludase, L_742_001, Laninamivir, Peramivir, Pimodivir, Xofluza and Zanamivir.
- If organism is set as
-
All Illumina SE and PE Workflows
- A new optional input,
read_qc
, to allow the user to decide betweenfastq_scan
andfastqc
for the evaluation of read quality. The affected workflows are: TheiaCoV_Illumina_PE_PHB, TheiaCoV_Illumina_SE_PHB, TheiaProk_Illumina_SE_PHB, TheiaProk_Illumina_PE_PHB, TheiaMeta_Illumina_PE_PHB and Freyja_FASTQ_PHB.
- A new optional input,
-
CZGenEpi_Prep_PHB
- Instead of extracting the
sample_is_private_column_name
and thegisaid_id_column_name
columns, these columns are now generated by the program using already-provided inputs and by the newis_private
Boolean variable which is used to set the value for all samples in the set. The field "GISAID ID (Public ID) - Optional" will now reflect the GISAID syntax for Virus Name.
- Instead of extracting the
Docker container updates:
- AMRFinderPlus has been updated to version v3.11.20 and database 2023-09-26.1
- tbp-parser has been updated to version 1.2.0
- Freyja has been updated to version 1.4.8
- ts_mlst database has been updated as of January 2024
- Gambit has been updated to version 1.3.0, including its database files
- Pangolin4 has been updated to version 4.3.1-pdata-1.23.1
- IRMA has been updated to version 1.1.3
Tag updates:
- SARS-CoV-2 Nexclade Dataset Tag has been updated to
2023-12-03T12:00:00Z
🐛 Bug fixes and small improvements:
- kSNP3_PHB: The
ksnp3_core_vcf
output has been renamed toksnp3_vcf_ref_genome
for readability. Additionally, two new outputs are provided:ksnp3_vcf_snps_not_in_ref
andksnp3_vcf_ref_samplename
. - TheiaProk Workflow Series: The MIDAS task was adjusted to reduce logging, and therefore the size of the log file, aiding debugging & reducing storage costs.
- TheiaMeta_Illumina_PE_PHB: A new task Krona was added for the visualization of the Kraken2 reports.
- Mercury_Prep_N_Batch: The
excluded_samples.tsv
is now printed to the execution log file, aiding debugging. - TheiaCoV Workflow Series: The
nextclade_lineage
output now populates correctly for SARS-CoV-2. Additionally, thenexclade_qc
field is now exposed as an output. - Augur_PHB: The AUGUR refine input
clock_filter_iqd
has been reverted to the previous default value of 4. - Kraken Standalone Workflows: A new task Krona was added for the visualization of the Kraken2 reports.
- TheiaValidate_PHB: TheiaValidate now outputs a table with validation-criteria failures only. Additionally, a new input was added that can translate different column names between tables to enable comparison.
- TheiaCoV_ONT_PBH: If a sample fails quality check with read screening, this will no longer cause the workflow to fail. Instead, it will finish with an appropriate message.
- Samples_To_Ref_Tree_PHB: The
organism
input has been renamed tonextclade_dataset_name
for better clarity. - Various workflows: Call caching was disabled in the following workflows: BaseSpace_Fetch_PHB, Transfer_Column_Content_PHB, Assembly_Fetch_PHB, Snippy_Streamline_PHB and TheiaValidate_PHB.
What's Changed
- updated VCF output file renaming in kSNP3 task by @kapsakcj in #207
- reduce unnecessary logging in MIDAS task by @kapsakcj in #210
- update default amrfinderplus docker image to v3.11.20 and db 2023-09-26.1 by @kapsakcj in #229
- TheiaCoV_ONT_PHB Influenza Track by @jrotieno in #233
- TheiaCoV_FASTA_Batch: TheiaCoV_FASTA, for many samples at once by @sage-wright in #238
- Add krona task to TheiaMeta_Illumina_PE by @cimendes in #213
- added 2 QC thresholds to ANI task to reduce false positives by @kapsakcj in #168
- Resfinder improvements, added support for Shigella spp., added XDR Shigella prediction by @kapsakcj in #159
- disable call caching for various workflows by @kapsakcj in #251
- Mercury_Prep_N_Batch: print the excluded_samples.tsv and update Docker to avoid Google SDK warning by @sage-wright in #220
- Nextclade Output Added by @DOH-HNH0303 in #239
- TheiaCoV_FASTA: Adding five new organisms by @jrotieno in #194
- Update task_augur_refine iqd back to 4 by @jrotieno in #268
- TheiaCoV Illumina PE: Identify Influenza Antiviral Resistance Mutations in Assemblies by @jrotieno in #252
- [New Utility] Workflow to rename FASTQ files (non-destructive) by @cimendes in #267
- [TheiaCoV_Fasta_Batch] Substitute FASTA concatenating task to ensure proper sample_id propagation by @cimendes in #274
- Kraken2 Standalone: add krona visualisation by @cimendes in #225
- TheiaValidate_PHB: new features and new Docker image from TheiaValidate repository by @sage-wright in #255
- TheiaProk TB: new VCF output and modification to the coverage report by @sage-wright in #245
- TheiaCoV_ONT: prevent failure by coercing files into strings by @sage-wright in #288
- update default freyja docker image to 1.4.8 for multiple tasks by @kapsakcj in #289
- FastQC added as an optional module in all Illumina_PE and Illumina_SE workflows by @sage-wright in #260
- update docker to version tag 2.23.0-2024-01 by @cimendes in #293
- [TheiaProk Workflows] Add Kraken2 as optional module by @cimendes in #286
- CZG...
v1.2.1
Public Health Bioinformatics v1.2.1 Release Notes
This patch release resolves various bugs and updates workflow defaults.
🐛 Bug Fixes
🦑 Kraken2_PE
- A bug was fixed in the Kraken2_PE_PHB standalone workflow where the workflow was expecting required outputs from the Kraken2_standalone task that are now optional. This solves the issue encountered when trying to import the workflow which would be unsuccessful.
Impacted Workflows/Tasks:
- Kraken2_PE_PHB
The following workflows uses Kraken2_standalone task but have not been affected as they do not require the affected outputs:
- TheiaMeta_Illumina_PE_PHB
- Kraken2_SE_PHB
The following workflows use a different Kraken2 task and have not been affected:
- TheiaCoV_Illumina_PE_PHB
- TheiaCoV_Illumina_SE_PHB
🌲 Augur
- The requirement to present genes and colors input files was causing run failures for non-MPXV tree builds. These files are no long required.
Users reported issues with with optional Augur_PHB inputs, specifically colors_tsv, with the following error messages:
- Error_1:"Failed to evaluate 'colors_tsv' (reason 1 of 1): Evaluating select_first([colors, mpxv_defaults.colors]) failed: select_first was called with 2 empty values. We needed at least one to be filled."
- Error_2: "Failed to evaluate 'genes' (reason 1 of 1): Evaluating select_first([genes, mpxv_defaults.genes]) failed: select_first was called with 2 empty values. We needed at least one to be filled."
📚 Read Screen
- The read screen task is designed to assess the quantity and quality of reads used as the input to the workflow, and halt the workflow if it is determined that the reads are insufficient. One of the qualities of the reads that is checked is the proportion of reads found in the R1 and R2 files.
- The former implementation did not calculate the proportion of reads correctly, and the reported error message did not reflect the defined parameter correctly.
- The math has been updated such that the ratio can not be unbalanced beyond a 60/40 split.
🔧 Workflows Updates
Workflows
🔬 TheiaCoV Workflows
- The default nextclade_dataset_tag for SARS-CoV-2 was updated to "2023-09-21T12:00:00Z" (as of 2023-10-10) across all 5 TheiaCov workflows:
- TheiaCoV_Illumina_PE_PHB, TheiaCoV_Illumina_SE_PHB, TheiaCoV_ClearLabs_PHB, TheiaCoV_ONT_PHB, TheiaCoV_FASTA_PHB
🦠 TheiaProk Workflows
- KmerFinder was added to the TheiaProk suite of workflows to find the best match (species identification) of a fasta file in a (kmer) database (downloaded on 2023-09-11).
New Outputs
- kmerfinder_docker
- kmerfinder_results_tsv
- kmerfinder_top_hit
- kmerfinder_query_coverage
- kmerfinder_template_coverage
- kmerfinder_database
Task Files
🎙️ UShER
- The runtime environment for the UShER task has been allocated additional compute resources to allow for larger input sets.
- The following defaults for the Pilon task were changed:
- CPU 4 -> 8
- Memory 8 -> 32
- Impacted Workflows/Tasks
- UShER _PHB is the only affected workflow.
- The UShER task is used in the UShER workflow.
- UShER _PHB is the only affected workflow.
🔎 Pilon
- The runtime environment for the Pilon task has been allocated additional compute resources to allow for larger input sets.
- The following defaults for the Pilon task were changed:
- CPU 4 -> 8
- Memory 8 -> 32
- Impacted Workflows/Tasks
- TheiaMeta_Illumina_PE_PHB is the only affected workflow.
- The Pilon task is used in the metaspades_assembly sub-workflow.
- TheiaMeta_Illumina_PE_PHB is the only affected workflow.
🏭 What's Changed
- KmerFinder to TheiaProk by @cimendes in #188
- Remove the genes and colors input files by @sage-wright in #212
- update default nextclade dataset tag to "2023-09-21T12:00:00Z" for all TheiaCov wfs by @kapsakcj in #208
- update template and update PHB version by @sage-wright in #217
- Update tbp-parser Docker and new output by @sage-wright in #214
- Fix bugs in read proportion calculation - read_screen task by @cimendes in #209
- Fix bug on Kraken2_PE standalone workflow by @cimendes in #219
- Add additional section to the PR template by @sage-wright in #221
- Update compute resource defaults in task_usher.wdl by @frankambrosio3 in #222
- TheiaMeta - Update Pilon defaults (cpu and memory) by @cimendes in #223
Full Changelog: v1.2.0...v1.2.1
Please see the full documentation for the PHB repository v1.2.1 release.
v1.2.0
Public Health Bioinformatics v1.2.0 Release Notes
This minor release introduces three new workflows and resolves various bugs.
New workflows:
-
TheiaMeta_Illumina_PE_PHB
This workflow offers a versatile approach to de novo metagenomic assembly, providing the option to use either reference-based or reference-independent metagenomic assembly. Taxonomic characterization is also performed with Kraken2. -
CZGenEpi_Prep_PHB
The CZGenEpi_Prep workflow formats metadata and assembly files for seamless integration with the Chan Zuckerberg GEN EPI platform. -
Samples_to_Ref_Tree_PHB
In this workflow, Nextclade is used to rapidly place new samples onto an existing reference phylogenetic tree. Phylogenetic placement is done by comparing the mutations of the query sequence (relative to the reference) with the mutations of every node and tip in the reference tree, and finding the node which has the most similar set of mutations. This operation is repeated for each query sequence, until all of them are placed onto the tree.
Changes in existing workflows
-
Kraken2_SE_PHB
Kraken2 output files were not being correctly identified by the single-end standalone workflow, causing it to fail unexpectedly Output files should now populate on the Terra datatable correctly. -
KMC
The output type ofest_genome_size
is now anint
so data can be sorted numerically in a Terra datatable when running TheiaProk_ONT. Additionally, this task no longer runs unnecessarily for the TheiaCoV_ONT workflow. -
TS_MLST
The database had been updated as of August 2023.New outputs:
ts_mlst_docker
Mycobacterium tuberculosis changes
-
TBProfiler
The default variant caller has been adjusted to FreeBayes to accurately identify resistance-conferring deletions and multi-nucleotide variants (MNVs), -
tbp-parser
A TBProfiler parsing module has been added to apply variant interpretation logic based on recommendations by the WHO, CDC and CDPH to produce antitubercular drug resistance calls. Additionally, a set of machine and human-interpretable files are produced to facilitate data sharing and interpretation. Find the source code here.New inputs:
tbprofiler_output_seq_method_type
(default="WGS")tbprofiler_operator
(default="")tbp_parser_min_depth
(default=10)tbp_parser_coverage_threshold
(default=100)tbp_parser_debug
(default=false)tbp_parser_docker_image
(default="us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.0.1")
New outputs:
tbprofiler_lims_report_csv
tbprofiler_looker_csv
tbprofiler_laboratorian_report_csv
tbprofiler_resistance_genes_percent_coverage
tbp_parser_genome_percent_coverage
tbp_parser_version
tbp_parser_docker
-
Clockwork
Theclockwork
module has been added to decontaminate read files of sequencing data that may come from a nontuberculous mycobacteria (NTM) or human genome.New outputs:
clockwork_decontaminated_read1
clockwork_decontaminated_read2
-
TBDB
The TBProfiler module uses a database called TBDB. We have modified the code to allow for custom databases to be used in place of the default TBDB. Additionally, we have created a custom database including mutations from TBDB, the WHO catalog, and a list of mutations included in the CDC's MTB pipeline Varpipe.By default, TBProfiler runs with the default database. If the Boolean input
tbprofiler_run_custom_db
is set to true and no database is provided by the user, a database containing both TBProfiler's TBDB and CDC Varpipe's collection of resistance conferring mutations will be used by TBProfiler. In this database, the duplicate entries have been manually curated by removing the TBDB entry in favor of Varpipe's mutation annotation.New inputs:
tbprofiler_run_custom_db
(default=false)tbprofiler_custom_db
(default="gs://theiagen-public-files/terra/theiaprok-files/tbdb_varpipe_combined.tar.gz")
Bug Fixes
- In the
KMC
task, the -n flag has been added to theecho
command to avoid newline characters - An optional
snippy_core_bed
file input has been added to the Snippy_Tree workflow to enable site masking, and thereby exposing this optional input to the Snippy_Streamline workflow. - The
memory
input for quast has been adjusted to match the style guide in TheiaEuk_Illumina_PE_PHB workflow. - The
version_capture
task now uses a Docker image hosted on Theiagen's Google Artifact Registry (GAR) instead of DockerHub; we also exposeddocker
as an optional input for this task. - The
plasmidfinder
output parsing was overambitious when removing duplicates and removed every instance of a duplicate, instead of just one. This has been resolved.
What's Changed
- Create issue templates by @sage-wright in #175
- Add preemptibles, shorter version string by @aofarrel in #185
- Fix kraken2_standalone for SE data by @cimendes in #178
- Patch theiaprok ont - change est_genome_size to Int by @cimendes in #179
- plasmidfinder task bugfix and updates by @kapsakcj in #191
- TheiaMeta: Viral Metagenomics workflow by @cimendes in #64
- adding bed file input by @jrotieno in #190
- Jro mpxv global tree by @jrotieno in #160
- Adding tbp_parser and clockwork to TheiaProk by @frankambrosio3 in #192
- KMC on TheiaProk_ONT and TheiaCoV_ONT by @cimendes in #193
- CZGenEpi_Prep_PHB workflow by @sage-wright in #161
- update ts mlst docker (August 2023) by @cimendes in #195
- TBDB with varpipe by @cimendes in #197
- Smw tbprofiler continuing dev by @sage-wright in #199
- adjusted call block for quast in theiaeuk_illumina_pe_PHB workflow: m… by @kapsakcj in #200
- add -n to echo command in kmc to avoid new line by @frankambrosio3 in #201
- switch default docker image for version_capture to GAR-hosted image; CI change to micromamba by @kapsakcj in #198
- update version by @sage-wright in #204
- revert ncbi scrub changes to commid id 4e0fa54 by @cimendes in #205
Full Changelog: v1.1.0...v1.2.0
v1.1.0
Public Health Bioinformatics v1.1.0 Release Notes
This minor release introduces two new workflows, changes the outputs for the ONT workflows, and resolves various bugs.
New workflows:
-
Terra_2_GISAID
This workflow will submit concatenated metadata and assembly files to GISAID directly from Terra. The user must obtain a GISAID client-id before they can use this workflow. -
Usher_PHB
This workflow will place your samples onto the most up-to-date versions of the UCSC's UShER phylogenetic trees and return subtree(s) that the user can visualize.
Major output changes in TheiaCoV_ONT and TheiaProk_ONT workflows
We identified an issue when using cg_pipeline
in our ONT workflows that led to inaccurate QC metrics. We have corrected this issue by deprecating the use of cg_pipeline
in all ONT workflows. QC metrics are now calculated using nanoplot
, which is a tool geared specifically for ONT data. In addition, since fastq-scan
is now redundant in these workflows, it has been removed.
Also, the maximum read length in TheiaProk_ONT was previously set to 10,000 base pairs. We have increased this to 100,000 base pairs by default.
-
TheiaProk_ONT New Outputs
The following columns are new.nanoplot_num_reads_clean1
nanoplot_num_reads_raw1
nanoplot_r1_mean_q_clean
nanoplot_r1_mean_q_raw
nanoplot_r1_mean_readlength_clean
nanoplot_r1_mean_readlength_raw
nanoplot_tsv_clean
nanoplot_tsv_raw
nanoplot_version
nanoplot_docker
nanoplot_html_clean
nanoplot_html_raw
The following variables are now generated using
nanoplot
:est_coverage_raw
est_coverage_clean
The following variables have been removed:
num_reads_clean1
num_reads_raw1
r1_mean_q_raw
r1_mean_readlength_raw
fastq_scan_version
-
TheiaCoV_ONT New Outputs
The following columns are new.nanoplot_tsv_clean
nanoplot_tsv_raw
nanoplot_version
nanoplot_docker
nanoplot_html_clean
nanoplot_html_raw
est_coverage_raw
est_coverage_clean
r1_mean_readlength_clean
r1_mean_readlength_raw
r1_mean_q_clean
r1_mean_q_raw
The following variables are now generated using
nanoplot
:num_reads_clean1
num_reads_raw1
The following variables have been removed:
fastq_scan_version
Bug Fixes
- Corrected an inaccurate file extension in the
augur
workflow. - Adjusted several files to meet the style guide
- Adjusted the default value for the
core_genome
input in Snippy_Tree to betrue
. - Fixed a bug in the
summarize_data
task - Fixed a bug and added new outputs in the
SRA_Fetch
workflow - Enabled the skipping of extra header columns in the
Concatenate_Column_Content
workflow - Added the
.gfa
file from Dragonflye as output - Updated default docker images and dataset tags for the Pangolin and Nextclade tasks.
- Updated the GAMBIT database to v1.1.0
- The GAMBIT docker image has been updated to use the latest GAMBIT version
- Fixed a bug in file name parsing in the Lyve_Set_PHB workflow
- Skipped the genome size estimation in the
read_screen
task for all ONT workflows.
What's Changed
- update default docker for busco to GAR docker image by @kapsakcj in #132
- change file extension by @sage-wright in #134
- minor mashtree improvements by @kapsakcj in #142
- [TheiaProk] expose kleborate_virulence_score and kleborate_resistance_score by @cimendes in #146
- Explode workflows by @sage-wright in #135
- Usher_PHB by @sage-wright in #149
- Snippy_Tree
core_genome
default value by @sage-wright in #144 - summarize_data task bug fix: -z bash conditional by @kapsakcj in #153
- SRA_fetch workflow &
fastq-dl
task improvements by @kapsakcj in #150 - Terra_2_GISAID by @sage-wright in #148
- Skip extra headers in Concatenate_Column_Content by @sage-wright in #162
- Deprecate the use of cg_pipeline for nanoplot stats by @cimendes in #164
- Update defaults by @sage-wright in #171
- update default gambit docker by @sage-wright in #173
- lyveset fastq file parsing bugfix and other improvements by @kapsakcj in #156
- update lyveSET FASTQ parsing by @kapsakcj in #177
Full Changelog: v1.0.1...v1.1.0
v1.0.1
Public Health Bioinformatics v1.0.1 Release Notes
This patch release resolves various bugs and updates workflows to use Theiagen-hosted Docker images.
Nextclade
- Updated the default Nextclade dataset tags and docker images
- Adjusted output parsing to ensure continuity between versions
- Corrected incorrect variable name (gene_annotations_json is now gene_annotations_gff, as described by the Nextclade documentation)
AMRFinderPlus
- Enabled organism-specific AMR gene detection for new organisms: K. pneumoniae, K. oxytoca, K. aerogenes, S. pseudintermedius, Streptococcus pyogenes, V. cholerae, Burkholderia cepacia, Burkholderia pseudomallei, C. coli, C. jejuni, Citrobacter freundii, Clostridioides difficile, Enterobacter cloacae, Enterococcus faecalis, Enterococcus hirae, Enterococcus faecium, Serratia marcescens
- added optional String input expected_taxon to allow the user to override the gambit_predicted_taxon as input for the amrfinderplus task
Shigella characterization
- Added option to change the Shigatyper Docker image
- Updated the default Shigeifinder Docker image
kSNP3
- Added an option to the kSNP3 workflow that allows users to add samples to an existing tree
Freyja
- Updated default docker images
- Changed result when database fails to update in Freyja_FASTQ (now fails instead of succeeding -- thanks @HNHalstead!)
TheiaValidate
- Fixed bug when the
validation_criteria_tsv
optional input variable was not provided - Fixed bug when "NA" strings were incorrectly parsed as actual NAs
BaseSpace_Fetch
- Fixed bug when read files contained dots
Snippy/TheiaEuk
- Fixed bug in Snippy gene query task that previously caused product names that include commas to be truncated
What's Changed
- update theiacov GHA tests for v1.0.0 by @rpetit3 in #101
- Updates to nextclade flu dataset tags and image by @jrotieno in #105
- New metadata output on summarize_data task for phylogenetic workflows by @sage-wright in #103
- shigeifinder 1.3.5 updates by @kapsakcj in #109
- TheiaValidate fixes by @sage-wright in #104
- Update task_freyja_one_sample.wdl by @HNHalstead in #116
- BaseSpace_Fetch: convert periods to dashes by @sage-wright in #118
- Update Nextclade Parsing by @sage-wright in #112
- amrfinder & gambit updates by @kapsakcj in #74
- GHA for TheiaCoV_Illumina_PE/_SE by @sage-wright in #119
- Snippy Gene Query fixes by @sage-wright in #102
- Add option to append current kSNP3 run to an existing tree by @cimendes in #122
- Rename variable in nextclade by @sage-wright in #123
- Use docker images stored in Google Artifact Register by @andrewjpage in #120
- small shigella updates, take 2 by @kapsakcj in #125
- update readme by @sage-wright in #128
New Contributors
- @HNHalstead made their first contribution in #116
- @andrewjpage made their first contribution in #120
Full Changelog: v1.0.0...v1.0.1