Snakemake workflow to identify novel microbial species from a set of genomes.
Genomes are first quality-filtered based on the CheckM stats then compared against a genome database using Mash and MUMmer. Unknown hits are extracted, clustered at species-level using dRep and further quality-controlled with GUNC.
git clone https://github.com/alexmsalmeida/magscreen.git
-
Edit
config.yml
file to point to the input, output and databases directories. Input directory should contain the.fa
assemblies to analyse and a.csv
file with CheckM completeness and contamination scores. The databases folder should contain the GUNC diamond database and a custom Mash database (.msh
) with the genomes you want to screen against. -
(option 1) Run the pipeline locally (adjust
-j
based on the number of available cores)
snakemake --use-conda -k -j 4
- (option 2) Run the pipeline on a cluster (e.g., LSF)
snakemake --use-conda -k -j 100 --cluster-config cluster.yml --cluster 'bsub -n {cluster.nCPU} -M {cluster.mem} -o {cluster.output}'
The main output is located in the directory new_species/
which contains the best-quality representative genomes (.fa
files) of each new species. New species matching all of the following criteria are filtered out:
- Flagged by GUNC: clade_separation_score >0.45; contamination_portion >0.05; reference_representation_score >0.5
- Are singletons (dRep clusters with only one member)
- Are <90% complete based on CheckM