Fast microbial species identification (16S rRNA gene-based approach) using genome assemblies. This software is run using ABRicate for gene screening on contigs. BACTspeciesID also checks for potential contaminations on the whole genome assemblies.
- ABRicate v1.0.1 (https://github.com/tseemann/abricate) with all its dependecies such as Blast+ v2.2.30, any2fasta
- Barrnap v0.7 (https://github.com/tseemann/barrnap)
- Bedtools
- Samtools
- SILVA 16S database can be downloaded from here with format
>HG530238.1.1461 Paucibacter toxinivorans
TCAGATTGAACGCTGGCGGCATGCCTTACACATGCAAGTCGAACGGCAGCACGGG
Please refer to this section in ABRicate repository and rename the database as SILVA-16S if you use the SILVA database (alternatively, any name you like).
$ bactspeciesID.sh -h
bactspeciesID identifies bacterial species/potential contaminations using whole genome assemblies
Usage: ./bactspeciesID2.sh [options] FASTA
Options:
-i BLASTn identity (default:99)
-d ABRicate database (default:SILVA-16S)
-c BLASTn coverage (default:50)
-m contamination check TRUE/FALSE (default:FALSE)
-r removal of intermediary files TRUE/FALSE (default:TRUE)
-h print usage and exit
-a print author and exit
-v print version and exit
Version 1.2 (2020)
Author: Raymond Kiu [email protected]
Multi-fasta genome assemblies, one genome assembly at a time. Can be any bacterial species.
You can specify BLASTn identity and ABRicate database if you like, 16S rRNA species boundary is recommended at 98.6%, so 99% is to play safe (default parameter anyway). SILVA database has more than 100K sequences and manually curated so it is the recommended database to use. I have tested on >70 samples from multiple species e.g. Bifidobacterium breve, Bifidobacterium longum, Staphylococcus spp, E. coli spp, Citrobacter spp etc, and achieved 100% accuracy based on ANI (>95%)support (compared with type strains). Can be used as a quick preliminary analysis. Importantly, bactsepciesID extracts all 16S sequences if you use -m option, it will then tell you whether this genome is contaminated based upon 16S gene sequence comparison. If there are 16S originated from >1 species it is deemed as contaminated genome. -i and -d are optional, if not specified it will run at default parameters.
$ bactspeciesID.sh -i 99 -d SILVA-16S -m TRUE CA-1111.fna
------------------------------------------------------------------------------------
BactspeciesID will identify with the following parameters:
BLAST identity 99% and coverage 50% with database SILVA-16S
Contamination option is set to TRUE
Intermediary file removal option is set to TRUE
BactspeciesID will start identifying 16S rRNA genes from genome assembly CA-1111.fna
------------------------------------------------------------------------------------
1 16S sequence(s) found, continue...
index file CA-1111.fna.fai not found, generating...
1 sequence(s) extracted...
[Identifying species using SILVA-16S database at identity >99%...]
Species identity is now stored in CA-1111.fna.species
[Intermediary files have been removed]
-----------------------------------------------------
The species identified for genome assembly CA-1111.fna :
Escherichia coli
----------------------------------------------------------------------
[contamination check: this genome is NOT known to be contaminated :) ]
Thank you for using bactspeciesID!
The result will be shown on stdout, also it will be saved into a file automatically called FASTA.fna.species (where FASTA is your genome assembly's name)
This script has been tested on Linux OS, it should run smoothly if dependencies are properly installed. Please report any issues to the issues page.
If you use BACTspeciesID for results in your publication, please cite:
- Kiu R, BACTspeciesID: identify microbial species and genome contamination using 16S rRNA gene approach, GitHub
https://github.com/raymondkiu/bactspeciesID
Raymond Kiu | [email protected] | @raymond_kiu