Skip to content

Latest commit

 

History

History
86 lines (72 loc) · 4.48 KB

README.md

File metadata and controls

86 lines (72 loc) · 4.48 KB

BACTspeciesID

Fast microbial species identification (16S rRNA gene-based approach) using genome assemblies. This software is run using ABRicate for gene screening on contigs. BACTspeciesID also checks for potential contaminations on the whole genome assemblies.

Dependencies - can be installed using Conda

16S rRNA Database

>HG530238.1.1461 Paucibacter toxinivorans
TCAGATTGAACGCTGGCGGCATGCCTTACACATGCAAGTCGAACGGCAGCACGGG

To set up 16S sequence database

Please refer to this section in ABRicate repository and rename the database as SILVA-16S if you use the SILVA database (alternatively, any name you like).

Usage

Options

$ bactspeciesID.sh -h
bactspeciesID identifies bacterial species/potential contaminations using whole genome assemblies

Usage: ./bactspeciesID2.sh [options] FASTA

Options:
 -i BLASTn identity (default:99)
 -d ABRicate database (default:SILVA-16S)
 -c BLASTn coverage (default:50)
 -m contamination check TRUE/FALSE (default:FALSE)
 -r removal of intermediary files TRUE/FALSE (default:TRUE)
 -h print usage and exit
 -a print author and exit
 -v print version and exit

Version 1.2 (2020)
Author: Raymond Kiu [email protected]

Input

Multi-fasta genome assemblies, one genome assembly at a time. Can be any bacterial species.

Run the software

You can specify BLASTn identity and ABRicate database if you like, 16S rRNA species boundary is recommended at 98.6%, so 99% is to play safe (default parameter anyway). SILVA database has more than 100K sequences and manually curated so it is the recommended database to use. I have tested on >70 samples from multiple species e.g. Bifidobacterium breve, Bifidobacterium longum, Staphylococcus spp, E. coli spp, Citrobacter spp etc, and achieved 100% accuracy based on ANI (>95%)support (compared with type strains). Can be used as a quick preliminary analysis. Importantly, bactsepciesID extracts all 16S sequences if you use -m option, it will then tell you whether this genome is contaminated based upon 16S gene sequence comparison. If there are 16S originated from >1 species it is deemed as contaminated genome. -i and -d are optional, if not specified it will run at default parameters.

$ bactspeciesID.sh -i 99 -d SILVA-16S -m TRUE CA-1111.fna

------------------------------------------------------------------------------------
BactspeciesID will identify with the following parameters: 
BLAST identity 99% and coverage 50% with database SILVA-16S 
Contamination option is set to TRUE 
Intermediary file removal option is set to TRUE 
BactspeciesID will start identifying 16S rRNA genes from genome assembly CA-1111.fna 
------------------------------------------------------------------------------------
1 16S sequence(s) found, continue...
index file CA-1111.fna.fai not found, generating...
1 sequence(s) extracted...
[Identifying species using SILVA-16S database at identity >99%...]
Species identity is now stored in CA-1111.fna.species
[Intermediary files have been removed]

-----------------------------------------------------
The species identified for genome assembly CA-1111.fna : 
Escherichia coli
----------------------------------------------------------------------
[contamination check: this genome is NOT known to be contaminated :) ]
Thank you for using bactspeciesID!

Output

The result will be shown on stdout, also it will be saved into a file automatically called FASTA.fna.species (where FASTA is your genome assembly's name)

Issues

This script has been tested on Linux OS, it should run smoothly if dependencies are properly installed. Please report any issues to the issues page.

Citation

If you use BACTspeciesID for results in your publication, please cite:

  • Kiu R, BACTspeciesID: identify microbial species and genome contamination using 16S rRNA gene approach, GitHub https://github.com/raymondkiu/bactspeciesID

License

GPLv3

Author

Raymond Kiu | [email protected] | @raymond_kiu