-
Notifications
You must be signed in to change notification settings - Fork 3
Add new taxa to fDOG
To add a new taxon to fDOG, you need to follow its naming schema ([Species acronym]@[NCBI ID]@[Proteome version]) and place the necessary files in the correct folders:
- searchTaxa_dir (Contains sub-directories for proteome fasta files for each species)
- coreTaxa_dir (Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes)
- annotation_dir (Contains feature annotation files for each proteome)
We simplify this process by providing 2 functions fdog.addTaxon
and fdog.addTaxa
.
Note: before using, please read the More section.
For this, you can use the fdog.addTaxon
function:
fdog.addTaxon -f your_genome.fa -i tax_id -c [-o /output/directory] [-n abbr_tax_name]
If the abbr. taxon name is not given using the option -n abbt_tax_name
, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid
(e.g. UNK12345678).
The script will add a new folder named abbr_tax_name@tax_id@date
and the corresponding content into searchTaxa_dir and coreTaxa_dir, as well as a annotation abbr_tax_name@[email protected]
file to annotation_dir. These 3 folders will be saved in /output/directory
. If not specified, new taxon will be added into the same directory of pre-calculated data.
The header of new FASTA sequence, i.e. the sequence ID, will be the first word of the original FASTA sequence. Everything after the first whitespace will be removed. If the first word is duplicated between different sequences, an increasing index will be added to make sure that the sequence IDs of the new FASTA file are unique.
Example, a before fasta file:
>EXR66326.1 biofilm-associated domain protein, partial [Acinetobacter baumannii 339786]
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1 hypothetical protein J700_4015, partial [Acinetobacter baumannii 339786]
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...
and after (this is how your new sequence data will look like):
>EXR66326.1
MTGEGPVAIHAEAVDAQGNVDVADADVTLTIDTTPQDLITAITVPEDLNGDGILNAAELGTDGSFNAQVALGPDAVDGTV
>EXR66351.1
NRRLLITTQPTATDSNYKTPIYINAPNGELYFANQDETSVSSVVFKRVIGATAANAPYVASDSWTKKIRKWNTYNHEVSK
...
In most of the cases, you would need to add more than one taxon into fDOG. For this purpose, the fdog.addTaxa
function can be used:
fdog.addTaxa -i /path/to/taxa/fasta -m mapping_file -c [-o /output/directory]
/path/to/taxa/fasta
is a folder where the FASTA files of all new taxa can be found. mapping_file
is a tab-delimited text file, where you provide the taxonomy IDs that stick with the FASTA files:
#filename tax_id abbr_tax_name version
filename.faa 9606
filename1.fa 12345678
filename2.fasta 4932 my_fungi
...
The header line (started with #) is a Must. The values of the last 2 columns (abbr. taxon name and genome version) are, however, optional. If you want to specify a new version for a genome, you need to define also the abbr. taxon name, so that the genome version is always at the 4th column in the mapping file.
If the abbr. taxon name is not given, it will be automatically suggested from the NCBI taxon name of the corresponding ID (e.g. abbr. taxon name for Homo sapiens will be HOMSA). If the given ID is not existing in NCBI taxonomy database, the abbr. taxon name will be UNK+taxid
(e.g. UNK12345678).
The script will check if the combination abbr_tax_name@tax_id@version
already exists in /output/directory/searchTaxa_dir, it will give an error message and it need to be solved before continuing.
These functions requires makeblastdb for creating Blast DB for input gene sets. Please install ncbi-blast+ if it is missing.
For more info about the 2 python functions, please read their help menu:
fdog.addTaxon -h
or
fdog.addTaxa -h