Skip to content

How fDOG works

Vinh Tran edited this page May 15, 2023 · 4 revisions

For a detailed and well-described of fDOG algorithm, please refer to fDOG paper (in preparation)

fDOG workflow

In general, fDOG contains three main steps: (1) core group compilation (steps in black), (2) ortholog search (steps in green) and (3) FAS score calculation (step in red).

Core group compilation

First, fDOG search for orthologs of seed sequence in all taxa within the coreTaxa_dir folder (--corepath). The reference species of the seed sequence must be also present in this core taxon list. Depend on the user specified settings, fDOG will try to compile the core ortholog group for the seed with n-1 sequences (n defined by the option --coreSize, default value is 6) and maximize the taxonomy diversification of the core group in the range between the specified minimum and maximum rank (with the options --minDist and maxDist, by default are genus and kingdom, respectively). The output core ortholog group will be saved in the core_orthologs folder (--hmmpath).

In this step, fDOG also use FAS scores for choosing the best candidate to add into a core group. This FAS score evaluation will not be applied if the user uses the option --fasoff.

Ortholog search

After having the core ortholog group of the seed gene, fDOG will use its profile HMM to find orthologs in the search taxa, which are all taxa in the searchTaxa_dir folder (--searchpath). The main output of this step is a multiple fasta file (jobName.extended.fa), where the seed sequence can be found at the beginning of the file, and followed by all founded ortholog sequences.

If the option --fasoff is used, the last step will be skipped, and fDOG will create another output called jobName.phyloprofile, which can be input to PhyloProfile tool for further phylogenetic analyzing.

FAS score calculation

In case --fasoff not set, fDOG will perform the FAS score calculation based on the jobName.extended.fa file. fdogFAS function of the FAS tool will be applied to compare the feature architecture of the seed protein against all other sequences in the jobID.extended.fa. The outputs of this step will be jobName.phyloprofile, jobName_forward.domains and jobName_reverse.domains.

Because fdogFAS takes the first sequence from the jobName.extended.fa file as the seed protein, therefore if you encounter any strange FAS result, you can check if the jobName.extended.fa is as expected.