The AGNOSTOS workflow DB-creation and DB-update modules were developed and tested in an HPC cluster setup with at least 4 nodes of 28 cores and 252 G of memory each, which uses SLURM as Grid Batch Scheduler. To run one of these two modules follow the next steps:
-
Clone the repository:
git clone https://github.com/functional-dark-side/agnostos-wf
andcd agnostos-wf/
-
Install packages not in Conda. Please, check the installation script installation_script.sh (
sh installation_script.sh
) and in case you are missing some of the listed programs install them using the commands from the script. -
Note on MMseqs2: the program can be installed via conda, however the latest workflow version was tested using the version "9cc89aa594131293b8bc2e7a121e2ed412f0b931" and newer releases could affect the workflow performance and results.
-
Check that you have the required external DBs listed in the config.yaml file (under "Databases"). In case you miss some of them, you can find the instructions for the download in the script download_DBs.sh. If you want to download all needed databases at once run
sh download_DBs.sh
(Please note that this require about 300GB of space and be patient, as this may take a while...). Alternatively, the DBs can be downloaded separately in the various steps and removed after use by specifyingdb_mode: "memory"
in the config.yaml file. -
Check the configuration files (.yaml) in the config/ folder. To change the program and output paths to your designated folders you can use the following commands:
# cd into the workflow directory
cd Workflow
sed -i 's|vol/cloud/agnostos-wf/workflow|/your/wotkflow/path|g' config/config.yaml
sed -i 's|vol/cloud/agnostos-wf/workflow|/your/wotkflow/path|g' config/config_communities.yaml
# your data directory
sed -i 's|/vol/cloud/agnostos_test/db_update_data|/your/data/path|g' config/config.yaml
# your results directory
sed -i 's|/vol/cloud/agnostos_test/db_update|/your/results/path|g' config/config.yaml
sed -i 's|/vol/cloud/agnostos_test/db_update|/your/results/path|g' config/config_communities.yaml
# the directory of the existing GC database
sed -i 's|/vol/cloud/agnostosDB|/your/GC_DB/path|g' config/config.yaml
# the directory to the external databases
sed -i 's|/vol/cloud/agnostos-wf/databases|/your/external_database/path|g' config/config.yaml
sed -i 's|/vol/cloud/agnostos-wf/databases|/your/external_database/path|g' config/config_communities.yaml
# OPTIONAL: the directory to the binaries needed by the workflow,
# by default in the workflow folder under the directory bin/
sed -i 's|/vol/cloud/agnostos-wf/bin/|/your/binaries/path|g' config/config.yaml
Additionally you will have to specify if your data consists of contigs (data_stage: "contigs"
), self predicted genes sequences (data_stage: "genes"
) or the gene prediction retrieved with anvi'o (data_stage: "anvio_genes"
), and provide the name of the input files in the config.yaml file in the following entries:
# Gene or contig file
data: "/your/data/path/your_genes.fasta"
# specify at which stage are your data, can be either "genes" or "contigs"
data_stage: "genes" #"contigs" or "genes" or "anvio_genes"
# If you already have the gene predictions, please provide path to gene completeness information
## In case your data comes from an anvi'o contigDB, please specify here the anvi'o gene_calls.tsv file,
## retrieved via "anvi-export-gene-calls -c CONTIGS.db -o anvio_gene_calls.tsv"
data_partial: "/vol/cloud/agnostos_test/db_update_data/new_genes_partial_info.tsv"
NB: In case you have separate files for each contig or gene prediction, please concatenate the files in a single multi-fasta file.
When everything is set you can run AGNOSTOS as follow:
(Example for the DB-update module)
cd workflow/
snakemake -s Snakefile --use-conda -j 100 --config module="update" --cluster-config config/cluster.yaml --cluster "sbatch --export=ALL -t {cluster.time} -c {threads} --ntasks-per-node {cluster.ntasks_per_node} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus_per_task} --job-name {rulename}.{jobid} --partition {cluster.partition}" -R --until update_workflow_report
The profile-search module does not require an HPC environment and can be run on a local computer following the steps below and installing MMseqs2 in case you don't have it yet:
# download the AGNOSTOS seed database gene cluster profiles
wget https://figshare.com/ndownloader/files/30998305 -O mmseqs-profiles.tar.gz
tar -xzvf mmseqs-profiles.tar.gz
# download the AGNOSTOS seed database gene cluster categories
wget https://ndownloader.figshare.com/files/23067140 -O cluster_ids_categ.tsv.gz
gunzip cluster_ids_categ.tsv.gz
# Run the sequence-profile search
Profile_search/profile_search.sh --query your-genes.fasta --clu_hmm mmseqs-profiles/clu_hmm_db --clu_cat cluster_ids_categ.tsv --mmseqs /path/to/mmseqs --mpi FALSE --threads 8
NOTE: On MAC-OS-X you will probably need to install the gnu-getopt, which supports long options (--). For this you can use the command conda install -c bioconda gnu-getopt
or with Homebrew as brew install gnu-getopt
.