From 830bc245b4e27322608e8df39e3b3c2b38fdf04a Mon Sep 17 00:00:00 2001 From: Anshu Gupta Date: Thu, 19 Sep 2024 09:43:19 -0700 Subject: [PATCH] extensively updated documentation, added contribution and troubleshooting steps --- README.md | 22 ++- docs/cite.md | 11 ++ docs/contribution.md | 72 ++++++++++ docs/index.md | 307 ---------------------------------------- docs/install.md | 112 +++++++++++++++ docs/quickstart.md | 61 ++++++++ docs/troubleshooting.md | 111 +++++++++++++++ docs/usage.md | 118 +++++++++++++++ mkdocs.yml | 12 ++ 9 files changed, 512 insertions(+), 314 deletions(-) create mode 100644 docs/cite.md create mode 100644 docs/contribution.md create mode 100644 docs/install.md create mode 100644 docs/quickstart.md create mode 100644 docs/troubleshooting.md create mode 100644 docs/usage.md diff --git a/README.md b/README.md index fe1f9009..4d151099 100644 --- a/README.md +++ b/README.md @@ -7,10 +7,11 @@ [![License][license-badge]][license-link] [![Build Status](https://github.com/TurakhiaLab/ROADIES/actions/workflows/ci.yml/badge.svg)](https://github.com/TurakhiaLab/ROADIES/actions) -[](https://snakemake.readthedocs.io/en/v7.19.1/index.html) +[](https://snakemake.readthedocs.io/en/v7.19.1/index.html) +[](http://bioconda.github.io/recipes/roadies/README.html) [](https://hub.docker.com/r/ang037/roadies) [](https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1) -[](https://doi.org/10.5061/dryad.tht76hf73) +[](https://doi.org/10.5061/dryad.tht76hf73) [](https://youtu.be/1sR741TvZnM?si=vVNAnonvzNEzrLKq)
@@ -59,7 +60,9 @@ Welcome to the official repository of ROADIES, a novel pipeline designed for phy To run ROADIES using Bioconda package, follow these steps: -**Note:** You need to have conda installed in your system. To install and use conda in Ubuntu machine, execute the set of commands below: +**Note:** You need to have conda installed in your system. Also make sure you have updated version of glibc in your system (`GLIBC >= 2.29`). + +To install and use conda in Ubuntu machine, execute the set of commands below: ``` wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh @@ -89,7 +92,7 @@ conda activate myenv conda install roadies ``` -All files of ROADIES along with dependencies will be found in `/miniconda3/envs/new_env/ROADIES`. +All files of ROADIES along with dependencies will be found in `/miniconda3/envs/myenv/ROADIES`. ### Using DockerHub @@ -143,13 +146,14 @@ This will install and build all tools and dependencies. Once the setup is comple #### Required dependencies To run this script, ensure the following dependencies are installed: -- Java Runtime Environment (version 1.7 or higher) -- Python (version 3.9 or higher) +- Java Runtime Environment (Version 1.7 or higher) +- Python (Version 3.9 or higher) - `wget` and `unzip` commands -- GCC (version 11.4 or higher) +- GCC (Version 11.4 or higher) - cmake (Download here: https://cmake.org/download/) - Boost library (Download here: https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/) - zlib (Download here: http://www.zlib.net/) +- GLIBC (Version 2.29 or higher) For Ubuntu, you can install these dependencies with: @@ -217,6 +221,10 @@ python run_roadies.py --cores 16 --mode fast
+### For troubleshooting and contribution details, refer to [Wiki](https://turakhialab.github.io/ROADIES/) + +
+ ## Citing ROADIES If you use ROADIES in your research or publications, please cite the following paper: diff --git a/docs/cite.md b/docs/cite.md new file mode 100644 index 00000000..d93756c5 --- /dev/null +++ b/docs/cite.md @@ -0,0 +1,11 @@ +# Cite ROADIES + +If you use ROADIES in your research or publications, please cite the following paper: + +Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. _bioRxiv_. [https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1](https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1). + +## Accessing ROADIES output files + +The output files with the gene trees and species trees generated by ROADIES in the manuscript are deposited to [Dryad](https://datadryad.org/stash). To access it, please refer to the following: + +Gupta, Anshu; Mirarab, Siavash; Turakhia, Yatish (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES [Dataset]. Dryad. [https://doi.org/10.5061/dryad.tht76hf73](https://doi.org/10.5061/dryad.tht76hf73). \ No newline at end of file diff --git a/docs/contribution.md b/docs/contribution.md new file mode 100644 index 00000000..91c8ab7d --- /dev/null +++ b/docs/contribution.md @@ -0,0 +1,72 @@ +# Contributions + +Thank you for considering contributing to **ROADIES**! We value your input and want to make it as easy as possible to get involved. Whether you're reporting a bug, adding a feature, or improving documentation, this guide will help you understand how to contribute effectively. + +## Table of Contents +- [How to Contribute](#how-to-contribute) + - [Reporting Bugs](#reporting-bugs) + - [Suggesting Features](#suggesting-features) + - [Documentation Update](#documentation-update) + - [Submitting Changes](#submitting-changes) +- [Pull Request Guidelines](#pull-request-guidelines) +- [License](#license) +- [Contact Information](#contact-information) + +## How to Contribute + +### Reporting Bugs +If you find a bug, please [open an issue](https://github.com/TurakhiaLab/ROADIES/issues) and include: + +- A clear description of the issue, including steps to reproduce it. +- Screenshots or logs if applicable. + +### Suggesting Features +We welcome feature requests! Please [open a new feature request issue](https://github.com/TurakhiaLab/ROADIES/issues) and include: + +- A clear description of the feature. +- Use a label for the issue you created - `new feature request`. +- Why it would be useful and how it fits with the project’s existing features. +- Any alternatives you've considered. + +### Documentation Update +If you want to suggest some updates/modifications in our documentation or, please [open a new documentation issue](https://github.com/TurakhiaLab/ROADIES/issues) and include: + +- Clear description of the modifications or suggested additions. +- Use a label for the issue you created - `documentation`. + + +### Submitting Changes +1. Fork the repository. +2. Create a new branch (`git checkout -b feature/your-feature`). +3. Make your changes. +4. Ensure your changes pass all tests. To check that, run the following command: + +```bash +cd ROADIES +python run_roadies.py --cores 16 +``` +After running this command, if you are able to successfully create the final species tree, the changes passed the test and its ready to be pushed to the forked repository. + +5. Commit your changes (`git commit -m 'Add new feature'`). +6. Push your changes to your fork (`git push origin feature/your-feature`). +7. Create a pull request. + +Please keep your pull requests small and focused. Large pull requests are harder to review and can delay the process. + +## Pull Request Guidelines +- Ensure the title is concise but descriptive. +- Include a detailed description of what the pull request does. +- Follow the [commit message guidelines](#https://github.com/TurakhiaLab/ROADIES/blob/main/.github/pull_request_template.md). +- Link to any related issues or pull requests. + +## License + +By contributing, you agree that your contributions will be licensed under the [MIT License](https://github.com/TurakhiaLab/ROADIES/blob/main/LICENSE). + +## Contact Information + +For general inquiries and support, reach out to our team. + +Anshu Gupta - ang037 [at] ucsd [dot] edu + +Yatish Turakhia - yturakhia [at] ucsd [dot] edu \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 321efd66..a956e35a 100644 --- a/docs/index.md +++ b/docs/index.md @@ -63,310 +63,3 @@ The initial count of the genes is crucial to get the accurate species tree at th -## Ways to install ROADIES - -### Using ROADIES Bioconda package - -To run ROADIES using Bioconda package, follow these steps: - -**Note:** You need to have conda installed in your system. To install and use conda in Ubuntu machine, execute the set of commands below: - -``` -wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -chmod +x Miniconda3-latest-Linux-x86_64.sh -./Miniconda3-latest-Linux-x86_64.sh - -export PATH="$HOME/miniconda3/bin:$PATH" -source ~/.bashrc - -conda config --add channels defaults -conda config --add channels bioconda -conda config --add channels conda-forge -``` - -After this, try running `conda` in your terminal to check if conda is properly installed. Once it is installed, follow the steps below: - -1. Create and activate custom conda environment with Python version 3.9 - -``` -conda create -n myenv python=3.9 -conda activate myenv -``` - -2. Install ROADIES bioconda package - -``` -conda install roadies -``` - -All files of ROADIES along with dependencies will be found in `/miniconda3/envs/new_env/ROADIES`. - -### Using DockerHub - -To run ROADIES using DockerHub, follow these steps: - -1. Pull the ROADIES Docker image from DockerHub: - -``` -docker pull ang037/roadies:latest -``` -2. Run the Docker container: - -``` -docker run -it ang037/roadies:latest -``` - -### Using Docker locally - -First, clone the repository (requires `git` to be installed in the system): - -``` -git clone https://github.com/TurakhiaLab/ROADIES.git -cd ROADIES -``` - -Then build and run the Docker container: - -``` -docker build -t roadies_image . -docker run -it roadies_image -``` - -### Using installation script (requires sudo access) - -First clone the repository: - -``` -git clone https://github.com/TurakhiaLab/ROADIES.git -cd ROADIES -``` - -Then, execute the installation script: - -``` -chmod +x roadies_env.sh -source roadies_env.sh -``` - -This will install and build all tools and dependencies. Once the setup is complete, it will print `Setup complete` in the terminal and activate the `roadies_env` environment with all Conda packages installed. - -!!! Note - ROADIES is built on [Snakemake (workflow parallelization tool)](https://snakemake.readthedocs.io/en/stable/). It also requires various tools (PASTA, LASTZ, RAxML-NG, MashTree, FastTree, ASTRAL-Pro2) to be installed before performing the analysis. To ease the process, instead of individually installing the tools, we provide `roadies_env.sh` script to automatically download all dependencies into the user system. - -#### Required dependencies - -To run this script, ensure the following dependencies are installed: -- Java Runtime Environment (version 1.7 or higher) -- Python (version 3 or higher) -- `wget` and `unzip` commands -- GCC (version 11.4 or higher) -- cmake (Download here: https://cmake.org/download/) -- Boost library (Download here: https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/) -- zlib (Download here: http://www.zlib.net/) - -For Ubuntu, you can install these dependencies with: - -``` -sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptools git default-jre libgomp1 libboost-all-dev cmake -``` - -!!! Warning - If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`. - - -## Quick start (with provided test dataset) - -Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine: - -1. Go to ROADIES repository directory if not there: - -``` -cd ROADIES -``` - -2. Create a directory for the test data and download the test datasets (using the following one line command): - -``` -mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}' -``` -3. Run the pipeline with the following command (from ROADIES directory): - -``` -python run_roadies.py --cores 16 -``` - -The second command will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. The third command will run ROADIES for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `output_files` folder upon completion. - -**Running ROADIES with different modes of operation**: To run ROADIES in various other modes of operation (fast, balanced, accurate) (description of these modes are mentioned in [Modes of operation](index.md#modes-of-operation) section), try the following commands: - -``` -python run_roadies.py --cores 16 --mode accurate -``` - -``` -python run_roadies.py --cores 16 --mode balanced -``` - -``` -python run_roadies.py --cores 16 --mode fast -``` -!!! Note - Accurate mode is the default mode of operation. If you don't specify any particular mode using `--mode` argument, default mode will run. - -For each modes, the output species tree will be saved as `roadies.nwk` in a separate `output_files` folder. - -**Running ROADIES in converge mode**: To run ROADIES with converge mode (details mentioned in [convergence mechanism](index.md#convergence-mechanism) section), run the following command (notice the addition of `--converge` argument): - -``` -python run_roadies.py --cores 16 --converge -``` - -Try following commands for other modes: - -``` -python run_roadies.py --cores 16 --mode balanced --converge -``` -``` -python run_roadies.py --cores 16 --mode fast --converge -``` - -The output files for all iterations will be saved in a separate `converge_files` folder. `output_files` will save the results of the last iteration. Species tree for all iterations will be saved in `converge_files` folder with the nomenclature `iteration_.nwk`. - -## Detailed Usage - -This section provides detailed instructions on how to configure the ROADIES pipeline further for various user requirements with your own genomic dataset. Once the required environment setup process is complete, follow the steps below. - -### Step 1: Specify input genomic dataset - -After installing the environment, you need to get input genomic sequences for creating the species tree. To run ROADIES with your own dataset, update the `config.yaml` file (found in the ROADIES directory - `config` folder) to include the path to your input datasets under the `GENOMES` parameter. - -!!! Note - All input genome assemblies in the path mentioned in `GENOMES` should be in `.fa` or `.fa.gz` format. The genome assembly files should be named according to the species' names (for example, Aardvark's genome assembly is to be named `Aardvark.fa`). Each file should contain the genome assembly of one unique species. If a file contains multiple species, split it into individual genome files (fasplit can be used for this: `faSplit byname `). Moreover, the file name should not have any special characters like `.` (apart from `_`) - for example, if the file name is `Aardvark.1.fa`, rename it to `Aardvark_1.fa`. - -### Step 2: Modify Other Configuration Paramters - -Adjust other parameters listed in `config.yaml` as per specific user requirements. Details of the parameters are mentioned below. - -!!! Note - ROADIES has default values for some of the parameters that give the best results and are recommended in general. However, users can optionally modify the values specific to their needs. - -| Parameters | Description | Default value | -| --- | --- | --- | -| **GENOMES** | Specify the path to your input files which includes raw genome assemblies of the species. | | -| **REFERENCE** (optional) | Specify the path for the reference tree (state-of-the-art) in Newick format to compare ROADIES' results with a state-of-the-art approach. If you don't want to specify any reference tree, set it to `NULL`. | `NULL` | -| **LENGTH** | Configure the lengths of each of the randomly sampled subsequences or genes. | 500 | -| **GENE_COUNT** | Configure the number of genes to be sampled across all input genome assemblies. In normal mode, this will be the count of the genes to be sampled. In `--converge` mode, this will be the initial count of the number of genes for the first iteration and this value will be doubled iteratively. | 250 | -| **UPPER_CASE** | Configure the lower limit threshold of upper cases for valid sampling. ROADIES samples the genes only if the percentage of upper cases in each gene is more than this value. | 0.9 (Recommended) | -| **OUT_DIR** | Specify the path for ROADIES output files (this saves the current iteration results in converge mode). | | -| **ALL_OUT_DIR** | Specify the path for ROADIES output files for all iterations in converge mode. | | -| **MIN_ALIGN** | Specify the minimum number of allowed species to exist in gene fasta files after LASTZ. This parameter is used for filtering gene fasta files which has very less species representation. It is recommended to set the value greater than or equal to 4 since ASTRAL-Pro follows a quartet-based topology for species tree inference. For larger evolutionary timescales, we recommended setting it to a much higher value. In such cases, 15 to 20 would be a good start. | 4 | -| **COVERAGE** | Set the percentage of input sequence included in the alignment for LASTZ. | 85 | -| **CONTINUITY** | Define the allowable percentage of non-gappy alignment columns for LASTZ. | 85 | -| **IDENTITY** | Set the percentage of the aligned base pairs (matches/mismatches) for LASTZ. For larger evolutionary timescales, consider lowering the identity values than default for more homologous hits to be encountered. | 65 | -| **MAX_DUP** | Specify maximum number of allowed gene copies from one input genome in an alignment. | 10| -| **STEPS** |Specify the number of steps in the LASTZ sampling (increasing number speeds up alignment but decreases LASTZ accuracy).|1 | -| **FILTERFRAGMENTS** | Specify the portion so that sites with less than the specified portion of non-gap characters in PASTA alignments will be masked out. If it is set to 0.5, then sites with less than 50% of non-gap characters will be masked out. | 0.5 | -| **MASKSITES** | Specify the portion so that sequences with less than the specified portion of non-gap sequences will be removed in PASTA alignment. If it is set to 0.05, then sequences having less than 5% of non-gap characters (i.e., more than 95% gaps) will be masked out.| 0.02 | -| **SUPPORT_THRESHOLD** | Specify the threshold so that support values with equal to or higher than this threshold is considered as highly supported node. Such highly supported nodes crossing this threshold will be counted at every iteration to check the confidence of the tree (works in `--converge` mode). | 0.95 | -| **NUM_INSTANCES** | Specify the number of instances for PASTA, LASTZ, MashTree and RAxML-NG to run in parallel. It is recommended to set the number of instances equal to (`--cores`/4) for optimal runtime. | 4 | - -### Step 3: Run the ROADIES pipeline - -Once the required installations are completed and the parameters are configured in `config.yaml` file, execute the following command (from ROADIES repo home directory): - -``` -python run_roadies.py --cores -``` - -This will let ROADIES run in accurate mode by default with specified number of cores. After the completion of the execution, the output species tree in Newick format will be saved as `roadies.nwk` in a separate `output_files` folder. - -### Command line arguments - -There are multiple command line arguments through which user can change the mode of operation, specify the custom config file path, etc. - -| Argument | Description | -| --- | --- | -| `--cores` | Specify the number of cores | -| `--mode` | Specify [modes of operation](index.md#modes-of-operation) (`accurate`, `balanced` or `fast`).`accurate` mode is the default mode. | -| `--converge` | Run ROADIES in [converge](index.md#convergence-mechanism) mode if you do not know the optimal gene count to start with | -| `--config` | Provide optional custom YAML files (in the same format as `config.yaml` provided with this repository). If not given, by default `config/config.yaml` file will be considered.| - -For example: - -``` -python run_roadies.py --cores 16 --mode balanced --converge --config config/config.yaml -``` - -Use `--help` to get the list of command line arguments. - -### Step 4: Analyze output files - -#### Without convergence - -After the pipeline finishes running, the final species tree estimated by ROADIES will be saved as `roadies.nwk` inside a separate folder mentioned in the `--OUT_DIR` parameter in the `config/config.yaml` file. - -ROADIES also provides a number of intermediate output files for extensive debugging by the user. These files will be saved in `--OUT_DIR`, containing the following subfolders: - -1. `alignments` - this folder contains the LASTZ alignment output of all individual input genomes aligned with randomly sampled gene sequences. -2. `benchmarks` - this folder contains the runtime value of each of the individual jobs for each of the stages in the pipeline. These files will only be used if you want to estimate and compare the stagewise runtime of various pipeline stages and will not be used in final tree estimation. -3. `genes` - this folder contains the output files of multiple sequence alignment and tree-building stages (run by PASTA, IQTREE/FastTree, MashTree) of the pipeline. -4. `genetrees` - this folder contains two files as follows: - - `gene_tree_merged.nwk` - this file lists all gene trees together generated by IQTREE/FastTree/MashTree. It is used by ASTRAL-Pro to estimate the final species tree from this list of gene trees. - - `original_list.txt` - this file lists all gene trees together corresponding to their gene IDs. Some lines will have only gene IDs but no associated gene trees. This is because some genes will be filtered out from tree building and MSA step if it has less than four species. Hence this file also lists those gene IDs with missing gene trees for further debugging. -5. `plots` - this folder contains four following plots: - - `gene_dup.png` - this histogram plot represents the count of the number of gene duplicates on the Y-axis vs. the number of genes having duplication on the X-axis. - - `homologues.png` - this histogram plot represents the count of the number of genes on the Y-axis vs. the number of homologous species on the X-axis. - - `num_genes.png` - this plot represents how many genes out of `--GENE_COUNT` parameter have been aligned to each of the input genomes after the LASTZ step. The X-axis represents different genomes, and the Y-axis represents the number of genes. - - `sampling.png` - the plot shows how many genes have been sampled from each of the input genomes after the random sampling step. The X-axis represents different genomes, and the Y-axis represents the number of genes. -6. `samples` - this folder contains the list of randomly sampled genes from individual input genomes. - - `_temp.fa` - these files contain genes sampled from the particular input genome. - - `out.fa` - this file contains all sampled subsequences (genes) from individual genomes combined, which is given to the the LASTZ step. -7. `statistics` - this folder contains CSV data for the plots shown in the `plots` directory mentioned above. - - `gene_to_species.csv` - this is an additional CSV file (corresponding plots to be added in future) which provides the information about which genes are aligned to what species after LASTZ step (`num_genes.csv` only gives the total count of the genes per species, `gene_to_species.csv` also gives the ID number of those aligned genes). Along with each gene ID number, it also provides the [score, line number in .maf file, position] of all the homologs of that particular gene. Score, position and line number information is collected from the corresponding species' .maf file (generated by LASTZ), saved in `results/alignments` folder. -8. `roadies_stats.nwk`- this is the final estimated species tree (same as `roadies.nwk`), along with the support branch values in the Newick tree. -9. `roadies.nwk`- this is the final estimated species tree in Newick format. -10. `roadies_rerooted.nwk` (optional) - this is the final estimated species tree, re-rooted corresponding to the outgroup node from the given reference tree (provided as `REFERENCE` in `config.yaml`). -11. `time_stamps.csv` - this file contains the start time, number of gene trees required for estimating species tree, end time, and total runtime (in seconds), respectively. -12. `ref_dist.csv` - this file provides the number of gene trees and the Normalized Robinson-Foulds distance between the final estimated species tree (i.e., `roadies.nwk`) and the reference tree (i.e., REFERENCE parameter in `config.yaml`). - -#### With convergence - -If converge option is enabled, the results of all iterations (along with the corresponding species tree in the name `iteration_.nwk`) will be saved in a separate folder mentioned in the `--ALL_OUT_DIR` parameter in the `config/config.yaml` file. - -!!! Note - With `--converge` option, `--OUT_DIR` saves the results of the current ongoing iteration (if pipeline execution is finished, then the last iteration), whereas `--ALL_OUT_DIR` saves the results of all iterations executed. - -For extensive debugging, other intermediate output files for each stage of the pipeline for each iterations are saved in `--ALL_OUT_DIR` as follows: - -1. Folder with `iteration_` - this folder contains results from the specific iteration corresponding to the iteration number in the folder name. - - Folder with name in `--OUT_DIR` - this contains the results of all stages of the pipeline (as described above in non convergence section). - - `gene_tree_merged.nwk` - this file lists all gene trees together generated by IQTREE/FastTree/MashTree in that particular iteration. It is concatenated with master list of gene trees from all past iterations before providing to ASTRAL-Pro to estimate the final converged species tree. - - `iteration_.log` - this file contains the log information of the corresponding iteration execution. - - `mapping.txt` - This file maps all gene names in the gene trees with the corresponding species name from where it originates. It is required by ASTRAL-Pro, along with the master list of gene trees from all iterations, to infer species tree. -2. `iteration__stats.nwk` - this is the final estimated species tree for the corresponding iteration (same as `iteration_.nwk`), along with the support branch values in the Newick tree. -3. `iteration_.nwk` - this is the final estimated species tree for the corresponding iteration -4. `iteration_.rerooted.nwk` - (optional) - this is the final estimated species tree for the corresponding iteration, re-rooted to the outgroup node from the given reference tree (provided as `REFERENCE` in `config.yaml`). -5. `master_gt.nwk` - this is the concatenated list of all gene trees from all iterations together. -6. `master_map.txt` - this is the concatenated list of all mapping files from all iterations together. This `master_gt.nwk` and `master_map.txt` is provided to ASTRAL-Pro after every iteration to get the converged species tree. -7. `ref_dist.csv` - this file provides the iteration number, number of gene trees and the Normalized Robinson-Foulds distance between the final estimated species tree (i.e., `roadies.nwk`) and the reference tree (i.e., REFERENCE parameter in `config.yaml`), for all iterations. -8. `time_stamps.csv`- this file contains the start time in first line, iteration number, number of gene trees required for estimating species tree, end time, and total runtime (in seconds), respectively, for all iterations in subsequent lines. - -## Contributions - -We welcome contributions from the community. If you encounter any issues or have suggestions for improvement, please open an issue on GitHub. For general inquiries and support, reach out to our team. - -Anshu Gupta - ang037 [at] ucsd [dot] edu - -Yatish Turakhia - yturakhia [at] ucsd [dot] edu - -## Citing ROADIES - -If you use ROADIES in your research or publications, please cite the following paper: - -Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. _bioRxiv_. [https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1](https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1). - -### Accessing ROADIES output files - -The output files with the gene trees and species trees generated by ROADIES in the manuscript are deposited to [Dryad](https://datadryad.org/stash). To access it, please refer to the following: - -Gupta, Anshu; Mirarab, Siavash; Turakhia, Yatish (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES [Dataset]. Dryad. [https://doi.org/10.5061/dryad.tht76hf73](https://doi.org/10.5061/dryad.tht76hf73). diff --git a/docs/install.md b/docs/install.md new file mode 100644 index 00000000..07f9f23a --- /dev/null +++ b/docs/install.md @@ -0,0 +1,112 @@ +# Installation Methods + +## Using ROADIES Bioconda package + +To run ROADIES using Bioconda package, follow these steps: + +**Note:** You need to have conda installed in your system. Also make sure you have updated version of glibc in your system (`GLIBC >= 2.29`). + +To install and use conda in Ubuntu machine, execute the set of commands below: + +```bash +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh +chmod +x Miniconda3-latest-Linux-x86_64.sh +./Miniconda3-latest-Linux-x86_64.sh + +export PATH="$HOME/miniconda3/bin:$PATH" +source ~/.bashrc + +conda config --add channels defaults +conda config --add channels bioconda +conda config --add channels conda-forge +``` + +After this, try running `conda` in your terminal to check if conda is properly installed. Once it is installed, follow the steps below: + +1. Create and activate custom conda environment with Python version 3.9 + +```bash +conda create -n myenv python=3.9 +conda activate myenv +``` + +2. Install ROADIES bioconda package + +``` +conda install roadies +``` + +All files of ROADIES along with dependencies will be found in `/miniconda3/envs/new_env/ROADIES`. + +## Using DockerHub + +To run ROADIES using DockerHub, follow these steps: + +1. Pull the ROADIES Docker image from DockerHub: + +```bash +docker pull ang037/roadies:latest +``` +2. Run the Docker container: + +```bash +docker run -it ang037/roadies:latest +``` + +## Using Docker locally + +First, clone the repository (requires `git` to be installed in the system): + +```bash +git clone https://github.com/TurakhiaLab/ROADIES.git +cd ROADIES +``` + +Then build and run the Docker container: + +```bash +docker build -t roadies_image . +docker run -it roadies_image +``` + +## Using installation script (requires sudo access) + +First clone the repository: + +```bash +git clone https://github.com/TurakhiaLab/ROADIES.git +cd ROADIES +``` + +Then, execute the installation script: + +```bash +chmod +x roadies_env.sh +source roadies_env.sh +``` + +This will install and build all tools and dependencies. Once the setup is complete, it will print `Setup complete` in the terminal and activate the `roadies_env` environment with all Conda packages installed. + +!!! Note + ROADIES is built on [Snakemake (workflow parallelization tool)](https://snakemake.readthedocs.io/en/stable/). It also requires various tools (PASTA, LASTZ, RAxML-NG, MashTree, FastTree, ASTRAL-Pro2) to be installed before performing the analysis. To ease the process, instead of individually installing the tools, we provide `roadies_env.sh` script to automatically download all dependencies into the user system. + +### Required dependencies + +To run this script, ensure the following dependencies are installed: +- Java Runtime Environment (version 1.7 or higher) +- Python (version 3 or higher) +- `wget` and `unzip` commands +- GCC (version 11.4 or higher) +- cmake (Download here: https://cmake.org/download/) +- Boost library (Download here: https://boostorg.jfrog.io/artifactory/main/release/1.82.0/source/) +- zlib (Download here: http://www.zlib.net/) +- GLIBC (Version 2.29 or higher) + +For Ubuntu, you can install these dependencies with: + +```bash +sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptools git default-jre libgomp1 libboost-all-dev cmake +``` + +!!! Warning + If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`. diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 00000000..190f78d0 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,61 @@ +# Quick start (with provided test dataset) + +Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine: + +**Step 1:** Go to ROADIES repository directory if not there: + +```bash +cd ROADIES +``` + +**Step 2:** Create a directory for the test data and download the test datasets (using the following one line command): + +```bash +mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}' +``` +**Step 3:** Run the pipeline with the following command (from ROADIES directory): + +```bash +python run_roadies.py --cores 16 +``` + +The second command will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. The third command will run ROADIES for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `output_files` folder upon completion. + +## Running ROADIES with different modes of operation + +To run ROADIES in various other modes of operation (fast, balanced, accurate) (description of these modes are mentioned in [Modes of operation](index.md#modes-of-operation) section), try the following commands: + +```bash +python run_roadies.py --cores 16 --mode accurate +``` + +```bash +python run_roadies.py --cores 16 --mode balanced +``` + +```bash +python run_roadies.py --cores 16 --mode fast +``` +!!! Note + Accurate mode is the default mode of operation. If you don't specify any particular mode using `--mode` argument, default mode will run. + +For each modes, the output species tree will be saved as `roadies.nwk` in a separate `output_files` folder. + +## Running ROADIES in converge mode + +To run ROADIES with converge mode (details mentioned in [convergence mechanism](index.md#convergence-mechanism) section), run the following command (notice the addition of `--converge` argument): + +```bash +python run_roadies.py --cores 16 --converge +``` + +Try following commands for other modes: + +```bash +python run_roadies.py --cores 16 --mode balanced --converge +``` +```bash +python run_roadies.py --cores 16 --mode fast --converge +``` + +The output files for all iterations will be saved in a separate `converge_files` folder. `output_files` will save the results of the last iteration. Species tree for all iterations will be saved in `converge_files` folder with the nomenclature `iteration_.nwk`. \ No newline at end of file diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 00000000..6d51cc48 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,111 @@ +# Troubleshooting Steps + +## 1. Mamba not found in the shell + +When running the following command: +```bash +$ python ROADIES-main/run_roadies.py --cores 1 +``` +You may encounter this error: + +```bash +rm: cannot remove 'output_files': No such file or directory +Unlocking working directory. +snakemake --cores 1 --config mode=accurate config_path=config/config.yaml num_threads=0 --use-conda --rerun-incomplete +Config file config/config.yaml is extended by additional config specified via the command line. +Building DAG of jobs... +CreateCondaEnvironmentException: +The 'mamba' command is not available in the shell /usr/bin/bash that will be used by Snakemake. You have to ensure that it is in your PATH, e.g., first activating the conda base environment with `conda activate base`.The mamba package manager (https://github.com/mamba-org/mamba) is a fast and robust conda replacement. It is the recommended way of using Snakemake's conda integration. It can be installed with `conda install -n base -c conda-forge mamba`. If you still prefer to use conda, you can enforce that by setting `--conda-frontend conda`. +``` + +### Cause + +The `mamba` package manager is missing or not available in the environment. + +### Solution + +Install mamba: + +``` +conda install -n base -c conda-forge mamba +``` + +If you prefer using `conda`, you can enforce it by adding the `--conda-frontend` conda argument. + +**Step 1:** In the downloaded ROADIES repository, open the file `noconverge.py` inside the `workflow` folder (`ROADIES/workflow/noconverge.py`). + +**Step 2:** At line 31, add the argument `--conda-frontend conda` to the `cmd` command, as shown below: + +```python +cmd = [ + "snakemake", + "--cores", + str(cores), + "--config", + "mode=" + str(mode), + "config_path=" + str(config_path), + "num_threads=" + str(num_threads), + "--use-conda", + "--rerun-incomplete", + "--conda-frontend", "conda" +] +``` +**Step 3:** Rerun the pipeline as follows: + +``` +python run_roadies.py --cores 16 +``` + +## 2. Conda not recognized + +### Cause + +Conda is not added to your system's PATH. + +### Solution + +Ensure conda is added to the PATH by running the following commands: + +```bash +export PATH="$HOME/miniconda3/bin:$PATH" +source ~/.bashrc +``` + +## 3. Singularity issues + +### Cause + +Problems arise when trying to run the pipeline with Singularity. + +### Solution + +We recommend using Docker instead of Singularity. Ensure Docker is installed and running on your system. We have also provided Bioconda support for users who face issues with Singularity. + +## 4. Handling dependencies (glibc) + +Ensure that the glibc version on your system is updated to 2.29 or higher. Update your system libraries if necessary. Otherwise you may encounter this error: + +```bash +workflow/scripts/lastz_32: /lib64/libm.so.6: version 'GLIBC_2.29' not found +``` + +## 5. PASTA fails with insufficient core count + +### Cause + +Pasta fails when the number of cores is insufficient for the number of instances. + +The pipeline provides `NUM_INSTANCES` as a configuration parameter in `config.yaml` to run multiple instances in parallel. Each instance can also be parallelized using threads. The number of threads per instance is calculated as: + +```makefile +num_threads = number_of_cores / num_instances +``` +If `num_instances > number_of_cores`, then `num_threads` will be 0 and the process (e.g., `pasta`) will fail. + +### Solution + +Ensure that the number of cores is greater than or equal to the number of instances. By default, `NUM_INSTANCES` is set to 4, so the number of cores (`--cores` in command line argument) must be at least 4. To run the pipeline with fewer cores, modify the `NUM_INSTANCES` parameter in the config file: + +```bash +python run_roadies.py --cores --config_path config/config.yaml +``` \ No newline at end of file diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 00000000..7a32ec17 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,118 @@ +# Detailed Usage + +This section provides detailed instructions on how to configure the ROADIES pipeline further for various user requirements with your own genomic dataset. Once the required environment setup process is complete, follow the steps below. + +## Step 1: Specify input genomic dataset + +After installing the environment, you need to get input genomic sequences for creating the species tree. To run ROADIES with your own dataset, update the `config.yaml` file (found in the ROADIES directory - `config` folder) to include the path to your input datasets under the `GENOMES` parameter. + +!!! Note + All input genome assemblies in the path mentioned in `GENOMES` should be in `.fa` or `.fa.gz` format. The genome assembly files should be named according to the species' names (for example, Aardvark's genome assembly is to be named `Aardvark.fa`). Each file should contain the genome assembly of one unique species. If a file contains multiple species, split it into individual genome files (fasplit can be used for this: `faSplit byname `). Moreover, the file name should not have any special characters like `.` (apart from `_`) - for example, if the file name is `Aardvark.1.fa`, rename it to `Aardvark_1.fa`. + +## Step 2: Modify Other Configuration Paramters + +Adjust other parameters listed in `config.yaml` as per specific user requirements. Details of the parameters are mentioned below. + +!!! Note + ROADIES has default values for some of the parameters that give the best results and are recommended in general. However, users can optionally modify the values specific to their needs. + +| Parameters | Description | Default value | +| --- | --- | --- | +| **GENOMES** | Specify the path to your input files which includes raw genome assemblies of the species. | | +| **REFERENCE** (optional) | Specify the path for the reference tree (state-of-the-art) in Newick format to compare ROADIES' results with a state-of-the-art approach. If you don't want to specify any reference tree, set it to `NULL`. | `NULL` | +| **LENGTH** | Configure the lengths of each of the randomly sampled subsequences or genes. | 500 | +| **GENE_COUNT** | Configure the number of genes to be sampled across all input genome assemblies. In normal mode, this will be the count of the genes to be sampled. In `--converge` mode, this will be the initial count of the number of genes for the first iteration and this value will be doubled iteratively. | 250 | +| **UPPER_CASE** | Configure the lower limit threshold of upper cases for valid sampling. ROADIES samples the genes only if the percentage of upper cases in each gene is more than this value. | 0.9 (Recommended) | +| **OUT_DIR** | Specify the path for ROADIES output files (this saves the current iteration results in converge mode). | | +| **ALL_OUT_DIR** | Specify the path for ROADIES output files for all iterations in converge mode. | | +| **MIN_ALIGN** | Specify the minimum number of allowed species to exist in gene fasta files after LASTZ. This parameter is used for filtering gene fasta files which has very less species representation. It is recommended to set the value greater than or equal to 4 since ASTRAL-Pro follows a quartet-based topology for species tree inference. For larger evolutionary timescales, we recommended setting it to a much higher value. In such cases, 15 to 20 would be a good start. | 4 | +| **COVERAGE** | Set the percentage of input sequence included in the alignment for LASTZ. | 85 | +| **CONTINUITY** | Define the allowable percentage of non-gappy alignment columns for LASTZ. | 85 | +| **IDENTITY** | Set the percentage of the aligned base pairs (matches/mismatches) for LASTZ. For larger evolutionary timescales, consider lowering the identity values than default for more homologous hits to be encountered. | 65 | +| **MAX_DUP** | Specify maximum number of allowed gene copies from one input genome in an alignment. | 10| +| **STEPS** |Specify the number of steps in the LASTZ sampling (increasing number speeds up alignment but decreases LASTZ accuracy).|1 | +| **FILTERFRAGMENTS** | Specify the portion so that sites with less than the specified portion of non-gap characters in PASTA alignments will be masked out. If it is set to 0.5, then sites with less than 50% of non-gap characters will be masked out. | 0.5 | +| **MASKSITES** | Specify the portion so that sequences with less than the specified portion of non-gap sequences will be removed in PASTA alignment. If it is set to 0.05, then sequences having less than 5% of non-gap characters (i.e., more than 95% gaps) will be masked out.| 0.02 | +| **SUPPORT_THRESHOLD** | Specify the threshold so that support values with equal to or higher than this threshold is considered as highly supported node. Such highly supported nodes crossing this threshold will be counted at every iteration to check the confidence of the tree (works in `--converge` mode). | 0.95 | +| **NUM_INSTANCES** | Specify the number of instances for PASTA, LASTZ, MashTree and RAxML-NG to run in parallel. It is recommended to set the number of instances equal to (`--cores`/4) for optimal runtime. | 4 | + +## Step 3: Run the ROADIES pipeline + +Once the required installations are completed and the parameters are configured in `config.yaml` file, execute the following command (from ROADIES repo home directory): + +```bash +python run_roadies.py --cores +``` + +This will let ROADIES run in accurate mode by default with specified number of cores. After the completion of the execution, the output species tree in Newick format will be saved as `roadies.nwk` in a separate `output_files` folder. + +## Command line arguments + +There are multiple command line arguments through which user can change the mode of operation, specify the custom config file path, etc. + +| Argument | Description | +| --- | --- | +| `--cores` | Specify the number of cores | +| `--mode` | Specify [modes of operation](index.md#modes-of-operation) (`accurate`, `balanced` or `fast`).`accurate` mode is the default mode. | +| `--converge` | Run ROADIES in [converge](index.md#convergence-mechanism) mode if you do not know the optimal gene count to start with | +| `--config` | Provide optional custom YAML files (in the same format as `config.yaml` provided with this repository). If not given, by default `config/config.yaml` file will be considered.| + +For example: + +``` +python run_roadies.py --cores 16 --mode balanced --converge --config config/config.yaml +``` + +Use `--help` to get the list of command line arguments. + +## Step 4: Analyze output files + +### Without convergence + +After the pipeline finishes running, the final species tree estimated by ROADIES will be saved as `roadies.nwk` inside a separate folder mentioned in the `--OUT_DIR` parameter in the `config/config.yaml` file. + +ROADIES also provides a number of intermediate output files for extensive debugging by the user. These files will be saved in `--OUT_DIR`, containing the following subfolders: + +1. `alignments` - this folder contains the LASTZ alignment output of all individual input genomes aligned with randomly sampled gene sequences. +2. `benchmarks` - this folder contains the runtime value of each of the individual jobs for each of the stages in the pipeline. These files will only be used if you want to estimate and compare the stagewise runtime of various pipeline stages and will not be used in final tree estimation. +3. `genes` - this folder contains the output files of multiple sequence alignment and tree-building stages (run by PASTA, IQTREE/FastTree, MashTree) of the pipeline. +4. `genetrees` - this folder contains two files as follows: + - `gene_tree_merged.nwk` - this file lists all gene trees together generated by IQTREE/FastTree/MashTree. It is used by ASTRAL-Pro to estimate the final species tree from this list of gene trees. + - `original_list.txt` - this file lists all gene trees together corresponding to their gene IDs. Some lines will have only gene IDs but no associated gene trees. This is because some genes will be filtered out from tree building and MSA step if it has less than four species. Hence this file also lists those gene IDs with missing gene trees for further debugging. +5. `plots` - this folder contains four following plots: + - `gene_dup.png` - this histogram plot represents the count of the number of gene duplicates on the Y-axis vs. the number of genes having duplication on the X-axis. + - `homologues.png` - this histogram plot represents the count of the number of genes on the Y-axis vs. the number of homologous species on the X-axis. + - `num_genes.png` - this plot represents how many genes out of `--GENE_COUNT` parameter have been aligned to each of the input genomes after the LASTZ step. The X-axis represents different genomes, and the Y-axis represents the number of genes. + - `sampling.png` - the plot shows how many genes have been sampled from each of the input genomes after the random sampling step. The X-axis represents different genomes, and the Y-axis represents the number of genes. +6. `samples` - this folder contains the list of randomly sampled genes from individual input genomes. + - `_temp.fa` - these files contain genes sampled from the particular input genome. + - `out.fa` - this file contains all sampled subsequences (genes) from individual genomes combined, which is given to the the LASTZ step. +7. `statistics` - this folder contains CSV data for the plots shown in the `plots` directory mentioned above. + - `gene_to_species.csv` - this is an additional CSV file (corresponding plots to be added in future) which provides the information about which genes are aligned to what species after LASTZ step (`num_genes.csv` only gives the total count of the genes per species, `gene_to_species.csv` also gives the ID number of those aligned genes). Along with each gene ID number, it also provides the [score, line number in .maf file, position] of all the homologs of that particular gene. Score, position and line number information is collected from the corresponding species' .maf file (generated by LASTZ), saved in `results/alignments` folder. +8. `roadies_stats.nwk`- this is the final estimated species tree (same as `roadies.nwk`), along with the support branch values in the Newick tree. +9. `roadies.nwk`- this is the final estimated species tree in Newick format. +10. `roadies_rerooted.nwk` (optional) - this is the final estimated species tree, re-rooted corresponding to the outgroup node from the given reference tree (provided as `REFERENCE` in `config.yaml`). +11. `time_stamps.csv` - this file contains the start time, number of gene trees required for estimating species tree, end time, and total runtime (in seconds), respectively. +12. `ref_dist.csv` - this file provides the number of gene trees and the Normalized Robinson-Foulds distance between the final estimated species tree (i.e., `roadies.nwk`) and the reference tree (i.e., REFERENCE parameter in `config.yaml`). + +### With convergence + +If converge option is enabled, the results of all iterations (along with the corresponding species tree in the name `iteration_.nwk`) will be saved in a separate folder mentioned in the `--ALL_OUT_DIR` parameter in the `config/config.yaml` file. + +!!! Note + With `--converge` option, `--OUT_DIR` saves the results of the current ongoing iteration (if pipeline execution is finished, then the last iteration), whereas `--ALL_OUT_DIR` saves the results of all iterations executed. + +For extensive debugging, other intermediate output files for each stage of the pipeline for each iterations are saved in `--ALL_OUT_DIR` as follows: + +1. Folder with `iteration_` - this folder contains results from the specific iteration corresponding to the iteration number in the folder name. + - Folder with name in `--OUT_DIR` - this contains the results of all stages of the pipeline (as described above in non convergence section). + - `gene_tree_merged.nwk` - this file lists all gene trees together generated by IQTREE/FastTree/MashTree in that particular iteration. It is concatenated with master list of gene trees from all past iterations before providing to ASTRAL-Pro to estimate the final converged species tree. + - `iteration_.log` - this file contains the log information of the corresponding iteration execution. + - `mapping.txt` - This file maps all gene names in the gene trees with the corresponding species name from where it originates. It is required by ASTRAL-Pro, along with the master list of gene trees from all iterations, to infer species tree. +2. `iteration__stats.nwk` - this is the final estimated species tree for the corresponding iteration (same as `iteration_.nwk`), along with the support branch values in the Newick tree. +3. `iteration_.nwk` - this is the final estimated species tree for the corresponding iteration +4. `iteration_.rerooted.nwk` - (optional) - this is the final estimated species tree for the corresponding iteration, re-rooted to the outgroup node from the given reference tree (provided as `REFERENCE` in `config.yaml`). +5. `master_gt.nwk` - this is the concatenated list of all gene trees from all iterations together. +6. `master_map.txt` - this is the concatenated list of all mapping files from all iterations together. This `master_gt.nwk` and `master_map.txt` is provided to ASTRAL-Pro after every iteration to get the converged species tree. +7. `ref_dist.csv` - this file provides the iteration number, number of gene trees and the Normalized Robinson-Foulds distance between the final estimated species tree (i.e., `roadies.nwk`) and the reference tree (i.e., REFERENCE parameter in `config.yaml`), for all iterations. +8. `time_stamps.csv`- this file contains the start time in first line, iteration number, number of gene trees required for estimating species tree, end time, and total runtime (in seconds), respectively, for all iterations in subsequent lines. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index a98463a1..2690173e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -47,6 +47,12 @@ extra: nav: - Home: index.md + - Install: install.md + - Quick Start: quickstart.md + - User Guide: usage.md + - Contribution: contribution.md + - Troubleshooting: troubleshooting.md + - Cite ROADIES: cite.md markdown_extensions: - pymdownx.highlight: @@ -61,6 +67,9 @@ markdown_extensions: - pymdownx.superfences - pymdownx.mark - attr_list + - def_list + - pymdownx.tasklist: + custom_checkbox: true - pymdownx.emoji: emoji_index: !!python/name:material.extensions.emoji.twemoji emoji_generator: !!python/name:materialx.emoji.to_svg @@ -68,3 +77,6 @@ markdown_extensions: repo_url: https://github.com/TurakhiaLab/ROADIES repo_name: TurakhiaLab/ROADIES + +copyright: | + © 2024 Turakhia Lab