Skip to content

Commit

Permalink
updated dryad links and added extra badges
Browse files Browse the repository at this point in the history
  • Loading branch information
ang037 committed Jul 23, 2024
1 parent 77239bb commit 723f604
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 22 deletions.
37 changes: 25 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
[<img src="https://img.shields.io/badge/Made with-Snakemake-brightgreen.svg?logo=snakemake">](https://snakemake.readthedocs.io/en/v7.19.1/index.html)
[<img src="https://img.shields.io/badge/Install with-DockerHub-informational.svg?logo=Docker">](https://hub.docker.com/r/ang037/roadies)
[<img src="https://img.shields.io/badge/Submitted to-bioRxiv-critical.svg?logo=LOGO">](https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1)
[<img src="https://img.shields.io/badge/DOI-10.5061/dryad.tht76hf73-brightgreen.svg?logo=LOGO">](https://doi.org/10.5061/dryad.tht76hf73)
[<img src="https://img.shields.io/badge/Watch it on-Youtube-FF0000.svg?logo=YouTube">](https://youtu.be/1sR741TvZnM?si=vVNAnonvzNEzrLKq)

<div align="center">

Expand All @@ -27,9 +29,10 @@
- [Using Installation Script](#script)
- [Quick Start](#start)
- [Run ROADIES with your own datasets](#runpipeline)
- [Contributions and Support](#support)
- [Citing ROADIES](#citation)

<br>

## <a name="overview"></a> Introduction

Welcome to the official repository of ROADIES, a novel pipeline designed for phylogenetic tree inference of the species directly from their raw genomic assemblies. ROADIES offers a fully automated, easy-to-use, scalable solution, eliminating any error-prone manual steps and providing unique flexibility in adjusting the tradeoff between accuracy and runtime.
Expand All @@ -47,7 +50,7 @@ Welcome to the official repository of ROADIES, a novel pipeline designed for phy

</div>


<br>

## <a name="usage"></a> Quick Install

Expand Down Expand Up @@ -82,7 +85,7 @@ docker build -t roadies_image .
docker run -it roadies_image
```

### <a name="script"></a> Using installation script
### <a name="script"></a> Using installation script (requires sudo access)

First clone the repository:

Expand Down Expand Up @@ -119,28 +122,38 @@ sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptoo

**Note:** If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`.

<br>

## <a name="start"></a> Quick Start

Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine:

1. Create a directory for the test data and download the test datasets:
1. Go to ROADIES repository directory if not there:

```
cd ROADIES
```

2. Create a directory for the test data and download the test datasets (using the following one line command):

```
mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}'
```
2. Run the ROADIES pipeline:
3. Run the pipeline with the following command (from ROADIES directory):

```
python run_roadies.py --cores 16
```

The first command will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. The second command will run ROADIES for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `ROADIES/output_files` folder upon completion.
The second command will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. The third command will run ROADIES pipeline for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `output_files` folder upon completion.

<br>

## <a name="runpipeline"></a> Run ROADIES with your own datasets

To run ROADIES with your own datasets, follow these steps:

1. **Specify Input Genomic Dataset**: Update the `config.yaml` file to include the path to your input datasets under the `GENOMES` parameter. Ensure all input genomic assemblies are in `.fa` or `.fa.gz` format and named according to the species' name (e.g., `Aardvark.fa`).
1. **Specify Input Genomic Dataset**: Update the `config.yaml` file (found in the ROADIES directory - `config` folder) to include the path to your input datasets under the `GENOMES` parameter. Ensure all input genomic assemblies are in `.fa` or `.fa.gz` format and named according to the species' name (e.g., `Aardvark.fa`).

**Note**: Each file should contain the genome assembly of one unique species. If a file contains multiple species, split it into individual genome files (`fasplit` can be used: `faSplit byname <input_dir> <output_dir>`).

Expand All @@ -165,18 +178,18 @@ python run_roadies.py --cores 16 --mode balanced
python run_roadies.py --cores 16 --mode fast
```

## <a name="support"></a> Contributions and Support

We welcome contributions from the community. If you encounter any issues or have suggestions for improvement, please open an issue on GitHub. For general inquiries and support, reach out to our team.
<br>

## <a name="citation"></a> Citing ROADIES

If you use ROADIES in your research or publications, please cite the following paper:

Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. _bioRxiv_. https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1
Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. _bioRxiv_. [https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1](https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1).

### Accessing ROADIES output files

The output files with the gene trees and species trees generated by ROADIES are deposited to [Dryad](https://datadryad.org/stash). To access it, please refer to [this](https://datadryad.org/stash/share/Pbbmp5I6AEmJmOHRvNld7FBT2ext-DEemyajkqUQfX0) link (Note: the dataset submission is undergoing review and a permanent link will be posted once available).
The output files with the gene trees and species trees generated by ROADIES in the manuscript are deposited to [Dryad](https://datadryad.org/stash). To access it, please refer to the following:

Gupta, Anshu; Mirarab, Siavash; Turakhia, Yatish (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES [Dataset]. Dryad. [https://doi.org/10.5061/dryad.tht76hf73](https://doi.org/10.5061/dryad.tht76hf73).


42 changes: 32 additions & 10 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ docker build -t roadies_image .
docker run -it roadies_image
```

### Using installation script
### Using installation script (requires sudo access)

First clone the repository:

Expand Down Expand Up @@ -138,22 +138,28 @@ sudo apt-get install -y wget unzip make g++ python3 python3-pip python3-setuptoo
If you encounter issues with the Boost library, add its path to `$CPLUS_LIBRARY_PATH` and save it in `~/.bashrc`.


### Quick start (with provided test dataset)
## Quick start (with provided test dataset)

Once setup is done, you can run the ROADIES pipeline using the provided test dataset. Follow these steps for a 16-core machine:

1. Create a directory for the test data and download the test datasets:
1. Go to ROADIES repository directory if not there:

```
cd ROADIES
```

2. Create a directory for the test data and download the test datasets (using the following one line command):

```
mkdir -p test/test_data && cat test/input_genome_links.txt | xargs -I {} sh -c 'wget -O test/test_data/$(basename {}) {}'
```
2. Run the ROADIES pipeline:
3. Run the pipeline with the following command (from ROADIES directory):

```
python run_roadies.py --cores 16
```

The first command will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. The second command will run ROADIES for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `ROADIES/output_files` folder upon completion.
The second command will download the 11 Drosophila genomic datasets (links provided in `test/input_genome_links.txt`) and save them in the `test/test_data` directory. The third command will run ROADIES for those 11 Drosophila genomes and save the final newick tree as `roadies.nwk` in a separate `output_files` folder upon completion.

**Running ROADIES with different modes of operation**: To run ROADIES in various other modes of operation (fast, balanced, accurate) (description of these modes are mentioned in [Modes of operation](index.md#modes-of-operation) section), try the following commands:

Expand Down Expand Up @@ -190,13 +196,13 @@ python run_roadies.py --cores 16 --mode fast --converge

The output files for all iterations will be saved in a separate `converge_files` folder. `output_files` will save the results of the last iteration. Species tree for all iterations will be saved in `converge_files` folder with the nomenclature `iteration_<iteration_number>.nwk`.

## Usage
## Detailed Usage

This section provides detailed instructions on how to configure the ROADIES pipeline further for various user requirements with your own genomic dataset. Once the required environment setup process is complete, follow the steps below.

### Step 1: Specify input genomic dataset

After installing the environment, you need to get input genomic sequences for creating the species tree. To run ROADIES with your own dataset, update the `config.yaml` file to include the path to your input datasets under the `GENOMES` parameter.
After installing the environment, you need to get input genomic sequences for creating the species tree. To run ROADIES with your own dataset, update the `config.yaml` file (found in the ROADIES directory - `config` folder) to include the path to your input datasets under the `GENOMES` parameter.

!!! Note
All input genome assemblies in the path mentioned in `GENOMES` should be in `.fa` or `.fa.gz` format. The genome assembly files should be named according to the species' names (for example, Aardvark's genome assembly is to be named `Aardvark.fa`). Each file should contain the genome assembly of one unique species. If a file contains multiple species, split it into individual genome files (fasplit can be used for this: `faSplit byname <input_dir> <output_dir>`). Moreover, the file name should not have any special characters like `.` (apart from `_`) - for example, if the file name is `Aardvark.1.fa`, rename it to `Aardvark_1.fa`.
Expand Down Expand Up @@ -230,15 +236,15 @@ Adjust other parameters listed in `config.yaml` as per specific user requirement

### Step 3: Run the ROADIES pipeline

Once the required installations are completed and the parameters are configured in `config/config.yaml` file, execute the following command:
Once the required installations are completed and the parameters are configured in `config.yaml` file, execute the following command (from ROADIES repo home directory):

```
python run_roadies.py --cores <number of cores>
```

This will let ROADIES run in accurate mode by default with specified number of cores. After the completion of the execution, the output species tree in Newick format will be saved as `roadies.nwk` in a separate `output_files` folder.

#### Command line arguments
### Command line arguments

There are multiple command line arguments through which user can change the mode of operation, specify the custom config file path, etc.

Expand All @@ -249,6 +255,12 @@ There are multiple command line arguments through which user can change the mode
| `--converge` | Run ROADIES in [converge](index.md#convergence-mechanism) mode if you do not know the optimal gene count to start with |
| `--config` | Provide optional custom YAML files (in the same format as `config.yaml` provided with this repository). If not given, by default `config/config.yaml` file will be considered.|

For example:

```
python run_roadies.py --cores 16 --mode balanced --converge --config config/config.yaml
```

Use `--help` to get the list of command line arguments.

### Step 4: Analyze output files
Expand Down Expand Up @@ -307,8 +319,18 @@ For extensive debugging, other intermediate output files for each stage of the p

We welcome contributions from the community. If you encounter any issues or have suggestions for improvement, please open an issue on GitHub. For general inquiries and support, reach out to our team.

Anshu Gupta - ang037 [at] ucsd [dot] edu

Yatish Turakhia - yturakhia [at] ucsd [dot] edu

## Citing ROADIES

If you use ROADIES in your research or publications, please cite the following paper:

Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. _bioRxiv_. https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1
Gupta A, Mirarab S, Turakhia Y, (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES. _bioRxiv_. [https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1](https://www.biorxiv.org/content/10.1101/2024.05.27.596098v1).

### Accessing ROADIES output files

The output files with the gene trees and species trees generated by ROADIES in the manuscript are deposited to [Dryad](https://datadryad.org/stash). To access it, please refer to the following:

Gupta, Anshu; Mirarab, Siavash; Turakhia, Yatish (2024). Accurate, scalable, and fully automated inference of species trees from raw genome assemblies using ROADIES [Dataset]. Dryad. [https://doi.org/10.5061/dryad.tht76hf73](https://doi.org/10.5061/dryad.tht76hf73).

0 comments on commit 723f604

Please sign in to comment.