Welcome! imPhy is a small Python and R pipeline for simulating, imputing, and analyzing phylogenetic trees. The main purpose of imPhy is to assess the effectiveness of methods imputing the positions of missing leaves in bifurcating trees. Effectiveness is measured by the distance between the original simulated trees and the imputed trees, using Robinson-Foulds and Billera-Holmes-Vogtmann (BHV) distances. Unless only a small number of trees are being imputed, it is recommended to run imPhy on a server.
Python is used to generate data and organize files, while R is used to remove some data to simulate missingness, and to make plots. If you have your own data and plotting software, there is no need to use R. Similarly, C++ is used to run our imputation method, but you can use your own imputation code in imPhy without interfacing with C++.
In the future, imPhy will be developed very slowly, if at all, so this repository is intended as a proof of concept and record for other researchers who would like to implement a similar pipeline.
Each file uses docopt, so running the file with -h
will bring up a help guide if you need it.
- Clone this repository.
- Install dependencies.
- Switch directories, using
cd imPhy/src/
. It is safest to run the files from this directory. - Edit
imPhy/src/cpp/Makefile
to match your Gurobi installation and C++ compiler, and runmake
.- If you would prefer to use your own imputation method,
cp
your imputation file toimPhy/src/cpp/missing1.o
. It does not have to be written in C++, but for now the file will have to be named missingX.o, and run using the command./missingX.o infile outfile
. - Please ensure that the input and output of your file correspond to the file APIs described in
imPhy/src/cpp/examples/
.
- If you would prefer to use your own imputation method,
- You can now run your own experiments by editing
imPhy/src/settings.py
. A good starting command ispython3 experiment.py -dfp apicomplexa
, which will run the pipeline on the Apicomplexa dataset.- Notably, the "Testing Variables" section of
imPhy/src/settings.py
will be used if the-t
flag is passed toimPhy/src/experiment.py
. Otherwise, the "Experiment Variables" settings will be used. - To use a different imputation method, modify the
methods
variables inimPhy/src/settings.py
. A list of [1, 2] corresponds to imputing with both themissing1.o
andmissing2.o
files.
- Notably, the "Testing Variables" section of
There are four main modules in the imPhy pipeline. They can all be run from the experiment.py
, by modifying the flow_dict
variables in imPhy/src/settings.py
.
-
Tree Generation (Python only)
- Creates trees with DendroPy, a very useful phylogenetic library for Python. ImPhy uses the Yule process to create species trees, and the contained coalescent model to create gene trees. Trees are written to
imPhy/my_experiment/batch_A/nexus/
. - Parameters:
- Species Depth: Height in generations of the species tree. Calculated by
(Effective Population Size) * (C-ratio)
. - Effective Population Size is equivalent to N, for a population of N haploid individuals, or 2N for a population of N diploid individuals. Defaults to 10000.
- C-ratio: Ratio of Species Depth to Effective Population Size. Used as a proxy for species depth.
- Number of Species Trees: The number of species trees to create with each depth and population size. This is effectively a "number of trials" adjuster.
- Number of Species: Number of leaves in the species tree.
- Number of Gene Trees: This many gene trees will be coalesced within each species tree. More gene trees means more information for the imputation software, but higher memory requirements.
- Number of Individuals per Species: Controls the number of leaves in each gene tree, which is equal to
(Num Individuals per Species) * (Num Species)
- Species Depth: Height in generations of the species tree. Calculated by
- Creates trees with DendroPy, a very useful phylogenetic library for Python. ImPhy uses the Yule process to create species trees, and the contained coalescent model to create gene trees. Trees are written to
-
Dropping Leaves (R only)
- To impute leaves (individuals), some must be missing. For this purpose imPhy uses the APE package in R. There are two methods for choosing the number of leaves to drop, which are chosen automatically based on the value of the leaf dropping parameter p. The identities of leaves to be dropped are chosen randomly regardless of which method is chosen. In
experiment.py
, p is set using theprobs
list. Trees are written toimPhy/my_experiment/batch_A/data/
in the data format found inimPhy/src/cpp/examples/
, which uses vectorized distance matrices.- p < 1 causes p to be interpreted as the probability of success (dropping a leaf) in a binomial distribution with size equal to the number of leaves in the tree. A single draw is made from the distribution, which serves as the number of leaves to drop from the tree.
- p >= 1 will result in 1/p of the leaves in a given tree being dropped. If the resulting number is not an integer, it will be rounded.
- To impute leaves (individuals), some must be missing. For this purpose imPhy uses the APE package in R. There are two methods for choosing the number of leaves to drop, which are chosen automatically based on the value of the leaf dropping parameter p. The identities of leaves to be dropped are chosen randomly regardless of which method is chosen. In
-
Imputation (Python and the imputation language)
- Imputation can be performed using any file that matches
missingX.o
. A bash wrapper for imputation techniques that are not written in C++ is included inimPhy/src/cpp/missing9.o
. - Our imputation method uses mutual information from multiple gene trees that have coalesced on the same species tree. Gurobi and C++ are used to impute leaves, so it is necessary to have a valid Gurobi installation and to compile the C++ file on the machine it will be run on. The imputation software can be swapped out with another file without impacting the rest of the pipeline, provided the inputs and outputs are of the same format.
- Inputs go to
src/cpp/data/
, and outputs go tosrc/cpp/sol/
. - Examples can be found in
src/cpp/examples/
. - For best results, name your imputation code
missingX.o
, where X is a number. Then, inexperiment.py
, the methods list can be set as:
# impute using imPhy/src/cpp/missingX.o
methods = [X]# impute using imPhy/src/cpp/missingX.o and
# imPhy/src/cpp/missing1.o
methods = [X, 1] - Inputs go to
- Imputation can be performed using any file that matches
-
Analysis (Python, R, and Java)
- In the analysis section, DendroPy is used to take the Robinson-Foulds distance between original trees and their imputed siblings, while Owen and Provan's GTP code is used to calculate the BHV geodesic. Files containing information about these distances are written to
imPhy/my_experiment/batch_A/stats/
. If the next step is impossible to run, these files can still provide interesting information. Headers are included in the CSVs.
CSV files (Python only)
- The CSV files contain information about the distances between imputed and original leaves and trees. Files created in this step are located in
imPhy/my_experiment/
.
Diagnostic Plots (R only)
- This step can be run by passing the
-d
flag toexperiment.py
. These plots are useful as a sanity check for imputation quality, but can only be used if the dependencies are installed. Files created in this step are located inimPhy/my_experiment/
as well asimPhy/my_experiment/heatmaps/
. The plots are most useful for smaller experiments.
- In the analysis section, DendroPy is used to take the Robinson-Foulds distance between original trees and their imputed siblings, while Owen and Provan's GTP code is used to calculate the BHV geodesic. Files containing information about these distances are written to
Other Features:
- Multiprocessing: imPhy uses Python's multiprocessing library to reduce total computation time. Parallel imputation methods have not been tested with the 'parallel' flag in the code, so caution is advised when using imputation methods that use more than one process.
- Logging:
When big jobs are run on servers, it can be difficult to identify failure points. For this purpose,
imPhy/my_experiment/output.log
is placed inside the experiment folder, which contains sequentially all the output created by each process during the run.
This code has only been tested on Python 3.5 and R 3.3. Other functions may break when using previous versions. Anaconda is recommended to install Python and its packages. It can also be used to install R, but is not as well supported as Anaconda for Python.
Python Packages:
R Packages:
Others:
Feel free to contact me on GitHub via the imPhy repo, at https://github.com/yasuiniko/imPhy. Problems can be reported using the Issues tab.
imPhy
│ License.md
│ README.md
│
└───my_experiment
│ │ interleaf_error.csv
│ │ interleaf_error.pdf
│ │ intertree_all.csv
│ │ intertree_error.csv
│ │ intertree_error.pdf
│ │ outlier_counts.pdf
│ │ output.log
│ │
│ └───batch_A
│ │ └───data
│ │ │ │ data_file_1.txt.gz
│ │ │ │ data_file_1_true.txt.gz
│ │ └───nexus
│ │ │ │ gene_trees_1.nex
│ │ │ │ separated.txt
│ │ │ │ species.nex
│ │ └───solutions
│ │ │ │ gene_trees_1.sol.gz
│ │ └───stats
│ │ │ │ gene_trees_1_all.txt.gz
│ │ │ │ gene_trees_1_tree_all.txt.gz
│ │ │ │ gene_trees_1_tree.txt.gz
│ │ │ │ gene_trees_1.txt.gz
│ └───batch_B
│ │
│ ...
│
└───src
│
...
I'd like to thank Dr. Yoshida for her leadership and excellent advice, Dr. Fukumizu for his sharp insight and for generously hosting me, and Dr. Vogiatzis for his constant support and development of the C++ imputation software. This software would not be possible without their great efforts.