We will use a vague gamma distribution centered on a mean evolutionary rate estimated by considering the tree height (molecular distance in substitutions per site) and the mean age for the root of our phylogeny (time unit = 100 Mya). As our rooted phylogeny has information about the branch lengths, we can use our R in-house script to calculate the tree heights. We also have a calibration to constrain the root age, which we will use as an approximate age for the root of our phylogeny to estimate the mean evolutionary rate.
By setting a vague shape (
Now, we have all the information we need to calculate the calculate_rateprior.R
to carry out all the tasks mentioned above. You can open this file in RStudio to find out the value of
First, we know that the molecular distance (tree height, distance from the root to present time) is equal to the mean evolutionary rate (in substitutions per site per time unit) times the age of the divergence time at the root (in time unit, which we can define later). If we have estimated our phylogeny, and therefore have estimated the value of the branch lengths, we will be able to calculate the tree height. The units of the tree height will be the following:
tree_height = rate * root_age --> units_tree_height = subst/site/time * time = subst/site
There are various functions we can use to calculate the tree heigt. We have chosen the R function `phytools::nodeHeights`. The maximum height calculated by this function corresponds to the length from the root to the heighest tip.
After calculating the tree height of our phylogeny (in subst/site) and considering the age of the root based on the fossil record or geological events (time unit = 1 Ga = 100 Mya = 1e9 years), we can get a rough estimate for the mean rate:
Time unit = 100 Mya --> mean_rate = tree_height / root_age = (subst/site) / (Mya) = subst/site/Mya (time unit = 100 Mya)
We also know that the mean of the gamma distribution that we want to use as rate prior is our parameter of interest: the mean evolutionary rate. Therefore:
mean_Gamma = mean_rate = alpha / beta
Time unit = 100 Mya: mean_rate = alpha / beta --> beta = alpha / mean_rate = 2 / mean_rate
The calibrated tree needs to incorporate the age constraints in the same time unit used to infer the mean evolutionary rate and establish the rate prior (i.e., do not forget to scale the calibrations accordingly if needed!).
If you run the R script calculate_rateprior.R
, you will see how all the steps described above take place and a new PDF file with the prior distribution to be used will be generated in a new directory called out_RData
.
We have then updated our template control files with the sigma2
parameter (i.e., variation in the clock) as 0.1 using a gamma prior with sigma2_gamma 1 10
.
Before running MCMCtree
using the approximate likelihood calculation to speed up timetree inference, we first need to calculate the vector of branch lengths, the gradient (vector), and the Hessian (matrix). We will use BASEML
for this purpose as our dataset is an amino acid alignment.
The file structure we will use is the following:
example_dating/
|
|- alignments/
| |- X/ # Directory for alignment X -- have as many directories as alignments
|
|- control_files/ # Pre-defined control file with flags to be later replaced with specific settings
|
|- Hessian/
| |- X # Directory for alignment X -- have as many directories as alignments
|
|- pipelines_Hessian # Directory where the pipeline to run `BASEML` will be executed
|
|- scripts # Scripts used to prepare control files to run `BASEML
|
|- trees
|- calibrated # Directory with the calibrated tree for `MCMCtree`
|- uncalibrated # Directory with the uncalibrated tree for `BASEML`
To create the example_dating
file structure, we run the following commands:
# Run the following commands from
# directory `00_BASEML`
mkdir -p tmp/example_dating
cd tmp/example_dating
num_aln=1
for i in `seq 1 $num_aln`
do
mkdir -p alignments/$i
mkdir -p Hessian/$i/prepare_baseml
mkdir -p trees/{uncalibrated/$i,calibrated/$i}
mkdir -p control_files/$i
done
mkdir -p pipelines_Hessian
mkdir scripts
Once the file structure is created, we can now populate it with the input files we have generated some minutes ago: alignment files, tree files, and control files. We will now relocate the location of this example_dating
directory so that our pipelines work:
# Run from `tmp/example_dating`
# Copy alignment
cp ../../../../00_data_formatting/01_inp_data/example_aln.phy alignments/1/
# Now, transfer calibrated tree
cp ../../../../00_data_formatting/01_inp_data/example_calib_MCMCtree.tree trees/calibrated/1
# Transfer uncalibrated tree
cp ../../../../00_data_formatting/01_inp_data/tree_example_uncalib.tree trees/uncalibrated/1
# Next, copy control file
cp ../../control_files/prepbaseml.ctl control_files/1
# Last, copy the in-house bash scripts with our pipeline
cp ../../scripts/*sh scripts/
cd ../
mv example_dating ../../../
cd ../
rm -r tmp
Now, we need to generate other input files to estimate the Hessian and the gradient: the input control files for BASEML
. To do this in a reproducible manner, you can use the script generate_prepbaseml.sh
, which you can find in the 01_PAML/00_BASEML/scripts
and which you should have in your example_dating
directory. Now,go to the example_dating
directory and run the next code snippet, where you will execute this script. Specifically, the generate_prepbaseml.sh
script needs one argument: the directory name where the alignments are saved: 1
, 2
, and 3
in our case!
# Run from `example_dating/scripts`
# Please change directories until
# you are there. Then, run the following
# commands.
chmod 775 *sh
# In this case, there are three alignments, so
# we can execute our script within a loop
num_aln=1
for i in `seq 1 $num_aln`
do
./generate_prepbaseml.sh $i
done
To make sure that all the paths have been properly extracted, you can run the following code snippet:
# Run from `example_dating/Hessian`
# Please change directories until
# you are there. Then, run the following
# commands.
grep 'seqfile' */prepare_baseml/*ctl
grep 'treefile' */prepare_baseml/*ctl
Now that we have the input files (alignment and tree files) and the instructions to run BASEML
(control file) in example_dating
, we will be manually running MCMCtree
inside each prepare_baseml
directory (see file structure above) in a special mode that launches BASEML
for the sole purpose we want: to infer the vectors and matrix required to approximate the likelihood calculation.
# Run `MCMCtree` from
# `example_dating/Hessian/1/prepare_baseml`.
# Please change directories until
# you are in there.
# The first command to change directories
# will work if you are still in
# `main/Hessian`, otherwise ignore and
# move to such directory with the command
# that best suits your current directory.
# If you had more than one alignment, you
# could write a `for` loop or access each
# dir individually.
dir=1
cd $dir/prepare_baseml
mcmctree prepbaseml*ctl # You may have other aliases to run `MCMCtree`,
# so run this command accordingly!
First, you will see that MCMCtree
starts parsing the first locus. Then, you will see something like the following printed on your screen (some sections may change depending on the PAML version you have installed on your cluster!):
*** Locus 1 ***
running baseml tmp0001.ctl
BASEML in paml version 4.10.7, June 2023
ns = 4 ls = 48
Reading sequences, sequential format..
Reading seq # 4: sp4
Sequences read..
48 site patterns read, 609 sites
Counting frequencies..
72 bytes for distance
4608 bytes for conP
1536 bytes for fhK
8000000 bytes for space
As soon as you see the last line, you will see that various tmp000X*
files will have been created, and hence you can stop this run by typing ctrl+C
on the terminal that you have used to run such command -- the dummy alignment is so short that you will not have time to even do this, but this trick is useful when having large phylogenomic datasets!
Once you have done this, you can check that the control file you will later need has been created:
# Run from the `example_dating/Hessian`
# Please change directories until
# you are there. Then, run the following
# commands.
grep 'seqfile' */*/tmp0001.ctl | wc -l # You should get as many datasets as you have, in this case 1
Note that, when we ran the commands above, we were not interested in running BASEML
or MCMCtree
. We just wanted to execute MCMCtree
with option usedata = 3
so that it generates the tmp000*
files that BASEML
will later need to estimate the branch lengths, the gradient, and the Hessian. We do this analysis in two steps given that there may be restrictions in the HPC you will be using in the future (or are using at the moment) that do not allow to run BASEML
+ MCMCtree
in one unique job within a reasonable amount of time (i.e., this applies to genomic datasets in case you wanted to scale your analyses with larger datasets, you could see how quickly this runs with a dummy dataset!). In addition, we want to modify some settings in the control file that is automatically generated when enabling usedata = 3
so that they match what we want to do for our inference. In a nutshell, this is what you will be doing:
- Run
MCMCtree
to generate thetmp000*
files. - Modify the
tmp0001.ctl
file according to the settings we want to enable to analyse our dataset withBASEML
. - Run
BASEML
using thetmp000*
files so that it estimates the branch lengths, the gradient, and the Hessian and saves them in a file calledrst2
. - Generate the final
in.BV
file for our dataset, which will be later used byMCMCtree
.
Once all tmp000*
files are generated for all alignments, we need to make sure that the correct evolutionary model has been enabled (i.e., model = 4
, ncatG=4
for HKY85+G4) and that option method = 1
is enabled, which will speed up the computation of the Hessian and the gradient. We can run the next code snippet to very that the four requirements aforementioned are met:
# Run from the `example_dating/Hessian`
# Please change directories until
# you are there. Then, run the following
# commands.
sed -i 's/method\ \=\ 0/method\ \=\ 1/' */*/tmp0001.ctl
grep 'method = 1' */*/tmp0001.ctl | wc -l # You should get as many as datasets you have
grep 'alpha' */*/tmp0001.ctl # You should see `fix_alpha = 0` and `alpha = 0.5`
grep 'ncatG' */*/tmp0001.ctl # You should see `ncatG = 4`
grep 'model' */*/tmp0001.ctl # You should see `model = 3` (i.e., empirical+F model)
We can now run BASEML
given that we have the control file ready as well as all the required input files!
We have created a template bash script with flags (i.e., see script pipeline_Hessian_BASEML_template_PC.sh
in the scripts
directory), which will be replaced with the appropriate values by another bash script (i.e.,generate_job_BASEML_PC.sh
, also saved in the scripts
directory). Please note that the second bash script will edit the template bash script according to the data alignment/s that will be analysed. We had already copied these scripts to the example_dating
directory when setting our file structure. Therefore, we just need to execute the following code snippet there:
# Run from `example_dating` dir.
# Please change directories until
# you are there. Then, run the following
# commands.
home_dir=$( pwd )
cd scripts
chmod 775 *sh
num_aln=1
# Arg1: Number of alignments
# Arg2: Path to the pipeline directory
# Arg3: Name of the working directory (i.e., `example_dating` in this analysis)
# Arg4: Name of the executable file for BASEML. E.g., `baseml4.10.7`, `baseml`, etc.
# Arg5: Boolean, PAML exported to the path? `Y` or `N`.
# If `N`, the executable file will be required to be in the home dirctory,
# i.e., directory which name you type as `Arg3`.
./generate_job_BASEML_PC.sh $num_aln $home_dir/pipelines_Hessian example_dating baseml Y
Next, we will go to the pipelines_Hessian
directory and run the script that will have been generated using the commands above:
# Run from `example_dating/pipelines_Hessian`.
# Please change directories until
# you are there. Then, run the following
# commands.
#
# If you list the content of this directory,
# you will see the pipeline you will need
# to execute in a bash script called
# `pipeline_Hessian.sh`
ll *
# Now, execute this bash script
chmod 775 *sh
./pipeline_Hessian.sh & # Include the `&` to run this job in the background!
Once BASEML
finishes, we are ready to generate the in.BV
file that we will later use when running MCMCtree
to approximate the likelihood calculation:
# Run from dir `example_dating/Hessian/`
# Please change directories until
# you are there. Then, run the following
# commands.
num_aln=1
for i in `seq 1 $num_aln`
do
printf "\nGenerating in.BV files for dir "$i" ... ...\n\n"
cp $i/rst2 $i/in.BV
done
We can now proceed to timetree inference with MCMCtree
! You can click this link to move to the next README.md
file!