09-maximum-likelihood.Rmd

# (PART\*) MAXIMUM LIKELIHOOD {-}

```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```

# What is Maximum Likelihood?

We are finally ready to try estimating trees using a true phylogenetic method. Maximum likelihood is a tree-searching method that attempts to find a tree that maximizes the likelihood of observing the data. Put another way, we want a tree that makes the data we see likely. If you've ever done any sort of statistical test (like a linear regression or a chi-square test), you've been doing maximum likelihood tests. 

Essentially, for each possible DNA mutation you've seen in the data, you are finding a tree that recreates the most likely pattern of mutation that you see in your data. This includes limiting the amount of homoplasy or mutational reversals.

Maximum likelihood searches are tree-based searches (like parsimony). You supply a starting tree, and the algorithm estimates the likelihood of that tree given the data. (The likelihood value is often very small, so we transform it using to the loglikelihood for ease.) Then the algorithm permutates (or changes) the tree and calculates the likelihood of the new tree. If the new tree has a better likelihood, then it is kept and another permutation is tried. If the likelihood of the new tree is worse than the original tree, then the new tree is discarded and the algorithm tries a different permutation on the old tree.

The advent of the maximum likelihood method (published by Joe Felsenstein in 1981) revolutionized the field of evolutionary biology and really kickstarted the creation of phylogenetics as it exists today. 

```{r, echo = FALSE, warning=FALSE, message = FALSE}
library(ape)
```

# Estimating a tree using Maximum Likelihood

We will be using the `phangorn` package in R to estimate maximum likelihood methods. 

`phangorn` has three possible tree permutation methods available in its maximum likelihood commands. The first is the nearest neighbor interchange (NNI) method, which we used in our parsimony estimation. However, the NNI algorithm can sometimes get stuck on a local optimum in tree space (especially as the tree gets larger). The other two (the stochastic approach and the likelihood ratchet) will make random NNI rearrangements to the tree, which can allow the tree to escape local optima and instead find a global optimum of our tree-searching space.


## Creating a starting tree

`phangorn` requires a starting tree for maximum likelihood searches. The tree itself doesn't matter much, but generally people supply a neighbor joining tree because they can be generated quickly. 

```{r, warning=FALSE, message = FALSE}
library(phangorn)

grass.align <- read.phyDat("grass_aligned-renamed.fasta", format = "fasta")

dist <- dist.ml(grass.align)
nj.tree <- nj(dist)
```

You can also upload your previously-generated neighbor joining tree and use that instead of creating a new one (just make sure that you used the same sample names in both your renamed fasta file and your neighbor joining tree).

```{r, warning=FALSE, message = FALSE, eval = FALSE}

nj.tree <- read.tree("nj.tre")
```


## Calculating likelihood score

Once you have both a tree and your fasta file loaded, you can calculated a likelihood score using the `pml` command.

```{r, warning=FALSE, message = FALSE}
fit <- pml(nj.tree, data = grass.align)
fit
```

One of the nice things about the `pml` output is that it includes the details of the substitution model used in the likelihood calculation. In this case, because we did not specify any model, the Jukes-Cantor (JC, or JC69) was used by default. We can tell this because the base frequencies are all 0.25 and there is only one substitution rate used in the rate matrix.

This output also reports both a log likelihood and an unconstrained log likelihood. The difference between the two values has to do with the size of the parameter space being searched. In my experience, the log likelihood (not the unconstrained log likelihood) is usually reported, though either is fine as long as you are consistent when testing multiple models.

## Optimizing the maximum likelihood parameters

The `pml` command simply calculates the likelihood of the data, given a tree. If we want to find the optimum tree (and we do), we need to optimize the tree topology (as well as branch length estimates) for our data. We can do this using the `optim.pml` command. For this particular model (JC), we simply need to supply a `pml` object (which we created above) and decide what type of tree arrangement we want to do.

```{r, warning=FALSE, message = FALSE}
fitJC <- optim.pml(fit, rearrangement = "NNI")
plot(fitJC$tree, main = "JC, NNI rearrangement")
```

With the choice of NNI rearrangement, you can see that only a few moves were necessary in order to find the ML tree. We can also choose a "stochastic" or "ratchet" (likelihood ratchet) tree rearrangement algorithm. I like to use the stochastic rearrangement.

Using either the stochastic or ratchet tree rearrangement takes more time than NNI. Here I've also added a command to hide the output while the algorithm is running, but feel free to leave it off so you can see what the algorithm is doing!

```{r, warning=FALSE, message = FALSE}
fitJC <- optim.pml(fit, rearrangement = "stochastic", control = pml.control(trace = 0))
plot(fitJC$tree, main = "JC, stochastic rearrangement")
```


## Changing the model for your ML search

All the work above is for the Jukes-Cantor model. However, there is a whole world of other substition models out there, and it would be nice to use them as well. We can update the `pml` object "fit" with other models.

According to the model test we ran on the grass data in the Neighbor Joining section, the best model for this dataset is the GTR + G model. The gamma distribution has two parameters, a shape parameter and a scale parameter. In phylogenetics, we commonly think of these as the number of rate categories and a shape parameter. When adding a gamma distribution, we can just specify the number of rate categories to start. The standard is 4.

```{r, warning=FALSE, message = FALSE}
fitGTR.G <- update(fit, model = "GTR", k = 4)
fitGTR.G
```

After we update the model and add the gamma distribution, we can see that fitGTR now 4 rate categories and a shape parameter = 1. However, the rate matrix and base frequencies are still the same as the JC model, because we did not update those. We could if we wanted to, but it's easier to let the `optim.pml` algorithm do it.

```{r, warning=FALSE, message = FALSE}
fitGTR.G <- optim.pml(fitGTR.G, model = "GTR", optGamma = T, rearrangement = "stochastic", control = pml.control(trace = 0))

fitGTR.G
plot(fitGTR.G$tree, main = "GTR + G")
```

We can also update the model to include an invariant sites parameter (inv = 0.2).

```{r, warning=FALSE, message = FALSE}
fitGTR.G.I <- update(fitGTR.G, inv = 0.2)

fitGTR.G.I <- optim.pml(fitGTR.G.I, model = "GTR", optGamma = T, optInv = T,  rearrangement = "stochastic", control = pml.control(trace = 0))

fitGTR.G.I
plot(fitGTR.G.I$tree, main = "GTR + I + G")
```

Keep in mind that we never want to overwrite our original "fit" object. By updating the "fit" object (and saving the update by a new name), we can easily go back and change the model for our ML estimation. Let's try a K80 (Kimura 2-parameter) model with an invariant sites parameter. In the `optim.pml` command, we can leave out the `optGamma = T` argument because we no longer have a gamma distribution included in the model.

```{r, warning=FALSE, message = FALSE}
fitK80.I <- update(fit, model = "K80", inv = 0.2)

fitK80.I <- optim.pml(fitK80.I, model = "K80", optInv = T,  rearrangement = "stochastic", control = pml.control(trace = 0))

fitK80.I
plot(fitK80.I$tree, main = "K80 + I")
```

## Bootstrapping the ML tree

Once we have decided which ML tree fits the data best (we can do so by choosing the one with the biggest log likelihood value), we can estimate the support for each branch in the tree with bootstrapping. Bootstrapping for ML trees can be _very, very_ slow, so be patient!

Ideally we want to do at least 1000 bootstrap replicates. Many guides will set the number of bootstrap replicates to 100, but this is often chosen for speed. We really need at least 1000 replicates to have any sort of trust in the values.

We will use the `bootstrap.pml` command, which does have an option to estimate trees in parallel, which can help speed up the bootstrapping process. This option doesn't appear to work on Windows machines, but it works fine on AnVIL. Using AnVIL itself will also help speed up the estimation.

Let's try bootstrapping the GTR + G tree, which was the model chosen for us using the `modelTest` command.

```{r, warning=FALSE, message = FALSE, eval = FALSE}
bs <- bootstrap.pml(fitGTR.G, bs=1000, multicore = T, optNni=TRUE,
+                     control = pml.control(trace = 0))
```

```{r, warning=FALSE, message = FALSE, scho = FALSE}
bs <- bootstrap.pml(fitGTR.G, bs=100, optNni=TRUE, control = pml.control(trace = 0))
```


To plot the bootstrap values, we can use a special plot function called `plotBS`. First, however, we need to root the tree. Then we can plot the rooted tree with our bootstrap values (saved in a list called "bs"), We are also going to format the bootstrap values so that only those values greater than 0.5 (50% bootstrap support) are shown using the `p` argument. We can also decide what color to make the bootstrap values and how many decimal places we want to see. (For a semi-complete list of colors in R, look [here](https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/).)

```{r, warning=FALSE, message = FALSE, eval=FALSE}
tree.root <- root(fitGTR.G$tree, outgroup = c('barley_D-hordein','Siberian wild rye_D-hordein'))

plotBS(tree.root, bs, main = "GTR + G bootstrap", type = "p",
       bs.col="orange", p = 0.5, digits = 2)
```


<img src="resources/images/09-bs_ml_tree.png" title="Major point!! example image" alt="Major point!! example image" style="display: block; margin: auto;" />


## Saving your bootstrap tree

Before you end your session, make sure to save your trees to your persistent disk.

```{r, warning=FALSE, message = FALSE, eval=FALSE}
write.tree(fitGTR.G$tree, file="grass_ml.tre")

write.tree(bs, file="grass_ml_bootstrap.tre")
```

```{r}
sessionInfo()
```