The derivation of the Bayes factors is described over pages 11 to 14 of the Supplementary Materials. The approach is set out on page 11:
That is, the Bayes factors are the ratio of the likelihood of the sequence data
Page 14 describes how the Bayes factor is taken from the combination of the posterior and prior odds:
Page 12 rewrites the posterior probabilities as joint probabilities and breaks them down over the relevant MRCA haplotypes
Page 13 breaks down the joint probabilities
Page 13 then describes the compatibility between the different topologies and MRCA haploytpes:
As noted in #7, the topology
-
$\tau = \tau_{2c}$ is compatible with$S_{MRCA} \in {S_{C/C}, S_{T/T}}$ -
$\tau = \tau_{1c}$ is compatible with$S_{MRCA} \in {S_{A}, S_{B}}$ -
$\tau = (\tau_{P}, \tau_{P})$ is compatible with$S_{MRCA} \in {S_{A}, S_{B}, S_{C/C}, S_{T/T}}$
Page 13 then describes how the compatbility statements define
This description is wrong. The renormalization is applied to the vector of multiple MRCA haplotypes
$P(S_{MRCA}|\tau = \tau_{2c}) = (0, 0, 0.5, 0.5)$ $P(S_{MRCA}|\tau = \tau_{1c}) = (0.5, 0.5, 0, 0)$ $P(S_{MRCA}|\tau = (\tau_p,\tau_p)) = (0.25, 0.25, 0.25, 0.25)$
These allow the equation to be expanded out, with the priors
This is mathematically identical to the authors' equations, but simpler because it avoids the superfluous marginalization over the sub-topologies
The authors inferred the posterior probabilities
The authors measured the likelihoods
Therefore,
- each clade includes 30-70% of the total taxa (taxa being sampled from the first 50,000 infections); and
- the clades are separated by two mutations (with one being basal for
$P(\tau_{1c}|I_1)$ and both being derived for$P(\tau_{2c}|I_1)$ ).
The additional conditions 1 and 2 reduce the single introduction likelihoods. This makes the published Bayes factors (4.2 and 4.3) meaningless, since they may be caused by this reduction rather than a higher plausiblity of the two introduction hypothesis.
This can be corrected by applying conditions 1 and 2 to the two introduction likelihood
This comment examines the effects of condition 1. A subsequent comment will examine condition 2.
Two simulations can be be tested against condition 1 by going through the first 50,000 infections amongst both, and comparing the number of samples from each, e.g.:
def test_sizes(run_1,run_2):
# read samples into seperate sets for each run
sample_dict = {run_1: set(), run_2: set()}
# read transmissions into one list for both runs
transmission_list = []
for run in [run_1, run_2]:
with open(f'simulations/{run:04d}/subsample_times.corrected.txt', 'r') as file:
for line in file:
u, t = line.strip().split('\t')
sample_dict[run].add(u)
with open(f'simulations/{run:04d}/transmission_network.subsampled.corrected.txt', 'r') as file:
for line in file:
u,v,t = line.strip().split()
if u != v:
transmission_list.append((v,float(t),run))
# sort the transmissions in order of time
transmission_list = sorted(transmission_list, key=lambda x: x[1])
# start sample counts from -1 because the primary case is not to be included
n_valid_samples = {run_1: -1, run_2: -1}
# go through the first 50,000 transmissions and count those that are sampled from each run
for event_no, transmission in enumerate(transmission_list):
if event_no == 50000:
break
if transmission[0] in sample_dict[transmission[2]]:
n_valid_samples[transmission[2]] += 1
# compare the numbers of sampled transmissions to the relative size condition
if 0.3 <= n_valid_samples[run_1]/(n_valid_samples[run_1] + n_valid_samples[run_2]) <= 0.7:
return True
else:
return False
The likelihood of two successful introductions producing basal polytomies and satisfying condition 1 can be measured by repeatedly drawing random pairs of simulations and testing them against both conditions, e.g.:
# sample 1100 topologies, as for the others
for i in range(1100):
# draw two simulations
run_1, run_2 = np.random.choice(range(1, 1101), 2, replace=True)
# get the clade sizes
clade_sizes_1 = clade_analyses_CC_d[run_1]['clade_sizes']
clade_sizes_2 = clade_analyses_CC_d[run_2]['clade_sizes']
# test them for basal polytomies
if len(clade_sizes_1) >= min_polytomy_descendants and len(clade_sizes_2) >= min_polytomy_descendants:
# test their sizes
if test_sizes(run_1, run_2):
count_atLeastMinDescendants += 1
The function clade_analysis_updated
in the notebook stableCoalescence_cladeAnalysis.py can be adapted to do this instead of counting the number of simulations with basal polytomies. The function calculate_bf
can then be adapted to use the resulting likelihood as
def calculate_bf(asr_results, simulation_results):
# Let t_p be a polytomy, t_1C be one clade, and t_2c be two clades. Note that t_p includes t_1c. t_p equals all topologies with a basal polytomy (Fig. 2a).
# trees are in the order
# (t_p, t_1C, t_2C, (t_p,(t_p,t_1C,t_2C)), (t_1C,(t_p,t_1C,t_2C)), (t_2c,(t_p,t_1C,t_2C)))
...
pr_trees_given_I2 = np.concatenate([np.array([0]*3), np.array([simulation_results[0]]), np.array([0]*8)])
The notebook can then be rerun to produce the following results.
Reducing inconsistency in the treatment of the different hyoptheses reduces the Bayes factors by a factor of six.
Note that the authors measured the two week doubling times of the simulated epidemics at the 1000th infection, and found a 95% highest density interval (HDI) of 1.35 to 5.44 days. This range of early growth rates suggests that two introductions are unlikely to grow to similar sizes within the first 50,000 infections, and the results confirm this.
Code and instructions for reproducing these results are available at this branch of the authors' repository.
test_sizes
assumes both introductions are simultaneous. This maximises the likelihood of the two introductions having similar sizes;- Each introduction has a different delay before the first samples due to the different times of first hospitalization. Realistically, the earlier time should apply to both. Therefore, this analysis misses some early samples in the simulation with the later time of first hospitalization. In some cases, missing these samples will drop branch counts below the polytomy threshold, or clade sizes below the relative size threshold, reducing the likelihood of a basal polytomy and satisfying condition 1;
- Each introduction has a stable coalescence based on when it reaches 50,000 infections. Assuming the stable coalescence should instead be based on when simulations reach 50,000 combined infections, this analysis removes more basal lineages than it should, increasing the likelihood of basal polytomies.
Overall, the limitations increase the likelihood of two simulations producing two basal polytomies and satisfying conditions 1, so the corrected Bayes factors may be considered a rough upper bound.
It is theoretically possible that applying condition 3 to the two introduction likelihoods could increase the Bayes factors. This would require an increase in the likelihood of two introductions producing topologies compatible with the more probable MRCA haplotypes (i.e. A or B). This is considered unlikely, but will be confirmed either way in the next comment.