-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drift trajectory implementation #1803
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1803 +/- ##
==========================================
- Coverage 91.05% 91.04% -0.01%
==========================================
Files 20 20
Lines 10493 10523 +30
Branches 2219 2225 +6
==========================================
+ Hits 9554 9581 +27
Misses 489 489
- Partials 450 453 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Will take a look next week--probably Tuesday. |
kevin note this is just the neutral fixation case, an incremental step towards getting the soft sweep model in |
also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in |
Yes, we don't want anything slipping through the cracks, so more
verification is a good idea.
…On Sat, Aug 7, 2021, 12:14 PM Andrew Kern ***@***.***> wrote:
also i should probably add verification against SLiM for patterns of
linked polymorphism to this PR. Right now in verification.py i've only
got tests of the sojourn time of the neutral variants against the analytic
expectation (which checks out) and i'm relying on the rest of the
structured coalescent machinery to just work. it should though.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#1803 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ6OH4RFEZZHQUBDD7SWZ3T3WAZPANCNFSM5BWVZY6A>
.
|
on loss (looking backwards in time) following Ewens */ | ||
double ux = -1.0 * freq; | ||
int sign = u < 0.5 ? 1 : -1; | ||
return freq + (ux * dt) + sign * sqrt(freq * (1.0 - freq) * dt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we think of reasonable ways to handle the possibility of numeric errors here? It seems that there are no restrictions placed on the input values, so one can get negative frequencies out:
In [1]: dt = 1e-8
In [2]: freq = 1e-8
In [3]: ux = -freq
In [4]: sign = -1
In [5]: import math
In [7]: freq + (ux * dt) + sign * math.sqrt(freq * (1.0 - freq) * dt)
Out[7]: -4.999999918875795e-17
Perhaps this should return max(value, 0.0)
?
@@ -1219,9 +1224,23 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model) | |||
goto out; | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the CPython bits -- need @jeromekelleher for that.
The model is one of a structured coalescent where selective backgrounds are | ||
defined as in | ||
`Braverman et al. (1995) <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1206652/>`_ | ||
The implementation details here follow closely those in discoal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the scaling identical to Schrider and Kern? The sweep model wasn't, so any differences here also need to be documented.
def test_model_end_broken(self): | ||
# Checking that we're correctly detecting the fact that | ||
# sweeps are non renentrant. | ||
model = msprime.NeutralFixation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally opaque to me what this test is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are copy-pastes of the Sweep selection model tests, and testing a bunch of internal things. This is checking for a known limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep-- testing all this stuff in the s=0
case. thought these would be good to include, though not necessary.
position=0.5, start_frequency=0.1, end_frequency=0.9, dt=0.01 | ||
) | ||
for num_labels in [1, 3, 10]: | ||
# Not the best error, but this shouldn't be exposed to the user anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is also unclear--num_labels
is an undocumented part of the public API. What's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right-- num_labels
refers to the number of labels we use internally in the structured coalescent. we generalized the bookkeeping of nodes to include a label
a ways back. here we are making sure that the model returns the correct number of labels, in this case 2
) | ||
assert ts.num_trees == 1 | ||
|
||
def test_neutral_fixation_no_recomb(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this test helpful? If it failed, that wouldn't imply an error in the neutral fixations, but somewhere else in the library, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair. can probably be removed.
for tree in ts.trees(): | ||
assert tree.num_roots == 1 | ||
|
||
def test_neutral_fixation_recomb(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests like this are a bit strange--number of trees is only > 1 some times, depending on rec rate, seed, etc., and could fail at a later time if some other part of the library is touched.
for j in range(0, 10): | ||
models.extend( | ||
[ | ||
# Start each sweep after 0.01 generations of Hudson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this really generations, or "time units"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be generations-- internal time in msprime
@@ -1641,6 +1641,60 @@ def test_sojourn_time2(self): | |||
pyplot.savefig(f, dpi=72) | |||
pyplot.close("all") | |||
|
|||
def test_sojourn_time3(self): | |||
""" | |||
testing against expected sojourn time of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments need more detail for future readers: what is the expectation, etc.?
reptimes[i] = np.max(tree_times) | ||
i += 1 | ||
data["alpha_means"].append(np.mean(reptimes)) | ||
data["exp_means"].append(2 * n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why comments are useful: many will think that the expected time to fixation is 4N, but here it is written as 2N--why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gone through the details @andrewkern, but it mostly seems good. My high-level input would be that there seems to be a lot of copy-paste boilerplate which is basically identical to the SweepGenicSelection model. Perhaps we could think about generalising that, and make the SweepGenicSelection class as a subclass of this for compatibility purposes?
def test_model_end_broken(self): | ||
# Checking that we're correctly detecting the fact that | ||
# sweeps are non renentrant. | ||
model = msprime.NeutralFixation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are copy-pastes of the Sweep selection model tests, and testing a bunch of internal things. This is checking for a known limitation.
Yeah I agree- this is too much of the same code repeated. One thing that I was having a hard time wrapping my head around is how to generalize this sort of model and deal with the any advice would be appreciated here Jerome. |
It looks like the only thing that's actually different is the trajectory generator, so what I'd like to do is to decouple the allele frequency trajectory generator from the class TrajectoryGenerator:
start_frequency:float
end_frequency: float
dt: float
class SweepModel:
position: float
trajectory: TrajectoryGenerator We could then stuff like model = SweepModel(position=0.5, trajectory=GenicSelectionTrajectory(0.1, 0.9, 1e-6))
# or
model = SweepModel(position=0.5, trajectory=NuetralFixationTrajectory(0.1, 0.9, 1e-6)) We could keep support for the existing SweepGenicSelection class easily enough by using these building blocks. These TrajectoryGenerator classes would front-end a bunch of stuff from the CPython layer, and would eventually end up referring to the corresponding struct + function pointer in C (a bit like ancestry models or demographic events). This would mean that we could also test the trajectory generation code independently of the actual simulations (a big issue at the moment), and also do things like allow people define trajectory generators in Python which would be super slow, but handy for prototyping (or maybe not - we're just generating numpy arrays, some models might be possible to implement efficiently using numpy). Would that work? Are there more types of sweep that would fit into this framework, i.e., where we just define the trajectory generator and don't touch the actual ancestry simulation code at all? If so, then maybe the best thing to do would be to back out of all the Python level changes and just commit in the nuetral trajectory in C (and maybe any others you have lying around handy) and then I'll jump in and do the plumbing? |
so i think it's a bit easier than this? all of the trajectories that we currently want (hard sweeps, soft sweeps, these neutral fixations) can be generated by the current C function currently it is doing neutral fixations by asking about the selection coefficient, so what i did was tack on the python So maybe at the python level we have something like this? class SweepModel:
position: float
start_frequency:float
end_frequency: float
dt: float
class HardSweepModel(SweepModel):
s: float
class SoftSweepModel(SweepModel):
s: float
f_0: float etc.? |
other use cases that I can think of that we would want to cover would be:
|
Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail. @molpopgen, any thoughts here? |
so the biggest issue i see in separating the trajectory generation from the model entirely is that eventually (a few PRs down the road) we will want to get the trajectory generation connected to changing population size e.g. Line 7049 in c7de424
i was thinking we'd do this like |
I'm imagining having something like this: pop_size = get_population_size(&simulator->populations[0], sim_time);
x = trajectory.transition_func(&trajectory, x, pop_size, gsl_rng_uniform(rng)); Rather than: pop_size = get_population_size(&simulator->populations[0], sim_time);
alpha = 2 * pop_size * trajectory.s;
if (alpha > 0) {
x = 1.0
- genic_selection_stochastic_forwards(
trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
} else {
x = neutral_stochastic_backwards(trajectory.dt, x, gsl_rng_uniform(rng));
} So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory. Would that work? |
For the case of a single trajectory, restricted to no migration between demes, this is the way to go. |
I don't see any issue with this, but it'd take a good validation test to make sure. EDIT: had forgotten that the not storing in RAM was already happening here. |
Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you? @molpopgen we are storing these trajectories in RAM currently, and will continue to need to for the rejection sampling that we will eventually add to deal with population size change |
Sure yeah - I just want to make it possible to specify one of these in Python.
So looking ahead, is there a more complex scenario we'd want to model? A trajectory going through multiple demes that would need multiple population sizes? Or would that be better modelled with a specific class? |
i've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent. the weird bit is specifying the how / when of the beneficial allele reaching a 2nd (or 3rd etc) population in the context of migration between demes. we can however do everything else in a multipop setting e.g., migration of neutral alleles, etc |
Doing this right requires either some diffusion results that I've never seen (fixation pathways conditional on a specific demography) or to explicitly get trajectories via forward sims, rejection sampling until you get what you need. It is tricky. |
The thing that slowed this down, I think, is #1780. There's some idea that we need to get the right API sorted out "at once". Since we're now at 1.x, and fiddling we do risks breaking the expectations of semantic versioning. So it behooves us to get this right the first time. |
After spending a fair bit of time on this a few months ago trying to see what the "right" API would look like, I've come to the conclusion that this is a hard problem and realistically it's something that we'd need someone working on full time for a couple of months to really nail down. None of us have this kind of time on our hands, so I think we need to decide whether we want to wait and get it right, or patch things up as best we can for now while keeping in mind that we'll probably deprecate the current API some day, replacing with something more general (when hopefully we'll have someone to work on it). I'd vote for patching things up as best we can for now, without adding in too much extra code complexity or duplication. So, I guess the first thing to do is to rebase this branch and bring it up to date, and maybe resolve any issues that have been sorted out? Or perhaps it would be simpler to make a fresh PR so that we can do clean reviews? |
this PR is dealing with issues #1762 and #1780. this implements a NeutralFixation model, which deals with a structured coalescent model akin to a selective sweep but for the case where the allele that we condition upon is neutral rather than beneficial.
This will need a rebase, etc, but wanted to get some feedback on what i've done thus far