Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drift trajectory implementation #1803

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

andrewkern
Copy link
Member

this PR is dealing with issues #1762 and #1780. this implements a NeutralFixation model, which deals with a structured coalescent model akin to a selective sweep but for the case where the allele that we condition upon is neutral rather than beneficial.

This will need a rebase, etc, but wanted to get some feedback on what i've done thus far

@codecov
Copy link

codecov bot commented Aug 6, 2021

Codecov Report

Merging #1803 (ea5fc6e) into main (63d1d25) will decrease coverage by 0.00%.
The diff coverage is 91.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1803      +/-   ##
==========================================
- Coverage   91.05%   91.04%   -0.01%     
==========================================
  Files          20       20              
  Lines       10493    10523      +30     
  Branches     2219     2225       +6     
==========================================
+ Hits         9554     9581      +27     
  Misses        489      489              
- Partials      450      453       +3     
Flag Coverage Δ
C 91.04% <91.17%> (-0.01%) ⬇️
python 97.53% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
msprime/__init__.py 100.00% <ø> (ø)
msprime/_msprimemodule.c 89.58% <66.66%> (-0.15%) ⬇️
lib/msprime.c 85.74% <100.00%> (+0.02%) ⬆️
msprime/ancestry.py 98.75% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63d1d25...ea5fc6e. Read the comment docs.

@molpopgen
Copy link
Member

Will take a look next week--probably Tuesday.

@andrewkern
Copy link
Member Author

kevin note this is just the neutral fixation case, an incremental step towards getting the soft sweep model in

@andrewkern
Copy link
Member Author

also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though.

@molpopgen
Copy link
Member

molpopgen commented Aug 7, 2021 via email

on loss (looking backwards in time) following Ewens */
double ux = -1.0 * freq;
int sign = u < 0.5 ? 1 : -1;
return freq + (ux * dt) + sign * sqrt(freq * (1.0 - freq) * dt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of reasonable ways to handle the possibility of numeric errors here? It seems that there are no restrictions placed on the input values, so one can get negative frequencies out:

In [1]: dt = 1e-8

In [2]: freq = 1e-8

In [3]: ux = -freq

In [4]: sign = -1

In [5]: import math

In [7]: freq + (ux * dt) + sign * math.sqrt(freq * (1.0 - freq) * dt)
Out[7]: -4.999999918875795e-17

Perhaps this should return max(value, 0.0) ?

@@ -1219,9 +1224,23 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model)
goto out;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the CPython bits -- need @jeromekelleher for that.

The model is one of a structured coalescent where selective backgrounds are
defined as in
`Braverman et al. (1995) <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1206652/>`_
The implementation details here follow closely those in discoal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scaling identical to Schrider and Kern? The sweep model wasn't, so any differences here also need to be documented.

def test_model_end_broken(self):
# Checking that we're correctly detecting the fact that
# sweeps are non renentrant.
model = msprime.NeutralFixation(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally opaque to me what this test is doing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are copy-pastes of the Sweep selection model tests, and testing a bunch of internal things. This is checking for a known limitation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep-- testing all this stuff in the s=0 case. thought these would be good to include, though not necessary.

position=0.5, start_frequency=0.1, end_frequency=0.9, dt=0.01
)
for num_labels in [1, 3, 10]:
# Not the best error, but this shouldn't be exposed to the user anyway.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is also unclear--num_labels is an undocumented part of the public API. What's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right-- num_labels refers to the number of labels we use internally in the structured coalescent. we generalized the bookkeeping of nodes to include a label a ways back. here we are making sure that the model returns the correct number of labels, in this case 2

)
assert ts.num_trees == 1

def test_neutral_fixation_no_recomb(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test helpful? If it failed, that wouldn't imply an error in the neutral fixations, but somewhere else in the library, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair. can probably be removed.

for tree in ts.trees():
assert tree.num_roots == 1

def test_neutral_fixation_recomb(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests like this are a bit strange--number of trees is only > 1 some times, depending on rec rate, seed, etc., and could fail at a later time if some other part of the library is touched.

for j in range(0, 10):
models.extend(
[
# Start each sweep after 0.01 generations of Hudson
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really generations, or "time units"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be generations-- internal time in msprime

@@ -1641,6 +1641,60 @@ def test_sojourn_time2(self):
pyplot.savefig(f, dpi=72)
pyplot.close("all")

def test_sojourn_time3(self):
"""
testing against expected sojourn time of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments need more detail for future readers: what is the expectation, etc.?

reptimes[i] = np.max(tree_times)
i += 1
data["alpha_means"].append(np.mean(reptimes))
data["exp_means"].append(2 * n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why comments are useful: many will think that the expected time to fixation is 4N, but here it is written as 2N--why?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through the details @andrewkern, but it mostly seems good. My high-level input would be that there seems to be a lot of copy-paste boilerplate which is basically identical to the SweepGenicSelection model. Perhaps we could think about generalising that, and make the SweepGenicSelection class as a subclass of this for compatibility purposes?

def test_model_end_broken(self):
# Checking that we're correctly detecting the fact that
# sweeps are non renentrant.
model = msprime.NeutralFixation(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are copy-pastes of the Sweep selection model tests, and testing a bunch of internal things. This is checking for a known limitation.

@andrewkern
Copy link
Member Author

Yeah I agree- this is too much of the same code repeated. One thing that I was having a hard time wrapping my head around is how to generalize this sort of model and deal with the _to_lowlevel() stuff. in particularly i'm not sure the best way to deal with the parameters that are associated, e.g., this ugly thing that i've done here:
https://github.com/andrewkern/msprime/blob/c7de424a686d4bba11a93e010d345b0c5cd57cfc/msprime/ancestry.py#L1969

any advice would be appreciated here Jerome.

@jeromekelleher
Copy link
Member

jeromekelleher commented Aug 10, 2021

It looks like the only thing that's actually different is the trajectory generator, so what I'd like to do is to decouple the allele frequency trajectory generator from the Sweep model. Would it make sense to have something like

class TrajectoryGenerator:
    start_frequency:float
    end_frequency: float
    dt: float

class SweepModel:
     position: float
     trajectory: TrajectoryGenerator

We could then stuff like

model = SweepModel(position=0.5, trajectory=GenicSelectionTrajectory(0.1, 0.9, 1e-6))
# or 
model = SweepModel(position=0.5, trajectory=NuetralFixationTrajectory(0.1, 0.9, 1e-6))

We could keep support for the existing SweepGenicSelection class easily enough by using these building blocks.

These TrajectoryGenerator classes would front-end a bunch of stuff from the CPython layer, and would eventually end up referring to the corresponding struct + function pointer in C (a bit like ancestry models or demographic events). This would mean that we could also test the trajectory generation code independently of the actual simulations (a big issue at the moment), and also do things like allow people define trajectory generators in Python which would be super slow, but handy for prototyping (or maybe not - we're just generating numpy arrays, some models might be possible to implement efficiently using numpy).

Would that work? Are there more types of sweep that would fit into this framework, i.e., where we just define the trajectory generator and don't touch the actual ancestry simulation code at all?

If so, then maybe the best thing to do would be to back out of all the Python level changes and just commit in the nuetral trajectory in C (and maybe any others you have lying around handy) and then I'll jump in and do the plumbing?

@andrewkern
Copy link
Member Author

so i think it's a bit easier than this? all of the trajectories that we currently want (hard sweeps, soft sweeps, these neutral fixations) can be generated by the current C function genic_selection_generate_trajectory() (which itself should be renamed to something more general).

currently it is doing neutral fixations by asking about the selection coefficient, so what i did was tack on the python NeutralFixation model to basically do everything that the current SweepGenicSelection model was doing, but fix the selection coefficient at s=0. if we added one more parameter, call it f_0, which told the C trajectory function to switch between neutrality and selection, then we could add a SoftSweep model.

So maybe at the python level we have something like this?

class SweepModel:
     position: float
     start_frequency:float
     end_frequency: float
     dt: float

class HardSweepModel(SweepModel):
     s: float

class SoftSweepModel(SweepModel):
     s: float
     f_0: float

etc.?

@andrewkern
Copy link
Member Author

andrewkern commented Aug 10, 2021

other use cases that I can think of that we would want to cover would be:

  1. hard partial sweeps (hard sweep that ends at freq < 1; already can do this)
  2. soft partial sweeps
  3. neutral fixations (already implemented)
  4. long term balanced polymorphism (no trajectory there, just constant deme sizes among backgrounds)

@jeromekelleher
Copy link
Member

Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.

@molpopgen, any thoughts here?

@andrewkern
Copy link
Member Author

so the biggest issue i see in separating the trajectory generation from the model entirely is that eventually (a few PRs down the road) we will want to get the trajectory generation connected to changing population size e.g.

/* TODO Wrap this in a rejection sample loop and get the population size

i was thinking we'd do this like pop_size = get_population_size(sim->populations[0], time); so that would would need the simulation object during trajectory generation.

@jeromekelleher
Copy link
Member

I'm imagining having something like this:

        pop_size = get_population_size(&simulator->populations[0], sim_time);
        x = trajectory.transition_func(&trajectory, x, pop_size,  gsl_rng_uniform(rng));

Rather than:

       pop_size = get_population_size(&simulator->populations[0], sim_time);
       alpha = 2 * pop_size * trajectory.s;
        if (alpha > 0) {
            x = 1.0
                - genic_selection_stochastic_forwards(
                      trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
        } else {
            x = neutral_stochastic_backwards(trajectory.dt, x, gsl_rng_uniform(rng));
        }

So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.

Would that work?

@molpopgen
Copy link
Member

Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.

@molpopgen, any thoughts here?

For the case of a single trajectory, restricted to no migration between demes, this is the way to go.

@molpopgen
Copy link
Member

molpopgen commented Aug 11, 2021

Would that work?

I don't see any issue with this, but it'd take a good validation test to make sure.

EDIT: had forgotten that the not storing in RAM was already happening here.

@andrewkern
Copy link
Member Author

So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.

Would that work?

Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?

@molpopgen we are storing these trajectories in RAM currently, and will continue to need to for the rejection sampling that we will eventually add to deal with population size change

@jeromekelleher
Copy link
Member

Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?

Sure yeah - I just want to make it possible to specify one of these in Python.

For the case of a single trajectory, restricted to no migration between demes, this is the way to go.

So looking ahead, is there a more complex scenario we'd want to model? A trajectory going through multiple demes that would need multiple population sizes? Or would that be better modelled with a specific class?

@andrewkern
Copy link
Member Author

i've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent. the weird bit is specifying the how / when of the beneficial allele reaching a 2nd (or 3rd etc) population in the context of migration between demes.

we can however do everything else in a multipop setting e.g., migration of neutral alleles, etc

@molpopgen
Copy link
Member

've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent.

Doing this right requires either some diffusion results that I've never seen (fixation pathways conditional on a specific demography) or to explicitly get trajectories via forward sims, rejection sampling until you get what you need. It is tricky.

@andrewkern
Copy link
Member Author

resurrected this cause of recent attention on slack, #1955, #1962

should we try to pull this functionality in finally?

@molpopgen
Copy link
Member

The thing that slowed this down, I think, is #1780. There's some idea that we need to get the right API sorted out "at once". Since we're now at 1.x, and fiddling we do risks breaking the expectations of semantic versioning. So it behooves us to get this right the first time.

@jeromekelleher
Copy link
Member

After spending a fair bit of time on this a few months ago trying to see what the "right" API would look like, I've come to the conclusion that this is a hard problem and realistically it's something that we'd need someone working on full time for a couple of months to really nail down. None of us have this kind of time on our hands, so I think we need to decide whether we want to wait and get it right, or patch things up as best we can for now while keeping in mind that we'll probably deprecate the current API some day, replacing with something more general (when hopefully we'll have someone to work on it).

I'd vote for patching things up as best we can for now, without adding in too much extra code complexity or duplication.

So, I guess the first thing to do is to rebase this branch and bring it up to date, and maybe resolve any issues that have been sorted out? Or perhaps it would be simpler to make a fresh PR so that we can do clean reviews?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants