Drift trajectory implementation #1803

andrewkern · 2021-08-06T22:09:12Z

this PR is dealing with issues #1762 and #1780. this implements a NeutralFixation model, which deals with a structured coalescent model akin to a selective sweep but for the case where the allele that we condition upon is neutral rather than beneficial.

This will need a rebase, etc, but wanted to get some feedback on what i've done thus far

…p model; verified

codecov · 2021-08-06T22:15:32Z

Codecov Report

Merging #1803 (ea5fc6e) into main (63d1d25) will decrease coverage by 0.00%.
The diff coverage is 91.17%.

@@            Coverage Diff             @@
##             main    #1803      +/-   ##
==========================================
- Coverage   91.05%   91.04%   -0.01%     
==========================================
  Files          20       20              
  Lines       10493    10523      +30     
  Branches     2219     2225       +6     
==========================================
+ Hits         9554     9581      +27     
  Misses        489      489              
- Partials      450      453       +3

Flag	Coverage Δ
C	`91.04% <91.17%> (-0.01%)`	⬇️
python	`97.53% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
msprime/__init__.py	`100.00% <ø> (ø)`
msprime/_msprimemodule.c	`89.58% <66.66%> (-0.15%)`	⬇️
lib/msprime.c	`85.74% <100.00%> (+0.02%)`	⬆️
msprime/ancestry.py	`98.75% <100.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63d1d25...ea5fc6e. Read the comment docs.

molpopgen · 2021-08-07T16:56:27Z

Will take a look next week--probably Tuesday.

andrewkern · 2021-08-07T19:11:43Z

kevin note this is just the neutral fixation case, an incremental step towards getting the soft sweep model in

andrewkern · 2021-08-07T19:14:21Z

also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though.

molpopgen · 2021-08-07T19:18:57Z

Yes, we don't want anything slipping through the cracks, so more verification is a good idea.

…

On Sat, Aug 7, 2021, 12:14 PM Andrew Kern ***@***.***> wrote: also i should probably add verification against SLiM for patterns of linked polymorphism to this PR. Right now in verification.py i've only got tests of the sojourn time of the neutral variants against the analytic expectation (which checks out) and i'm relying on the rest of the structured coalescent machinery to just work. it should though. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#1803 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ6OH4RFEZZHQUBDD7SWZ3T3WAZPANCNFSM5BWVZY6A> .

molpopgen · 2021-08-09T22:07:17Z

lib/msprime.c

+    on loss (looking backwards in time) following Ewens */
+    double ux = -1.0 * freq;
+    int sign = u < 0.5 ? 1 : -1;
+    return freq + (ux * dt) + sign * sqrt(freq * (1.0 - freq) * dt);


Can we think of reasonable ways to handle the possibility of numeric errors here? It seems that there are no restrictions placed on the input values, so one can get negative frequencies out:

In [1]: dt = 1e-8 In [2]: freq = 1e-8 In [3]: ux = -freq In [4]: sign = -1 In [5]: import math In [7]: freq + (ux * dt) + sign * math.sqrt(freq * (1.0 - freq) * dt) Out[7]: -4.999999918875795e-17

Perhaps this should return max(value, 0.0) ?

molpopgen · 2021-08-09T22:09:37Z

msprime/_msprimemodule.c

@@ -1219,9 +1224,23 @@ Simulator_parse_simulation_model(Simulator *self, PyObject *py_model)
            goto out;
        }
    }
+


I can't comment on the CPython bits -- need @jeromekelleher for that.

molpopgen · 2021-08-09T22:10:08Z

msprime/ancestry.py

+    The model is one of a structured coalescent where selective backgrounds are
+    defined as in
+    `Braverman et al. (1995) <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1206652/>`_
+    The implementation details here follow closely those in discoal


Is the scaling identical to Schrider and Kern? The sweep model wasn't, so any differences here also need to be documented.

molpopgen · 2021-08-09T22:12:10Z

tests/test_models.py

+    def test_model_end_broken(self):
+        # Checking that we're correctly detecting the fact that
+        # sweeps are non renentrant.
+        model = msprime.NeutralFixation(


Totally opaque to me what this test is doing?

I think these are copy-pastes of the Sweep selection model tests, and testing a bunch of internal things. This is checking for a known limitation.

yep-- testing all this stuff in the s=0 case. thought these would be good to include, though not necessary.

molpopgen · 2021-08-09T22:13:34Z

tests/test_models.py

+            position=0.5, start_frequency=0.1, end_frequency=0.9, dt=0.01
+        )
+        for num_labels in [1, 3, 10]:
+            # Not the best error, but this shouldn't be exposed to the user anyway.


This test is also unclear--num_labels is an undocumented part of the public API. What's going on here?

right-- num_labels refers to the number of labels we use internally in the structured coalescent. we generalized the bookkeeping of nodes to include a label a ways back. here we are making sure that the model returns the correct number of labels, in this case 2

molpopgen · 2021-08-09T22:14:56Z

tests/test_models.py

+        )
+        assert ts.num_trees == 1
+
+    def test_neutral_fixation_no_recomb(self):


is this test helpful? If it failed, that wouldn't imply an error in the neutral fixations, but somewhere else in the library, right?

fair. can probably be removed.

molpopgen · 2021-08-09T22:16:12Z

tests/test_models.py

+        for tree in ts.trees():
+            assert tree.num_roots == 1
+
+    def test_neutral_fixation_recomb(self):


Tests like this are a bit strange--number of trees is only > 1 some times, depending on rec rate, seed, etc., and could fail at a later time if some other part of the library is touched.

molpopgen · 2021-08-09T22:18:13Z

tests/test_models.py

+        for j in range(0, 10):
+            models.extend(
+                [
+                    # Start each sweep after 0.01 generations of Hudson


is this really generations, or "time units"

should be generations-- internal time in msprime

molpopgen · 2021-08-09T22:24:33Z

verification.py

@@ -1641,6 +1641,60 @@ def test_sojourn_time2(self):
        pyplot.savefig(f, dpi=72)
        pyplot.close("all")

+    def test_sojourn_time3(self):
+        """
+        testing against expected sojourn time of a


Comments need more detail for future readers: what is the expectation, etc.?

molpopgen · 2021-08-09T22:25:24Z

verification.py

+                reptimes[i] = np.max(tree_times)
+                i += 1
+            data["alpha_means"].append(np.mean(reptimes))
+            data["exp_means"].append(2 * n)


This is why comments are useful: many will think that the expected time to fixation is 4N, but here it is written as 2N--why?

jeromekelleher

I haven't gone through the details @andrewkern, but it mostly seems good. My high-level input would be that there seems to be a lot of copy-paste boilerplate which is basically identical to the SweepGenicSelection model. Perhaps we could think about generalising that, and make the SweepGenicSelection class as a subclass of this for compatibility purposes?

jeromekelleher · 2021-08-10T13:40:09Z

tests/test_models.py

+    def test_model_end_broken(self):
+        # Checking that we're correctly detecting the fact that
+        # sweeps are non renentrant.
+        model = msprime.NeutralFixation(


I think these are copy-pastes of the Sweep selection model tests, and testing a bunch of internal things. This is checking for a known limitation.

andrewkern · 2021-08-10T14:19:29Z

Yeah I agree- this is too much of the same code repeated. One thing that I was having a hard time wrapping my head around is how to generalize this sort of model and deal with the _to_lowlevel() stuff. in particularly i'm not sure the best way to deal with the parameters that are associated, e.g., this ugly thing that i've done here:
https://github.com/andrewkern/msprime/blob/c7de424a686d4bba11a93e010d345b0c5cd57cfc/msprime/ancestry.py#L1969

any advice would be appreciated here Jerome.

jeromekelleher · 2021-08-10T18:10:21Z

It looks like the only thing that's actually different is the trajectory generator, so what I'd like to do is to decouple the allele frequency trajectory generator from the Sweep model. Would it make sense to have something like

class TrajectoryGenerator:
    start_frequency:float
    end_frequency: float
    dt: float

class SweepModel:
     position: float
     trajectory: TrajectoryGenerator

We could then stuff like

model = SweepModel(position=0.5, trajectory=GenicSelectionTrajectory(0.1, 0.9, 1e-6))
# or 
model = SweepModel(position=0.5, trajectory=NuetralFixationTrajectory(0.1, 0.9, 1e-6))

We could keep support for the existing SweepGenicSelection class easily enough by using these building blocks.

These TrajectoryGenerator classes would front-end a bunch of stuff from the CPython layer, and would eventually end up referring to the corresponding struct + function pointer in C (a bit like ancestry models or demographic events). This would mean that we could also test the trajectory generation code independently of the actual simulations (a big issue at the moment), and also do things like allow people define trajectory generators in Python which would be super slow, but handy for prototyping (or maybe not - we're just generating numpy arrays, some models might be possible to implement efficiently using numpy).

Would that work? Are there more types of sweep that would fit into this framework, i.e., where we just define the trajectory generator and don't touch the actual ancestry simulation code at all?

If so, then maybe the best thing to do would be to back out of all the Python level changes and just commit in the nuetral trajectory in C (and maybe any others you have lying around handy) and then I'll jump in and do the plumbing?

andrewkern · 2021-08-10T18:56:17Z

so i think it's a bit easier than this? all of the trajectories that we currently want (hard sweeps, soft sweeps, these neutral fixations) can be generated by the current C function genic_selection_generate_trajectory() (which itself should be renamed to something more general).

currently it is doing neutral fixations by asking about the selection coefficient, so what i did was tack on the python NeutralFixation model to basically do everything that the current SweepGenicSelection model was doing, but fix the selection coefficient at s=0. if we added one more parameter, call it f_0, which told the C trajectory function to switch between neutrality and selection, then we could add a SoftSweep model.

So maybe at the python level we have something like this?

class SweepModel:
     position: float
     start_frequency:float
     end_frequency: float
     dt: float

class HardSweepModel(SweepModel):
     s: float

class SoftSweepModel(SweepModel):
     s: float
     f_0: float

etc.?

andrewkern · 2021-08-10T18:59:37Z

other use cases that I can think of that we would want to cover would be:

hard partial sweeps (hard sweep that ends at freq < 1; already can do this)
soft partial sweeps
neutral fixations (already implemented)
long term balanced polymorphism (no trajectory there, just constant deme sizes among backgrounds)

jeromekelleher · 2021-08-11T05:49:40Z

Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.

@molpopgen, any thoughts here?

andrewkern · 2021-08-11T14:03:03Z

so the biggest issue i see in separating the trajectory generation from the model entirely is that eventually (a few PRs down the road) we will want to get the trajectory generation connected to changing population size e.g.

msprime/lib/msprime.c

Line 7049 in c7de424

/* TODO Wrap this in a rejection sample loop and get the population size

i was thinking we'd do this like pop_size = get_population_size(sim->populations[0], time); so that would would need the simulation object during trajectory generation.

jeromekelleher · 2021-08-11T14:31:25Z

I'm imagining having something like this:

        pop_size = get_population_size(&simulator->populations[0], sim_time);
        x = trajectory.transition_func(&trajectory, x, pop_size,  gsl_rng_uniform(rng));

Rather than:

       pop_size = get_population_size(&simulator->populations[0], sim_time);
       alpha = 2 * pop_size * trajectory.s;
        if (alpha > 0) {
            x = 1.0
                - genic_selection_stochastic_forwards(
                      trajectory.dt, 1.0 - x, alpha, gsl_rng_uniform(rng));
        } else {
            x = neutral_stochastic_backwards(trajectory.dt, x, gsl_rng_uniform(rng));
        }

So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.

Would that work?

molpopgen · 2021-08-11T15:41:39Z

Thinking ahead though @andrewkern, can you see reasons why we would want to specify different trajectories? I would rather put a bit of time in now and get the framework right. If this is what the sweep model really is (a trajectory supplied to the existing sweeps simulatation), then I'd rather specify it this way rather than trying to hide this detail.

@molpopgen, any thoughts here?

For the case of a single trajectory, restricted to no migration between demes, this is the way to go.

molpopgen · 2021-08-11T15:42:47Z

Would that work?

I don't see any issue with this, but it'd take a good validation test to make sure.

EDIT: had forgotten that the not storing in RAM was already happening here.

andrewkern · 2021-08-11T16:21:30Z

So, we try to abstract out the details of everything except the core function for transitioning an allele frequency randomly. All the parameters that we need are encapsulated in the particular type of trajectory.

Would that work?

Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?

@molpopgen we are storing these trajectories in RAM currently, and will continue to need to for the rejection sampling that we will eventually add to deal with population size change

jeromekelleher · 2021-08-11T18:44:14Z

Yeah I think that can work. I'd like to keep these allele freq transition functions on the C side if that's okay by you?

Sure yeah - I just want to make it possible to specify one of these in Python.

For the case of a single trajectory, restricted to no migration between demes, this is the way to go.

So looking ahead, is there a more complex scenario we'd want to model? A trajectory going through multiple demes that would need multiple population sizes? Or would that be better modelled with a specific class?

andrewkern · 2021-08-11T19:01:39Z

i've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent. the weird bit is specifying the how / when of the beneficial allele reaching a 2nd (or 3rd etc) population in the context of migration between demes.

we can however do everything else in a multipop setting e.g., migration of neutral alleles, etc

molpopgen · 2021-08-11T19:53:18Z

've never figured out the details of how to simulate alleles sweeping through multiple populations simultaneously in the coalescent.

Doing this right requires either some diffusion results that I've never seen (fixation pathways conditional on a specific demography) or to explicitly get trajectories via forward sims, rejection sampling until you get what you need. It is tricky.

andrewkern · 2021-12-15T19:15:43Z

resurrected this cause of recent attention on slack, #1955, #1962

should we try to pull this functionality in finally?

molpopgen · 2021-12-21T18:11:55Z

The thing that slowed this down, I think, is #1780. There's some idea that we need to get the right API sorted out "at once". Since we're now at 1.x, and fiddling we do risks breaking the expectations of semantic versioning. So it behooves us to get this right the first time.

jeromekelleher · 2022-01-05T15:00:53Z

After spending a fair bit of time on this a few months ago trying to see what the "right" API would look like, I've come to the conclusion that this is a hard problem and realistically it's something that we'd need someone working on full time for a couple of months to really nail down. None of us have this kind of time on our hands, so I think we need to decide whether we want to wait and get it right, or patch things up as best we can for now while keeping in mind that we'll probably deprecate the current API some day, replacing with something more general (when hopefully we'll have someone to work on it).

I'd vote for patching things up as best we can for now, without adding in too much extra code complexity or duplication.

So, I guess the first thing to do is to rebase this branch and bring it up to date, and maybe resolve any issues that have been sorted out? Or perhaps it would be simpler to make a fresh PR so that we can do clean reviews?

andrewkern added 5 commits July 30, 2021 11:51

added neutral_stochastic traj function

64d781a

added capability of simulating neutral trajectory under existing swee…

d126ad3

…p model; verified

passing c unit tests

99d55c1

python tests now pass

c9ae7bf

added docs

f90e5b2

andrewkern requested review from jeromekelleher and molpopgen August 6, 2021 22:09

fixed reference issue in docs

c7de424

molpopgen reviewed Aug 9, 2021

View reviewed changes

jeromekelleher reviewed Aug 10, 2021

View reviewed changes

jeromekelleher mentioned this pull request Aug 12, 2021

Initial outline for Trajectory refactoring #1808

Draft

verif switch

ea5fc6e

Drift trajectory implementation #1803

Are you sure you want to change the base?

Drift trajectory implementation #1803

Conversation

andrewkern commented Aug 6, 2021

codecov bot commented Aug 6, 2021 • edited Loading

Codecov Report

molpopgen commented Aug 7, 2021

andrewkern commented Aug 7, 2021

andrewkern commented Aug 7, 2021

molpopgen commented Aug 7, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromekelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkern commented Aug 10, 2021

jeromekelleher commented Aug 10, 2021 • edited Loading

andrewkern commented Aug 10, 2021

andrewkern commented Aug 10, 2021 • edited Loading

jeromekelleher commented Aug 11, 2021

andrewkern commented Aug 11, 2021

jeromekelleher commented Aug 11, 2021

molpopgen commented Aug 11, 2021

molpopgen commented Aug 11, 2021 • edited Loading

andrewkern commented Aug 11, 2021

jeromekelleher commented Aug 11, 2021

andrewkern commented Aug 11, 2021

molpopgen commented Aug 11, 2021

andrewkern commented Dec 15, 2021

molpopgen commented Dec 21, 2021

jeromekelleher commented Jan 5, 2022

codecov bot commented Aug 6, 2021 •

edited

Loading

jeromekelleher commented Aug 10, 2021 •

edited

Loading

andrewkern commented Aug 10, 2021 •

edited

Loading

molpopgen commented Aug 11, 2021 •

edited

Loading