Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add modular simplifier #2752

Draft
wants to merge 88 commits into
base: main
Choose a base branch
from

Conversation

molpopgen
Copy link
Member

@molpopgen molpopgen commented May 8, 2023

Description

WIP related to #2751. This PR adds the basic functionality for a "modular" simplifier along with tests.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

@codecov
Copy link

codecov bot commented May 9, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (1601a5e) 89.72% compared to head (c559f64) 77.00%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2752       +/-   ##
===========================================
- Coverage   89.72%   77.00%   -12.72%     
===========================================
  Files          30       30               
  Lines       30296    30369       +73     
  Branches     5896     5640      -256     
===========================================
- Hits        27183    23387     -3796     
- Misses       1781     5587     +3806     
- Partials     1332     1395       +63     
Flag Coverage Δ
c-tests 86.14% <84.93%> (+0.01%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.73% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
c/tskit/tables.c 83.23% <84.93%> (+0.05%) ⬆️

... and 13 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1601a5e...c559f64. Read the comment docs.

@jeromekelleher
Copy link
Member

I've had a look through @molpopgen, and I think I sort-of get it. I wonder if it would be helpful to write some "real" client code here by making a new example overlapping generations WF simulator? It would help me understand the intended final interface, and might be a less annoying way of chasing down bugs than hand-crafting tests (which is super tedious!)?

@jeromekelleher
Copy link
Member

Could use either an array of tsk_edge_t or an tsk_edge_table_t instance as the edge buffer for the first pass (i.e., not worrying about the wasted space of storing the parent, which will always be the same)?

@molpopgen
Copy link
Member Author

Could use either an array of tsk_edge_t or an tsk_edge_table_t instance as the edge buffer for the first pass (i.e., not worrying about the wasted space of storing the parent, which will always be the same)?

For this to work, one would have to sort the edge table or implement a fully worked-out edge buffer (which is basically an edge table + an adjacency list). I was planning the latter for a separate PR. Here, the goal is to show that, in principle, granular control over the simplification steps is possible.

Using an edge table w/o the adjacency list to index it, one would have to do a bunch of sorting, too.

Such an example would be okay so long as we update in the future to actually be efficient.

@jeromekelleher
Copy link
Member

Just sorting the list of edges would be fine for a first pass I think? We're not going to get it all done in one go anyway, so may as well get some of the infrastructure in place that'll be needed to make development less tedious. I'd like to help, and having a simulator I can run and experiment with would make it much easier for me to understand what's going on.

I guess we could take a copy of the haploid_wright_fisher example, just adding a survival_probability argument (or something) to allow for overlapping generations? We could all this efficient_forward_simulation or something?

I think all you need to do is add the file, and do a straight copy of whatever we have for the other examples in the meson.build file

@molpopgen
Copy link
Member Author

Sounds good -- I'll get to the "soon".

@molpopgen
Copy link
Member Author

@jeromekelleher -- the new example is there. I can't break it unless I intentionally break the qsort callback.

@jeromekelleher
Copy link
Member

I've spent a few hours loading this up into my head again @molpopgen,
and have been trying to figure out an interface that keeps as much
detail internal as we can, and takes the burden of buffering edges
away from the user.

I'm starting to think that the way to do this is to provide a new method

int tsk_table_collection_simplify_with_parent_buffer(
    tsk_table_collection_t *self, tsk_parent_edge_buffer_t *parent_edges,
    ... standard args)
{
}

This function assumes that self is a well formed table collection that
is fully sorted, and its node table has contains all the nodes for edges
stored in the parent_edges object. How we implement the parent_buffer isn't
important initially, but we'll want to do it so that we don't have to sort the
edges in the standard way (as you've shown).

The function would take care of the hairy details about clearing out the parent
buffer for parents that are dead, but hanging on to those for the parents that
are still alive (I think that's what's happening).

So, the only API addition we'd need would be a new object, tsk_parent_edge_buffer_t,
which would need (at a minimum) a function add_edge(left, right, parent, child)
which would take care of the details of how to store the edges efficiently.

Here's a mock up in Python (the simplify_with_buffer function is just here to
show the interface - obviously the implementation is nonsense).

Does this get us moving in the right direction, or have I missed the point?

def simplify_with_buffer(tables, parent_buffer, samples):
    # Pretend this was done efficiently internally without any sorting
    # by creating a simplifier object and adding the ancstry for the
    # new parents appropriately before flushing through the rest of the
    # edges.
    for parent, edges in parent_buffer.items():
        for left, right, child in edges:
            tables.edges.add_row(left, right, parent, child)
    tables.sort()
    tables.simplify(samples)
    # We've exhausted the parent buffer, so clear it out. In reality we'd
    # do this more carefully, like KT does in the post_simplify step.
    parent_buffer.clear()


def wright_fisher_use_parent_buffer(N, death_proba, L, T, *, simplify_interval):
    tables = tskit.TableCollection(L)
    alive = [tables.nodes.add_row(time=T) for _ in range(N)]
    parent_buffer = collections.defaultdict(list)

    t = T
    n = N
    while t > 0:
        t -= 1
        next_alive = list(alive)
        for j in range(N):
            if random.random() < death_proba:
                # alive[j] is dead - replace it.
                u = tables.nodes.add_row(time=t)
                next_alive[j] = u
                a = random.randint(0, N - 1)
                b = random.randint(0, N - 1)
                x = random.uniform(0, L)
                parent_buffer[alive[a]].append((0, x, u))
                parent_buffer[alive[b]].append((x, L, u))
        alive = next_alive
        if t % simplify_interval == 0 or t == 0:
            print(f"Simplifying nodes={len(tables.nodes)} new edges={len(tables.edges)}")
            simplify_with_buffer(tables, parent_buffer, alive)
            print(f"Done        nodes={len(tables.nodes)} edges={len(tables.edges)}")
            alive = list(range(N))
    return tables.tree_sequence()

@molpopgen
Copy link
Member Author

@jeromekelleher The issue with that proposal is that it prohibits innovation. By requiring that someone use a provided buffer type, they cannot come up with better/more clever ways to buffer things themselves. For example, in tskit-rust, my intuition is to provide a buffer that only uses safe code (no calls into a C API, for example). (Also, I've already written such a type when I first prototype all of this... :) )

I think that such a function could, and maybe should, be provided. But I don't think that it is the only thing that tskit should support.

@molpopgen
Copy link
Member Author

The counter argument to my worry is the possibility of an internal refactor of the simplifier that does not include the steps "collect edges under a parent" and "process all the edges that we've collected for this parent". If that is likely, then we need to hide everything. (Or keep doing what we do and make it undocumented beyond some safety-related comments that don't end up in the manual.)

@jeromekelleher
Copy link
Member

I just pushed a couple of commits here with some experimental code that uses the Python simplifier, following the basic approach I outlined above. I think the basic approach works, but I'm missing a detail somewhere and need to close up for the day; I forgot how hard this stuff is!

The idea is that we use the BirthBuffer to keep all the edges since the last epoch, as well as all the edges for parents that were alive at the end of the last epoch (== simplify interval). This is slightly different to your approach (I think you buffer all edges after the first alive parent?), but following the same basic idea.

We then manually add the edges from each of these sources into the simplifier, in the correct order.

The only sorting that remains is where we currently sort the parents in the BirthBuffer. However, these should be inserted in sorted order almost always, so I think something simple like doing an insertion sort when a parent is inserted out-of-order would probably work quite well. I expect you've experimented with this already?

Regarding your points above, I think we do want to have this BirthBuffer and simplify_with_buffer interface in the library, because the majority of people will just want something that makes their forward sims go faster and won't be interested in experimenting with different ways of buffering edges. I suspect you are the only one who is going to do any innovating in this space!

Keeping that in mind, I suggest we implement the BirthBuffer and simplify_with_buffer interface as well as we can in C, which hopefully we can make as fast as your Rust versions. If you're still interested in accessing the guts of the simplifier, I'm happy to just put it undocumented in the external header, so you can access. It would be hard work to properly document what's here currently, and I'm really not clear what is required in terms of safety. It seems like a waste of time if only one person is every going to use it.

What do you think?

@molpopgen
Copy link
Member Author

I'll take a look. I buffer all the edges so that I don't have to sort anything for the case of overlapping generations. If the buffer is an array-based adjacency list, the way I do things allows you to go backwards (in time, and probably in array index space) through the parents and then process each one (that is not some sentinel value implying "no births for this parent").

@molpopgen
Copy link
Member Author

molpopgen commented Jun 23, 2023

The way that I've been implementing these buffers is very close to what is in this notebook: https://github.com/molpopgen/edge_buffering_explanation.

EDIT: if we do the "trick" of "stealing" all edges that overlap in time w/currently alive nodes, the sort in this notebook is no longer necesary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants