add modular simplifier #2752

molpopgen · 2023-05-08T18:17:58Z

Description

WIP related to #2751. This PR adds the basic functionality for a "modular" simplifier along with tests.

PR Checklist:

Tests that fully cover new/changed functionality.
Documentation including tutorial content if appropriate.
Changelogs, if there are API changes.

codecov · 2023-05-09T20:12:31Z

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (1601a5e) 89.72% compared to head (c559f64) 77.00%.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2752       +/-   ##
===========================================
- Coverage   89.72%   77.00%   -12.72%     
===========================================
  Files          30       30               
  Lines       30296    30369       +73     
  Branches     5896     5640      -256     
===========================================
- Hits        27183    23387     -3796     
- Misses       1781     5587     +3806     
- Partials     1332     1395       +63

Flag	Coverage Δ
c-tests	`86.14% <84.93%> (+0.01%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.73% <ø> (ø)`
python-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
c/tskit/tables.c	`83.23% <84.93%> (+0.05%)`	⬆️

... and 13 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1601a5e...c559f64. Read the comment docs.

jeromekelleher · 2023-05-30T13:19:02Z

I've had a look through @molpopgen, and I think I sort-of get it. I wonder if it would be helpful to write some "real" client code here by making a new example overlapping generations WF simulator? It would help me understand the intended final interface, and might be a less annoying way of chasing down bugs than hand-crafting tests (which is super tedious!)?

jeromekelleher · 2023-05-30T13:36:16Z

Could use either an array of tsk_edge_t or an tsk_edge_table_t instance as the edge buffer for the first pass (i.e., not worrying about the wasted space of storing the parent, which will always be the same)?

molpopgen · 2023-05-31T12:38:08Z

Could use either an array of tsk_edge_t or an tsk_edge_table_t instance as the edge buffer for the first pass (i.e., not worrying about the wasted space of storing the parent, which will always be the same)?

For this to work, one would have to sort the edge table or implement a fully worked-out edge buffer (which is basically an edge table + an adjacency list). I was planning the latter for a separate PR. Here, the goal is to show that, in principle, granular control over the simplification steps is possible.

Using an edge table w/o the adjacency list to index it, one would have to do a bunch of sorting, too.

Such an example would be okay so long as we update in the future to actually be efficient.

jeromekelleher · 2023-05-31T12:53:35Z

Just sorting the list of edges would be fine for a first pass I think? We're not going to get it all done in one go anyway, so may as well get some of the infrastructure in place that'll be needed to make development less tedious. I'd like to help, and having a simulator I can run and experiment with would make it much easier for me to understand what's going on.

I guess we could take a copy of the haploid_wright_fisher example, just adding a survival_probability argument (or something) to allow for overlapping generations? We could all this efficient_forward_simulation or something?

I think all you need to do is add the file, and do a straight copy of whatever we have for the other examples in the meson.build file

molpopgen · 2023-05-31T13:46:49Z

Sounds good -- I'll get to the "soon".

molpopgen · 2023-05-31T17:54:38Z

@jeromekelleher -- the new example is there. I can't break it unless I intentionally break the qsort callback.

jeromekelleher · 2023-06-22T14:21:08Z

I've spent a few hours loading this up into my head again @molpopgen,
and have been trying to figure out an interface that keeps as much
detail internal as we can, and takes the burden of buffering edges
away from the user.

I'm starting to think that the way to do this is to provide a new method

int tsk_table_collection_simplify_with_parent_buffer(
    tsk_table_collection_t *self, tsk_parent_edge_buffer_t *parent_edges,
    ... standard args)
{
}

This function assumes that self is a well formed table collection that
is fully sorted, and its node table has contains all the nodes for edges
stored in the parent_edges object. How we implement the parent_buffer isn't
important initially, but we'll want to do it so that we don't have to sort the
edges in the standard way (as you've shown).

The function would take care of the hairy details about clearing out the parent
buffer for parents that are dead, but hanging on to those for the parents that
are still alive (I think that's what's happening).

So, the only API addition we'd need would be a new object, tsk_parent_edge_buffer_t,
which would need (at a minimum) a function add_edge(left, right, parent, child)
which would take care of the details of how to store the edges efficiently.

Here's a mock up in Python (the simplify_with_buffer function is just here to
show the interface - obviously the implementation is nonsense).

Does this get us moving in the right direction, or have I missed the point?

def simplify_with_buffer(tables, parent_buffer, samples):
    # Pretend this was done efficiently internally without any sorting
    # by creating a simplifier object and adding the ancstry for the
    # new parents appropriately before flushing through the rest of the
    # edges.
    for parent, edges in parent_buffer.items():
        for left, right, child in edges:
            tables.edges.add_row(left, right, parent, child)
    tables.sort()
    tables.simplify(samples)
    # We've exhausted the parent buffer, so clear it out. In reality we'd
    # do this more carefully, like KT does in the post_simplify step.
    parent_buffer.clear()


def wright_fisher_use_parent_buffer(N, death_proba, L, T, *, simplify_interval):
    tables = tskit.TableCollection(L)
    alive = [tables.nodes.add_row(time=T) for _ in range(N)]
    parent_buffer = collections.defaultdict(list)

    t = T
    n = N
    while t > 0:
        t -= 1
        next_alive = list(alive)
        for j in range(N):
            if random.random() < death_proba:
                # alive[j] is dead - replace it.
                u = tables.nodes.add_row(time=t)
                next_alive[j] = u
                a = random.randint(0, N - 1)
                b = random.randint(0, N - 1)
                x = random.uniform(0, L)
                parent_buffer[alive[a]].append((0, x, u))
                parent_buffer[alive[b]].append((x, L, u))
        alive = next_alive
        if t % simplify_interval == 0 or t == 0:
            print(f"Simplifying nodes={len(tables.nodes)} new edges={len(tables.edges)}")
            simplify_with_buffer(tables, parent_buffer, alive)
            print(f"Done        nodes={len(tables.nodes)} edges={len(tables.edges)}")
            alive = list(range(N))
    return tables.tree_sequence()

molpopgen · 2023-06-22T14:51:51Z

@jeromekelleher The issue with that proposal is that it prohibits innovation. By requiring that someone use a provided buffer type, they cannot come up with better/more clever ways to buffer things themselves. For example, in tskit-rust, my intuition is to provide a buffer that only uses safe code (no calls into a C API, for example). (Also, I've already written such a type when I first prototype all of this... :) )

I think that such a function could, and maybe should, be provided. But I don't think that it is the only thing that tskit should support.

molpopgen · 2023-06-22T16:08:24Z

The counter argument to my worry is the possibility of an internal refactor of the simplifier that does not include the steps "collect edges under a parent" and "process all the edges that we've collected for this parent". If that is likely, then we need to hide everything. (Or keep doing what we do and make it undocumented beyond some safety-related comments that don't end up in the manual.)

jeromekelleher · 2023-06-23T15:59:16Z

I just pushed a couple of commits here with some experimental code that uses the Python simplifier, following the basic approach I outlined above. I think the basic approach works, but I'm missing a detail somewhere and need to close up for the day; I forgot how hard this stuff is!

The idea is that we use the BirthBuffer to keep all the edges since the last epoch, as well as all the edges for parents that were alive at the end of the last epoch (== simplify interval). This is slightly different to your approach (I think you buffer all edges after the first alive parent?), but following the same basic idea.

We then manually add the edges from each of these sources into the simplifier, in the correct order.

The only sorting that remains is where we currently sort the parents in the BirthBuffer. However, these should be inserted in sorted order almost always, so I think something simple like doing an insertion sort when a parent is inserted out-of-order would probably work quite well. I expect you've experimented with this already?

Regarding your points above, I think we do want to have this BirthBuffer and simplify_with_buffer interface in the library, because the majority of people will just want something that makes their forward sims go faster and won't be interested in experimenting with different ways of buffering edges. I suspect you are the only one who is going to do any innovating in this space!

Keeping that in mind, I suggest we implement the BirthBuffer and simplify_with_buffer interface as well as we can in C, which hopefully we can make as fast as your Rust versions. If you're still interested in accessing the guts of the simplifier, I'm happy to just put it undocumented in the external header, so you can access. It would be hard work to properly document what's here currently, and I'm really not clear what is required in terms of safety. It seems like a waste of time if only one person is every going to use it.

What do you think?

molpopgen · 2023-06-23T16:03:07Z

I'll take a look. I buffer all the edges so that I don't have to sort anything for the case of overlapping generations. If the buffer is an array-based adjacency list, the way I do things allows you to go backwards (in time, and probably in array index space) through the parents and then process each one (that is not some sentinel value implying "no births for this parent").

molpopgen · 2023-06-23T16:04:47Z

The way that I've been implementing these buffers is very close to what is in this notebook: https://github.com/molpopgen/edge_buffering_explanation.

EDIT: if we do the "trick" of "stealing" all edges that overlap in time w/currently alive nodes, the sort in this notebook is no longer necesary.

… let us use a common runner routine later

molpopgen added 17 commits February 5, 2024 14:00

add prototypes

365de88

clang-format

4ba5513

add implementations

2626292

alloc array for tracking parent contiguity

12a1245

check for contiguous parent processing

0656a78

test skeleton

c3d791b

building up tests

293a5f5

whew

e61c48c

free

03673a9

working on test

5ff0442

that was kinda brutal

b8b7253

more stress test

d37cb5d

internal checks so we don't have to test internal details

4b49a7a

modularize, test that we can catch birth order violations

57c1b01

move comments

bcc9a15

FIXME

7a96dcd

show some of the side effects that one can get

145abce

molpopgen and others added 29 commits February 5, 2024 14:00

force a failing test

0732c81

we don't need a particular order for the child nodes...

c81da32

add test of parent time error

8d6f8b4

reinstate test failure

3b39431

start refactor from temp buffer back into temp edge table. idea is to…

cb0bd03

… let us use a common runner routine later

at least no segfault any more...

8af74c1

some pring

9254283

we are not detecting parent time violations properly

154e5a3

now tests pass

ead9f5c

clean

5114645

delete printf

60ef2e4

start to abstract out the test runner step

c25d6f9

streamline for new test runner

7819804

use new runner for most/all tests

8d25cc6

cleanup

dc51f7d

add new .c

2e1be0f

update the meson stuff

b11525d

add some infrastructure

9f2d496

add some infrastructure

add7347

fmt

4e07d22

skeleton in place. we have a segfault

3dfc9fc

make valgrind happy

24a76bf

cleanup

a5ac4c4

cleanup

cc89815

add death rate, update lots of the new example internals

f095a76

overlapping gens seems to work

0c498c0

add an important note

fcd17f3

Add basic simulation for development

e19a5a6

Add almost-working version with Python simplifier

c559f64

molpopgen force-pushed the add_modular_simplifier branch from a3ed3e5 to c559f64 Compare February 5, 2024 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add modular simplifier #2752

add modular simplifier #2752

molpopgen commented May 8, 2023 •

edited

Loading

codecov bot commented May 9, 2023 •

edited

Loading

jeromekelleher commented May 30, 2023

jeromekelleher commented May 30, 2023

molpopgen commented May 31, 2023

jeromekelleher commented May 31, 2023

molpopgen commented May 31, 2023

molpopgen commented May 31, 2023

jeromekelleher commented Jun 22, 2023

molpopgen commented Jun 22, 2023

molpopgen commented Jun 22, 2023

jeromekelleher commented Jun 23, 2023

molpopgen commented Jun 23, 2023

molpopgen commented Jun 23, 2023 •

edited

Loading

add modular simplifier #2752

Are you sure you want to change the base?

add modular simplifier #2752

Conversation

molpopgen commented May 8, 2023 • edited Loading

Description

PR Checklist:

codecov bot commented May 9, 2023 • edited Loading

Codecov Report

jeromekelleher commented May 30, 2023

jeromekelleher commented May 30, 2023

molpopgen commented May 31, 2023

jeromekelleher commented May 31, 2023

molpopgen commented May 31, 2023

molpopgen commented May 31, 2023

jeromekelleher commented Jun 22, 2023

molpopgen commented Jun 22, 2023

molpopgen commented Jun 22, 2023

jeromekelleher commented Jun 23, 2023

molpopgen commented Jun 23, 2023

molpopgen commented Jun 23, 2023 • edited Loading

molpopgen commented May 8, 2023 •

edited

Loading

codecov bot commented May 9, 2023 •

edited

Loading

molpopgen commented Jun 23, 2023 •

edited

Loading