-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add modular simplifier #2752
base: main
Are you sure you want to change the base?
add modular simplifier #2752
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2752 +/- ##
===========================================
- Coverage 89.72% 77.00% -12.72%
===========================================
Files 30 30
Lines 30296 30369 +73
Branches 5896 5640 -256
===========================================
- Hits 27183 23387 -3796
- Misses 1781 5587 +3806
- Partials 1332 1395 +63
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 13 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
I've had a look through @molpopgen, and I think I sort-of get it. I wonder if it would be helpful to write some "real" client code here by making a new example overlapping generations WF simulator? It would help me understand the intended final interface, and might be a less annoying way of chasing down bugs than hand-crafting tests (which is super tedious!)? |
Could use either an array of |
For this to work, one would have to sort the edge table or implement a fully worked-out edge buffer (which is basically an edge table + an adjacency list). I was planning the latter for a separate PR. Here, the goal is to show that, in principle, granular control over the simplification steps is possible. Using an edge table w/o the adjacency list to index it, one would have to do a bunch of sorting, too. Such an example would be okay so long as we update in the future to actually be efficient. |
Just sorting the list of edges would be fine for a first pass I think? We're not going to get it all done in one go anyway, so may as well get some of the infrastructure in place that'll be needed to make development less tedious. I'd like to help, and having a simulator I can run and experiment with would make it much easier for me to understand what's going on. I guess we could take a copy of the I think all you need to do is add the file, and do a straight copy of whatever we have for the other examples in the meson.build file |
Sounds good -- I'll get to the "soon". |
@jeromekelleher -- the new example is there. I can't break it unless I intentionally break the qsort callback. |
I've spent a few hours loading this up into my head again @molpopgen, I'm starting to think that the way to do this is to provide a new method int tsk_table_collection_simplify_with_parent_buffer(
tsk_table_collection_t *self, tsk_parent_edge_buffer_t *parent_edges,
... standard args)
{
} This function assumes that The function would take care of the hairy details about clearing out the parent So, the only API addition we'd need would be a new object, tsk_parent_edge_buffer_t, Here's a mock up in Python (the simplify_with_buffer function is just here to Does this get us moving in the right direction, or have I missed the point? def simplify_with_buffer(tables, parent_buffer, samples):
# Pretend this was done efficiently internally without any sorting
# by creating a simplifier object and adding the ancstry for the
# new parents appropriately before flushing through the rest of the
# edges.
for parent, edges in parent_buffer.items():
for left, right, child in edges:
tables.edges.add_row(left, right, parent, child)
tables.sort()
tables.simplify(samples)
# We've exhausted the parent buffer, so clear it out. In reality we'd
# do this more carefully, like KT does in the post_simplify step.
parent_buffer.clear()
def wright_fisher_use_parent_buffer(N, death_proba, L, T, *, simplify_interval):
tables = tskit.TableCollection(L)
alive = [tables.nodes.add_row(time=T) for _ in range(N)]
parent_buffer = collections.defaultdict(list)
t = T
n = N
while t > 0:
t -= 1
next_alive = list(alive)
for j in range(N):
if random.random() < death_proba:
# alive[j] is dead - replace it.
u = tables.nodes.add_row(time=t)
next_alive[j] = u
a = random.randint(0, N - 1)
b = random.randint(0, N - 1)
x = random.uniform(0, L)
parent_buffer[alive[a]].append((0, x, u))
parent_buffer[alive[b]].append((x, L, u))
alive = next_alive
if t % simplify_interval == 0 or t == 0:
print(f"Simplifying nodes={len(tables.nodes)} new edges={len(tables.edges)}")
simplify_with_buffer(tables, parent_buffer, alive)
print(f"Done nodes={len(tables.nodes)} edges={len(tables.edges)}")
alive = list(range(N))
return tables.tree_sequence() |
@jeromekelleher The issue with that proposal is that it prohibits innovation. By requiring that someone use a provided buffer type, they cannot come up with better/more clever ways to buffer things themselves. For example, in tskit-rust, my intuition is to provide a buffer that only uses safe code (no calls into a C API, for example). (Also, I've already written such a type when I first prototype all of this... :) ) I think that such a function could, and maybe should, be provided. But I don't think that it is the only thing that tskit should support. |
The counter argument to my worry is the possibility of an internal refactor of the simplifier that does not include the steps "collect edges under a parent" and "process all the edges that we've collected for this parent". If that is likely, then we need to hide everything. (Or keep doing what we do and make it undocumented beyond some safety-related comments that don't end up in the manual.) |
I just pushed a couple of commits here with some experimental code that uses the Python simplifier, following the basic approach I outlined above. I think the basic approach works, but I'm missing a detail somewhere and need to close up for the day; I forgot how hard this stuff is! The idea is that we use the BirthBuffer to keep all the edges since the last epoch, as well as all the edges for parents that were alive at the end of the last epoch (== simplify interval). This is slightly different to your approach (I think you buffer all edges after the first alive parent?), but following the same basic idea. We then manually add the edges from each of these sources into the simplifier, in the correct order. The only sorting that remains is where we currently sort the parents in the BirthBuffer. However, these should be inserted in sorted order almost always, so I think something simple like doing an insertion sort when a parent is inserted out-of-order would probably work quite well. I expect you've experimented with this already? Regarding your points above, I think we do want to have this BirthBuffer and simplify_with_buffer interface in the library, because the majority of people will just want something that makes their forward sims go faster and won't be interested in experimenting with different ways of buffering edges. I suspect you are the only one who is going to do any innovating in this space! Keeping that in mind, I suggest we implement the BirthBuffer and simplify_with_buffer interface as well as we can in C, which hopefully we can make as fast as your Rust versions. If you're still interested in accessing the guts of the simplifier, I'm happy to just put it undocumented in the external header, so you can access. It would be hard work to properly document what's here currently, and I'm really not clear what is required in terms of safety. It seems like a waste of time if only one person is every going to use it. What do you think? |
I'll take a look. I buffer all the edges so that I don't have to sort anything for the case of overlapping generations. If the buffer is an array-based adjacency list, the way I do things allows you to go backwards (in time, and probably in array index space) through the parents and then process each one (that is not some sentinel value implying "no births for this parent"). |
The way that I've been implementing these buffers is very close to what is in this notebook: https://github.com/molpopgen/edge_buffering_explanation. EDIT: if we do the "trick" of "stealing" all edges that overlap in time w/currently alive nodes, the sort in this notebook is no longer necesary. |
… let us use a common runner routine later
a3ed3e5
to
c559f64
Compare
Description
WIP related to #2751. This PR adds the basic functionality for a "modular" simplifier along with tests.
PR Checklist: