Emit all genealogical nodes? #2648

sgravel · 2022-11-24T19:46:45Z

sgravel
Nov 24, 2022

We are interested in using the general_stat function to compute quantities for specific genealogical individuals in genealogical simulations. For example, what is the total span of all the segments contributed by a genealogical individual to the samples. This does not seem to work out of the box using the general_stat in the mode="nodes" function because (I believe) it does not consider contributions of genealogical individuals which happen via a unary node, or prior to the MRCA.

We would be happy to work to contribute such a feature. One possible workaround (option 1) would be to set all genealogical individuals as samples, do the genealogical simulation, but then set weights in general_stat to be positive only for the "real" samples. I think this would work, but might be quite inefficient in a large genealogy.

Option 2 is to write a an option to emit_all_genealogical_nodes in a genealogical simulation, which would require a bit more work but might be worth doing.

I think we will try option 1 to start, but open to suggestions/told that the approach is wrong.

petrelharp · 2022-11-24T21:04:17Z

petrelharp
Nov 24, 2022
Maintainer

Is the problem to do with "how to make sure the information is in the tree sequence in the first place" or "how to get the summaries out of the tree sequence", or both? (maybe both?)

I wonder if it's the former, becuase portions of ancestry for which a non-sample is unary are not retained; if you want to keep these in then you do need to do something (and what to do depends on the simulator?).

But, if all the information you need is actually in the tree sequence then I think this does what you want, regardless of unary-ness or MRCAs:

ts.sample_count_stat(ts.samples(), lambda x: x, 1, polarised=True, span_normalise=False, strict=False, mode="node")

2 replies

LukeAndersonTrocme Nov 24, 2022
Collaborator

Will dig a little deeper tomorrow, but I'm getting a weird error when I try to run that:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [17], line 1
----> 1 ts.sample_count_stat(ts.samples(), lambda x: x, 1, polarised=True, span_normalise=False, strict=False, mode="node")

File ~/anaconda3/envs/msprime-env/lib/python3.10/site-packages/tskit/trees.py:7161, in TreeSequence.sample_count_stat(self, sample_sets, f, output_dim, windows, polarised, mode, span_normalise, strict)
   7159 # helper function for common case where weights are indicators of sample sets
   7160 for U in sample_sets:
-> 7161     if len(U) != len(set(U)):
   7162         raise ValueError(
   7163             "Elements of sample_sets must be lists without repeated elements."
   7164         )
   7165     if len(U) == 0:

TypeError: object of type 'numpy.int32' has no len()

edit: solved issue with the following

ts.sample_count_stat([ts.samples().tolist()], lambda x: x, 1, polarised=True, span_normalise=False, strict=False, mode="node")

petrelharp Nov 25, 2022
Maintainer

ah, silly mistake on my part

sgravel · 2022-11-25T16:52:36Z

sgravel
Nov 25, 2022
Author

It's true that this is more a question for the simulator than for tskit, since it is about how to record nodes rather than how to parse them. So just to confirm here: if we modified the simulator (here msprime) to emit all nodes within the genealogy, we would be able to use general_stat as discussed to compute the expected total contributions by any individual in the genealogy (including if they do not have a coalescence)?

There is a second type of summary statistic that we were interested in computing and that might still be doable with modification of general_stat: it is the amount of pairwise coalescence that occurs in a given genealogical individual (as a generalization to the instantaneous coalescence rate to individuals). In principle, we could do this by computing all IBD and modify the identitysegment object to record the node id where the IBD occurred. But I also think that this could be computed from a modification of the general_stat function where the function f applied not to the weights of a node itself, but to the weights of the children of the node (i.e., if we select a pair of samples (s_1,s_2), the rate of pairwise coalescence in node u is the sum over all pairs (i,j) of children of u of the probability that s_1 descends from node i and s_2 descends from node j, plus the opposite pairing. In other words, the rate of pairwise coalescence in a node u corresponding to a genealogical individual is proportional to \sum_{children i,j of u }(w_i/n_samples w_j/n_samples).

Does that make sense?

10 replies

petrelharp Nov 30, 2022
Maintainer

That's where we're at, too - me and @nspope are going to discuss the tradeoffs soon. =)

But, if you have concrete examples of things that you really want to do that are not do-able based on that sort of edge-wide information, let us know!

General-Solution Mar 2, 2023

Hi all, I've started on some Python implementations for the sum over edges statistic:
https://github.com/General-Solution/tskit/blob/main/python/tests/test_parent_child_stats.py

The node_parent_child_stat() methods all implement it slightly differently. Right now node_parent_child_stat_4() is the fastest but might not be doable in C. node_parent_child_stat_2() is slightly slower.

I've also written some tests and benchmarking in the same file.

nspope Mar 2, 2023
Collaborator

Cool @General-Solution -- you might take a look at the implementation at

tskit/python/tskit/stats.py

Line 222 in c12f384

class CoalescenceTimeDistribution:

which does a similar sort of thing (to count coalescent events of various sorts -- e.g. number pairwise coalescence events between populations at each node).

sgravel Mar 2, 2023
Author

Thanks! There does seem to be sizeable overlap, so it might make sense to think about how best to package this. From my perspective, I would imagine that the computation of coalescences per node might be a distinct function, which we can then use to compute a TMRCA distribution, (as you did), but also other ways of picking apart the coalescences (e.g., by geographical area, etc).

nspope Mar 2, 2023
Collaborator

Totally. Just to be clear, the idea underlying that class is to have a general interface for counting coalescence events of various "topologies" per node (for example, coalescence events of a particular pair of samples, or of a trio of populations). These are stored in a time-sorted table, so that it just takes a binary search to compute the number of events in an arbitrary time window (and calculate things like instantaneous coalescence rate within that window). The class methods are intended to give efficient access to that table wrt time windows, but the node-level values are probably interesting in their own right (e.g. the "weights" could be mapped back onto nodes).

The update scheme implemented in that class is (I think) a pretty good tradeoff between efficient and general, but it's definitely not the only possibility. I completely agree that it's worth hashing out other applications for these types of computations, to figure out what the C-level machinery should ultimately look like.

petrelharp · 2022-11-26T16:51:08Z

petrelharp
Nov 26, 2022
Maintainer

It's true that this is more a question for the simulator than for tskit, since it is about how to record nodes rather than how to parse them. So just to confirm here: if we modified the simulator (here msprime) to emit all nodes within the genealogy, we would be able to use general_stat as discussed to compute the expected total contributions by any individual in the genealogy (including if they do not have a coalescence)?

I don't see why not, yes? Although it's possible I'm missing some oddity, as I haven't done any verification.

Note that keeping this information is pretty straight forward in SLiM (and you can simulate in pedigrees there now).

2 replies

petrelharp Nov 26, 2022
Maintainer

Ah, and in msprime, you'd mark everyone as samples in the pedigree.

sgravel Nov 28, 2022
Author

Right, this would be a way to do this out of the box. Although with a genealogy of 5 million people, this would be a bit heavy (although probably doable, especially if we stop the simulation at the top of the genealogy).

jeromekelleher · 2022-11-28T09:44:40Z

jeromekelleher
Nov 28, 2022
Maintainer

Separate from the the discussion about the stats (which sounds useful and interesting!) is the issue about keeping track of more nodes in msprime. This has also cropped up in the context of trying to improve performance in general in msprime (tskit-dev/msprime#2121) so maybe addressing the "keeping more of the ARG" in the pedigree simulation would be a good place to start?

cc @GertjanBisschop

2 replies

sgravel Nov 28, 2022
Author

It would be a good start, although in our case we will need two kinds of nodes not in the ARG: unary nodes without coalescence nor recombination, and nodes above the MRCA that may be relevant for some genealogical quantities.

jeromekelleher Nov 29, 2022
Maintainer

Ah, you want the full path through all pedigree nodes. That should be straightforward to do - it's really a case of not doing some logic where we record nodes if there's a coalescence. Ping me offline if you're keen to get it done quickly, we could see what can be done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit all genealogical nodes? #2648

{{title}}

Replies: 4 comments 16 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Emit all genealogical nodes? #2648

sgravel Nov 24, 2022

Replies: 4 comments · 16 replies

petrelharp Nov 24, 2022 Maintainer

LukeAndersonTrocme Nov 24, 2022 Collaborator

petrelharp Nov 25, 2022 Maintainer

sgravel Nov 25, 2022 Author

petrelharp Nov 30, 2022 Maintainer

General-Solution Mar 2, 2023

nspope Mar 2, 2023 Collaborator

sgravel Mar 2, 2023 Author

nspope Mar 2, 2023 Collaborator

petrelharp Nov 26, 2022 Maintainer

petrelharp Nov 26, 2022 Maintainer

sgravel Nov 28, 2022 Author

jeromekelleher Nov 28, 2022 Maintainer

sgravel Nov 28, 2022 Author

jeromekelleher Nov 29, 2022 Maintainer

sgravel
Nov 24, 2022

Replies: 4 comments 16 replies

petrelharp
Nov 24, 2022
Maintainer

LukeAndersonTrocme Nov 24, 2022
Collaborator

petrelharp Nov 25, 2022
Maintainer

sgravel
Nov 25, 2022
Author

petrelharp Nov 30, 2022
Maintainer

nspope Mar 2, 2023
Collaborator

sgravel Mar 2, 2023
Author

nspope Mar 2, 2023
Collaborator

petrelharp
Nov 26, 2022
Maintainer

petrelharp Nov 26, 2022
Maintainer

sgravel Nov 28, 2022
Author

jeromekelleher
Nov 28, 2022
Maintainer

sgravel Nov 28, 2022
Author

jeromekelleher Nov 29, 2022
Maintainer