Replies: 4 comments 16 replies
-
Is the problem to do with "how to make sure the information is in the tree sequence in the first place" or "how to get the summaries out of the tree sequence", or both? (maybe both?) I wonder if it's the former, becuase portions of ancestry for which a non-sample is unary are not retained; if you want to keep these in then you do need to do something (and what to do depends on the simulator?). But, if all the information you need is actually in the tree sequence then I think this does what you want, regardless of unary-ness or MRCAs:
|
Beta Was this translation helpful? Give feedback.
-
It's true that this is more a question for the simulator than for tskit, since it is about how to record nodes rather than how to parse them. So just to confirm here: if we modified the simulator (here msprime) to emit all nodes within the genealogy, we would be able to use general_stat as discussed to compute the expected total contributions by any individual in the genealogy (including if they do not have a coalescence)? There is a second type of summary statistic that we were interested in computing and that might still be doable with modification of general_stat: it is the amount of pairwise coalescence that occurs in a given genealogical individual (as a generalization to the instantaneous coalescence rate to individuals). In principle, we could do this by computing all IBD and modify the identitysegment object to record the node id where the IBD occurred. But I also think that this could be computed from a modification of the general_stat function where the function f applied not to the weights of a node itself, but to the weights of the children of the node (i.e., if we select a pair of samples (s_1,s_2), the rate of pairwise coalescence in node u is the sum over all pairs (i,j) of children of u of the probability that s_1 descends from node i and s_2 descends from node j, plus the opposite pairing. In other words, the rate of pairwise coalescence in a node u corresponding to a genealogical individual is proportional to \sum_{children i,j of u }(w_i/n_samples w_j/n_samples). Does that make sense? |
Beta Was this translation helpful? Give feedback.
-
I don't see why not, yes? Although it's possible I'm missing some oddity, as I haven't done any verification. Note that keeping this information is pretty straight forward in SLiM (and you can simulate in pedigrees there now). |
Beta Was this translation helpful? Give feedback.
-
Separate from the the discussion about the stats (which sounds useful and interesting!) is the issue about keeping track of more nodes in msprime. This has also cropped up in the context of trying to improve performance in general in msprime (tskit-dev/msprime#2121) so maybe addressing the "keeping more of the ARG" in the pedigree simulation would be a good place to start? |
Beta Was this translation helpful? Give feedback.
-
We are interested in using the general_stat function to compute quantities for specific genealogical individuals in genealogical simulations. For example, what is the total span of all the segments contributed by a genealogical individual to the samples. This does not seem to work out of the box using the general_stat in the mode="nodes" function because (I believe) it does not consider contributions of genealogical individuals which happen via a unary node, or prior to the MRCA.
We would be happy to work to contribute such a feature. One possible workaround (option 1) would be to set all genealogical individuals as samples, do the genealogical simulation, but then set weights in general_stat to be positive only for the "real" samples. I think this would work, but might be quite inefficient in a large genealogy.
Option 2 is to write a an option to emit_all_genealogical_nodes in a genealogical simulation, which would require a bit more work but might be worth doing.
I think we will try option 1 to start, but open to suggestions/told that the approach is wrong.
Beta Was this translation helpful? Give feedback.
All reactions