-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ancestor lengths in genome and time bins #68
base: main
Are you sure you want to change the base?
Ancestor lengths in genome and time bins #68
Conversation
Some issues that remain with this PR:
|
model.py
Outdated
|
||
num_x_wins = int(np.ceil(nodes_right.max() - nodes_left.min()) / win_x_size) | ||
num_y_wins = int(np.ceil(nodes_time.max() / win_y_size)) | ||
heatmap_sums = np.zeros((num_x_wins, num_y_wins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm creating a numpy 2D array here as it's easier to keep track of bin indices, and then flattening the array later. Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at using np.digitize to do the binning, rather than flooring things manually?
model.py
Outdated
) # map the node span to the x-axis bins it overlaps | ||
x_end = int(np.floor(nodes_right[u] / win_x_size)) | ||
y = max(0, int(np.floor(nodes_time[u] / win_y_size)) - 1) | ||
heatmap_sums[x_start:x_end, y] += min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min() operation is only really required for the first and last bins as they may not completely overlap the node span. Every other bin can just be summed by the window size.
ancestor_span_heatmap.mp4 |
138b09e
to
7106fcc
Compare
@savitakartik Would you like a review here now? |
Yes that would be great, thanks @benjeffery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I wonder if we can use existing numpy tools a bit more though?
model.py
Outdated
|
||
num_x_wins = int(np.ceil(nodes_right.max() - nodes_left.min()) / win_x_size) | ||
num_y_wins = int(np.ceil(nodes_time.max() / win_y_size)) | ||
heatmap_sums = np.zeros((num_x_wins, num_y_wins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you looked at using np.digitize to do the binning, rather than flooring things manually?
model.py
Outdated
@@ -551,3 +555,48 @@ def calc_mutations_per_tree(self): | |||
mutations_per_tree = np.zeros(self.ts.num_trees, dtype=np.int64) | |||
mutations_per_tree[unique_values] = counts | |||
return mutations_per_tree | |||
|
|||
def compute_ancestor_spans_heatmap_data(self, win_x_size=1_000_000, win_y_size=500): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these sizes of genome coordinates? If so, it's going to be tricky to make default that work across species (compare humans to sars 2, eg.).
What you using number of x-bins and number of y-bins instead? That way you can be sure of a given level of resolution and that you won't use too much memory.
Can use np.linspace to generate the bin coordinates.
Can you rebase please @savitakartik to pull in the logging changes? |
c778950
to
1b4b7ea
Compare
now done, @jeromekelleher. I will try to address the failing tests now. |
I just looked at the plot for the Unified genealogy trees and it's not terribly informative. Let's have a chat about it tomorrow. |
1b4b7ea
to
5541b73
Compare
Recording.2023-09-24.104825.mp4 |
I think this is very useful, @benjeffery can you help getting it merge please? |
@savitakartik Looks good - but I think you need to increment the cache version for nodes. |
…unt on hover and plot titles added
5541b73
to
47536db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - we can merge now and follow up with a PR to address potential performance issues later?
@@ -449,6 +449,10 @@ def nodes_df(self): | |||
"time": ts.nodes_time, | |||
"num_mutations": self.nodes_num_mutations, | |||
"ancestors_span": child_right - child_left, | |||
"child_left": child_left, # FIXME add test for this | |||
"child_right": child_right, # FIXME add test for this | |||
"child_left": child_left, # FIXME add test for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplication here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, looks like it's tested now?
x_ends = np.digitize(nodes_right, x_bins, right=True) | ||
y_starts = np.digitize(nodes_time, y_bins, right=True) | ||
|
||
for u in range(len(nodes_left)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably slow - should we do it with numba?
Addresses #20
Added
child_left
,child_right
columns to nodes_df and tests for these.Changed computation of ancestor-spans-heatmap data to avoid iterate over nodes instead of bins.