Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ancestor lengths in genome and time bins #68

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

savitakartik
Copy link
Collaborator

@savitakartik savitakartik commented Aug 25, 2023

Addresses #20

Added child_left, child_right columns to nodes_df and tests for these.
Changed computation of ancestor-spans-heatmap data to avoid iterate over nodes instead of bins.

@savitakartik
Copy link
Collaborator Author

savitakartik commented Aug 25, 2023

Some issues that remain with this PR:

  • will need to remove empty trees or set better xlim values, as there are quite a few empty bins in the heatmap currently displayed.

  • x and y bin sizes have been set arbitrarily at 1Mb and 1000 time units but this may not be ideal for other tree sequences. Will need to find a better way to set bin sizes.

  • We don't need to record all three of these columns in the nodes_df: ancestor_span, child_left and child_right since we're assuming contiguous node spans. Left them in for now as other plots use ancestor_span column.

model.py Outdated

num_x_wins = int(np.ceil(nodes_right.max() - nodes_left.min()) / win_x_size)
num_y_wins = int(np.ceil(nodes_time.max() / win_y_size))
heatmap_sums = np.zeros((num_x_wins, num_y_wins))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm creating a numpy 2D array here as it's easier to keep track of bin indices, and then flattening the array later. Is this acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at using np.digitize to do the binning, rather than flooring things manually?

model.py Outdated
) # map the node span to the x-axis bins it overlaps
x_end = int(np.floor(nodes_right[u] / win_x_size))
y = max(0, int(np.floor(nodes_time[u] / win_y_size)) - 1)
heatmap_sums[x_start:x_end, y] += min(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min() operation is only really required for the first and last bins as they may not completely overlap the node span. Every other bin can just be summed by the window size.

model.py Outdated Show resolved Hide resolved
@savitakartik
Copy link
Collaborator Author

ancestor_span_heatmap.mp4

@savitakartik savitakartik marked this pull request as ready for review August 25, 2023 11:04
@benjeffery
Copy link
Member

@savitakartik Would you like a review here now?

@savitakartik
Copy link
Collaborator Author

savitakartik commented Aug 29, 2023

@savitakartik Would you like a review here now?

Yes that would be great, thanks @benjeffery.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - I wonder if we can use existing numpy tools a bit more though?

model.py Outdated

num_x_wins = int(np.ceil(nodes_right.max() - nodes_left.min()) / win_x_size)
num_y_wins = int(np.ceil(nodes_time.max() / win_y_size))
heatmap_sums = np.zeros((num_x_wins, num_y_wins))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you looked at using np.digitize to do the binning, rather than flooring things manually?

model.py Outdated
@@ -551,3 +555,48 @@ def calc_mutations_per_tree(self):
mutations_per_tree = np.zeros(self.ts.num_trees, dtype=np.int64)
mutations_per_tree[unique_values] = counts
return mutations_per_tree

def compute_ancestor_spans_heatmap_data(self, win_x_size=1_000_000, win_y_size=500):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these sizes of genome coordinates? If so, it's going to be tricky to make default that work across species (compare humans to sars 2, eg.).

What you using number of x-bins and number of y-bins instead? That way you can be sure of a given level of resolution and that you won't use too much memory.

Can use np.linspace to generate the bin coordinates.

@jeromekelleher
Copy link
Member

Can you rebase please @savitakartik to pull in the logging changes?

@savitakartik
Copy link
Collaborator Author

now done, @jeromekelleher. I will try to address the failing tests now.

@jeromekelleher
Copy link
Member

I just looked at the plot for the Unified genealogy trees and it's not terribly informative. Let's have a chat about it tomorrow.

@savitakartik
Copy link
Collaborator Author

Recording.2023-09-24.104825.mp4

@jeromekelleher
Copy link
Member

I think this is very useful, @benjeffery can you help getting it merge please?

@benjeffery
Copy link
Member

@savitakartik Looks good - but I think you need to increment the cache version for nodes.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we can merge now and follow up with a PR to address potential performance issues later?

@@ -449,6 +449,10 @@ def nodes_df(self):
"time": ts.nodes_time,
"num_mutations": self.nodes_num_mutations,
"ancestors_span": child_right - child_left,
"child_left": child_left, # FIXME add test for this
"child_right": child_right, # FIXME add test for this
"child_left": child_left, # FIXME add test for this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplication here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, looks like it's tested now?

x_ends = np.digitize(nodes_right, x_bins, right=True)
y_starts = np.digitize(nodes_time, y_bins, right=True)

for u in range(len(nodes_left)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably slow - should we do it with numba?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants