Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize findOverlaps for GRangesFactor objects #28

Open
wants to merge 9 commits into
base: devel
Choose a base branch
from

Conversation

LTLA
Copy link

@LTLA LTLA commented Jun 22, 2019

This should improve efficiency of the overlaps... which was the whole aim of this class in the first place.

@hpages hpages self-assigned this Jun 23, 2019
@hpages
Copy link
Contributor

hpages commented Jun 23, 2019

Hmm, the findOverlaps(..., select="all") + selectHits() strategy is certainly going to slow down things a lot in some situations. This is because the findOverlaps#GenomicRanges#GenomicRanges method is highly optimized when select is "first", "last", or "arbitrary". In these cases the method collects at most 1 hit per query (and stores it directly in the integer vector to return) rather than collect all the hits in a Hits object to later drop most of them. In addition the length of the integer vector is known in advance so the vector can be pre-allocated whereas in the select="all" case the final size of the Hits object is not known in advance, which means that the object cannot be pre-allocated so has to be grown via re-allocations and copies:

library(GenomicRanges)
query <- GRanges("chr1", IRanges(1, 1:9500))
subject <- GRanges("chr1", IRanges(1:9500, 9500))
system.time(q2s <- findOverlaps(query, subject, select="arbitrary"))
#    user  system elapsed 
#   0.029   0.000   0.031 
system.time(hits <- findOverlaps(query, subject, select="all"))
#    user  system elapsed 
#   2.948   0.407   3.355 

The more number of hits per query (in average), the worse select="all" will perform with respect to select="first", "last", or "arbitrary".

The select="arbitrary" case is the workhorse behind overlapsAny(), %over%, and %within%.

@LTLA
Copy link
Author

LTLA commented Jun 23, 2019

Well, I can't say it was easy, but select!="all" optimizations are done. Note that the lack of special behaviour for a GRF subject when select="arbitrary" is deliberate; I'd have to unique the indices anyway to ensure that the query doesn't select a range that isn't used.

@LTLA
Copy link
Author

LTLA commented Sep 4, 2019

Nudge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants