-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in drawWithoutReplacementSkip() #734
Conversation
Thanks, that looks like an easy fix (and simplification). Did you check any runtimes? Back in 2014, the Fisher-Yates implementation gave a big speedup for some settings (which I cannot remember exactly). |
Nope, I haven't tested any runtimes. I expect that std::sample is implementing Fisher-Yates (or something similar). Hopefully the std library folks have optimized this so that we don't have to :) I suppose that it may be slightly less efficient if you are running with a huge number of options and only a tiny sample (since it has to allocate the list of all indexes at start). I wouldn't expect that to be the bottleneck in most contexts, but makes sense that it would be good to test. |
Unfortunately, this is slower. Particularly for high-dimensional data and it's quite extreme for the GWAS settings (p>>n, few unique values). Here is an example (in R but results will be similar without R): #remotes::install_github("imbs-hl/ranger")
#remotes::install_github("sligocki/ranger@sample_skip_bug")
library(ranger)
library(microbenchmark)
# High dimensional
n <- 1000
p <- 10000
x <- matrix(rbinom(n * p, size = 1, prob = .5), nrow = n,
dimnames = list(NULL, paste0("X", 1:p)))
y <- rnorm(n)
microbenchmark(
default = ranger(x = x, y = y, num.threads = 1),
times = 1)
# Extremely high dimensional
n <- 100
p <- 100000
x <- matrix(rbinom(n * p, size = 1, prob = .5), nrow = n,
dimnames = list(NULL, paste0("X", 1:p)))
y <- rnorm(n)
microbenchmark(
default = ranger(x = x, y = y, num.threads = 1),
times = 1) On this machine, the first one takes about 10 seconds with the master and about 24 seconds with this PR. The second one is more extreme: less then 2 second with the master and about 17 seconds with this PR. For a real GWAS (or similar), the differences will probably be several hours. So it looks like I have to fix the skipping in |
Oh my, I see |
I don't think we can re-use them. This is the sampling of the |
I merged #738, so don't need this PR anymore. Thanks for your help. |
drawWithoutReplacementSkip()
appears to have a bug in both thedrawWithoutReplacementSimple()
anddrawWithoutReplacementFisherYates()
branches. Specifically, these bugs are for cases where there are multiple indexes to be skipped. These bugs are demonstrated by new tests addeddrawWithoutReplacementSkip.small_small4
(FisherYates) anddrawWithoutReplacementSkip.small_large2
(Simple).I fix these both by switching to using
std::sample()
which seems like a safer bet to avoid these sorts of subtle bugs. This change requires updating to (at least) C++17.This might be the source of #733