Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Biased estimator for Janitza p-values #555

Open
jonathanishhorowicz opened this issue Mar 14, 2021 · 5 comments
Open

Biased estimator for Janitza p-values #555

jonathanishhorowicz opened this issue Mar 14, 2021 · 5 comments

Comments

@jonathanishhorowicz
Copy link

jonathanishhorowicz commented Mar 14, 2021

Hi, thanks again for this excellent package.

While using it for genomics applications I have had some problems with p-values that are exactly zero when using the Janitza method. The paper "Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn" by Phipson and Smith, this should not be the case and it is better to use a positively-biased estimator for a p-value with m permutations:
p = (b + 1)/(m + 1),
where b is the number of permutations where the test statistic is greater than the observed value.

This functionality was removed in #206 so is there a reason I should stick with the unbiased estimator (which is b/m)?

Many thanks!

@jonathanishhorowicz
Copy link
Author

For completeness, if I were implementing this I would add the following at line 127 in importance.R:

pval.biased <- (1+length(vimp)-nSmaller)/(1+length(vimp))

@mnwright
Copy link
Member

Hm... I'm trying to remember what the problem with the +1 approach was. I think one problem is that you cannot increase the number of "permutation" with the Janitza approach because that's the number of zero-or-negative-importance variables.

@snembrini Do you remember anything else?

@jonathanishhorowicz
Copy link
Author

I also had that problem so have been returning the number of "permutations" for the Janitza method, so there is some idea of the precision of the estimated p-value. In the end I decided that not being able to increase the number of permutations was not as serious as being unable to do a multiple testing correction (due to the p-values being exactly zero), so I have gone with bootstrapped CIs for the impurity_corrected scores instead.

@snembrini
Copy link

snembrini commented Mar 15, 2021

@mnwright It is probably because the null distribution has a fixed m.
As I had previously shared with you, m could be increased by adding permutations of the regressors, but that inevitably slightly changes the original approach.

The information underlying m could be summarized using a Binomial confidence interval: With a small m, the CI will be wider.

I personally do not think that most P-value corrections can be applicable to the Janitza method, because very often Pvalues are required to be uniform under the null, which is not the case.

Also @jonathanishhorowicz, I know it is the same approach of Ishwaran, but CIs where there is no population parameters, is a term that should be avoided because it brings a lot of confusion

@jonathanishhorowicz
Copy link
Author

Sorry for taking so long to reply to this!

Thank you @snembrini, I will look at the binomial confidence intervals for my application, although I suspect that the Bonferroni correction will make them extremely wide.

The approach I am using is not quite that of Ishwaran, since I am using your debaised impurity VIMP measure I was under the impression I did not have to account for the double bootstrap effect, since the impurity is a feature of the tree and is not calculated with additional samples. I have just wrapped my RF fitting in a bootstrap resampling procedure - is this not a valid approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants