Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend evaluating over just 100-1000 examples #28

Open
carlini opened this issue Feb 25, 2021 · 0 comments
Open

Recommend evaluating over just 100-1000 examples #28

carlini opened this issue Feb 25, 2021 · 0 comments

Comments

@carlini
Copy link
Member

carlini commented Feb 25, 2021

Lots of papers say something to the effect of "we couldn't possibly hope to run [powerful attack] on the entire test set, so we'll run FGSM instead". While running attacks on more data is always preferable, often this comes at the cost of correctness.

I think it would make sense to recommend people evaluate (at least initially) on just a few hundred or maybe a thousand examples. Assuming that each test example is independent, a thousand samples lets you pin down accuracy to +/- 1% or so. In most cases, correctness is about 5% or 10%, so this is more than enough statistical power.

If papers really want to, they can try really hard on a small number of examples, and then repeat the attack on a large number (maybe with a faster attack) and then check that the confidence intervals overlap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant