Recommend evaluating over just 100-1000 examples #28

carlini · 2021-02-25T17:55:29Z

Lots of papers say something to the effect of "we couldn't possibly hope to run [powerful attack] on the entire test set, so we'll run FGSM instead". While running attacks on more data is always preferable, often this comes at the cost of correctness.

I think it would make sense to recommend people evaluate (at least initially) on just a few hundred or maybe a thousand examples. Assuming that each test example is independent, a thousand samples lets you pin down accuracy to +/- 1% or so. In most cases, correctness is about 5% or 10%, so this is more than enough statistical power.

If papers really want to, they can try really hard on a small number of examples, and then repeat the attack on a large number (maybe with a faster attack) and then check that the confidence intervals overlap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommend evaluating over just 100-1000 examples #28

Recommend evaluating over just 100-1000 examples #28

carlini commented Feb 25, 2021

Recommend evaluating over just 100-1000 examples #28

Recommend evaluating over just 100-1000 examples #28

Comments

carlini commented Feb 25, 2021