You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lots of papers say something to the effect of "we couldn't possibly hope to run [powerful attack] on the entire test set, so we'll run FGSM instead". While running attacks on more data is always preferable, often this comes at the cost of correctness.
I think it would make sense to recommend people evaluate (at least initially) on just a few hundred or maybe a thousand examples. Assuming that each test example is independent, a thousand samples lets you pin down accuracy to +/- 1% or so. In most cases, correctness is about 5% or 10%, so this is more than enough statistical power.
If papers really want to, they can try really hard on a small number of examples, and then repeat the attack on a large number (maybe with a faster attack) and then check that the confidence intervals overlap.
The text was updated successfully, but these errors were encountered:
Lots of papers say something to the effect of "we couldn't possibly hope to run [powerful attack] on the entire test set, so we'll run FGSM instead". While running attacks on more data is always preferable, often this comes at the cost of correctness.
I think it would make sense to recommend people evaluate (at least initially) on just a few hundred or maybe a thousand examples. Assuming that each test example is independent, a thousand samples lets you pin down accuracy to +/- 1% or so. In most cases, correctness is about 5% or 10%, so this is more than enough statistical power.
If papers really want to, they can try really hard on a small number of examples, and then repeat the attack on a large number (maybe with a faster attack) and then check that the confidence intervals overlap.
The text was updated successfully, but these errors were encountered: