Test Set missing files causing variation in published research #8

lijuncheng16 · 2022-04-06T17:02:20Z

Hi, Dan and the other contributors:
Thank you for maintaining the repo so far!
AudioSet, to me, is a great resource, and still the best resource to understand the nature of sound.
We did a recent study: paper and code, where we found the recent research papers are having a whooping +- 5% difference in performance due to test set missing files when downloading. Plus, the difference in label quality also contributed to the performance variation and making it less fair (see figure 2 in our paper).
I understand you guys have legal constraints on youtube licensing, but guess this issue could be easier for the original authors to address. Either to advocate the community to use a common subset, or release an updated test set? given you guys already released updated strong labels.
Looking forward to your thoughts.

dpwe · 2022-04-06T17:39:10Z

Thanks for the message. It’s not immediately obvious to me where the 5% number comes from, or how you know it’s due to differences in missing files. Can you walk me through it? It strikes me that mAP is possibly more vulnerable than d-prime. We track d-prime and have been surprised how consistent the overall results have been as the data has eroded by >10% over time. Thanks, DAn.

…

On Wed, Apr 6, 2022 at 13:02 Billy ***@***.***> wrote: Hi, Dan and the other contributors: Thank you for maintaining the repo so far! AudioSet, to me, is a great resource, and still the best resource to understand the nature of sound. We did a recent study: paper <https://arxiv.org/abs/2203.13448> and code <https://github.com/lijuncheng16/AudioTaggingDoneRight>, where we found the recent research papers are having a whooping *+- 5%* difference in performance due to test set missing files when downloading. Plus, the *difference in label quality* also contributed to the performance variation and making it less fair (see figure 2 in our paper). I understand you guys have legal constraints on youtube licensing, but guess this issue could be easier for the original authors to address. Either to advocate the community to use a common subset, or release an updated test set? given you guys already released updated strong labels. Looking forward to your thoughts. — Reply to this email directly, view it on GitHub <#8>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEGZUP2YNCN3HRZOULZCBTVDW7SPANCNFSM5SWVVQ5Q> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

lijuncheng16 · 2022-04-06T17:58:05Z

Hi, Dan:
Thank you again for your prompt response!
In our Paper: In table 1, due to the downloading difference of AudioSet, number of train&test set varies a whopping ±5% across previous works.

e.g. AST<Gong 2021 et al>number of test : 19185 VS. Ours: 20123. |19185 - 20123| / 19185 = 0.049 that's where the 5% comes from.
Not to mention the ERANN< Verbitskiy et al>'s test size was only 17967.

We use the exact setup of training pipeline of AST and result in a 2.5% mAP drop by using our test set VS. using theirs.

Especially, the different test size could cause severe fluctuations in final mAP reporting as seen in Figure 2 for our paper.
e.g. one could have downloaded the lower-label-quality test samples that tank their score.
or e.g. one could test only on some high-label-quality samples and report higher mAP.

dpwe · 2022-04-07T13:15:33Z

Oh, I misunderstood, I thought 5% was the difference in metric, not the test set size. Yes, we’ve seen variations of >10% in available dataset sizes, but there’s not much we can do since videos get taken down all the time; from the beginning we tried to choose videos with a lower chance of disappearing, but that wasn’t very successful. I wouldn’t assume that it’s differences in available videos that’s the main factor in result variation, there are many other things at play. We’ve had a very hard time matching published results, even reproducing our own past results - sometimes it appears to be subtle changes in the underlying DNN package (with different releases) or arithmetic differences in different accelerator hardware. It would be very interesting to measure this directly, e.g. delete dataset entries at random and see how that affects the resulting metric. If you’re only looking at the impact of changes in the evaluation set, that could be very quick, since you only need to apply the ablation in the final step before summarizing the results across all the eval set items. DAn.

…

On Wed, Apr 6, 2022 at 13:58 Billy ***@***.***> wrote: Hi, Dan: Thank you again for your prompt response! In our Paper <https://arxiv.org/pdf/2203.13448.pdf>: In table 1, due to the downloading difference of AudioSet, number of train&test set varies a whopping ±5% across previous works. 1. e.g. AST<Gong 2021 et al>number of test : 19185 VS. Ours: 20123. |19185 - 20123| / 19185 = 0.049 that's where the 5% comes from. 2. Not to mention the ERANN< Verbitskiy et al>'s test size was only 17967. - We use the exact setup of training pipeline of AST and result in a 2.5% mAP drop by using our test set VS. using theirs. Especially, the different test size could cause severe fluctuations in final mAP reporting as seen in Figure 2 for our paper. e.g. one could have downloaded the lower-label-quality test samples that tank their score. or e.g. one could test only on some high-label-quality samples and report higher mAP. — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEGZUPJFWOWK7FQZ2N57BDVDXGDPANCNFSM5SWVVQ5Q> . You are receiving this because you commented.Message ID: ***@***.***>

lijuncheng16 · 2022-04-07T14:40:00Z

Hi, Dan:
Thanks again for your reply.
I notice there could be minor performance fluctuations due to pytorch/tensorflow or hardware changes, but those are not very significant.
Yes, we did that ablation already, this is shown in our paper mentioned above:

Basically, this graph is showing the performance of models at test time changing rapidly on the different label-quality-quantile test set.

e.g. you can see CNN+trans model using our A2 recipe (using 128x 1024 feature, and the scheduling described in our paper) can get to 0.526 test mAP if we only report on (>90% quality classes in test set, ablation all low quality classes), VS. if we report on all test set (100% of our 20123 test files), the same model gets 0.437 mAP. I find this gap between 0.526 and 0.437 significant.

All the models listed here are SOTA results I have reproduced/implemented. If you are interested, you can try running my pipeline here

lijuncheng16 · 2022-04-07T14:55:45Z

Again, I appreciate your correspondence a lot here! I feel trying to solve it here on GitHub could be faster and maybe easier than going for a chat at Interspeech or ICASSP, of course, I would love to do so if you are going to attend.

As an audio ML researcher, I always feel AudioSet could be the ImageNet for the audio community, given you guys already spent lots of effort and resources collecting it.
Look at ImageNet, folks spent time reporting 0.1 performance gain on Top-1 on Top-5 acc. change, (surely, there are lots of black-box/incremental research...) but point is ImageNet is more and more established as the large-scale de-facto standard for the vision community, and that community benefited from that.

That's why I strongly feel a fair comparison on AudioSet could be helpful and is actually a poignant task. Hopefully, you get where I am coming from.

dpwe · 2022-04-09T03:42:02Z

Subsetting classes is definitely going to have a large influence on the
summary score, because there's such a wide spread in per-class performances
(some classes are legitimately just more prone to confusion; some have
scarce training data, although this seems to matter less than I expect).
Here's a scatter of per-class mAP (for a basic resnet50 model) vs. the QA
quality estimate across all 527 classes:

Your figure 2, showing that the average over subsets of these points
(growing from the right, I guess) yields different overall averages, seems
natural given such a wide spread.

In practice, of course, multi-label makes it impossible to select all the
positive samples for one class without including some positives from
multiple other classes, but you could drastically alter the prior of
different classes. But note that wouldn't actually help you "goose" your
results, unless you had some threshold that classes with too few samples
were excluded from the final balanced average. Absent that, reducing but
not eliminating classes would add noise to their contribution to the
balanced average, but wouldn't weaken it, since balanced average treats
each class's value equally regardless of the number of eval samples its
based on.

But the missing 1000+ segments in the smaller downloaded eval sets aren't
going to be concentrated in a few classes. They should occur at random
across all the segments, and to impact all classes equally, in expectation.

I tested the impact of random deletions by taking multiple random subsets
of the eval set with different amounts of deletion, then calculating the
mean and SD of the metrics vs. the proportion of the eval set being
selected. So, proportion = 1.0 is the full-set metric, and shows no
variance because every draw has to be the same. As the proportion drops,
we expect the variance to go up because the different draws can be
increasingly unlike one another. Here's the result for d-prime (i.e.,
transformed AUC) which is our preferred within-class performance metric.
The shaded region represents +/- 1 SD away from the mean, over 100 draws
per proportion:

We see that the average across 100 draws is approximately constant across
all proportions, but the spread grows for smaller sets. However, even for
proportion=0.7 (30% deletion), it's still within about 0.008 of the
full-set figure. I normally ignore differences in d-prime smaller than
0.02 or so (since we see variations on that scale just across
different trainings or checkpoints), so the erosion of the dataset doesn't
seem to be adding serious noise here.

Here's the same treatment for mAP:

Now the spread across random samples at 70% is around 0.004, so again not a
huge effect in mAP, where changes below 0.01 aren't really worth paying too
much attention to. However, the means of the different proportions appear
to have a definite trend, rather than being estimates of the same
underlying value. This is not at all what I expected, and I can't explain
it off-hand, but maybe it's another reason not to use mAP (the big reason
being that mAP is conflated with the priors for each class in your
particular eval set, whereas ROC-curve metrics normalize that out). But,
even so, the bias due to the smaller set is only about another 0.005 at 70%
deletion.

So my belief is that random deletions from the eval set (which primarily
occur because videos get deleted from YouTube) is not as serious a threat
to metric repeatability as I feared at first. (In 2017, the sets were
disappearing at ~1% per month, but that seems to have slowed down). I hope
these plots reassure you too. I think there must be a different factor
causing the difference you saw in AST results.

dpwe · 2022-04-13T19:48:07Z

(To the curious, the comment I deleted was alerting me that my first attempt to upload the discussion of eval set erosion was missing the figures).

lijuncheng16 changed the title ~~Test Set missing files causing fluctuation in research~~ Test Set missing files causing variation in published research Apr 6, 2022

audioset deleted a comment from lijuncheng16 Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Set missing files causing variation in published research #8

Test Set missing files causing variation in published research #8

lijuncheng16 commented Apr 6, 2022

dpwe commented Apr 6, 2022 via email

lijuncheng16 commented Apr 6, 2022

dpwe commented Apr 7, 2022 via email

lijuncheng16 commented Apr 7, 2022 •

edited

Loading

lijuncheng16 commented Apr 7, 2022

dpwe commented Apr 9, 2022

dpwe commented Apr 13, 2022

Test Set missing files causing variation in published research #8

Test Set missing files causing variation in published research #8

Comments

lijuncheng16 commented Apr 6, 2022

dpwe commented Apr 6, 2022 via email

lijuncheng16 commented Apr 6, 2022

dpwe commented Apr 7, 2022 via email

lijuncheng16 commented Apr 7, 2022 • edited Loading

lijuncheng16 commented Apr 7, 2022

dpwe commented Apr 9, 2022

dpwe commented Apr 13, 2022

lijuncheng16 commented Apr 7, 2022 •

edited

Loading