Skip to content
This repository has been archived by the owner on Aug 4, 2022. It is now read-only.

Test Set missing files causing variation in published research #8

Open
lijuncheng16 opened this issue Apr 6, 2022 · 7 comments
Open

Comments

@lijuncheng16
Copy link

Hi, Dan and the other contributors:
Thank you for maintaining the repo so far!
AudioSet, to me, is a great resource, and still the best resource to understand the nature of sound.
We did a recent study: paper and code, where we found the recent research papers are having a whooping +- 5% difference in performance due to test set missing files when downloading. Plus, the difference in label quality also contributed to the performance variation and making it less fair (see figure 2 in our paper).
I understand you guys have legal constraints on youtube licensing, but guess this issue could be easier for the original authors to address. Either to advocate the community to use a common subset, or release an updated test set? given you guys already released updated strong labels.
Looking forward to your thoughts.

@lijuncheng16 lijuncheng16 changed the title Test Set missing files causing fluctuation in research Test Set missing files causing variation in published research Apr 6, 2022
@dpwe
Copy link
Contributor

dpwe commented Apr 6, 2022 via email

@lijuncheng16
Copy link
Author

Hi, Dan:
Thank you again for your prompt response!
In our Paper: In table 1, due to the downloading difference of AudioSet, number of train&test set varies a whopping ±5% across previous works.

  1. e.g. AST<Gong 2021 et al>number of test : 19185 VS. Ours: 20123. |19185 - 20123| / 19185 = 0.049 that's where the 5% comes from.
  2. Not to mention the ERANN< Verbitskiy et al>'s test size was only 17967.
  • We use the exact setup of training pipeline of AST and result in a 2.5% mAP drop by using our test set VS. using theirs.

Especially, the different test size could cause severe fluctuations in final mAP reporting as seen in Figure 2 for our paper.
e.g. one could have downloaded the lower-label-quality test samples that tank their score.
or e.g. one could test only on some high-label-quality samples and report higher mAP.

@dpwe
Copy link
Contributor

dpwe commented Apr 7, 2022 via email

@lijuncheng16
Copy link
Author

lijuncheng16 commented Apr 7, 2022

Hi, Dan:
Thanks again for your reply.
I notice there could be minor performance fluctuations due to pytorch/tensorflow or hardware changes, but those are not very significant.
Yes, we did that ablation already, this is shown in our paper mentioned above:
mAP _ 0 35 not listed, A1_ 128x1024, pretrain; A2_ 128x1024 w_o pretrain; A3_ 64x400 pretrain; A4_ 64x400 w_o pretrain;
Basically, this graph is showing the performance of models at test time changing rapidly on the different label-quality-quantile test set.

  • e.g. you can see CNN+trans model using our A2 recipe (using 128x 1024 feature, and the scheduling described in our paper) can get to 0.526 test mAP if we only report on (>90% quality classes in test set, ablation all low quality classes), VS. if we report on all test set (100% of our 20123 test files), the same model gets 0.437 mAP. I find this gap between 0.526 and 0.437 significant.

All the models listed here are SOTA results I have reproduced/implemented. If you are interested, you can try running my pipeline here

@lijuncheng16
Copy link
Author

Again, I appreciate your correspondence a lot here! I feel trying to solve it here on GitHub could be faster and maybe easier than going for a chat at Interspeech or ICASSP, of course, I would love to do so if you are going to attend.

As an audio ML researcher, I always feel AudioSet could be the ImageNet for the audio community, given you guys already spent lots of effort and resources collecting it.
Look at ImageNet, folks spent time reporting 0.1 performance gain on Top-1 on Top-5 acc. change, (surely, there are lots of black-box/incremental research...) but point is ImageNet is more and more established as the large-scale de-facto standard for the vision community, and that community benefited from that.

That's why I strongly feel a fair comparison on AudioSet could be helpful and is actually a poignant task. Hopefully, you get where I am coming from.

@dpwe
Copy link
Contributor

dpwe commented Apr 9, 2022

Subsetting classes is definitely going to have a large influence on the
summary score, because there's such a wide spread in per-class performances
(some classes are legitimately just more prone to confusion; some have
scarce training data, although this seems to matter less than I expect).
Here's a scatter of per-class mAP (for a basic resnet50 model) vs. the QA
quality estimate across all 527 classes:

image

Your figure 2, showing that the average over subsets of these points
(growing from the right, I guess) yields different overall averages, seems
natural given such a wide spread.

In practice, of course, multi-label makes it impossible to select all the
positive samples for one class without including some positives from
multiple other classes, but you could drastically alter the prior of
different classes. But note that wouldn't actually help you "goose" your
results, unless you had some threshold that classes with too few samples
were excluded from the final balanced average. Absent that, reducing but
not eliminating classes would add noise to their contribution to the
balanced average, but wouldn't weaken it, since balanced average treats
each class's value equally regardless of the number of eval samples its
based on.

But the missing 1000+ segments in the smaller downloaded eval sets aren't
going to be concentrated in a few classes. They should occur at random
across all the segments, and to impact all classes equally, in expectation.

I tested the impact of random deletions by taking multiple random subsets
of the eval set with different amounts of deletion, then calculating the
mean and SD of the metrics vs. the proportion of the eval set being
selected. So, proportion = 1.0 is the full-set metric, and shows no
variance because every draw has to be the same. As the proportion drops,
we expect the variance to go up because the different draws can be
increasingly unlike one another. Here's the result for d-prime (i.e.,
transformed AUC) which is our preferred within-class performance metric.
The shaded region represents +/- 1 SD away from the mean, over 100 draws
per proportion:

image

We see that the average across 100 draws is approximately constant across
all proportions, but the spread grows for smaller sets. However, even for
proportion=0.7 (30% deletion), it's still within about 0.008 of the
full-set figure. I normally ignore differences in d-prime smaller than
0.02 or so (since we see variations on that scale just across
different trainings or checkpoints), so the erosion of the dataset doesn't
seem to be adding serious noise here.

Here's the same treatment for mAP:

image

Now the spread across random samples at 70% is around 0.004, so again not a
huge effect in mAP, where changes below 0.01 aren't really worth paying too
much attention to. However, the means of the different proportions appear
to have a definite trend, rather than being estimates of the same
underlying value. This is not at all what I expected, and I can't explain
it off-hand, but maybe it's another reason not to use mAP (the big reason
being that mAP is conflated with the priors for each class in your
particular eval set, whereas ROC-curve metrics normalize that out). But,
even so, the bias due to the smaller set is only about another 0.005 at 70%
deletion.

So my belief is that random deletions from the eval set (which primarily
occur because videos get deleted from YouTube) is not as serious a threat
to metric repeatability as I feared at first. (In 2017, the sets were
disappearing at ~1% per month, but that seems to have slowed down). I hope
these plots reassure you too. I think there must be a different factor
causing the difference you saw in AST results.

@audioset audioset deleted a comment from lijuncheng16 Apr 9, 2022
@dpwe
Copy link
Contributor

dpwe commented Apr 13, 2022

(To the curious, the comment I deleted was alerting me that my first attempt to upload the discussion of eval set erosion was missing the figures).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants