-
Notifications
You must be signed in to change notification settings - Fork 152
Test Set missing files causing variation in published research #8
Comments
Thanks for the message.
It’s not immediately obvious to me where the 5% number comes from, or how
you know it’s due to differences in missing files. Can you walk me through
it?
It strikes me that mAP is possibly more vulnerable than d-prime. We track
d-prime and have been surprised how consistent the overall results have
been as the data has eroded by >10% over time.
Thanks,
DAn.
…On Wed, Apr 6, 2022 at 13:02 Billy ***@***.***> wrote:
Hi, Dan and the other contributors:
Thank you for maintaining the repo so far!
AudioSet, to me, is a great resource, and still the best resource to
understand the nature of sound.
We did a recent study: paper <https://arxiv.org/abs/2203.13448> and code
<https://github.com/lijuncheng16/AudioTaggingDoneRight>, where we found
the recent research papers are having a whooping *+- 5%* difference in
performance due to test set missing files when downloading. Plus, the *difference
in label quality* also contributed to the performance variation and
making it less fair (see figure 2 in our paper).
I understand you guys have legal constraints on youtube licensing, but
guess this issue could be easier for the original authors to address.
Either to advocate the community to use a common subset, or release an
updated test set? given you guys already released updated strong labels.
Looking forward to your thoughts.
—
Reply to this email directly, view it on GitHub
<#8>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEGZUP2YNCN3HRZOULZCBTVDW7SPANCNFSM5SWVVQ5Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi, Dan:
Especially, the different test size could cause severe fluctuations in final mAP reporting as seen in Figure 2 for our paper. |
Oh, I misunderstood, I thought 5% was the difference in metric, not the
test set size. Yes, we’ve seen variations of >10% in available dataset
sizes, but there’s not much we can do since videos get taken down all the
time; from the beginning we tried to choose videos with a lower chance of
disappearing, but that wasn’t very successful.
I wouldn’t assume that it’s differences in available videos that’s the main
factor in result variation, there are many other things at play. We’ve had
a very hard time matching published results, even reproducing our own past
results - sometimes it appears to be subtle changes in the underlying DNN
package (with different releases) or arithmetic differences in different
accelerator hardware.
It would be very interesting to measure this directly, e.g. delete dataset
entries at random and see how that affects the resulting metric. If you’re
only looking at the impact of changes in the evaluation set, that could be
very quick, since you only need to apply the ablation in the final step
before summarizing the results across all the eval set items.
DAn.
…On Wed, Apr 6, 2022 at 13:58 Billy ***@***.***> wrote:
Hi, Dan:
Thank you again for your prompt response!
In our Paper <https://arxiv.org/pdf/2203.13448.pdf>: In table 1, due to
the downloading difference of AudioSet, number of train&test set varies a
whopping ±5% across previous works.
1. e.g. AST<Gong 2021 et al>number of test : 19185 VS. Ours: 20123.
|19185 - 20123| / 19185 = 0.049 that's where the 5% comes from.
2. Not to mention the ERANN< Verbitskiy et al>'s test size was only
17967.
- We use the exact setup of training pipeline of AST and result in a
2.5% mAP drop by using our test set VS. using theirs.
Especially, the different test size could cause severe fluctuations in
final mAP reporting as seen in Figure 2 for our paper.
e.g. one could have downloaded the lower-label-quality test samples that
tank their score.
or e.g. one could test only on some high-label-quality samples and report
higher mAP.
—
Reply to this email directly, view it on GitHub
<#8 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEGZUPJFWOWK7FQZ2N57BDVDXGDPANCNFSM5SWVVQ5Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi, Dan:
All the models listed here are SOTA results I have reproduced/implemented. If you are interested, you can try running my pipeline here |
Again, I appreciate your correspondence a lot here! I feel trying to solve it here on GitHub could be faster and maybe easier than going for a chat at Interspeech or ICASSP, of course, I would love to do so if you are going to attend. As an audio ML researcher, I always feel AudioSet could be the ImageNet for the audio community, given you guys already spent lots of effort and resources collecting it. That's why I strongly feel a fair comparison on AudioSet could be helpful and is actually a poignant task. Hopefully, you get where I am coming from. |
Subsetting classes is definitely going to have a large influence on the Your figure 2, showing that the average over subsets of these points In practice, of course, multi-label makes it impossible to select all the But the missing 1000+ segments in the smaller downloaded eval sets aren't I tested the impact of random deletions by taking multiple random subsets We see that the average across 100 draws is approximately constant across Here's the same treatment for mAP: Now the spread across random samples at 70% is around 0.004, so again not a So my belief is that random deletions from the eval set (which primarily |
(To the curious, the comment I deleted was alerting me that my first attempt to upload the discussion of eval set erosion was missing the figures). |
Hi, Dan and the other contributors:
Thank you for maintaining the repo so far!
AudioSet, to me, is a great resource, and still the best resource to understand the nature of sound.
We did a recent study: paper and code, where we found the recent research papers are having a whooping +- 5% difference in performance due to test set missing files when downloading. Plus, the difference in label quality also contributed to the performance variation and making it less fair (see figure 2 in our paper).
I understand you guys have legal constraints on youtube licensing, but guess this issue could be easier for the original authors to address. Either to advocate the community to use a common subset, or release an updated test set? given you guys already released updated strong labels.
Looking forward to your thoughts.
The text was updated successfully, but these errors were encountered: