-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cemgil Scores > 1 #414
Comments
Yes, that's definitely not the expected behaviour — but I am not sure if it is a bug with only Cemgil's metric. Swapping annotations and detections reduces Cemgil's score to 0.601, which is way more reasonable. However, I am not sure at all if the other metrics behave as intended. Since your example has way more detections than annotations, also information gain is also pretty close to it's maximum. And even simple metrics like F-measure should be low given that many false positive detections. So it looks more like we are generally unable to handle that many detections. I think it is safest to do proper peak picking after thresholding. Have you tried P.S. I am not sure why I was swapping them in the first place, so I have to rethink the whole issue. Also I have to compare the results with |
Yeah, you are right - where I also noticed it was P-Score (although it never went > 1). Hmm, maybe we should rename this issue... Yes, In most cases, peak picking fixes it completely. However, in the worst case (online networks/online algorithms) I'm still getting ~3% difference for P-Score and Cemgil depending on the swapping. After all, this an edge case, but probably still worth a closer look 🤷♂️ |
The issue
Hi there, using simple thresholding as a beat detection method can represent a viable baseline when evaluating different beat trackers. By its nature, thresholding tends to gather multiple detections around an annotation. However, this leads to Cemgil scores > 1 and thus, usually ranks thresholding higher than all other algorithms. The reason is that activations and detections are swapped (in contrast to the original implementation). This is also commented in the code:
madmom/madmom/evaluation/beats.py
Lines 473 to 476 in 41155f4
I guess, the swapping was done to prevent confusion about the parameter names? For a 'normal' usecase this works fine but I'm currious to whether we should rethink that swapping...
Steps needed to reproduce the behaviour
Just evaluate this example. This results in Cemgil = 1.233.
detections.txt
annotations.txt
The text was updated successfully, but these errors were encountered: