-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPORTEC] Load DFL Open Data #365
Conversation
Cool! I didn't know about this great dataset. However, I wonder if it is supposed to be made public. The URL is a private Figshare link. Typically, its purpose is to share private data with reviewers. Moreover, the paper that is now published includes the following statement:
I assume there is also no license for the data? |
Good point, thanks for checking that. I assumed it would be freely available simply because:
|
Hey guys, I'm responsible for including the dataset into floodlight and may answer some of your questions. Our department got permission by the DFL to publish these seven matches under CC-BY-4.0 as an educational resource and benchmark dataset. We currently have an accompanying paper in review in Nature Scientific Data, as described in floodlights documentation of the dataset. During the revision process in SData the datasets are hosted on a private figshare repository as @probberechts pointed out. However, the paper by Henrik Biermann and colleagues is not related. It uses different data that we are not allowed do share. Since we made the private link to the repository open source, you can feel free (and I appreciate it!!) to add it to kloppy. DataBallPy already included it in their latest release. However, please be aware of the following: First, once the dataset is published, the figshare repository will change to public and the hard-coded links will most likely change. Second, please cite the accompanying paper in the documentation. So if you want to include the dataset in its current stage, you can do this but you will have to fix the links and references upon final publication. Cheers! |
Hi @manuba95 thanks very much for the thorough reply! Is there any objection to rehosting the data on github to avoid broken links in the future? Or @probberechts should we either wait for the public url, or perhaps shall I simply put a deprecation warning in place? That if the url appears broken users should update their kloppy version? CIting the paper should not be an issue, we're still working on extending the docs but we'll include it! |
@manuba95 Great! Thanks for the clarification and good luck with the review process. Is there also a license in the Figshare repo (or somewhere else) for which you could provide the link?
The main reason for using a scientific data repository is exactly to avoid broken links in the future. 😃 For example for Zenode, once you upload something, there's almost no possibility to remove it and it will associate a DOI with your deposited artifacts, which will permanently link to the corresponding version. I am not very familiar with Figshare, but I assume it is similar. On the other hand, you can remove/rename/edit a GitHub repo. That being said, I don't think it is a bad idea to mirror the dataset on GitHub. It makes it easier to discover and explore the data.
If @manuba95 is ok with it, I would make it available now. I would not use a deprecation warning, but a regular warning seems a good idea.
I think adding the citation to the docstring would be sufficient for now. Once we update the docs, we can include something like "if you use this dataset, please cite ..." |
Thanks @probberechts I'll include a regular warning. @manuba95 is there anything to cite yet, or is this only relevant after publication of your Nature paper? |
Since it is published under CC-BY-4.0 you can share and distribute anywhere you like, from a lincensing perspective. But imo (and as @probberechts pointed out) GitHub is not a data repository, figshare is. Once published (hopefully within the next weeks fingers crossed) the links will be permanent and accessible. You can decide whether your users need another way to access the raw data. The repository does not have a license for the data. We have a letter of permission by the DFL that we also shared with the editorial office of SData. @UnravelSports we currently cite the paper as follows: Bassek, M., Weber, H., Rein, R., & Memmert,D. (2024). An integrated dataset of synchronized spatiotemporal and event data in elite soccer. In Submission. Which is sufficient until publication. After that I can update you with the final reference! |
Thanks @manuba95, I will include it. I understand GitHub is not file storage, but it seemed like a decent solution since other files supported by Kloppy are also hosted there (StatsBomb for example). Just wanted to make sure the data persists, and I'm not familiar with figshare. Either way, please update us with any information (new source, final reference) after publication when you find some time! |
Make sure to add a license to the repository once it gets published! Without a license, data is not truly open. This would be a great dataset for benchmarking but without an explicit license together with the data, we would not be allowed to use it in our research. |
I did some refactoring and added documentation. Do you agree with my changes @UnravelSports? |
Awesome @probberechts thanks for that, it looks a lot cleaner! |
The Floodlight Package has a way to load 7 games of DFL event and tracking data.
I have created an implementation to load this same data directly into Kloppy. We simply add:
I've added the following in the same file, but I'm not sure how desirable this is and if it needs to be put in another file.
Finally, when loading
match_id="J03WN1"
ormatch_id="J03WN1"
tracking data I get an error, butmatch_id="J03WPY"
works fine. So, this would be an issue that we'd have to resolve, although it seems unrelated to this specific implementation.