From 7079280973b2e9497eb01763e718ef82ae7736ff Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 15:49:28 +0200 Subject: [PATCH 1/7] cosmetics + citation --- README.md | 25 +++++++++---------------- 1 file changed, 9 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 854078e..2684d37 100644 --- a/README.md +++ b/README.md @@ -3,13 +3,12 @@ TODO doi zenodo -Discogs-VI is a dataset of [music version](https://en.wikipedia.org/wiki/Cover_version) metadata and precomputed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset. +Discogs-VI is a dataset of [musical version](https://en.wikipedia.org/wiki/Cover_version) metadata and precomputed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset. In the VI literature the set of tracks that are versions of each other is defined as a *clique*. Here’s an example of the metadata for a [clique](./data/example_clique.json). *Discogs-VI* contains about 1.9 million versions belonging to around 348,000 cliques, while *Discogs-VI-YT* includes 493,000 versions across 98,000 cliques. This website accompanies the dataset and the related publication, providing summary information, instructions on access and usage, as well as the code to re-create the dataset, including audio downloads from the matched YouTube videos. - ## Table of contents * [Discogs](#discogs) @@ -31,7 +30,6 @@ This website accompanies the dataset and the related publication, providing summ Discogs regularly releases public [data dumps](https://www.discogs.com/data) containing comprehensive release metadata (such as artists, genres, styles, labels, release year, and country). See an [example](https://www.discogs.com/Prodigy-Firestarter/release/3804513) of a release page. See how the Discogs database is built [here](https://support.discogs.com/hc/en-us/articles/360008545114-Overview-Of-How-DiscogsIs-Built). You can see some statistics for all music releases submitted to Discogs on their [explore page](https://www.discogs.com/search/). - ## Dependencies We use Python 3.10.9 on Linux. @@ -49,10 +47,9 @@ Three types of data are associated with the dataset: clique metadata (*Discogs-V ### Metadata -TODO zip the files, describe the contents in a readme file inside the main directory. add a license. TODO upload to zenodo, add the url here. -We provide all the metadata including the intermediary files of the dataset creation process and the final outputs. Due to their sizes they are separated into two directories so that one does not have to download everything. If your goal is to download the main metadata and start working, download `discogs_20240701/main/` (21GB before compressing). If for some reason you are interested in the intermediary files, you download `discogs_20240701/intermediary/` (46GB before compressing). Contents of these folders are provided in [this section](#data-structure). +We provide the dataset including the intermediary files of the creation process. Due to their sizes they are separated into two directories so that an intereseted person do not have to download everything. If your goal is to download the main metadata and start working, download `discogs_20240701/main/` (21GB before compressing). If for some reason you are interested in the intermediary files, you download `discogs_20240701/intermediary/` (46GB before compressing). Contents of these folders are provided in [this section](#data-structure). ### Audio @@ -82,9 +79,7 @@ python discogs_vi_yt/audio_download_yt/download_missing_version_youtube_urls.py ### Audio representations -This repository does not contain the code for extracting the CQT audio representations used to train the `Discogs-VINet` described in the paper, nor the features themselves. The model and code to extract the features are available in a separate [repository](https://github.com/raraz15/Discogs-VINet). The features we extracted are available upon request for non-commercial scientific research purposes only. Please contact [Music Technology Group](https://www.upf.edu/web/mtg/contact) to make a request. - -Contact: R. Oğuz Araz +This repository does not contain the code for extracting the CQT audio representations used to train the `Discogs-VINet` described in the paper, nor the features themselves. The model and code to extract the features are available in a separate [repository](https://github.com/raraz15/Discogs-VINet). The extracted features are available upon request for non-commercial scientific research purposes. Please contact [Music Technology Group](https://www.upf.edu/web/mtg/contact) to make a request. ## Data Structure @@ -171,18 +166,16 @@ The steps to re-create the dataset is detailed in a separate [README](./README-r ## Cite -TODO: - Please cite the following publication when using the dataset: -> Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry +> R. O. Araz, X. Serra, and D. Bogdanov, "Discogs-VI: A musical version identification dataset based on public editorial metadata," in Proc. of the 25th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2024. ```bibtex -@conference {, - author = "Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry", - title = "Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Data", - booktitle = "", - year = "2024", +@inproceedings{araz_discogs-vi_2024, + title = {Discogs-{VI}: {A} musical version identification dataset based on public editorial metadata}, + booktitle = {Proc. of the 25th {Int}. {Soc}. for {Music} {Information} {Retrieval} {Conf}. ({ISMIR})}, + author = {Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry}, + year = {2024}, } ``` From 76941daa11e5af7439d7430b9639fda2dd9431a8 Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 17:31:32 +0200 Subject: [PATCH 2/7] uncompressed --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2684d37..f426fea 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ Three types of data are associated with the dataset: clique metadata (*Discogs-V TODO upload to zenodo, add the url here. -We provide the dataset including the intermediary files of the creation process. Due to their sizes they are separated into two directories so that an intereseted person do not have to download everything. If your goal is to download the main metadata and start working, download `discogs_20240701/main/` (21GB before compressing). If for some reason you are interested in the intermediary files, you download `discogs_20240701/intermediary/` (46GB before compressing). Contents of these folders are provided in [this section](#data-structure). +We provide the dataset including the intermediary files of the creation process. Due to their sizes they are separated into two directories so that you do not have to download everything. If your goal is to use the main metadata and start working, download `discogs_20240701/main.zip` (1.4GB compressed, 21GB uncompressed). If for some reason you are interested in the intermediary files, download `discogs_20240701/intermediary.zip` (8.7GB compressed, 46GB uncompressed). Contents of these folders are provided in [this section](#data-structure). ### Audio From c6548388107c07c2e050e2302476db2eeb9e35e7 Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 17:33:48 +0200 Subject: [PATCH 3/7] some more fixes --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f426fea..297b3fa 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ However, `Discogs-VI-20240701.jsonl.youtube_query_matched` contains more version python discogs_vi_yt/audio_download_yt/download_missing_version_youtube_urls.py Discogs-VI-20240701.jsonl.youtube_query_matched music_dir/ ``` -**NOTE**: We recommend parallelizing this operation because there are many audio files using `utilities/shuffle_and_split.sh`. However, if you use too many parallel processes you may get banned from YouTube. We experimented with 2-20 processes. After 10 processes we got banned a few times. In that case, you should stop downloading and wait a couple of days before trying again. +**NOTE**: We recommend parallelizing this operation because there are many audio files using `utilities/shuffle_and_split.sh`. However, if you use too many parallel processes you may get banned from YouTube. We experimented with 2-20 processes. Using more than 10 processes got us banned a few times. In that case, you should stop downloading and wait a couple of days before trying again. ```bash utilities/shuffle_and_split.sh Discogs-VI-YT-20240701.jsonl 16 From 10368e52b547bc6235980f046cedfae4d5a83705 Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 17:56:53 +0200 Subject: [PATCH 4/7] improve readme --- README.md | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 297b3fa..7ea4ccf 100644 --- a/README.md +++ b/README.md @@ -87,24 +87,25 @@ Below you can find some information about the contents of the dataset and how to ### Main files -* `Discogs-VI-20240701.jsonl` corresponds to the *Discogs-VI* dataset which contains all identified cliques and their metadata. The versions are not mapped to Youtube IDs. -* `Discogs-VI-YT-20240701.jsonl` corresponds to *Discogs-VI-YT* dataset subset, with versions mapped to YouTube IDs and post-processing to ensure that each clique has at least two downloaded versions. -* However we could match much more videos than we could download in Barcelona between 2023-2024. Maybe depending on your location you can download more. `Discogs-VI-20240701.jsonl.youtube_query_matched` contains all these videos. - * Some versions are matched to more than one alternative YouTube ID (1.4 videos per version on average) and the matches are sorted from the highest quality match to the lowest, although all matches are matches to official uploads. -* `Discogs-VI-20240701.jsonl` and `Discogs-VI-YT-20240701.jsonl` contain rich metadata, therefore these files are large in size (around 7 GB and 4 GB). Therefore we provide a file where only clique, version, and Youtube IDs are provided: `Discogs-VI-YT-light-20240701.json` -* We then create train, validation, and test partitions from `Discogs-VI-YT-light-20240701.jsonl` after dealing with Da-TACOS and SHS100K datasets (see the paper for more information). -* `discogs_20240701_artists.xml.jsonl.clean` contains detailed artist related information. +* `Discogs-VI-20240701.jsonl` corresponds to the *Discogs-VI* dataset which contains all identified cliques and their metadata. The versions are not matched to Youtube IDs. +* `Discogs-VI-YT-20240701.jsonl` corresponds to *Discogs-VI-YT* subset, with versions matched to YouTube IDs and with post-processing applied to ensure that each clique has at least two downloaded versions. +* However, we could match more videos than we could download in Barcelona between 2023-2024. Depending on your location, maybe you can download more than us. `Discogs-VI-20240701.jsonl.youtube_query_matched` contains all these YouTube IDs. + * Some versions are matched to more than one alternative YouTube ID (1.4 videos per version on average) and the matches are sorted from the highest quality match to the lowest, although all YouTube IDs are official uploads. +* `Discogs-VI-20240701.jsonl` and `Discogs-VI-YT-20240701.jsonl` contain rich metadata, therefore these files are large in size (around 7 GB and 4 GB). Therefore we provide a file where only clique, version, and Youtube IDs are provided: `Discogs-VI-YT-light-20240701.json`. This file is the basis for training neural networks. +* We then create train, validation, and test partitions from `Discogs-VI-YT-light-20240701.json` after dealing with the test sets of the Da-TACOS and SHS100K datasets (see the paper for more information). + * `Discogs-VI-YT-20240701-light.json.train`, `Discogs-VI-YT-20240701-light.json.val`, `Discogs-VI-YT-20240701-light.json.test` +* `discogs_20240701_artists.xml.jsonl.clean` contains detailed artist related information that may be usefull. * `Discogs-VI-YT-20240701.jsonl.demo` should be used with the Streamlit demo for visualization purposes. **NOTE**: Every clique and version has a unique ID associated to them. Currently the clique IDs change between Discogs dumps (will be fixed in the code later). ### Intermediary files -* `discogs_20240701_artists.xml.jsonl` is the Discogs artist data dump xml file parsed to a json file with some processing. It contains artist information such as aliases, group memberships, or name variations. -* `discogs_20240701_releases.xml.jsonl` is the parsed releases file. +* `discogs_20240701_artists.xml.jsonl` is the Discogs artist data dump xml file parsed to a jsonl file with some processing. It contains artist information such as aliases, group memberships, or name variations. +* `discogs_20240701_releases.xml.jsonl` is the Discogs release data dump xml file parsed releases to a jsonl file with some processing. * `discogs_20240701_releases.xml.jsonl.clean` is the cleaned version. -* `discogs_20240701_releases.xml.jsonl.clean.tracks` parses the releases to tracks. -* `Discogs-VI-20240701-DaTACOS-SHS100K2_TEST-lost_cliques.txt` contains the clique ids in Discogs-VI that intersect with Da-TACOS and SHS100K datasets. +* `discogs_20240701_releases.xml.jsonl.clean.tracks` contains the tracks from the clean releases. It is used for identifying the cliques. +* `Discogs-VI-20240701-DaTACOS-SHS100K2_TEST-lost_cliques.txt` contains the clique ids in Discogs-VI that intersect with Da-TACOS and SHS100K test sets. * `Discogs-VI-20240701.jsonl.queries` contains the query strings that was created to search the versions on YouTube. ### Loading with python @@ -137,18 +138,18 @@ with open("Discogs-VI-YT-light-20240701.json") as in_f: # Access the data ``` +#### Rest of the files + ```python with open("discogs_20240701_artists.xml.jsonl.clean", encoding="utf-8") as infile: for jsonline in infile: artist = json.loads(jsonline) ``` -#### Rest of the files - -* `discogs_20240701_artists.xml.jsonl`, `discogs_20240701_releases.xml.jsonl`, `discogs_20240701_releases.xml.jsonl.clean`, discogs_20240701_releases.xml.jsonl.clean.tracks are JSONL files with utf-8 encoding. +* `discogs_20240701_artists.xml.jsonl`, `discogs_20240701_artists.xml.jsonl.clean`, `discogs_20240701_releases.xml.jsonl`, `discogs_20240701_releases.xml.jsonl.clean`, discogs_20240701_releases.xml.jsonl.clean.tracks are JSONL files with utf-8 encoding. * `Discogs-VI-20240701-DaTACOS-SHS100K2_TEST-lost_cliques.txt` and `Discogs-VI-20240701.jsonl.queries` are line-delimited text files. -Please refer to the code for more examples. +Please refer to our [GitHub Repository](https://github.com/MTG/discogs-vi-dataset/) for more examples. ## Discogs-VI-YT Streamlit demo From f06a2202d745e4043ea9d79ce7724ecb40a7fc18 Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 18:17:54 +0200 Subject: [PATCH 5/7] expand citation --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7ea4ccf..de028a7 100644 --- a/README.md +++ b/README.md @@ -169,12 +169,12 @@ The steps to re-create the dataset is detailed in a separate [README](./README-r Please cite the following publication when using the dataset: -> R. O. Araz, X. Serra, and D. Bogdanov, "Discogs-VI: A musical version identification dataset based on public editorial metadata," in Proc. of the 25th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2024. +> R. O. Araz, X. Serra, and D. Bogdanov, "Discogs-VI: A musical version identification dataset based on public editorial metadata," in Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024. ```bibtex @inproceedings{araz_discogs-vi_2024, title = {Discogs-{VI}: {A} musical version identification dataset based on public editorial metadata}, - booktitle = {Proc. of the 25th {Int}. {Soc}. for {Music} {Information} {Retrieval} {Conf}. ({ISMIR})}, + booktitle = {Proceedings of the 25th {International} {Society} for {Music} {Information} {Retrieval} {Conference} ({ISMIR})}, author = {Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry}, year = {2024}, } From 859d032f0f94053462e0f7fecc13293cd73d8eaf Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 20:35:13 +0200 Subject: [PATCH 6/7] update readme again --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index de028a7..47e6f0f 100644 --- a/README.md +++ b/README.md @@ -3,9 +3,9 @@ TODO doi zenodo -Discogs-VI is a dataset of [musical version](https://en.wikipedia.org/wiki/Cover_version) metadata and precomputed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset. +Discogs-VI is a dataset of [musical version](https://en.wikipedia.org/wiki/Cover_version) metadata and pre-computed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset. -In the VI literature the set of tracks that are versions of each other is defined as a *clique*. Here’s an example of the metadata for a [clique](./data/example_clique.json). *Discogs-VI* contains about 1.9 million versions belonging to around 348,000 cliques, while *Discogs-VI-YT* includes 493,000 versions across 98,000 cliques. +In the VI literature the set of tracks that are versions of each other is defined as a *clique*. Here’s an example of the metadata for a [clique](./data/example_clique.json). *Discogs-VI* contains approximately 1.9 million versions belonging to around 348,000 cliques, while *Discogs-VI-YT* includes approximately 493,000 versions across about 98,000 cliques. This website accompanies the dataset and the related publication, providing summary information, instructions on access and usage, as well as the code to re-create the dataset, including audio downloads from the matched YouTube videos. @@ -28,7 +28,7 @@ This website accompanies the dataset and the related publication, providing summ ## Discogs -Discogs regularly releases public [data dumps](https://www.discogs.com/data) containing comprehensive release metadata (such as artists, genres, styles, labels, release year, and country). See an [example](https://www.discogs.com/Prodigy-Firestarter/release/3804513) of a release page. See how the Discogs database is built [here](https://support.discogs.com/hc/en-us/articles/360008545114-Overview-Of-How-DiscogsIs-Built). You can see some statistics for all music releases submitted to Discogs on their [explore page](https://www.discogs.com/search/). +Discogs regularly releases public [data dumps](https://www.discogs.com/data) containing comprehensive release metadata (such as artists, genres, styles, labels, release year, and country). See an [example](https://www.discogs.com/master/92381-Benny-Benassi-Hypnotica) of a release page. See how the Discogs database is built [here](https://support.discogs.com/hc/en-us/articles/360008545114-Overview-Of-How-DiscogsIs-Built). You can see some statistics for all music releases submitted to Discogs on their [explore page](https://www.discogs.com/search/). ## Dependencies @@ -49,7 +49,7 @@ Three types of data are associated with the dataset: clique metadata (*Discogs-V TODO upload to zenodo, add the url here. -We provide the dataset including the intermediary files of the creation process. Due to their sizes they are separated into two directories so that you do not have to download everything. If your goal is to use the main metadata and start working, download `discogs_20240701/main.zip` (1.4GB compressed, 21GB uncompressed). If for some reason you are interested in the intermediary files, download `discogs_20240701/intermediary.zip` (8.7GB compressed, 46GB uncompressed). Contents of these folders are provided in [this section](#data-structure). +We provide the dataset including the intermediary files of the creation process. Due to their sizes, they are separated into two directories so that you do not have to download everything. If your goal is to use the dataset and start working, download `main.zip` (1.4 GB compressed, 21 GB uncompressed). If for some reason you are interested in the intermediary files, download `intermediary.zip` (8.7 GB compressed, 46 GB uncompressed). Contents of these folders are provided in [this section](#data-structure). ### Audio @@ -91,11 +91,11 @@ Below you can find some information about the contents of the dataset and how to * `Discogs-VI-YT-20240701.jsonl` corresponds to *Discogs-VI-YT* subset, with versions matched to YouTube IDs and with post-processing applied to ensure that each clique has at least two downloaded versions. * However, we could match more videos than we could download in Barcelona between 2023-2024. Depending on your location, maybe you can download more than us. `Discogs-VI-20240701.jsonl.youtube_query_matched` contains all these YouTube IDs. * Some versions are matched to more than one alternative YouTube ID (1.4 videos per version on average) and the matches are sorted from the highest quality match to the lowest, although all YouTube IDs are official uploads. -* `Discogs-VI-20240701.jsonl` and `Discogs-VI-YT-20240701.jsonl` contain rich metadata, therefore these files are large in size (around 7 GB and 4 GB). Therefore we provide a file where only clique, version, and Youtube IDs are provided: `Discogs-VI-YT-light-20240701.json`. This file is the basis for training neural networks. +* `Discogs-VI-20240701.jsonl` and `Discogs-VI-YT-20240701.jsonl` contain rich metadata and they are large in size (around 7 GB and 4 GB). Therefore we provide a file where only clique, version, and Youtube IDs are provided: `Discogs-VI-YT-light-20240701.json`. This file is the basis for training neural networks. * We then create train, validation, and test partitions from `Discogs-VI-YT-light-20240701.json` after dealing with the test sets of the Da-TACOS and SHS100K datasets (see the paper for more information). * `Discogs-VI-YT-20240701-light.json.train`, `Discogs-VI-YT-20240701-light.json.val`, `Discogs-VI-YT-20240701-light.json.test` -* `discogs_20240701_artists.xml.jsonl.clean` contains detailed artist related information that may be usefull. -* `Discogs-VI-YT-20240701.jsonl.demo` should be used with the Streamlit demo for visualization purposes. +* `discogs_20240701_artists.xml.jsonl.clean` contains detailed artist metadata that may be useful. +* `Discogs-VI-YT-20240701.jsonl.demo` is to be used with the Streamlit demo for visualization purposes. **NOTE**: Every clique and version has a unique ID associated to them. Currently the clique IDs change between Discogs dumps (will be fixed in the code later). From a0cbc8ccb96d738c553a1095a7a5eebbb665126a Mon Sep 17 00:00:00 2001 From: raraz15 Date: Wed, 23 Oct 2024 20:57:28 +0200 Subject: [PATCH 7/7] add zenodo --- README.md | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 47e6f0f..4359b36 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,6 @@ # Discogs-VI Dataset -TODO doi zenodo - +[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13983028.svg)](https://doi.org/10.5281/zenodo.13983028) Discogs-VI is a dataset of [musical version](https://en.wikipedia.org/wiki/Cover_version) metadata and pre-computed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset. @@ -47,9 +46,7 @@ Three types of data are associated with the dataset: clique metadata (*Discogs-V ### Metadata -TODO upload to zenodo, add the url here. - -We provide the dataset including the intermediary files of the creation process. Due to their sizes, they are separated into two directories so that you do not have to download everything. If your goal is to use the dataset and start working, download `main.zip` (1.4 GB compressed, 21 GB uncompressed). If for some reason you are interested in the intermediary files, download `intermediary.zip` (8.7 GB compressed, 46 GB uncompressed). Contents of these folders are provided in [this section](#data-structure). +We provide the dataset including the intermediary files of the creation process. Due to their sizes, they are separated into two directories so that you do not have to download everything. If your goal is to use the dataset and start working, download `main.zip` (1.4 GB compressed, 21 GB uncompressed). If for some reason you are interested in the intermediary files, download `intermediary.zip` (8.7 GB compressed, 46 GB uncompressed). Contents of these folders are provided in [this section](#data-structure). You can download the data from [Zenodo](https://doi.org/10.5281/zenodo.13983028) ### Audio