Merge pull request #2 from MTG/reviewed

cosmetics + citation
MTG · Oct 24, 2024 · ca798eb · ca798eb
2 parents 2675ef7 + a0cbc8c
commit ca798eb
Showing 1 changed file with 30 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -1,15 +1,13 @@
 # Discogs-VI Dataset
 
-TODO doi zenodo
-<!-- [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3826813.svg)](https://doi.org/10.5281/zenodo.3826813) -->
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13983028.svg)](https://doi.org/10.5281/zenodo.13983028)
 
-Discogs-VI is a dataset of [music version](https://en.wikipedia.org/wiki/Cover_version) metadata and precomputed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset.
+Discogs-VI is a dataset of [musical version](https://en.wikipedia.org/wiki/Cover_version) metadata and pre-computed audio representations, created for research on version identification (VI), also referred to as cover song identification (CSI). It was created using editorial metadata from the public [Discogs](https://discogs.com) music database by identifying version relationships among millions of tracks, utilizing metadata matching based on artist and writer credits as well as track title metadata. The identified versions comprise the *Discogs-VI* dataset, with a large portion of it mapped to official music uploads on YouTube, resulting in the *Discogs-VI-YT* subset.
 
-In the VI literature the set of tracks that are versions of each other is defined as a *clique*. Here’s an example of the metadata for a [clique](./data/example_clique.json). *Discogs-VI* contains about 1.9 million versions belonging to around 348,000 cliques, while *Discogs-VI-YT* includes 493,000 versions across 98,000 cliques.
+In the VI literature the set of tracks that are versions of each other is defined as a *clique*. Here’s an example of the metadata for a [clique](./data/example_clique.json). *Discogs-VI* contains approximately 1.9 million versions belonging to around 348,000 cliques, while *Discogs-VI-YT* includes approximately 493,000 versions across about 98,000 cliques.
 
 This website accompanies the dataset and the related publication, providing summary information, instructions on access and usage, as well as the code to re-create the dataset, including audio downloads from the matched YouTube videos.
 
-
 ## Table of contents
 
 * [Discogs](#discogs)
@@ -29,8 +27,7 @@ This website accompanies the dataset and the related publication, providing summ
 
 ## Discogs
 
-Discogs regularly releases public [data dumps](https://www.discogs.com/data) containing comprehensive release metadata (such as artists, genres, styles, labels, release year, and country). See an [example](https://www.discogs.com/Prodigy-Firestarter/release/3804513) of a release page. See how the Discogs database is built [here](https://support.discogs.com/hc/en-us/articles/360008545114-Overview-Of-How-DiscogsIs-Built). You can see some statistics for all music releases submitted to Discogs on their [explore page](https://www.discogs.com/search/).
-
+Discogs regularly releases public [data dumps](https://www.discogs.com/data) containing comprehensive release metadata (such as artists, genres, styles, labels, release year, and country). See an [example](https://www.discogs.com/master/92381-Benny-Benassi-Hypnotica) of a release page. See how the Discogs database is built [here](https://support.discogs.com/hc/en-us/articles/360008545114-Overview-Of-How-DiscogsIs-Built). You can see some statistics for all music releases submitted to Discogs on their [explore page](https://www.discogs.com/search/).
 
 ## Dependencies
 
@@ -49,10 +46,7 @@ Three types of data are associated with the dataset: clique metadata (*Discogs-V
 
 ### Metadata
 
-TODO zip the files, describe the contents in a readme file inside the main directory. add a license.
-TODO upload to zenodo, add the url here.
-
-We provide all the metadata including the intermediary files of the dataset creation process and the final outputs. Due to their sizes they are separated into two directories so that one does not have to download everything. If your goal is to download the main metadata and start working, download `discogs_20240701/main/` (21GB before compressing). If for some reason you are interested in the intermediary files, you download `discogs_20240701/intermediary/` (46GB before compressing). Contents of these folders are provided in [this section](#data-structure).
+We provide the dataset including the intermediary files of the creation process. Due to their sizes, they are separated into two directories so that you do not have to download everything. If your goal is to use the dataset and start working, download `main.zip` (1.4 GB compressed, 21 GB uncompressed). If for some reason you are interested in the intermediary files, download `intermediary.zip` (8.7 GB compressed, 46 GB uncompressed). Contents of these folders are provided in [this section](#data-structure). You can download the data from [Zenodo](https://doi.org/10.5281/zenodo.13983028)
 
 ### Audio
 
@@ -68,7 +62,7 @@ However, `Discogs-VI-20240701.jsonl.youtube_query_matched` contains more version
 python discogs_vi_yt/audio_download_yt/download_missing_version_youtube_urls.py Discogs-VI-20240701.jsonl.youtube_query_matched music_dir/
 ```
 
-**NOTE**: We recommend parallelizing this operation because there are many audio files using `utilities/shuffle_and_split.sh`. However, if you use too many parallel processes you may get banned from YouTube. We experimented with 2-20 processes. After 10 processes we got banned a few times. In that case, you should stop downloading and wait a couple of days before trying again.
+**NOTE**: We recommend parallelizing this operation because there are many audio files using `utilities/shuffle_and_split.sh`. However, if you use too many parallel processes you may get banned from YouTube. We experimented with 2-20 processes. Using more than 10 processes got us banned a few times. In that case, you should stop downloading and wait a couple of days before trying again.
 
 ```bash
 utilities/shuffle_and_split.sh Discogs-VI-YT-20240701.jsonl 16
@@ -82,34 +76,33 @@ python discogs_vi_yt/audio_download_yt/download_missing_version_youtube_urls.py
 
 ### Audio representations
 
-This repository does not contain the code for extracting the CQT audio representations used to train the `Discogs-VINet` described in the paper, nor the features themselves. The model and code to extract the features are available in a separate [repository](https://github.com/raraz15/Discogs-VINet). The features we extracted are available upon request for non-commercial scientific research purposes only. Please contact [Music Technology Group](https://www.upf.edu/web/mtg/contact) to make a request.
-
-Contact: R. Oğuz Araz <[email protected]>
+This repository does not contain the code for extracting the CQT audio representations used to train the `Discogs-VINet` described in the paper, nor the features themselves. The model and code to extract the features are available in a separate [repository](https://github.com/raraz15/Discogs-VINet). The extracted features are available upon request for non-commercial scientific research purposes. Please contact [Music Technology Group](https://www.upf.edu/web/mtg/contact) to make a request.
 
 ## Data Structure
 
 Below you can find some information about the contents of the dataset and how to load them using Python.
 
 ### Main files
 
-* `Discogs-VI-20240701.jsonl` corresponds to the *Discogs-VI* dataset which contains all identified cliques and their metadata. The versions are not mapped to Youtube IDs.
-* `Discogs-VI-YT-20240701.jsonl` corresponds to *Discogs-VI-YT* dataset subset, with versions mapped to YouTube IDs and post-processing to ensure that each clique has at least two downloaded versions.
-* However we could match much more videos than we could download in Barcelona between 2023-2024. Maybe depending on your location you can download more. `Discogs-VI-20240701.jsonl.youtube_query_matched` contains all these videos.
-  * Some versions are matched to more than one alternative YouTube ID (1.4 videos per version on average) and the matches are sorted from the highest quality match to the lowest, although all matches are matches to official uploads.
-* `Discogs-VI-20240701.jsonl` and `Discogs-VI-YT-20240701.jsonl` contain rich metadata, therefore these files are large in size (around 7 GB and 4 GB). Therefore we provide a file where only clique, version, and Youtube IDs are provided: `Discogs-VI-YT-light-20240701.json`
-* We then create train, validation, and test partitions from `Discogs-VI-YT-light-20240701.jsonl` after dealing with Da-TACOS and SHS100K datasets (see the paper for more information).
-* `discogs_20240701_artists.xml.jsonl.clean` contains detailed artist related information.
-* `Discogs-VI-YT-20240701.jsonl.demo` should be used with the Streamlit demo for visualization purposes.
+* `Discogs-VI-20240701.jsonl` corresponds to the *Discogs-VI* dataset which contains all identified cliques and their metadata. The versions are not matched to Youtube IDs.
+* `Discogs-VI-YT-20240701.jsonl` corresponds to *Discogs-VI-YT* subset, with versions matched to YouTube IDs and with post-processing applied to ensure that each clique has at least two downloaded versions.
+* However, we could match more videos than we could download in Barcelona between 2023-2024. Depending on your location, maybe you can download more than us. `Discogs-VI-20240701.jsonl.youtube_query_matched` contains all these YouTube IDs.
+  * Some versions are matched to more than one alternative YouTube ID (1.4 videos per version on average) and the matches are sorted from the highest quality match to the lowest, although all YouTube IDs are official uploads.
+* `Discogs-VI-20240701.jsonl` and `Discogs-VI-YT-20240701.jsonl` contain rich metadata and they are large in size (around 7 GB and 4 GB). Therefore we provide a file where only clique, version, and Youtube IDs are provided: `Discogs-VI-YT-light-20240701.json`. This file is the basis for training neural networks.
+* We then create train, validation, and test partitions from `Discogs-VI-YT-light-20240701.json` after dealing with the test sets of the Da-TACOS and SHS100K datasets (see the paper for more information).
+  * `Discogs-VI-YT-20240701-light.json.train`, `Discogs-VI-YT-20240701-light.json.val`, `Discogs-VI-YT-20240701-light.json.test`
+* `discogs_20240701_artists.xml.jsonl.clean` contains detailed artist metadata that may be useful.
+* `Discogs-VI-YT-20240701.jsonl.demo` is to be used with the Streamlit demo for visualization purposes.
 
 **NOTE**: Every clique and version has a unique ID associated to them. Currently the clique IDs change between Discogs dumps (will be fixed in the code later).
 
 ### Intermediary files
 
-* `discogs_20240701_artists.xml.jsonl` is the Discogs artist data dump xml file parsed to a json file with some processing. It contains artist information such as aliases, group memberships, or name variations.
-* `discogs_20240701_releases.xml.jsonl` is the parsed releases file.
+* `discogs_20240701_artists.xml.jsonl` is the Discogs artist data dump xml file parsed to a jsonl file with some processing. It contains artist information such as aliases, group memberships, or name variations.
+* `discogs_20240701_releases.xml.jsonl` is the Discogs release data dump xml file parsed releases to a jsonl file with some processing.
 * `discogs_20240701_releases.xml.jsonl.clean` is the cleaned version.
-* `discogs_20240701_releases.xml.jsonl.clean.tracks` parses the releases to tracks.
-* `Discogs-VI-20240701-DaTACOS-SHS100K2_TEST-lost_cliques.txt` contains the clique ids in Discogs-VI that intersect with Da-TACOS and SHS100K datasets.
+* `discogs_20240701_releases.xml.jsonl.clean.tracks` contains the tracks from the clean releases. It is used for identifying the cliques.
+* `Discogs-VI-20240701-DaTACOS-SHS100K2_TEST-lost_cliques.txt` contains the clique ids in Discogs-VI that intersect with Da-TACOS and SHS100K test sets.
 * `Discogs-VI-20240701.jsonl.queries` contains the query strings that was created to search the versions on YouTube.
 
 ### Loading with python
@@ -142,18 +135,18 @@ with open("Discogs-VI-YT-light-20240701.json") as in_f:
 # Access the data
 ```
 
+#### Rest of the files
+
 ```python
 with open("discogs_20240701_artists.xml.jsonl.clean", encoding="utf-8") as infile:
     for jsonline in infile:
         artist = json.loads(jsonline)
 ```
 
-#### Rest of the files
-
-* `discogs_20240701_artists.xml.jsonl`, `discogs_20240701_releases.xml.jsonl`, `discogs_20240701_releases.xml.jsonl.clean`, discogs_20240701_releases.xml.jsonl.clean.tracks are JSONL files with utf-8 encoding.
+* `discogs_20240701_artists.xml.jsonl`, `discogs_20240701_artists.xml.jsonl.clean`, `discogs_20240701_releases.xml.jsonl`, `discogs_20240701_releases.xml.jsonl.clean`, discogs_20240701_releases.xml.jsonl.clean.tracks are JSONL files with utf-8 encoding.
 * `Discogs-VI-20240701-DaTACOS-SHS100K2_TEST-lost_cliques.txt` and `Discogs-VI-20240701.jsonl.queries` are line-delimited text files.
 
-Please refer to the code for more examples.
+Please refer to our [GitHub Repository](https://github.com/MTG/discogs-vi-dataset/) for more examples.
 
 ## Discogs-VI-YT Streamlit demo
 
@@ -171,18 +164,16 @@ The steps to re-create the dataset is detailed in a separate [README](./README-r
 
 ## Cite
 
-TODO:
-
 Please cite the following publication when using the dataset:
 
-> Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry
+> R. O. Araz, X. Serra, and D. Bogdanov, "Discogs-VI: A musical version identification dataset based on public editorial metadata," in Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024.
 
 ```bibtex
-@conference {,
-    author = "Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry",
-    title = "Discogs-VI: A Musical Version Identification Dataset Based on Public Editorial Data",
-    booktitle = "",
-    year = "2024",
+@inproceedings{araz_discogs-vi_2024,
+ title = {Discogs-{VI}: {A} musical version identification dataset based on public editorial metadata},
+ booktitle = {Proceedings of the 25th {International} {Society} for {Music} {Information} {Retrieval} {Conference} ({ISMIR})},
+ author = {Araz, R. Oguz and Serra, Xavier and Bogdanov, Dmitry},
+ year = {2024},
 }
 ```