-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data download size descrepancy #267
Comments
Hey :) Thanks for catching that! You’re right—the file size displayed on Hugging Face (HF) only reflects the zipped files and not the extracted ones. We’ll look into updating that to avoid any confusion. Download process: Dataset size: HF dataset structure & downloads:
Duplicate downloads:
Error in XCM/XCL test dataset:
If you still need clarification, feel free to reach out here or by email! |
Possible solutions list:
|
I'm downloading XCL using the following:
Based on the description on HuggingFace I expected 528,434 files, 484 Gb. However, I eventually ran out of storage with the downloaded content hitting over 700 Gb.
Additionally, when I re-started the download, it did not resume downloading but instead started re-downloading into subdirectories with different long random-hexidecimal-character names.
Two questions: (1) what is the full size of the XCL download; and (2) is there a way to avoid duplicate downloads of the same files using this api? This applies not only to if a download gets interrupted, but also to the case of downloading multiple datasets like XCL and PER: ideally they would reference the same set of files on disk rather than have to store an additional copy of the xeno canto files.
Edit: I was able to download the entire XCL after clearing up space. The data_birdset/downloads folder is 986G and data_birdset/downloads/extracted is 502G. Should I now delete the files in
downloads
folder? (are they temporary files that were extracted intodownloads/extracted
?) I'm also unclear on how to use the XCL/XCM datasets in general, is there a script somewhere that demonstrates training on XCL? After the download completes and the train/valid split are created using the code above, I getKeyError: 'test_5s'
which I guess is because this dataset (unlike HSN etc) doesn't contain test data.The text was updated successfully, but these errors were encountered: