Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data download size descrepancy #267

Open
sammlapp opened this issue Nov 1, 2024 · 2 comments
Open

Data download size descrepancy #267

sammlapp opened this issue Nov 1, 2024 · 2 comments

Comments

@sammlapp
Copy link

sammlapp commented Nov 1, 2024

I'm downloading XCL using the following:

from birdset.datamodule import DatasetConfig
from birdset.datamodule.birdset_datamodule import BirdSetDataModule

# download a complete xeno canto snapshot included in BirdSet
# https://huggingface.co/datasets/DBD-research-group/BirdSet

# initiate the data module
dm = BirdSetDataModule(
    dataset=DatasetConfig(
        data_dir=".../data_birdset/",
        hf_path="DBD-research-group/BirdSet",
        hf_name="XCL",
        n_workers=4,
        val_split=0.2,
        task="multilabel",
        classlimit=500,
        eventlimit=5,
        sampling_rate=32000,
    ),
)
# prepare the data (download dataset, ...)
dm.prepare_data()

Based on the description on HuggingFace I expected 528,434 files, 484 Gb. However, I eventually ran out of storage with the downloaded content hitting over 700 Gb.

Additionally, when I re-started the download, it did not resume downloading but instead started re-downloading into subdirectories with different long random-hexidecimal-character names.

Two questions: (1) what is the full size of the XCL download; and (2) is there a way to avoid duplicate downloads of the same files using this api? This applies not only to if a download gets interrupted, but also to the case of downloading multiple datasets like XCL and PER: ideally they would reference the same set of files on disk rather than have to store an additional copy of the xeno canto files.

Edit: I was able to download the entire XCL after clearing up space. The data_birdset/downloads folder is 986G and data_birdset/downloads/extracted is 502G. Should I now delete the files in downloads folder? (are they temporary files that were extracted into downloads/extracted?) I'm also unclear on how to use the XCL/XCM datasets in general, is there a script somewhere that demonstrates training on XCL? After the download completes and the train/valid split are created using the code above, I get KeyError: 'test_5s' which I guess is because this dataset (unlike HSN etc) doesn't contain test data.

Traceback (most recent call last):
  File "/home/sml161/birdset_download/download_XCL.py", line 22, in <module>
    dm.prepare_data()
  File "/home/sml161/BirdSet/birdset/datamodule/base_datamodule.py", line 120, in prepare_data
    dataset = self._preprocess_data(dataset)
  File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in _preprocess_data
    dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]})
  File "/home/sml161/BirdSet/birdset/datamodule/birdset_datamodule.py", line 130, in <dictcomp>
    dataset = DatasetDict({split: dataset[split] for split in ["train", "test_5s"]})
  File "/home/sml161/miniconda3/envs/birdset/lib/python3.10/site-packages/datasets/dataset_dict.py", line 75, in __getitem__
    return super().__getitem__(k)
KeyError: 'test_5s'
@lurauch
Copy link
Contributor

lurauch commented Nov 3, 2024

Hey :)

Thanks for catching that! You’re right—the file size displayed on Hugging Face (HF) only reflects the zipped files and not the extracted ones. We’ll look into updating that to avoid any confusion.

Download process:
HF should recognize if a download has already started and automatically continue it. If the download is restarting, it might be due to a change in the download folder path?

Dataset size:
After unpacking, the dataset should be around 993 GB total, with the extracted folder taking up about 510 GB. TBH we haven’t thought about deleting files outside /downloads/extracted, but a quick check suggests that all paths point to the extracted files. This is a great point! Maybe we can add an automatic cleanup step in the builder script to remove unnecessary files—I’ll explore this further.

HF dataset structure & downloads:
Here are a few notes on how HF handles data structure and downloads:

  • When you download the dataset, HF automatically saves/unpacks it into subdirectories based on HF’s naming conventions.
  • Our data natively includes Audio(decode=False), which means preprocessing (using map in HF) without decoding will only update the metadata. Using our example code with save_to_disk, you’ll need the /downloads/extracted folder path to load it properly with ds = load_from_disk(dm.disk_save_path). Our XCL_processed folders end up around 6GB, so those downloads need to stay in place.
  • If you choose to save_to_disk after decoding, HF saves it in Arrow format. This unpacked data might be larger than the original files. In this case, you could delete the entire /download folder, though you’d lose the ability to unpack any additional files from the original set.

Duplicate downloads:
Duplicate downloads can be an issue, and while we’ve attempted to address this in our HF builder script, afaik HF doesn’t allow a better solution right now. To help minimize duplicates, you could:

  1. Selective downloads: You can download only specific datasets like HSN_scape without pulling XC files again.
  2. Subset creation: If you need the train subsets, a workaround is to first load XCL, then apply a custom mapping function to filter the specific eBird codes for each test set (we created our subsets this way) and saved them with save_to_disk. We’ll look into integrating a simpler subset creation method in BirdSet—thanks for the suggestion!

Error in XCM/XCL test dataset:
The error is due to needing a different datamodule for XCM/XCL, as these sets don’t include a specific test dataset. This is covered in the “ Reproduce Baselines” section in the docs (that I accidentally removed). Also, you can refer to the configs:

If you still need clarification, feel free to reach out here or by email!

@lurauch
Copy link
Contributor

lurauch commented Nov 25, 2024

@Moritz-Wirth

Possible solutions list:

  • Avoid extracting the files entirely.
    This approach works easily if we utilize the Audio class from Hugging Face. However, it has two significant drawbacks:

    • The true file paths would not be accessible.
    • The soundfile library would need to support partial loading of zipped files to load specific events without processing the entire file each time.
  • Delete the zipped files after extracting them via Hugging Face.
    This is the simplest solution but requires additional temporary disk space during extraction. The space is freed afterward, but the temporary demand is still large.

  • Delete zipped files during the extraction process via Hugging Face.
    Implementing this directly in the dataset builder script may not be possible? and could involve additional complexity.

  • Re-upload files without zipping them.
    Completely omit zipped files by re-uploading all content directly in a usable format, such as .ogg or .flac.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants