-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dataset loader for Lio and the Central Flores languages #312
Comments
#self-assign |
Hi! |
Hi @joanitolopo, thank you for taking on this dataloader. Could we have multiple subsets instead of multiple dataloaders?
|
HI @holylovenia.
I assumed that we have 16 configs for each language because there are four languages and two task. For |
Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help. |
Yes. For the MT, could you please use Sorry for the late reply. |
Adding |
Hi @joanitolopo, may I know if you need any help with the dataloader? |
Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive. Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file. |
Noted, thanks @joanitolopo @SamuelCahyawijaya! But I'll keep the datasheet as-is with ASR and MT tasks since the dataset provides the resources needed for these tasks—albeit with additional postprocessing steps. |
Dataloader name:
lio_and_central_flores/lio_and_central_flores.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lio_and_central_flores
The text was updated successfully, but these errors were encountered: