Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Lio and the Central Flores languages #312

Closed
SamuelCahyawijaya opened this issue Jan 10, 2024 · 10 comments · Fixed by #561
Closed

Create dataset loader for Lio and the Central Flores languages #312

SamuelCahyawijaya opened this issue Jan 10, 2024 · 10 comments · Fixed by #561
Assignees
Labels
bonus +1 pr-ready A PR that closes this issue is Ready to be reviewed top-priority Needs to get done ASAP for the experiments

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: lio_and_central_flores/lio_and_central_flores.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lio_and_central_flores

Dataset lio_and_central_flores
Description This dataset is a collection of language resources of Li'o, Ende, Nage, and So'a which are collected in Ende, Flores, Eastern Nusa Tenggara. This dataset is the dataset from the research MA thesis by Alexander Elias. Title: Lio and the Central Flores languages
Subsets Lio Collection
Languages end, nxe, ssq, ljl, eng
Tasks Automatic Speech Recognition, Machine Translation
License Unknown (unknown)
Homepage https://archive.mpi.nl/tla/islandora/search/alexander%20elias?type=dismax&islandora_solr_search_navigation=0&f%5B0%5D=cmd.Contributor%3A%22Alexander%5C%20Elias%22
HF URL -
Paper URL https://studenttheses.universiteitleiden.nl/handle/1887/69452
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Jan 10, 2024
@joanitolopo
Copy link
Contributor

#self-assign

@joanitolopo
Copy link
Contributor

Hi!
I have a question regarding this dataset. Would you mind that should we separate the task data loaders within this dataset for the sake of simplicity?: Speeh Recognition and Machine Translation. If not, could you please share a reference that has implemented two or more tasks in a single data loader?
Thanks!

@holylovenia
Copy link
Contributor

Hi! I have a question regarding this dataset. Would you mind that should we separate the task data loaders within this dataset for the sake of simplicity?: Speeh Recognition and Machine Translation. If not, could you please share a reference that has implemented two or more tasks in a single data loader? Thanks!

Hi @joanitolopo, thank you for taking on this dataloader. Could we have multiple subsets instead of multiple dataloaders?

seacrowd subsets

  1. lio_and_central_flores_asr_{lang}_seacrowd_sptext for all of the SEA languages
  2. lio_and_central_flores_mt_{lang}_seacrowd_sptext for all of the SEA languages

source subsets

  1. lio_and_central_flores_asr_{lang}_source for all of the SEA languages
  2. lio_and_central_flores_mt_{lang}_source for all of the SEA languages

@joanitolopo
Copy link
Contributor

joanitolopo commented Feb 5, 2024

HI @holylovenia.

Could we have multiple subsets instead of multiple dataloaders?

I assumed that we have 16 configs for each language because there are four languages and two task.

For seacrowd subsets, i used lio_and_central_flores_asr_{lang}_seacrowd_sptext for ASR task and lio_and_central_flores_mt_{lang}_seacrowd_t2t for MT task. Am i right? Thankyou

Copy link

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@holylovenia
Copy link
Contributor

holylovenia commented Feb 26, 2024

I assumed that we have 16 configs for each language because there are four languages and two task.

For seacrowd subsets, i used lio_and_central_flores_asr_{lang}_seacrowd_sptext for ASR task and lio_and_central_flores_mt_{lang}_seacrowd_t2t for MT task. Am i right? Thankyou

Yes. For the MT, could you please use lio_and_central_flores_mt_eng_{lang}_seacrowd_t2t instead of lio_and_central_flores_mt_{lang}_seacrowd_t2t? Just for clarity's sake.

Sorry for the late reply.

@holylovenia holylovenia added bonus +1 top-priority Needs to get done ASAP for the experiments labels Mar 12, 2024
@holylovenia
Copy link
Contributor

Adding top-priority and bonus+ labels because we would need this for the experiments.

@holylovenia
Copy link
Contributor

Hi @joanitolopo, may I know if you need any help with the dataloader?

@SamuelCahyawijaya
Copy link
Collaborator Author

SamuelCahyawijaya commented Mar 30, 2024

Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive.

Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file.

@holylovenia
Copy link
Contributor

Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive.

Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file.

Noted, thanks @joanitolopo @SamuelCahyawijaya! But I'll keep the datasheet as-is with ASR and MT tasks since the dataset provides the resources needed for these tasks—albeit with additional postprocessing steps.

@holylovenia holylovenia added the pr-ready A PR that closes this issue is Ready to be reviewed label Apr 1, 2024
ljvmiranda921 pushed a commit that referenced this issue Apr 21, 2024
* Create lio_and_central_flores dataset loader

* fix requirement issue

* adding docstring and run make file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +1 pr-ready A PR that closes this issue is Ready to be reviewed top-priority Needs to get done ASAP for the experiments
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants