Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Alorese Collection #448

Closed
SamuelCahyawijaya opened this issue Feb 18, 2024 · 1 comment · Fixed by #541
Closed

Create dataset loader for Alorese Collection #448

SamuelCahyawijaya opened this issue Feb 18, 2024 · 1 comment · Fixed by #541
Assignees
Labels
bonus +3 pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: alorese/alorese.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?alorese

Dataset alorese
Description Alorese Collection or Alorese Corpus is a collection of language data in a couple of Alorese variation (Alor and Pantar Alorese). The collection is available in video, audio, and text formats with genres ranging from Experiment or task, Stimuli, Discourse, and Written materials.
Subsets -
Languages aol, ind
Tasks Language Modeling, Automatic Speech Recognition, Machine Translation
License Unknown (unknown)
Homepage https://hdl.handle.net/1839/e10d7de5-0a6d-4926-967b-0a8cc6d21fb1
HF URL -
Paper URL https://scholarlypublications.universiteitleiden.nl/handle/1887/70891
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Feb 18, 2024
@patrickamadeus
Copy link
Collaborator

#self-assign

@sabilmakbar sabilmakbar added the pr-ready A PR that closes this issue is Ready to be reviewed label Mar 20, 2024
sabilmakbar pushed a commit that referenced this issue Apr 29, 2024
* feat: dataloader for text2text MT

* nitpick: block sp2t to pass tc for t2t task

* nitpick join

* feat: support sptext, sptext_translated

* feat: final alorese_source code

* chore: scrape entire URLs

* nitpick

* nitpick: config builder naming

* fix: nitpick naming a bit

* nitpick PR: formatting, abs import, invalid schema handler

* docs: add docstring scraping approach

* fix: add URL scrape timestamp, revise code formatting, citation

* nitpick year

* nitpick review

* fix: revise schema and remove subset

* nitpick formatting

* Update seacrowd/sea_datasets/alorese/alorese.py

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* Update alorese.py

fix formatting on `yield` of `_generate_examples`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +3 pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants