Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for ara-close-lange #14

Closed
SamuelCahyawijaya opened this issue Nov 1, 2023 · 14 comments · Fixed by #243
Closed

Create dataset loader for ara-close-lange #14

SamuelCahyawijaya opened this issue Nov 1, 2023 · 14 comments · Fixed by #243
Assignees
Labels
help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Nov 1, 2023

Dataloader name: ara_close/ara_close.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?ara_close

Dataset ara_close
Description The dataset contribution of this study is a compilation of short fictional stories written in Bikol for readability assessment. The data was combined other collected Philippine language corpora, such as Tagalog and Cebuano. The data from these languages are all distributed across the Philippine elementary system's first three grade levels (L1, L2, L3). We sourced this dataset from Let's Read Asia (LRA), Bloom Library, Department of Education, and Adarna House.
Subsets -
Languages bcl, tgl, ceb
Tasks Readibility Assessment
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://github.com/imperialite/ara-close-lang
HF URL -
Paper URL https://aclanthology.org/2023.findings-acl.331/
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Nov 1, 2023
@androstj
Copy link

androstj commented Nov 2, 2023

#self-assign

@imperialite
Copy link

Hello. I'm the owner of this dataset and was about to submit it through the form provided, but it seems someone has already done it. Can I submit a new form? I'm sure I can fill up some of the details more clearly. Also, so I can be attributed for the points.

@androstj androstj removed their assignment Nov 17, 2023
@androstj
Copy link

@imperialite Hi, I un-assign myself, you can take this task :)

@sabilmakbar
Copy link
Collaborator

Hi @imperialite do you want to take this dataloader task so you can get points from it?

@sabilmakbar
Copy link
Collaborator

Hello. I'm the owner of this dataset and was about to submit it through the form provided, but it seems someone has already done it. Can I submit a new form? I'm sure I can fill up some of the details more clearly. Also, so I can be attributed for the points.

An update to this. You may resubmit it again w/ more complete information. If it's approved by the Datasheet Reviewers, you'll be attributed w/ points too, @imperialite

@sabilmakbar
Copy link
Collaborator

Hi @androstj, since he isn't responding to your last comment on this Issue after 2 weeks, I think you may retake this issue again

@imperialite
Copy link

Hi @sabilmakbar thanks for this. I'll resubmit the data details to the form again. I'm contributing by adding more data than implementing data loaders. @androstj feel free to retake the issue again and sorry for the delay :)

@SamuelCahyawijaya
Copy link
Collaborator Author

@imperialite: Oh, I just realized that the previous one was about the datasheet. I will recheck the new submission and the score will be shared in this case.

@holylovenia holylovenia added the help wanted Extra attention is needed label Dec 10, 2023
@MJonibek
Copy link
Collaborator

#self-assign

@MJonibek
Copy link
Collaborator

Can you please advise, which schema I need to use for this dataset?

I was thinking about text schema (for text classification) with labels L1, L2 and L3, but I am not sure.

@imperialite
Copy link

Hi @MJonibek yes using the labels L1, L2, and L3 is correct and these are the same as what we used in the paper. Thanks.

@sabilmakbar
Copy link
Collaborator

for the schema in SEACrowd, you can use the text label schema (under utils/schema/text.py). However, since the READABILITY_ASSESSMENT task isn't available on the _SUPPORTED_TASK list yet, I'll help with adding that.

@MJonibek
Copy link
Collaborator

I found out that data for tgl was deleted.

Also, data has titles and text. For text label schema (utils/schema/text.py),
{
"id": idx,
"text": text,
"label": label,
}

need I:

  1. append title to text
  2. use title as id
  3. just ignore title?

@imperialite
Copy link

I deleted the Tagalog (tgl) data because the owner (Adarna House) would prefer if it is requested directly through them. The title can be the ID since it can be used for checking duplication and additional info on the books (useful for genre-based or author analysis).

@sabilmakbar sabilmakbar added the pr-ready A PR that closes this issue is Ready to be reviewed label Dec 28, 2023
holylovenia pushed a commit that referenced this issue Jan 22, 2024
* Add ara_close dataloader

* Rename class name to AraCloseDataset
raileymontalan pushed a commit to raileymontalan/seacrowd-datahub that referenced this issue Feb 28, 2024
…owd#243)

* Add ara_close dataloader

* Rename class name to AraCloseDataset
MJonibek added a commit to MJonibek/seacrowd-datahub that referenced this issue Apr 18, 2024
* Fix bug unique ids

* Closes SEACrowd#162 | Add Bloom-Captioning Dataloader (SEACrowd#198)

* Init dataloader bloom captioning

* Fix issue on multiple splits from its source

* Change local var

* Cater 'test' and 'val' split and fix the '_id' generation

* fix: remove abstreact and change _LOCAL and _DESC

* fix: _DESC indent

* Format openslr.py and add init file

* Closes SEACrowd#271 | Implement dataloader for UiT-ViCTSD (SEACrowd#300)

* Implement UiT-ViCTSD dataloader

* Improve subset IDs, feature types, code to generate examples

* Closes SEACrowd#161 | Create dataset loader for ICON 161 (SEACrowd#317)

* Create icon.py

* Update icon.py

* Create __init__.py

* Closes SEACrowd#142 | Add Unimorph v4 dataloader (SEACrowd#168)

* Add Unimorph dataloader

Resolves SEACrowd#142

* Add Dataset to class name

* Closes SEACrowd#71 | Create dataset loader for MASSIVE (SEACrowd#196)

* add data loader for massive dataset

* modify the class name & refactor the function name

* change task name from pos tagging to slot filling & make check_file & change subset name to differentiate intent / slot filling tasks

* Closes SEACrowd#14 | Create dataset loader for ara-close-lange (SEACrowd#243)

* Add ara_close dataloader

* Rename class name to AraCloseDataset

* Closes SEACrowd#273 | Implement dataloader for UIT_ViON (SEACrowd#282)

* Implement dataloader for UIT_ViON

* Add __init__.py

* Add {lang} in subset id for openslr

* Closes SEACrowd#219 | Create dataloader for scb-mt-en-th-2020 (SEACrowd#287)

* Create dataloader for scb-mt-en-th-2020

* Rename the data loader files to its snakecase

* rename _DATASETNAME to snakecase

* Fix languages setting

* Update template.py

* Add docstring openslr.py

* Closes SEACrowd#277 | Implement dataloader for spamid_pair (SEACrowd#281)

* Implemente dataloader for spamid_pair

* Update seacrowd/sea_datasets/spamid_pair/spamid_pair.py

Co-authored-by: Lj Miranda <[email protected]>

* Add __init__.py

* Update __init__.py

---------

Co-authored-by: Lj Miranda <[email protected]>

* Implemented dataloader for indoler

* Add imqa schema and VISUAL_QUESTION_ANSWERING task (SEACrowd#380)

* Update template.py

Update DownloadManager documentation link in template.py

* Closes SEACrowd#54 | Implement Dataloader for IndoSMD (SEACrowd#258)

* feat: indosmd dataloader for source

* refactor by pre-commit

* IndoSMD: reformatted by pre-commit

* Update changes on indosmd.py

* revised line 223 in indosmd.py

* Close#143 | Create dataset loader for Abui WordNet (SEACrowd#285)

* add tydiqa dataloader

* add id_vaccines_tweet dataloader

* add uit-vicc dataloader

* add ICON dataloader

* add iaap_squad dataloader

* add stb_ext dataloader

* Revert "add iaap_squad dataloader"

This reverts commit 1f8a591.

* Revert "add tydiqa dataloader"

This reverts commit 6bf4546.

* Revert "add id_vaccines_tweet dataloader"

This reverts commit 1154087.

* Revert "add uit-vicc dataloader"

This reverts commit 09661fa.

* Revert "add ICON dataloader"

This reverts commit 0891e58.

* Update stb_ext.py

* add abui_wordnet dataloader

* Revert "Update stb_ext.py"

This reverts commit 59c5301.

* Delete seacrowd/sea_datasets/stb_ext/stb_ext.py

* Delete seacrowd/sea_datasets/stb_ext/__init__.py

* Update abui_wordnet.py

* Update abui_wordnet.py

* Update abui_wordnet.py

---------

Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: Samuel Cahyawijaya <[email protected]>

* Added Morality Classification Tasks to constants.py (SEACrowd#371)

* Closes SEACrowd#216 |  Create dataset loader for Mozilla Pontoon (SEACrowd#260)

* Begin first draft of Mozilla Pontoon dataloader

* Add dataloader for Mozilla Pontoon

* Remove enumerate in _generate_examples

* Fix issues due to changed format, rename features and config names

* Closes SEACrowd#157 | Create dataset loader for M3Exam (SEACrowd#302)

* Add m3exam dataloader

* Small change in m3exam.py

* Fix bug during downloading

* Add meta feature to seacrowd schema for m3exam

* Rename class M3Exam to M3ExamDataset

* Add image question answering

* Merge two source schemas into one for m3exam

* Fix image path, choices and answer in m3exam

* Update CODEOWNERS

* Rectify SEACrowd Internal Vars (SEACrowd#386)

* Add missing __init__.py

* add init

* fix bug in phoatis load

* add lang variables in dataloaders

* Add dataset use ack on source HF repo into description

* Closes SEACrowd#204 | Implement dataloader for Melayu_Sabah (SEACrowd#234)

* Implement dataloader for Melayu_Sabah

* Update name for the dataloader

* Add _CITATION

* Update seacrowd/sea_datasets/melayu_sabah/melayu_sabah.py

* Applu suggestions from review

* Moving unnecessary content in dialogue text

* Update melayu_sabah.py

* Improvement: Workflow Message to Mention Assignee in Staled Issues (SEACrowd#400)

* Update stale.yml (SEACrowd#327)

* Update stale.yml

Test on adding vars on assignee & author of Issues & PR

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Closes SEACrowd#272 | Create dataset loader for SNLI (SEACrowd#290)

* [New Feature] Add SNLI dataloader

* [Fix] SNLI rev according to PR review

* [Chore] Add comment for accessibility

* Update common_parser.py (SEACrowd#333)

* Implement dataloader for UCLA Phonetic Corpus

* Implement dataloader for KDE4

* removed redundant builder_config

* Update cc3m_35l.py

Changed into no parallelization since it was kept being killed by the OS for some reason.

* Fix: Workflow Assignee Mention (SEACrowd#410)

* Update stale.yml

* Fix: wrong quote in message (SEACrowd#411)

* Update and fix bug on stale.yml

* Closes SEACrowd#17 | Implement dataloader for Philippine Fake News Corpus (SEACrowd#331)

* Implement dataloader

* Edit dataloader class name

* Simplify code

* Fix citation typo

* Closes SEACrowd#359 | Implement dataloader for LR-Sum (SEACrowd#368)

* Implement dataloader

* Fix short description

* feat: mswc dataloader skeleton

* feat: example for seacrowd schema

* Closes SEACrowd#265 | Implement dataloader for `myxnli` (SEACrowd#336)

* Implement dataloader for myxnli

* update myxnli

* Closes SEACrowd#112 | Implement Dataloader for Wisesight Thai Corpus (SEACrowd#279)

* Add wisesight_thai_sentiment dataset

* changes according to review

* changes according to review

* changes according to review

* Add changes according to review

* refactor: formatting

* fix: subset

* refactor: formatting

* Closes SEACrowd#6 | Add Loader for XCOPA (SEACrowd#286)

* initial add for loader

* edit to include multi language

* adjust comments

* apply suggestion

* fix by linter

---------

Co-authored-by: fawwaz.mayda <[email protected]>

* Closes SEACrowd#140 | Add Dengue Filipino (SEACrowd#259)

* add dengue filipino

* update license and tasks

* Update _LANGUAGE

* Update dengue_filipino.py

* feat: flores200 dataloader skeleton

* Set only one source schema

* Fix subnodes ids for root node alt_burmese_treebank

* implement Filipino Gay Language dataloader (SEACrowd#66)

* convert citation to raw string

* Closes SEACrowd#210 | Create dataset loader for Orchid Corpus (SEACrowd#303)

* Add orchid_pos dataloader

* Rename OrchidPOS to OrchidPOSDataset

* Fix parser bug in orchid_pos.py

* Add .strip() in source orchid_pos

* Cahange string for special char orchid_pos

* fix: remove useless loop

* refactor: remove unused loop

* Closes SEACrowd#159 | Create dataset loader for CC-Aligned (SEACrowd#298)

* Add cc_aligned_doc dataloader

* Rename class and format cc_aligned_doc

* Add SEACROWD_SCHEMA_NAME for cc_aligned_doc

* Closes SEACrowd#268 | Implement dataloader for Thai Toxicity Tweet Corpus (SEACrowd#301)

* Implement dataloader for Thai toxicity tweets

* Fix description grammar

* List labels as constant

* Change task to ABUSIVE_LANGUAGE_PREDICTION, improve _generate_examples

* Rename dataloader folder and file

* Remove comment, change license value

* Define SEACROWD_SCHEMA using _SUPPORTED_TASKS

* Fix bug where example ID and index do not match

* Closes SEACrowd#363 | Create dataset loader for identifikasi-bahasa (SEACrowd#379)

* [add]  initial commit

* [add] dataset loader for identifikasi_bahasa

* [refactor]  removed __main__

* Update seacrowd/sea_datasets/identifikasi_bahasa/identifikasi_bahasa.py

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#182. | Implement dataloader for `roots_vi_ted` (SEACrowd#329)

* Implement dataloader for roots_vi_ted

* update

* update

* update

* remove local data

* reformat

* Closes SEACrowd#180 | Implement `IndoMMLU` dataloader (SEACrowd#324)

* Implement dataloader for indommlu

* update

* update

* Closes SEACrowd#345 | Implemented dataloader for vlsp2016_ner (SEACrowd#372)

* Implemented dataloader for vlsp2016_ner

* Format vlsp2016_ner.py

* Closes SEACrowd#276 | Implement PRDECT-ID dataloader (SEACrowd#322)

* Implement PRDECT-ID dataloader

Closes SEACrowd#276

* Add better type formatting

* Follow id_google_play_review for structure

* Include source configs for both emotion and sentiment

* Closes SEACrowd#9 | Add bhinneka_korpus dataset loader (SEACrowd#175)

* Add bhinnek_korpus dataset loader

* Updating the suggested changes

* Resolved review suggestions

* Create indonesian_news_dataset dataloader

* Closes SEACrowd#183 | Implement `wongnai_reviews` dataloader (SEACrowd#325)

* Implement dataloader for wongnai_reviews

* add __init__.py

* update

* update

* Implement change requested by holylovenia

* Closes SEACrowd#348 | Implemented dataloader for indoner_tourism (SEACrowd#373)

* Implemented dataloader for indoner_tourism

* Perform changes requested by ljvmiranda921

* Closes SEACrowd#361 | Create dataset loader for Thai-Lao Parallel Corpus (SEACrowd#384)

* [add] dataloader for tha_lao_embassy_parcor, no citation yet

* [add] citation; removed debug code

* [style] make format restyle

* [refactor]  removed TODO code

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Update constants.py

* Closes SEACrowd#305 | Implement dataloader for UIT_ViOCD (SEACrowd#335)

* Implement dataloader for UIT_ViOCD

* update according to the review

* Update _SUPPORTED_TASKS

* Closes SEACrowd#362 | Create dataset loader for GKLMIP Khmer News Dataset (SEACrowd#383)

* [add] dataloader for gklmip_newsclass

* [refactor]  changed licence value

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#358 | Create dataset loader for GKLMIP Product Sentiment (SEACrowd#417)

* [add] dataset loader for gklmip_sentiment

* [refactor]  removed comment; removed "split" parameter in gen_kwargs

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Update constants.py

* Close SEACrowd#306 | Create dataset loader for ViHealthQA (SEACrowd#319)

* Create dataset loader for ViHealthQA SEACrowd#306

* add class docstring

* Update vihealthqa.py

* Closes SEACrowd#10 | Create beaye_lexicon dataset loader (SEACrowd#320)

* Create beaye_lexicon dataset loader

* add implementation of eng-day word pairs

* Closes SEACrowd#179 | Implement `indo_story_cloze` dataloader (SEACrowd#323)

* Implement indo_story_cloze dataloader.

* correct license

* update according to the feedback

* update

* Closes SEACrowd#353| Create dataset loader for FilWordNet (SEACrowd#377)

* Add dataloader for FilWordNet

* Update seacrowd/sea_datasets/filwordnet/filwordnet.py

Co-authored-by: Lj Miranda <[email protected]>

* Update seacrowd/sea_datasets/filwordnet/filwordnet.py

Co-authored-by: Lj Miranda <[email protected]>

* Fix formatting

---------

Co-authored-by: Lj Miranda <[email protected]>

* feat: id_sentiment_analysis dataloader

* refactor: remove print

* refactor: default config name

* feat: subsets

* Closes SEACrowd#350 | Implement dataloader for Indonesian PRONER (SEACrowd#399)

* Implement dataloader for Indonesian PRONER

* Add manual and automatic subsets

---------

Co-authored-by: Railey Montalan <[email protected]>

* Implement dataloader for IMAD Malay Corpus (SEACrowd#402)

Co-authored-by: ssfei81 <[email protected]>

* Update id_wsd.py

* add thaigov (SEACrowd#412)

* add thaigov

* Update thaigov.py

* add inline comment for file structure

* Update and rename snli.py to snli_indo.py

* Rename SNLI to SNLI Indo

* Update snli_indo.py

* [add]  dataloader for sarawak_malay

* Closes SEACrowd#264 | Create dataset loader for mySentence SEACrowd#264 (SEACrowd#291)

* add mysentences dataloader

* align the config name to subset_id

* update mysentence config

* Update mysentence.py

* remove comment line

* Update mysentence.py

* Update mysentence config

* Update mysentence.py

* Update seacrowd/sea_datasets/mysentence/mysentence.py

Fix the subset_id case-checking for data download

* added __init__.py to ucla_phonetic

* updated dataloader according to suggestions

* Update memolon.py

* fix: subset_id format

* refactor: prepend dataset name to subset id

* fix: first language is set to latin english

* Add thai depression

* Create __init__.py

* Create __init__.py

* Create __init__.py

* Implement dataloader for SeaEval

* Update template.py instruction for dataloader class name (SEACrowd#334)

* Add documentation for dataloader class name

* Update template.py

* Update REVIEWING.md

This modified the content of adding "Dataset" suffix into optional, and giving a reference to templates/templates.py for example

* Update REVIEWING.md

fix file reference name

---------

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* Closes SEACrowd#165 | Add BLOOM-LM dataset (SEACrowd#294)

* Init add BLOOM-LM dataset

* Adjusting changes based on review

* fix typing on _generate_examples

* update import based on formatter suggestion

* Closes SEACrowd#349 | Create dataset loader for QASiNa (SEACrowd#418)

* [add] dataloader for qasina

* [refactor] renamed dataset class

* [add]  added contex_title to qa_seacrowd schema

* [refactor, add]  changed QA type, added "answer_start", "contx_length" information to meta

* [refactor]  bug fixes

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#263 | Implement dataloader for VIVOS (SEACrowd#398)

* Implement dataloader for

* Implement dataloader for VIVOS

* Add missing __init__.py file

* Change _LANGUAGES into list

---------

Co-authored-by: Railey Montalan <[email protected]>

* Closes SEACrowd#190 | Create dataset loader for TydiQA  (SEACrowd#251)

* add tydiqa dataloader

* Update tydiqa.py

* add example helper and update config

* Update tydiqa.py

* Update Configs and _info

* Update features in _info()

* Update tydiqa.py

This update covers the requested changes from @jen-santoso and @jamesjaya, please advice if needs any further changes. Thanks.

* add tydiqa_id subset

* Update tydiqa.py

Reformat long lines in the code and add IndoNLG in citation

* remove tydiqa_id

* Closes SEACrowd#338 | Created DataLoader for IndonesianNMT (SEACrowd#367)

* Implementing Dataloader for indonesiannmt issue SEACrowd#338

* Update template.py

* Implementing Dataloader for indonesiannmt issue SEACrowd#338

* removed if __main__ section

* IndonesianNMT reconstructing dataloader

* Implement ssp task, implement suggestions

* format indonesiannmt

---------

Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Jonibek Mansurov <[email protected]>

* Closes SEACrowd#366 | Implement dataloader for Kheng.info Speech (SEACrowd#401)

* Implement dataloader for Kheng.info Speech

* Add init file

* Closes SEACrowd#226 | Vi Pubmed dataloader (SEACrowd#391)

* feat: vi_pubmed dataloader

* fix: homepage

* fix: non unique id error

* refactor: class name

* refactor: remove unused loop

* Create __init__.py

* [refactor]  removed comment

* Update flores200.py

* refactor: remove main function

* Closes SEACrowd#69 | Implement XStoryCloze Dataloader (SEACrowd#137)

* implement xstorycloze dataloader

* add __init__.py

* update

* remove ssp schema; add _LANGUAGES

* remove unnecessary import; pascal case for class name

* Closes SEACrowd#147 | implemented dataloader for gatitos dataset (SEACrowd#415)

* implemented dataloader for gatitos dataset

* added __init__.py to gatitos folder

* Updated gatitos

---------

Co-authored-by: ssfei81 <[email protected]>

* Update CODEOWNERS

* Patch Workflow on Stale Checking (SEACrowd#482)

* Update stale.yml

* Create add-new-comment-on-stale

* Update and rename stale.yml to stale-labeler.yml

* Update add-new-comment-on-stale

* Rename add-new-comment-on-stale to add-new-comment-on-stale.yml

* Sabilmakbar Patch Workflow (SEACrowd#484)

Bugfix on SEACrowd#482.

* Update add-new-comment-on-stale.yml

add workflow trigger criteria on PR message aswell

* Update add-new-comment-on-stale.yml

* Update add-new-comment-on-stale.yml

fix yaml indent

* Update add-new-comment-on-stale.yml

* Closes SEACrowd#340 | Implement Dataloader for emotes_3k (SEACrowd#397)

* Implement Dataloader for emotes_3k

* Implement Dataloader for emotes_3k

* Tasks updated from sentiment analysis to morality classification

* Implement Change Request

* formatting emotes_3k

---------

Co-authored-by: Jonibek Mansurov <[email protected]>

* refactor: remove main function

Co-authored-by: Lj Miranda <[email protected]>

* Update constants.py

* Closes SEACrowd#311 | Add dataloader for indonesian_madurese_bible_translation (SEACrowd#337)

* add dataloader for indonesian_madurese_bible_translation

* update the license of indonesian_madurese_bible_translation

* Update indonesian_madurese_bible_translation.py

* modify based on comments from holylovenia

* [indonesian_madurese_bible_translation]

* update based on the reviewer's comments

* Remove `CONTRIBUTING.md`, update PR Message Template, and add bash to initialize dataset (SEACrowd#468)

* add bash to initialize dataset

* delete CONTRIBUTING.md since it's duplicated with DATALOADER.md

* update the docs slightly on suggesting new dataloader contributors to use template

* fix few wordings

* Add info on required vars '_LOCAL'

* Add checklist on __init__.py

* fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val

* fix wordings on first section of PR msg

* add newline separator for better readability

* add info on some to-dos

* refactor: citation

* Closes SEACrowd#83 | Implement Dataloader for GlobalWoZ (SEACrowd#261)

* refactor by pre-commit

* reformatted by pre-commit

* refactor code for globalwoz

* Create dataset loader for IndoQA SEACrowd#430 (SEACrowd#431)

* Add CODE_SWITCHING_IDENTIFICATION task (SEACrowd#488)

* Closes SEACrowd#396 | Implement dataloader for CrossSum (SEACrowd#419)

* Implement dataloader

* Change to 3-letter ISO codes

* Change task to CROSS_LINGUAL_SUMMARIZATION

* Closes SEACrowd#92 | Create Jail break data loader (SEACrowd#390)

* feat: jailbreak dataloader

* fix: minor errors

* refactor: styling

* refactor: remove main entry

* refactor: class name

* refactor: remove unused loop

* fix: separate text column into different subsets

* Create __init__.py

* Implement CommonVoice 12.0 dataloader (SEACrowd#452)

* Closes SEACrowd#202 | Implement dataloader for WIT (SEACrowd#374)

* Implement dataloader for WIT

* Remove unnecessary commits

* Add to description

---------

Co-authored-by: Railey Montalan <[email protected]>

* Split into language subsets

* Split into language subsets

* Update seacrowd/sea_datasets/thai_depression/thai_depression.py

Co-authored-by: Lj Miranda <[email protected]>

* fix: change lincense to unknown

* fix: minor errors

* Closes SEACrowd#80 | Implement MSVD-Indonesian Dataloader (SEACrowd#135)

* implement id_msvd dataloader

* change logic for seacrowd schema (text first, then video); quality of life change to video schema

* revert seacrowd video key from "text" to "texts"

* change source logic to match original data implementation

* run make check_file

* Closes SEACrowd#34  |  Create dataset loader for MKQA (SEACrowd#177)

* Create dataset loader for MKQA SEACrowd#34

* Refactor class variables _LANGUAGES to global for MKQA SEACrowd#34

* Filter supported languages (SEA only) of seacrowd_qa schema for MKQA SEACrowd#34

* Filter supported languages (SEA only) of source schema for MKQA SEACrowd#34

* Filter supported languages (SEA only) for MKQA SEACrowd#34 (a leftover)

* Change language code from macrolanguage, msa to zlm, for MKQA SEACrowd#34

* Change to a more appropriate language code of  for Malaysian variant used in MKQA SEACrowd#34

* Changed the value of field 'type' of QA schema to be more general, and moved the more specific value to 'meta' field for MKQA SEACrowd#34

* Replace None value to empty array in 'answer_aliases' sub-field for consistency in MKQA SEACrowd#34

* Closes SEACrowd#193 | Create dataset loader for MALINDO Morph (SEACrowd#332)

* Implement dataloader for MALINDO morph

* Specify file encoding and remove newlines when loading data

* Add blank __init__.py

* Fix typos in docstring

* Fix typos

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

---------

Co-authored-by: Jennifer Santoso <[email protected]>

* fix: subsets

* Closes SEACrowd#314 | Add dataloader for Indonesia chinese mt robust eval (SEACrowd#388)

* add dataloader for indonesian_madurese_bible_translation

* update dataloader for indonesia_chinese_mtrobusteval

* Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py

* Update indonesia_chinese_mtrobusteval.py

* update code based on the reviewer comments

* add __init__.py

* Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py

* Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py

---------

Co-authored-by: Jennifer Santoso <[email protected]>

* refactor: feature naming

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* fix: homepage url

* Closes SEACrowd#211 | Implement dataloader for SEAHORSE (SEACrowd#407)

* implement seahorse dataloader

* update

* update

* incorporate the latest comments though tensorflow still needed for tfds

* update

* update

* fix: lowercase feature name

* refactor: subset name

* fix: limit the sentence paths to the relevant languages

* refactor: remove possible error

* Change default split to TEST

* Closes SEACrowd#447 |  Create dataset loader for Aya Dataset (SEACrowd#457)

* Implementing data loader for Aya Dataset

* Fixing license serialization issue

* Update based on formatter for aya_dataset.py

* update xlsum to extend more langs

* update based on formatter

* Closes SEACrowd#360 | Implement dataloader for khpos (SEACrowd#376)

* Implement dataloader for khpos

* Remove unneeded comment

* Implemented Test and Validation loading

* Streamlining code

* Closes SEACrowd#116 | Add pho_ner_covid Dataloader (SEACrowd#461)

* feat: pho_ner_covid dataloader

* refactor: classname

Co-authored-by: Lj Miranda <[email protected]>

* fix: remove main function

Co-authored-by: Lj Miranda <[email protected]>

* refactor: remove inplace uses for dataframe

* refactor: remove duplicate statement

---------

Co-authored-by: Lj Miranda <[email protected]>

* refactor: remove trailing spaces

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* refactor: url format

* edit 'texts' to 'text' key (SEACrowd#499)

* Closes SEACrowd#217 | Implement dataloader for `wili_2018` (SEACrowd#381)

* Implement dataloader for wili_2018

* update

* Closes SEACrowd#104 | Add lazada_review_filipino (SEACrowd#409)

* Add lazada_review_filipino Closes SEACrowd#104

* Update lazada_review_filipino.py

Update config name

* Update lazada_review_filipino.py

fix typo

* Update lazada_review_filipino.py

bug fix - ValueError: Class label 5 greater than configured num_classes 5

* Update seacrowd/sea_datasets/lazada_review_filipino/lazada_review_filipino.py

---------

Co-authored-by: Samuel Cahyawijaya <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>

* Adjust bash script test_example.sh and test_example_source_only.sh (SEACrowd#171)

* update: adjust test_example.sh and test_example_source_only.sh

* fix: minor error message when dataset is empty

* updated kde4 language codes to iso639-3

* fix: citation

* refactor: use base config class

* create dataset loader for myanmar-rakhine parallel (SEACrowd#471)

* add pyreadr==0.5.0 (SEACrowd#504)

usage: reads/writes R RData and Rds files into/from pandas data frames

* Closes SEACrowd#97 | Inter-Agency Task Force for the Management of Emerging Infectious Diseases (IATF) COVID-19 Resolutions  (SEACrowd#460)

* Closes SEACrowd#274 | Create OIL data loader (SEACrowd#389)

* initial commit

* refactor: move module

* feat: dataset implementation

* feat: oil dataloader

* refactor: move dataloader file

* refactor: move dataloader file

* fix: non unique id error

* refactor: file formating

* refactor: remove comments

* fix: invalid config name exception raise

* refactor: audio cache file path

* fix: remove useless loop

* refactor: formatting

* Create __init__.py

* fix: citation

* fix: remove seacrowd schema

* Closes SEACrowd#49 | Updated existing TICO_19 dataloader to support more sea languages (SEACrowd#414)

* Updated existing TICO_19 dataloader to support more sea languages

* added sea languages to _LANGUAGES

---------

Co-authored-by: ssfei81 <[email protected]>

* Closes SEACrowd#443 | Add dataloader for ASR-STIDUSC (SEACrowd#493)

* Add dataloader for ASR-STIDUSC

* update task, dataset name, pythonic coding

* add relation extraction task (SEACrowd#502)

* fix: subset and config name

* Update bibtex id

* Closes SEACrowd#356 | Implement dataloader for CodeSwitch-Reddit (SEACrowd#451)

* Add CODE_SWITCHING_IDENTIFICATION task

* Implement dataloader

* Update codeswitch_reddit.py

fix column naming in source (using lowercase instead of capitalized)

* Closes SEACrowd#222 | Create dataset loader for CreoleRC (SEACrowd#469)

* Create dataset loaderfor CreoleRC

* remove changes to constants.py

* remove document_id, add normalized, add sanity check on offset value

* Update REVIEWING.md

Clarify wording in Dataloader Reviewing Doc

* Closes SEACrowd#341  | Create dataset loader for myParaphrase (SEACrowd#436)

* [add]  dataloader for my_paraphrase

* [refactor]  removed redundant breakpoint; put right default schema function

* [refactor]  changed schema for dataset

* [refactor]  split data into 3 categories(paraphrase, non_paraphrase, all)

* [refactor]  default config name is changed

* [refactor]  source configs for _paraphrase,_non_paraphrase,_all; altered schema naming

* [refactor]  cleaner conditioning, defined else clause

* Closes SEACrowd#269 | Create dataset loader for ViVQA SEACrowd#269 (SEACrowd#318)

* add vivqa dataloader

* Update vivqa.py

* update viviq dataloader config

* Update vivqa.py

* add vivqa dataloader

* Update vivqa.py

* update viviq dataloader config

* Update vivqa.py

* Update vivqa.py

* update

* Update vivqa.py

* Update vivqa.py

* Delete .idea/vcs.xml

* Delete .idea/seacrowd-datahub.iml

* Delete .idea/inspectionProfiles/profiles_settings.xml

* Delete .idea/inspectionProfiles/Project_Default.xml

* Update vivqa.py

* Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"

This reverts commit a96fa80, reversing
changes made to 23700ca.

* Delete .idea/vcs.xml

* Delete .idea/seacrowd-datahub.iml

* Delete .idea/inspectionProfiles/profiles_settings.xml

* Delete .idea/inspectionProfiles/Project_Default.xml

* Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"

This reverts commit a96fa80, reversing
changes made to 23700ca.

* Revert "Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa""

This reverts commit 5f1a3d6.

* fixing trailing space and run Makefile

* Closes SEACrowd#445 | Create dataset loader for malaysia-tweets-with-sentiment-labels (SEACrowd#450)

* Fix typo syntax dictionary at constants.py

* Add dataloader for malaysia_tweets

* Completed requested changes

* add dataloader for ASR-Sindodusc (SEACrowd#491)

* Closes SEACrowd#475 | Add dataloader for indonglish-dataset (SEACrowd#490)

* create dataloader for indonglish

* make subset_id unique, use ClassLabel for label

* Closes SEACrowd#215 | Implement dataloader for `thai_gpteacher` (SEACrowd#382)

* Implement dataloader for thai_gpteacher

* update

* update

* Closes SEACrowd#275 | Create dataset loader for UIT-ViCoV19QA SEACrowd#275 (SEACrowd#463)

* add SeaCrowd dataloader for uit_vicov19qa

* Merge subsets to one

* remove unused imported package

* Closes SEACrowd#309 | Create dataset loader for Vietnamese Hate Speech Detection (UIT-ViHSD) #309Uit vihsd (SEACrowd#501)

* create dataloader for uit_vihsd

* Update uit_vihsd.py

* Add some info for the labels

* Update example for Seacrowd schema

* Closes SEACrowd#441 | Add dataloader for ASR-SMALDUSC (SEACrowd#492)

* Add dataloader for ASR-SMALDUSC

* add prompt field

* Closes SEACrowd#307 | Implement dataloader for ViSoBERT  (SEACrowd#466)

* Update constants.py

* Implement dataloader for ViSoBERT

* Fix conflicts with constants.py

* Combine source and seacrowd_ssp schemas

---------

Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>

* add dataloader for wikitext_tl_39 (SEACrowd#486)

* Closes SEACrowd#393 | Create dataset loader for WEATHub (SEACrowd#496)

* [Feature] Add Weathub DataLoader

* [Fix] Add filter for SEA languages only + add constants + run formatter

* [Chore] Fix data loader naming

* [Fix] Impelement request changes from review

* Closes SEACrowd#188 | Implement dataloader for Sea-bench (SEACrowd#375)

* Implement dataloader for WIT

* Implement dataloader for sea_bench

* Remove WIT

* Remove logger and unnecessary variables

* Add instruction tuning and remove QA and summarization tasks

* Add __init__.py file

* Remove machine translation task

* Fix nitpicks

---------

Co-authored-by: Railey Montalan <[email protected]>

* Closes SEACrowd#115 | Create dataset loader for PhoMT dataset (SEACrowd#489)

* add dataloader for PhoMT dataset

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* update text1/2 name for PhoMT dataset

* Update phomt.py to replace en&vi to eng&vie

---------

Co-authored-by: Elyanah Aco <[email protected]>

* Closes SEACrowd#310 |Create dataset loader for ViSpamReviews SEACrowd#310 (SEACrowd#454)

* add vispamreviews dataloader

* update vispamreviews

* update schema

* Closes SEACrowd#530 | Add/Update Dataloader Tatabahasa (SEACrowd#540)

* feat: dataloader QA commonsense-reasoning

* nitpick

* Closes SEACrowd#267  | Add dataloader for struct_amb_ind (SEACrowd#506)

* Implement dataloader for struct_amb_ind

* Update seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py

Co-authored-by: Jonibek Mansurov <[email protected]>

---------

Co-authored-by: Jonibek Mansurov <[email protected]>

* Closes SEACrowd#347 | Create dataset loader for IndoWiki (SEACrowd#485)

* create dataset loader for IndoWiki

* remove seacrowd schema

* Closes SEACrowd#354 | Implement dataloader for ETOS (SEACrowd#416)

* Implement dataloader for ETOS

* Implement dataloader for ETOS

* Rename dataset class name to ETOSDataset

* Remove  schema due to insufficient annotations

* Change ETOS into a POS tagging dataset

* Add missing __init__.py file

* Fix nitpicks

* Add DEFAULT_CONFIG_NAME

---------

Co-authored-by: Railey Montalan <[email protected]>

* update common_parser for UD JV_CSUI (SEACrowd#558)

* Create dataset loader for UD Javanese-CSUI SEACrowd#427 (SEACrowd#432)

* Closes SEACrowd#446 | Add/Update Dataloader voxlingua (SEACrowd#543)

* add init voxlingua

* Update seacrowd/sea_datasets/voxlingua/voxlingua.py

Co-authored-by: Lj Miranda <[email protected]>

---------

Co-authored-by: Lj Miranda <[email protected]>

* Closes SEACrowd#428 | Create dataset loader for Indonesia BioNER (SEACrowd#434)

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update cc3m_35l.py

Changed "_LANGS" to "_LANGUAGES"

* init commit

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Closes SEACrowd#344 | Create dataset loader for VLSP2016-SA (SEACrowd#500)

* [add]  dataloader for vlsp2016_sa[local]

* [refactor]  changed schema name

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Fix the private datasheet link in POINTS.md (SEACrowd#568)

* Closes SEACrowd#192 | Create dataset loader for MALINDO_parallel (SEACrowd#385)

* add malindo_parallel.py

* cleanup

* Class name fix

Co-authored-by: Lj Miranda <[email protected]>

* Remove sample licenses

Co-authored-by: Lj Miranda <[email protected]>

* fix dataset formatting error, use original dataset id

---------

Co-authored-by: Lj Miranda <[email protected]>

* Closes SEACrowd#114 | Implement dataloader for VnDT (SEACrowd#467)

* Implement dataloader for VnDT

* Add utility to impute missing sent_id and text fields from CoNLL files

* Fix imputed outputs

---------

Co-authored-by: Railey Montalan <[email protected]>

* add ocr task (SEACrowd#555)

* PR for update subset composition of TydiQA | Close SEACrowd#465 (SEACrowd#503)

* update csubset composition

* Update Subset Composition

* Update Subset Composition

* update subset name

indonesian --> ind
thai --> tha

* Update nusaparagraph_emot.py

* Update nusaparagraph_emot.py

* Update configs.py

* Closes SEACrowd#346 | Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings) (SEACrowd#406)

* Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings)

* Create __init__.py for MUSE SEACrowd#346

* Remove unused comment lines for MUSE SEACrowd#346

* changed all 2 letters language codes to 3 letters

---------

Co-authored-by: ssfei81 <[email protected]>
Co-authored-by: Frederikus Hudi <[email protected]>

* Closes SEACrowd#12 | Add/Update Dataloader BalitaNLP (SEACrowd#550)

* Implement dataloader for balita_nlp

* Remove articles with missing images from imtext schema

* Add details to metadata

* Adding New Citation for Bhinneka korpus (SEACrowd#599)

* Add bhinnek_korpus dataset loader

* Updating the suggested changes

* Resolved review suggestions

* adding new citation

---------

Co-authored-by: Holy Lovenia <[email protected]>

* Closes SEACrowd#270 | Create dataset loader for OpenViVQA SEACrowd#270 (SEACrowd#464)

* add sample

* init submit for openvivqa dataloader

* Update openvivqa.py

* Update openvivqa.py

* update dict format

* Closes SEACrowd#516 | Add/Update Dataloader id_newspaper_2018 (SEACrowd#551)

* Implement dataloader for id_newspaper_2018

* Specify JSON ecoding

* Closes SEACrowd#429 | Implement dataloader for filipino_hatespeech_election (SEACrowd#487)

* Add dataloader for filipino_hatespeech_election

* update task

* update

* Closes SEACrowd#52 | Add cosem dataloader (SEACrowd#473)

* feat: cosem dataloader

* fix: citation

* refactor: dataloader class name

* fix: file parsing logic

* fix: id format

* fix: tab separator bug in text

* fix: check for unique id

* Closes SEACrowd#424 | Add Dataloader Bactrian-X

* Import `schemas` beforehand on `templates/template.py` (SEACrowd#644)

* add import statement for schemas

* add import statement for schemas

* Closes SEACrowd#313 | Add dataloader for Saltik (SEACrowd#387)

* add dataloader for indonesian_madurese_bible_translation

* add dataloader for saltik

* Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py

* update based on the reviewer comment

* update based on the reviewer comment

* Remove the modified constants.py from PR

---------

Co-authored-by: Holy Lovenia <[email protected]>

* Add `.upper` method for `--schema` parameter (SEACrowd#648)

* add upper method for --schema

* revert code-style

* Closes SEACrowd#438 | Add dataloader for ASR-INDOCSC (SEACrowd#509)

* add dataloader for asr_indocsc

* Update asr_indocsc.py for data downloading instructions

---------

Co-authored-by: Salsabil Maulana Akbar <[email protected]>
Co-authored-by: Elyanah Aco <[email protected]>
Co-authored-by: Yuze GAO <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: XU, Yan (Yana) <[email protected]>
Co-authored-by: Haochen Li <[email protected]>
Co-authored-by: Jennifer Santoso <[email protected]>
Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Lucky Susanto <[email protected]>
Co-authored-by: Samuel Cahyawijaya <[email protected]>
Co-authored-by: Muhammad Dehan Al Kautsar <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: Lucky Susanto <[email protected]>
Co-authored-by: Maria Khelli <[email protected]>
Co-authored-by: Ishan Jindal <[email protected]>
Co-authored-by: ssfei81 <[email protected]>
Co-authored-by: IvanHalimP <[email protected]>
Co-authored-by: Enliven26 <[email protected]>
Co-authored-by: Dan John Velasco <[email protected]>
Co-authored-by: Chenxi <[email protected]>
Co-authored-by: Bhavish Pahwa <[email protected]>
Co-authored-by: FawwazMayda <[email protected]>
Co-authored-by: fawwaz.mayda <[email protected]>
Co-authored-by: Ilham F Putra <[email protected]>
Co-authored-by: rafif-kewmann <[email protected]>
Co-authored-by: mrafifrbbn <[email protected]>
Co-authored-by: Yong Zheng-Xin <[email protected]>
Co-authored-by: Amir Djanibekov <[email protected]>
Co-authored-by: Amir Djanibekov <[email protected]>
Co-authored-by: joan <[email protected]>
Co-authored-by: joanitolopo <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>
Co-authored-by: ssun32 <[email protected]>
Co-authored-by: Tyson <[email protected]>
Co-authored-by: Ilham Firdausi Putra <[email protected]>
Co-authored-by: Johanes Lee <[email protected]>
Co-authored-by: Akhdan Fadhilah <[email protected]>
Co-authored-by: Frederikus Hudi <[email protected]>
Co-authored-by: Börje Karlsson <[email protected]>
Co-authored-by: Muhammad Satrio Wicaksono <[email protected]>
Co-authored-by: Wenyu Zhang <[email protected]>
Co-authored-by: R. Damanhuri <[email protected]>
Co-authored-by: Patrick Amadeus Irawan <[email protected]>
Co-authored-by: Reza Qorib <[email protected]>
Co-authored-by: Bryan Wilie <[email protected]>
Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants