Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #83 | Implement Dataloader for GlobalWoZ #261

Merged
merged 6 commits into from
Mar 4, 2024

Conversation

dehanalkautsar
Copy link
Collaborator

@dehanalkautsar dehanalkautsar commented Dec 30, 2023

Closes #83.

Note

  • For the TOD SEACrowd Schema features, I left the turn_label feature key in this dataloader as an empty array because this dataset does not have them.
  • system_acts will be populated with the dialog_act from the user utterance in the original dataset, as our schema dictates that system_acts should represent the system's intended actions based on the user's utterance.
  • Additionally, the belief_state will be populated with the span_info from the user utterance in the original dataset, as our schema dictates that belief_state should represent the system's belief state based on the user's utterance, not the system_utterance.
  • Some key features from the source dataset have not been incorporated into the SEACrowd TOD schema, as no corresponding features align with those in the schema (e.g., goal and log.metadata).

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Output of The Unit-test

Because this is a local dataset (I have some problems while loading the datasets directly from OneDrive URLs, so I need to download them via this link https://entuedu-my.sharepoint.com/personal/bosheng001_e_ntu_edu_sg/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fbosheng001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fglobalwoz%5Fv2&ga=1),

My Unit-test script

python -m tests.test_seacrowd seacrowd/sea_datasets/globalwoz/globalwoz.py --data_dir=./dataset/globalwoz --subset_id globalwoz_EandF_id

Unit-test output

INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/globalwoz/globalwoz.py', schema=None, subset_id='globalwoz_EandF_id', data_dir='./dataset/globalwoz', use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/globalwoz/globalwoz.py
INFO:__main__:self.SUBSET_ID: globalwoz_EandF_id
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: ./dataset/globalwoz
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.globalwoz.globalwoz
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.E2E_TASK_ORIENTED_DIALOGUE: 'TOD'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'TOD'}
INFO:__main__:schemas_to_check: {'TOD'}
INFO:__main__:Checking load_dataset with config name globalwoz_EandF_id_source
D:\External\seacrowd-datahub\venv\lib\site-packages\datasets\load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
INFO:__main__:Checking load_dataset with config name globalwoz_EandF_id_seacrowd_tod
D:\External\seacrowd-datahub\venv\lib\site-packages\datasets\load.py:2088: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
INFO:__main__:Dataset sample [source]
{'id': '0', 'goal': {'attraction': '{}', 'hospital': '{}', 'hotel': '{"book": {"day": "Minggu", "invalid": false, "people": "dua", "pre_invalid": true, "stay": "tujuh"}, "fail_book": {"stay": "empat"}, "fail_info": {}, "info": {"internet": "no", "stars": "2", "type": "guesthouse"}}', 'police': '{}', 'restaurant': '{"book": {"day": "Minggu", "invalid": false, "people": "dua", "time": "17:15"}, "fail_book": {}, "fail_info": {}, "info": {"area": "pusat", "food": "Asia", "pricerange": 
"harga sedang"}}', 'taxi': '{}', 'train': '{}'}, 'log': [{'dialog_act': '{"Restaurant-Inform": [["food", "Asia"]]}', 'metadata': '{}', 'span_info': [['restaurant-inform', 'food', 'Asia', '71', '89']], 'text': "i'd really like to take my client out to a nice restaurant that serves Asia food."}, {'dialog_act': '{"Restaurant-Inform": [["choice", "many"], ["food", "Asia"], ["pricerange", "murah"]], "Restaurant-Request": [["area", "?"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": [], "people": [], "stay": []}, "semi": {"area": [], "internet": [], "name": [], "parking": [], "pricerange": [], "stars": [], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [], "day": [], "people": [], "time": []}, "semi": {"area": [], "food": ["Asia"], "name": [], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": 
[], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [['restaurant-inform', 'choice', 'many', '7', '11'], ['restaurant-inform', 'food', 'Asia', '35', '53'], ['restaurant-inform', 'pricerange', 'murah', '62', '86']], 'text': 'i show many restaurants that 
serve Asia food in murah. what area would you like to travel to?'}, {'dialog_act': '{"Restaurant-Inform": [["area", "pusat"], ["food", "Asia"], ["pricerange", "harga sedang"]]}', 'metadata': '{}', 'span_info': [['restaurant-inform', 'pricerange', 'harga sedang', '20', '44'], ['restaurant-inform', 'food', 'Asia', '45', '63'], ['restaurant-inform', 'area', 'pusat', '90', '108']], 'text': 'i am looking for an harga sedang Asia restaurant in the area of pusat.'}, {'dialog_act': '{"Booking-Inform": [["none", "none"]], "Restaurant-Recommend": [["area", "pusat"], ["food", "Asia"], ["name", "Resto Ngalam"], ["pricerange", "harga sedang"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": [], "people": [], "stay": []}, "semi": {"area": [], "internet": [], "name": [], "parking": [], "pricerange": [], "stars": [], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [], "day": [], "people": [], "time": []}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": [], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [['restaurant-recommend', 'name', 'Resto Ngalam', '18', '36'], ['restaurant-recommend', 'pricerange', 'harga sedang', '49', '73'], ['restaurant-recommend', 'food', 'Asia', '74', '92'], ['restaurant-recommend', 'area', 'pusat', '111', '129']], 'text': 'might i recommend Resto Ngalam? that is an harga sedang Asia restaurant in the pusat. i can book a table for you, if you like.'}, {'dialog_act': '{"Restaurant-Inform": [["bookday", "Minggu"], ["bookpeople", "dua"], ["booktime", "17:15"]]}', 'metadata': '{}', 'span_info': [['restaurant-inform', 'bookpeople', 'dua', '28', '52'], ['restaurant-inform', 'booktime', '17:15', '63', '85'], ['restaurant-inform', 'bookday', 'Minggu', '89', '110']], 'text': 'sure thing, please book for dua people at 17:15 on Minggu.'}, {'dialog_act': '{"Booking-Book": [["ref", "rf00jufq"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": [], "people": [], "stay": []}, "semi": {"area": [], "internet": [], "name": [], "parking": [], 
"pricerange": [], "stars": [], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": 
[], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [['booking-book', 'ref', 'rf00jufq', '94', '102']], 'text': 'booking was successful. the table will be reserved for 15 minutes. your reference number is : rf00jufq .'}, {'dialog_act': '{"Hotel-Inform": [["internet", "no"], ["stars", "2"]]}', 'metadata': '{}', 'span_info': [['hotel-inform', 'stars', '2', '61', '75']], 'text': "okay great! thank you so much. could you also help me find a 2 star hotel in the area. i don't need wifi either."}, {'dialog_act': '{"Booking-Inform": [["none", "none"]], "Hotel-Inform": [["area", 
"selatan"], ["internet", "yes"], ["name", "Everyday Smart Hotel Mayestik"], ["parking", "yes"], ["pricerange", "mahal"], ["stars", "2"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": 
{"booked": [], "day": [], "people": [], "stay": []}, "semi": {"area": [], "internet": ["no"], "name": [], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [['hotel-inform', 'name', 'Everyday Smart Hotel Mayestik', '0', '14'], ['hotel-inform', 'area', 'selatan', '25', '39'], ['hotel-inform', 'pricerange', 'mahal', '47', '67'], ['hotel-inform', 'stars', '2', '76', '90']], 'text': 'Everyday Smart Hotel Mayestik is in the selatan and is mahal. it has 2 stars and no internet or parking. would you like to book a room?'}, {'dialog_act': '{"Hotel-Inform": [["internet", "no"], ["parking", "no"]]}', 'metadata': '{}', 'span_info': [], 'text': 'that sounds great. please book that now.'}, {'dialog_act': '{"Booking-Request": [["bookpeople", "?"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": [], "people": [], "stay": []}, "semi": {"area": [], "internet": ["no"], "name": ["Everyday Smart Hotel Mayestik"], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [], 'text': 'may i ask how many people are in your group?'}, {'dialog_act': '{"Hotel-Inform": [["bookpeople", "dua"]]}', 'metadata': '{}', 'span_info': [['hotel-inform', 'bookpeople', 'dua', '7', '31']], 'text': 'i have dua people in my group.'}, {'dialog_act': '{"Booking-Request": [["bookstay", "?"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": [], "people": ["dua"], "stay": []}, "semi": {"area": [], "internet": ["no"], "name": ["Everyday Smart Hotel Mayestik"], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [], 'text': 'how many days would 
you like to stay?'}, {'dialog_act': '{"Hotel-Inform": [["bookday", "Minggu"], ["bookstay", "empat"]]}', 'metadata': '{}', 'span_info': [['hotel-inform', 'bookstay', 'empat', '0', '18'], ['hotel-inform', 'bookday', 'Minggu', '40', '61']], 'text': 'empat nights, starting the Minggu as the reservation.'}, {'dialog_act': '{"Booking-NoBook": [["none", "none"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": ["Minggu"], "people": ["dua"], "stay": ["empat"]}, "semi": {"area": [], "internet": ["no"], "name": ["Everyday Smart Hotel Mayestik"], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [], 'text': "i'm sorry. it looks like they're full. would you like me to look for something else?"}, {'dialog_act': '{}', 'metadata': '{}', 'span_info': [], 'text': 'yes please. is there something else available in that area?'}, {'dialog_act': '{"Hotel-Inform": [["area", "barat"], ["stars", "2"], ["type", "homestay"]], "Hotel-Request": [["area", "?"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [], "day": ["Minggu"], "people": ["dua"], "stay": ["empat"]}, "semi": {"area": [], "internet": ["no"], "name": ["Everyday Smart Hotel Mayestik"], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [['hotel-inform', 'stars', '2', '42', '56'], ['hotel-inform', 'type', 'homestay', '62', '76'], ['hotel-inform', 'area', 'barat', '80', '94']], 'text': "i'm sorry, it looks like that is the only 2 star homestay in barat, would you like me to look somewhere else?"}, {'dialog_act': '{"Hotel-Inform": [["bookstay", "tujuh"]]}', 'metadata': '{}', 'span_info': [['hotel-inform', 'bookstay', 'tujuh', '15', '33']], 'text': "can we try for tujuh night instead of 2? i'll need the reference number please."}, {'dialog_act': '{"Booking-Book": [["bookstay", "lima"], ["ref", "9xvt8m5t"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [{"name": ["Everyday Smart Hotel Mayestik"], "reference": "9xvt8m5t"}], "day": ["Minggu"], "people": ["dua"], "stay": ["tujuh"]}, "semi": {"area": [], "internet": ["no"], "name": ["Everyday Smart Hotel Mayestik"], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": 
{"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [['booking-book', 'bookstay', 'lima', '18', '36'], ['booking-book', 'ref', '9xvt8m5t', '68', '76']], 'text': 'i was able to get lima night, the reference number is 9xvt8m5t .'}, {'dialog_act': '{"general-thank": [["none", "none"]]}', 'metadata': '{}', 'span_info': 
[], 'text': 'thank you so much!'}, {'dialog_act': '{"general-welcome": [["none", "none"]]}', 'metadata': '{"attraction": {"book": {"booked": []}, "semi": {"area": [], "name": [], "type": []}}, "hospital": {"book": {"booked": []}, "semi": {"department": []}}, "hotel": {"book": {"booked": [{"name": ["Everyday Smart Hotel Mayestik"], "reference": "9xvt8m5t"}], "day": ["Minggu"], "people": ["dua"], "stay": ["tujuh"]}, "semi": {"area": [], "internet": ["no"], "name": ["Everyday Smart Hotel Mayestik"], "parking": [], "pricerange": [], "stars": ["2"], "type": []}}, "police": {"book": {"booked": []}, "semi": {}}, "restaurant": {"book": {"booked": [{"name": ["Resto Ngalam"], "reference": "rf00jufq"}], "day": ["Minggu"], "people": ["dua"], "time": ["17:15"]}, "semi": {"area": ["pusat"], "food": ["Asia"], "name": ["Resto Ngalam"], "pricerange": ["harga sedang"]}}, "taxi": {"book": {"booked": []}, "semi": {"arriveBy": [], "departure": [], "destination": [], "leaveAt": []}}, "train": {"book": {"booked": [], "ticket": []}, "semi": {"arriveBy": [], "day": [], "departure": [], "destination": [], "leaveAt": []}}}', 'span_info': [], 'text': "i'm glad to help, you're welcome!"}]}
INFO:__main__:Dataset sample [seacrowd_tod]
{'dialogue_idx': 0, 'dialogue': [{'turn_label': [], 'system_utterance': '', 'turn_idx': 0, 'belief_state': [{'slots': [['food', 'Asia']], 'act': 'inform'}], 'user_utterance': "i'd really like to take my client out to a nice restaurant that serves Asia food.", 'system_acts': [['food']]}, {'turn_label': [], 'system_utterance': 'i show many restaurants that serve Asia food in murah. what area would you like to travel to?', 'turn_idx': 1, 'belief_state': [{'slots': [['pricerange', 'harga sedang']], 'act': 'inform'}, {'slots': [['food', 'Asia']], 'act': 'inform'}, {'slots': [['area', 'pusat']], 'act': 'inform'}], 'user_utterance': 'i am looking 
for an harga sedang Asia restaurant in the area of pusat.', 'system_acts': [['area'], ['food'], ['pricerange']]}, {'turn_label': [], 'system_utterance': 'might i 
recommend Resto Ngalam? that is an harga sedang Asia restaurant in the pusat. i can book a table for you, if you like.', 'turn_idx': 2, 'belief_state': [{'slots': [['bookpeople', 'dua']], 'act': 'inform'}, {'slots': [['booktime', '17:15']], 'act': 'inform'}, {'slots': [['bookday', 'Minggu']], 'act': 'inform'}], 'user_utterance': 'sure thing, please book for dua people at 17:15 on Minggu.', 'system_acts': [['bookday'], ['bookpeople'], ['booktime']]}, {'turn_label': [], 'system_utterance': 'booking was successful. the table will be reserved for 15 minutes. your reference number is : rf00jufq .', 'turn_idx': 3, 'belief_state': [{'slots': [['stars', '2']], 'act': 'inform'}], 'user_utterance': "okay great! thank you so much. could you also help me find a 2 star hotel in the area. i don't need wifi either.", 'system_acts': [['internet'], ['stars']]}, {'turn_label': [], 'system_utterance': 'Everyday Smart Hotel Mayestik is in the selatan and is mahal. it has 2 stars and no internet or parking. would you like to book a room?', 'turn_idx': 4, 'belief_state': [], 'user_utterance': 'that sounds great. please book that now.', 'system_acts': [['internet'], ['parking']]}, {'turn_label': [], 'system_utterance': 'may i ask how many people are in your group?', 'turn_idx': 5, 'belief_state': [{'slots': [['bookpeople', 'dua']], 'act': 'inform'}], 'user_utterance': 'i have dua people in my group.', 'system_acts': [['bookpeople']]}, {'turn_label': [], 'system_utterance': 'how many days would you like to stay?', 'turn_idx': 6, 'belief_state': [{'slots': [['bookstay', 'empat']], 'act': 'inform'}, {'slots': [['bookday', 'Minggu']], 'act': 'inform'}], 'user_utterance': 'empat nights, starting the Minggu as the reservation.', 'system_acts': [['bookday'], ['bookstay']]}, {'turn_label': [], 'system_utterance': "i'm sorry. it looks like they're full. would you like me to look for something else?", 'turn_idx': 7, 'belief_state': [], 'user_utterance': 'yes please. is there something else available in that area?', 'system_acts': []}, {'turn_label': [], 'system_utterance': "i'm sorry, it looks like that is the only 2 star homestay in barat, would you like me to look somewhere else?", 'turn_idx': 8, 'belief_state': [{'slots': [['bookstay', 'tujuh']], 'act': 'inform'}], 'user_utterance': "can we try for tujuh night instead of 2? i'll need the reference number please.", 'system_acts': [['bookstay']]}, {'turn_label': [], 'system_utterance': 'i was able to get lima night, the reference number is 9xvt8m5t .', 'turn_idx': 9, 'belief_state': [], 'user_utterance': 'thank you so much!', 'system_acts': [['none']]}, {'turn_label': [], 'system_utterance': "i'm glad to help, you're welcome!", 'turn_idx': 10, 'belief_state': [], 'user_utterance': '', 'system_acts': []}]}
INFO:__main__:Checking global ID uniqueness
INFO:__main__:Found 0 unique IDs
INFO:__main__:Gathering schema statistics
INFO:__main__:Gathering schema statistics
train
==========
dialogue_idx: 10437
dialogue: 81959

.
----------------------------------------------------------------------
Ran 1 test in 18.155s

@sabilmakbar sabilmakbar removed the request for review from jensan-1 January 18, 2024 18:42
@sabilmakbar
Copy link
Collaborator

Because this is a local dataset (I have some problems while loading the datasets directly from OneDrive URLs, so I need to download them via this link https://entuedu-my.sharepoint.com/personal/bosheng001_e_ntu_edu_sg/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fbosheng001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fglobalwoz%5Fv2&ga=1)

Just throwing a possible workaround. Have you tried to copy the download link? after you click download, supposedly you can obtain it's download link (like this one and pass it directly to HF Download methods

Copy link
Collaborator

@sabilmakbar sabilmakbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init review. Won't review the dataloader methods for now prior to having clarity whether it's possible to make it publicly available.

}
"""

_DATASETNAME = "[globalwoz]"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this as string w/o any non word char

Comment on lines 84 to 103
BUILDER_CONFIGS = [
seacrowd_config_constructor("EandF", "id", "source", _SOURCE_VERSION),
seacrowd_config_constructor("EandF", "th", "source", _SOURCE_VERSION),
seacrowd_config_constructor("EandF", "vi", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandE", "id", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandE", "th", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandE", "vi", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandF", "id", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandF", "th", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandF", "vi", "source", _SOURCE_VERSION),
seacrowd_config_constructor("EandF", "id", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("EandF", "th", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("EandF", "vi", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandE", "id", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandE", "th", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandE", "vi", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandF", "id", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandF", "th", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandF", "vi", "seacrowd_tod", _SEACROWD_VERSION),
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mind replacing this by using itertools.product and iterating over it rather than having these expressions?

smth like this:

Suggested change
BUILDER_CONFIGS = [
seacrowd_config_constructor("EandF", "id", "source", _SOURCE_VERSION),
seacrowd_config_constructor("EandF", "th", "source", _SOURCE_VERSION),
seacrowd_config_constructor("EandF", "vi", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandE", "id", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandE", "th", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandE", "vi", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandF", "id", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandF", "th", "source", _SOURCE_VERSION),
seacrowd_config_constructor("FandF", "vi", "source", _SOURCE_VERSION),
seacrowd_config_constructor("EandF", "id", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("EandF", "th", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("EandF", "vi", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandE", "id", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandE", "th", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandE", "vi", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandF", "id", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandF", "th", "seacrowd_tod", _SEACROWD_VERSION),
seacrowd_config_constructor("FandF", "vi", "seacrowd_tod", _SEACROWD_VERSION),
]
BUILDER_CONFIGS = [
seacrowd_config_constructor(tod_format, lang, schema, _SOURCE_VERSION if schema == "source" else _SEACROWD_VERSION) for tod_format, lang, schema in itertools.product(("EandE", "EandF", "FandF"), ("id", "th", "vi"), ("source", "seacrowd_tod")]

seacrowd_config_constructor("FandF", "vi", "seacrowd_tod", _SEACROWD_VERSION),
]

DEFAULT_CONFIG_NAME = "globalwoz_EandF_id_source"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since DEFAULT_CONFIG_NAME is actually optional, do you think it's still make sense to assign a default config to it?

@dehanalkautsar
Copy link
Collaborator Author

dehanalkautsar commented Jan 22, 2024

Just throwing a possible workaround. Have you tried to copy the download link? after you click download, supposedly you can obtain it's download link (like this one and pass it directly to HF Download methods

I've got an error like this whenever try to download directly from the link:

raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")
ConnectionError: Couldn't reach https://entuedu-my.sharepoint.com/personal/bosheng001_e_ntu_edu_sg/_layouts/15/download.aspx?SourceUrl=%2Fpersonal%2Fbosheng001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fglobalwoz%5Fv2%2FE%26F%2Fid%2FE%26F%5Fid%2Ejson (error 403)

What are your thoughts on this? @sabilmakbar

@SamuelCahyawijaya
Copy link
Collaborator

@dehanalkautsar : Do you have the local copy of the dataset? If so, would you be able to ask the authors for the license of the dataset?

Since the dataset is based on MultiWoZv2.2 and MultiWoZv2.2 is licensed as Apache 2.0 (https://huggingface.co/datasets/multi_woz_v22), I expect that GlobalWoZ will also have a similar license. If so, we can redistribute the dataset and make a copy of the dataset on a GitHub repo which will be easier to access.

@dehanalkautsar
Copy link
Collaborator Author

@dehanalkautsar : Do you have the local copy of the dataset? If so, would you be able to ask the authors for the license of the dataset?

Since the dataset is based on MultiWoZv2.2 and MultiWoZv2.2 is licensed as Apache 2.0 (https://huggingface.co/datasets/multi_woz_v22), I expect that GlobalWoZ will also have a similar license. If so, we can redistribute the dataset and make a copy of the dataset on a GitHub repo which will be easier to access.

I need to download all 9 subset datasets, so far I've downloaded 1 only because of the datasets' size. But for sure that is feasible. I will contact the authors for the license of the dataset.

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented Feb 4, 2024

I've got an error like this whenever try to download directly from the link:

raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")
ConnectionError: Couldn't reach https://entuedu-my.sharepoint.com/personal/bosheng001_e_ntu_edu_sg/_layouts/15/download.aspx?SourceUrl=%2Fpersonal%2Fbosheng001%5Fe%5Fntu%5Fedu%5Fsg%2FDocuments%2Fglobalwoz%5Fv2%2FE%26F%2Fid%2FE%26F%5Fid%2Ejson (error 403)
What are your thoughts on this? @sabilmakbar

Hi @dehanalkautsar, it looks like the dataset URL and its JSON download require user authentication. From what I read, it's tricky to pass the Authentication part in order to access and download the data successfully (as you've mentioned previously that you received a 403 response code). Maybe our workaround for now is to pass the local path of downloaded files/folders per language before we can find a way to authenticate the drive via user-email.

Btw, this is the closest and prob easiest way in log-in the drive service via Drive's SDK, but I'm unclear where and how to generate the client ID and Token, and whether this is suffices to pass the authentication barrier to read the files.

@SamuelCahyawijaya
Copy link
Collaborator

@sabilmakbar @dehanalkautsar : If the license allows, we can just make a copy of the dataset, put it somewhere else, and link the dataset to the new source link.

Btw, I saw the email from @dehanalkautsar to the authors of the dataset, and I think it will be better to email the last author (the supervisor) rather than the main authors (the students).

@dehanalkautsar
Copy link
Collaborator Author

@sabilmakbar @dehanalkautsar : If the license allows, we can just make a copy of the dataset, put it somewhere else, and link the dataset to the new source link.

Btw, I saw the email from @dehanalkautsar to the authors of the dataset, and I think it will be better to email the last author (the supervisor) rather than the main authors (the students).

Ah okay, I've resent the e-mail to the last author👍

Copy link

Hi @ & @, may I know if you are still working on this PR?

@SamuelCahyawijaya
Copy link
Collaborator

Hi @dehanalkautsar , Is there any update on this?

@dehanalkautsar
Copy link
Collaborator Author

Hi @dehanalkautsar , Is there any update on this?

Hi @SamuelCahyawijaya, sadly there is no update as the e-mail hasn't been replied to by either the first or last author.

@sabilmakbar
Copy link
Collaborator

can we try to test it locally first (by setting it as _LOCAL=True) and proceed with the PR before waiting for their permission to copy the data into new sources and modify it slightly (by pointing to new URL and setting _LOCAL=False)?

cc @SamuelCahyawijaya

@holylovenia
Copy link
Contributor

I'm just passing by to remind @dehanalkautsar to add __init__.py.

@dehanalkautsar
Copy link
Collaborator Author

@sabilmakbar, I have revised the code according to your earlier feedback. If you have any questions, please feel free to ask. Additionally, for local testing, I've used the following test script:

python -m tests.test_seacrowd seacrowd/sea_datasets/globalwoz/globalwoz.py --data_dir=./dataset/globalwoz --subset_id globalwoz_EandF_id

The output matches the results from my initial Unit-test output submitted in my first PR.

Copy link
Collaborator

@SamuelCahyawijaya SamuelCahyawijaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dehanalkautsar : Thank you for contributing! LGTM!

Copy link
Collaborator

@sabilmakbar sabilmakbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset and unittest looks okay, but do you have the data format for FandE and FandF ready to be tested similarly (so the implementation is verified for the other two formats), @dehanalkautsar ?

If you don't have those and the datasets are also quite large, we can acknowledge this.

@dehanalkautsar
Copy link
Collaborator Author

dehanalkautsar commented Mar 4, 2024

@sabilmakbar : I have skimmed past the dialogue dataset for the other dataset type (F&E and F&F), and I think the other has a similar data format as E&F. This is also confirmed by the very slight differences in dialogue which can be seen in the image below:
image

@sabilmakbar sabilmakbar merged commit 5fba8f0 into SEACrowd:master Mar 4, 2024
1 check passed
zwenyu pushed a commit to zwenyu/seacrowd-datahub that referenced this pull request Mar 14, 2024
* refactor by pre-commit

* reformatted by pre-commit

* refactor code for globalwoz
MJonibek added a commit to MJonibek/seacrowd-datahub that referenced this pull request Apr 18, 2024
* Fix bug unique ids

* Closes SEACrowd#162 | Add Bloom-Captioning Dataloader (SEACrowd#198)

* Init dataloader bloom captioning

* Fix issue on multiple splits from its source

* Change local var

* Cater 'test' and 'val' split and fix the '_id' generation

* fix: remove abstreact and change _LOCAL and _DESC

* fix: _DESC indent

* Format openslr.py and add init file

* Closes SEACrowd#271 | Implement dataloader for UiT-ViCTSD (SEACrowd#300)

* Implement UiT-ViCTSD dataloader

* Improve subset IDs, feature types, code to generate examples

* Closes SEACrowd#161 | Create dataset loader for ICON 161 (SEACrowd#317)

* Create icon.py

* Update icon.py

* Create __init__.py

* Closes SEACrowd#142 | Add Unimorph v4 dataloader (SEACrowd#168)

* Add Unimorph dataloader

Resolves SEACrowd#142

* Add Dataset to class name

* Closes SEACrowd#71 | Create dataset loader for MASSIVE (SEACrowd#196)

* add data loader for massive dataset

* modify the class name & refactor the function name

* change task name from pos tagging to slot filling & make check_file & change subset name to differentiate intent / slot filling tasks

* Closes SEACrowd#14 | Create dataset loader for ara-close-lange (SEACrowd#243)

* Add ara_close dataloader

* Rename class name to AraCloseDataset

* Closes SEACrowd#273 | Implement dataloader for UIT_ViON (SEACrowd#282)

* Implement dataloader for UIT_ViON

* Add __init__.py

* Add {lang} in subset id for openslr

* Closes SEACrowd#219 | Create dataloader for scb-mt-en-th-2020 (SEACrowd#287)

* Create dataloader for scb-mt-en-th-2020

* Rename the data loader files to its snakecase

* rename _DATASETNAME to snakecase

* Fix languages setting

* Update template.py

* Add docstring openslr.py

* Closes SEACrowd#277 | Implement dataloader for spamid_pair (SEACrowd#281)

* Implemente dataloader for spamid_pair

* Update seacrowd/sea_datasets/spamid_pair/spamid_pair.py

Co-authored-by: Lj Miranda <[email protected]>

* Add __init__.py

* Update __init__.py

---------

Co-authored-by: Lj Miranda <[email protected]>

* Implemented dataloader for indoler

* Add imqa schema and VISUAL_QUESTION_ANSWERING task (SEACrowd#380)

* Update template.py

Update DownloadManager documentation link in template.py

* Closes SEACrowd#54 | Implement Dataloader for IndoSMD (SEACrowd#258)

* feat: indosmd dataloader for source

* refactor by pre-commit

* IndoSMD: reformatted by pre-commit

* Update changes on indosmd.py

* revised line 223 in indosmd.py

* Close#143 | Create dataset loader for Abui WordNet (SEACrowd#285)

* add tydiqa dataloader

* add id_vaccines_tweet dataloader

* add uit-vicc dataloader

* add ICON dataloader

* add iaap_squad dataloader

* add stb_ext dataloader

* Revert "add iaap_squad dataloader"

This reverts commit 1f8a591.

* Revert "add tydiqa dataloader"

This reverts commit 6bf4546.

* Revert "add id_vaccines_tweet dataloader"

This reverts commit 1154087.

* Revert "add uit-vicc dataloader"

This reverts commit 09661fa.

* Revert "add ICON dataloader"

This reverts commit 0891e58.

* Update stb_ext.py

* add abui_wordnet dataloader

* Revert "Update stb_ext.py"

This reverts commit 59c5301.

* Delete seacrowd/sea_datasets/stb_ext/stb_ext.py

* Delete seacrowd/sea_datasets/stb_ext/__init__.py

* Update abui_wordnet.py

* Update abui_wordnet.py

* Update abui_wordnet.py

---------

Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: Samuel Cahyawijaya <[email protected]>

* Added Morality Classification Tasks to constants.py (SEACrowd#371)

* Closes SEACrowd#216 |  Create dataset loader for Mozilla Pontoon (SEACrowd#260)

* Begin first draft of Mozilla Pontoon dataloader

* Add dataloader for Mozilla Pontoon

* Remove enumerate in _generate_examples

* Fix issues due to changed format, rename features and config names

* Closes SEACrowd#157 | Create dataset loader for M3Exam (SEACrowd#302)

* Add m3exam dataloader

* Small change in m3exam.py

* Fix bug during downloading

* Add meta feature to seacrowd schema for m3exam

* Rename class M3Exam to M3ExamDataset

* Add image question answering

* Merge two source schemas into one for m3exam

* Fix image path, choices and answer in m3exam

* Update CODEOWNERS

* Rectify SEACrowd Internal Vars (SEACrowd#386)

* Add missing __init__.py

* add init

* fix bug in phoatis load

* add lang variables in dataloaders

* Add dataset use ack on source HF repo into description

* Closes SEACrowd#204 | Implement dataloader for Melayu_Sabah (SEACrowd#234)

* Implement dataloader for Melayu_Sabah

* Update name for the dataloader

* Add _CITATION

* Update seacrowd/sea_datasets/melayu_sabah/melayu_sabah.py

* Applu suggestions from review

* Moving unnecessary content in dialogue text

* Update melayu_sabah.py

* Improvement: Workflow Message to Mention Assignee in Staled Issues (SEACrowd#400)

* Update stale.yml (SEACrowd#327)

* Update stale.yml

Test on adding vars on assignee & author of Issues & PR

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Update stale.yml

* Closes SEACrowd#272 | Create dataset loader for SNLI (SEACrowd#290)

* [New Feature] Add SNLI dataloader

* [Fix] SNLI rev according to PR review

* [Chore] Add comment for accessibility

* Update common_parser.py (SEACrowd#333)

* Implement dataloader for UCLA Phonetic Corpus

* Implement dataloader for KDE4

* removed redundant builder_config

* Update cc3m_35l.py

Changed into no parallelization since it was kept being killed by the OS for some reason.

* Fix: Workflow Assignee Mention (SEACrowd#410)

* Update stale.yml

* Fix: wrong quote in message (SEACrowd#411)

* Update and fix bug on stale.yml

* Closes SEACrowd#17 | Implement dataloader for Philippine Fake News Corpus (SEACrowd#331)

* Implement dataloader

* Edit dataloader class name

* Simplify code

* Fix citation typo

* Closes SEACrowd#359 | Implement dataloader for LR-Sum (SEACrowd#368)

* Implement dataloader

* Fix short description

* feat: mswc dataloader skeleton

* feat: example for seacrowd schema

* Closes SEACrowd#265 | Implement dataloader for `myxnli` (SEACrowd#336)

* Implement dataloader for myxnli

* update myxnli

* Closes SEACrowd#112 | Implement Dataloader for Wisesight Thai Corpus (SEACrowd#279)

* Add wisesight_thai_sentiment dataset

* changes according to review

* changes according to review

* changes according to review

* Add changes according to review

* refactor: formatting

* fix: subset

* refactor: formatting

* Closes SEACrowd#6 | Add Loader for XCOPA (SEACrowd#286)

* initial add for loader

* edit to include multi language

* adjust comments

* apply suggestion

* fix by linter

---------

Co-authored-by: fawwaz.mayda <[email protected]>

* Closes SEACrowd#140 | Add Dengue Filipino (SEACrowd#259)

* add dengue filipino

* update license and tasks

* Update _LANGUAGE

* Update dengue_filipino.py

* feat: flores200 dataloader skeleton

* Set only one source schema

* Fix subnodes ids for root node alt_burmese_treebank

* implement Filipino Gay Language dataloader (SEACrowd#66)

* convert citation to raw string

* Closes SEACrowd#210 | Create dataset loader for Orchid Corpus (SEACrowd#303)

* Add orchid_pos dataloader

* Rename OrchidPOS to OrchidPOSDataset

* Fix parser bug in orchid_pos.py

* Add .strip() in source orchid_pos

* Cahange string for special char orchid_pos

* fix: remove useless loop

* refactor: remove unused loop

* Closes SEACrowd#159 | Create dataset loader for CC-Aligned (SEACrowd#298)

* Add cc_aligned_doc dataloader

* Rename class and format cc_aligned_doc

* Add SEACROWD_SCHEMA_NAME for cc_aligned_doc

* Closes SEACrowd#268 | Implement dataloader for Thai Toxicity Tweet Corpus (SEACrowd#301)

* Implement dataloader for Thai toxicity tweets

* Fix description grammar

* List labels as constant

* Change task to ABUSIVE_LANGUAGE_PREDICTION, improve _generate_examples

* Rename dataloader folder and file

* Remove comment, change license value

* Define SEACROWD_SCHEMA using _SUPPORTED_TASKS

* Fix bug where example ID and index do not match

* Closes SEACrowd#363 | Create dataset loader for identifikasi-bahasa (SEACrowd#379)

* [add]  initial commit

* [add] dataset loader for identifikasi_bahasa

* [refactor]  removed __main__

* Update seacrowd/sea_datasets/identifikasi_bahasa/identifikasi_bahasa.py

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#182. | Implement dataloader for `roots_vi_ted` (SEACrowd#329)

* Implement dataloader for roots_vi_ted

* update

* update

* update

* remove local data

* reformat

* Closes SEACrowd#180 | Implement `IndoMMLU` dataloader (SEACrowd#324)

* Implement dataloader for indommlu

* update

* update

* Closes SEACrowd#345 | Implemented dataloader for vlsp2016_ner (SEACrowd#372)

* Implemented dataloader for vlsp2016_ner

* Format vlsp2016_ner.py

* Closes SEACrowd#276 | Implement PRDECT-ID dataloader (SEACrowd#322)

* Implement PRDECT-ID dataloader

Closes SEACrowd#276

* Add better type formatting

* Follow id_google_play_review for structure

* Include source configs for both emotion and sentiment

* Closes SEACrowd#9 | Add bhinneka_korpus dataset loader (SEACrowd#175)

* Add bhinnek_korpus dataset loader

* Updating the suggested changes

* Resolved review suggestions

* Create indonesian_news_dataset dataloader

* Closes SEACrowd#183 | Implement `wongnai_reviews` dataloader (SEACrowd#325)

* Implement dataloader for wongnai_reviews

* add __init__.py

* update

* update

* Implement change requested by holylovenia

* Closes SEACrowd#348 | Implemented dataloader for indoner_tourism (SEACrowd#373)

* Implemented dataloader for indoner_tourism

* Perform changes requested by ljvmiranda921

* Closes SEACrowd#361 | Create dataset loader for Thai-Lao Parallel Corpus (SEACrowd#384)

* [add] dataloader for tha_lao_embassy_parcor, no citation yet

* [add] citation; removed debug code

* [style] make format restyle

* [refactor]  removed TODO code

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Update constants.py

* Closes SEACrowd#305 | Implement dataloader for UIT_ViOCD (SEACrowd#335)

* Implement dataloader for UIT_ViOCD

* update according to the review

* Update _SUPPORTED_TASKS

* Closes SEACrowd#362 | Create dataset loader for GKLMIP Khmer News Dataset (SEACrowd#383)

* [add] dataloader for gklmip_newsclass

* [refactor]  changed licence value

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#358 | Create dataset loader for GKLMIP Product Sentiment (SEACrowd#417)

* [add] dataset loader for gklmip_sentiment

* [refactor]  removed comment; removed "split" parameter in gen_kwargs

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Update constants.py

* Close SEACrowd#306 | Create dataset loader for ViHealthQA (SEACrowd#319)

* Create dataset loader for ViHealthQA SEACrowd#306

* add class docstring

* Update vihealthqa.py

* Closes SEACrowd#10 | Create beaye_lexicon dataset loader (SEACrowd#320)

* Create beaye_lexicon dataset loader

* add implementation of eng-day word pairs

* Closes SEACrowd#179 | Implement `indo_story_cloze` dataloader (SEACrowd#323)

* Implement indo_story_cloze dataloader.

* correct license

* update according to the feedback

* update

* Closes SEACrowd#353| Create dataset loader for FilWordNet (SEACrowd#377)

* Add dataloader for FilWordNet

* Update seacrowd/sea_datasets/filwordnet/filwordnet.py

Co-authored-by: Lj Miranda <[email protected]>

* Update seacrowd/sea_datasets/filwordnet/filwordnet.py

Co-authored-by: Lj Miranda <[email protected]>

* Fix formatting

---------

Co-authored-by: Lj Miranda <[email protected]>

* feat: id_sentiment_analysis dataloader

* refactor: remove print

* refactor: default config name

* feat: subsets

* Closes SEACrowd#350 | Implement dataloader for Indonesian PRONER (SEACrowd#399)

* Implement dataloader for Indonesian PRONER

* Add manual and automatic subsets

---------

Co-authored-by: Railey Montalan <[email protected]>

* Implement dataloader for IMAD Malay Corpus (SEACrowd#402)

Co-authored-by: ssfei81 <[email protected]>

* Update id_wsd.py

* add thaigov (SEACrowd#412)

* add thaigov

* Update thaigov.py

* add inline comment for file structure

* Update and rename snli.py to snli_indo.py

* Rename SNLI to SNLI Indo

* Update snli_indo.py

* [add]  dataloader for sarawak_malay

* Closes SEACrowd#264 | Create dataset loader for mySentence SEACrowd#264 (SEACrowd#291)

* add mysentences dataloader

* align the config name to subset_id

* update mysentence config

* Update mysentence.py

* remove comment line

* Update mysentence.py

* Update mysentence config

* Update mysentence.py

* Update seacrowd/sea_datasets/mysentence/mysentence.py

Fix the subset_id case-checking for data download

* added __init__.py to ucla_phonetic

* updated dataloader according to suggestions

* Update memolon.py

* fix: subset_id format

* refactor: prepend dataset name to subset id

* fix: first language is set to latin english

* Add thai depression

* Create __init__.py

* Create __init__.py

* Create __init__.py

* Implement dataloader for SeaEval

* Update template.py instruction for dataloader class name (SEACrowd#334)

* Add documentation for dataloader class name

* Update template.py

* Update REVIEWING.md

This modified the content of adding "Dataset" suffix into optional, and giving a reference to templates/templates.py for example

* Update REVIEWING.md

fix file reference name

---------

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* Closes SEACrowd#165 | Add BLOOM-LM dataset (SEACrowd#294)

* Init add BLOOM-LM dataset

* Adjusting changes based on review

* fix typing on _generate_examples

* update import based on formatter suggestion

* Closes SEACrowd#349 | Create dataset loader for QASiNa (SEACrowd#418)

* [add] dataloader for qasina

* [refactor] renamed dataset class

* [add]  added contex_title to qa_seacrowd schema

* [refactor, add]  changed QA type, added "answer_start", "contx_length" information to meta

* [refactor]  bug fixes

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Closes SEACrowd#263 | Implement dataloader for VIVOS (SEACrowd#398)

* Implement dataloader for

* Implement dataloader for VIVOS

* Add missing __init__.py file

* Change _LANGUAGES into list

---------

Co-authored-by: Railey Montalan <[email protected]>

* Closes SEACrowd#190 | Create dataset loader for TydiQA  (SEACrowd#251)

* add tydiqa dataloader

* Update tydiqa.py

* add example helper and update config

* Update tydiqa.py

* Update Configs and _info

* Update features in _info()

* Update tydiqa.py

This update covers the requested changes from @jen-santoso and @jamesjaya, please advice if needs any further changes. Thanks.

* add tydiqa_id subset

* Update tydiqa.py

Reformat long lines in the code and add IndoNLG in citation

* remove tydiqa_id

* Closes SEACrowd#338 | Created DataLoader for IndonesianNMT (SEACrowd#367)

* Implementing Dataloader for indonesiannmt issue SEACrowd#338

* Update template.py

* Implementing Dataloader for indonesiannmt issue SEACrowd#338

* removed if __main__ section

* IndonesianNMT reconstructing dataloader

* Implement ssp task, implement suggestions

* format indonesiannmt

---------

Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Jonibek Mansurov <[email protected]>

* Closes SEACrowd#366 | Implement dataloader for Kheng.info Speech (SEACrowd#401)

* Implement dataloader for Kheng.info Speech

* Add init file

* Closes SEACrowd#226 | Vi Pubmed dataloader (SEACrowd#391)

* feat: vi_pubmed dataloader

* fix: homepage

* fix: non unique id error

* refactor: class name

* refactor: remove unused loop

* Create __init__.py

* [refactor]  removed comment

* Update flores200.py

* refactor: remove main function

* Closes SEACrowd#69 | Implement XStoryCloze Dataloader (SEACrowd#137)

* implement xstorycloze dataloader

* add __init__.py

* update

* remove ssp schema; add _LANGUAGES

* remove unnecessary import; pascal case for class name

* Closes SEACrowd#147 | implemented dataloader for gatitos dataset (SEACrowd#415)

* implemented dataloader for gatitos dataset

* added __init__.py to gatitos folder

* Updated gatitos

---------

Co-authored-by: ssfei81 <[email protected]>

* Update CODEOWNERS

* Patch Workflow on Stale Checking (SEACrowd#482)

* Update stale.yml

* Create add-new-comment-on-stale

* Update and rename stale.yml to stale-labeler.yml

* Update add-new-comment-on-stale

* Rename add-new-comment-on-stale to add-new-comment-on-stale.yml

* Sabilmakbar Patch Workflow (SEACrowd#484)

Bugfix on SEACrowd#482.

* Update add-new-comment-on-stale.yml

add workflow trigger criteria on PR message aswell

* Update add-new-comment-on-stale.yml

* Update add-new-comment-on-stale.yml

fix yaml indent

* Update add-new-comment-on-stale.yml

* Closes SEACrowd#340 | Implement Dataloader for emotes_3k (SEACrowd#397)

* Implement Dataloader for emotes_3k

* Implement Dataloader for emotes_3k

* Tasks updated from sentiment analysis to morality classification

* Implement Change Request

* formatting emotes_3k

---------

Co-authored-by: Jonibek Mansurov <[email protected]>

* refactor: remove main function

Co-authored-by: Lj Miranda <[email protected]>

* Update constants.py

* Closes SEACrowd#311 | Add dataloader for indonesian_madurese_bible_translation (SEACrowd#337)

* add dataloader for indonesian_madurese_bible_translation

* update the license of indonesian_madurese_bible_translation

* Update indonesian_madurese_bible_translation.py

* modify based on comments from holylovenia

* [indonesian_madurese_bible_translation]

* update based on the reviewer's comments

* Remove `CONTRIBUTING.md`, update PR Message Template, and add bash to initialize dataset (SEACrowd#468)

* add bash to initialize dataset

* delete CONTRIBUTING.md since it's duplicated with DATALOADER.md

* update the docs slightly on suggesting new dataloader contributors to use template

* fix few wordings

* Add info on required vars '_LOCAL'

* Add checklist on __init__.py

* fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val

* fix wordings on first section of PR msg

* add newline separator for better readability

* add info on some to-dos

* refactor: citation

* Closes SEACrowd#83 | Implement Dataloader for GlobalWoZ (SEACrowd#261)

* refactor by pre-commit

* reformatted by pre-commit

* refactor code for globalwoz

* Create dataset loader for IndoQA SEACrowd#430 (SEACrowd#431)

* Add CODE_SWITCHING_IDENTIFICATION task (SEACrowd#488)

* Closes SEACrowd#396 | Implement dataloader for CrossSum (SEACrowd#419)

* Implement dataloader

* Change to 3-letter ISO codes

* Change task to CROSS_LINGUAL_SUMMARIZATION

* Closes SEACrowd#92 | Create Jail break data loader (SEACrowd#390)

* feat: jailbreak dataloader

* fix: minor errors

* refactor: styling

* refactor: remove main entry

* refactor: class name

* refactor: remove unused loop

* fix: separate text column into different subsets

* Create __init__.py

* Implement CommonVoice 12.0 dataloader (SEACrowd#452)

* Closes SEACrowd#202 | Implement dataloader for WIT (SEACrowd#374)

* Implement dataloader for WIT

* Remove unnecessary commits

* Add to description

---------

Co-authored-by: Railey Montalan <[email protected]>

* Split into language subsets

* Split into language subsets

* Update seacrowd/sea_datasets/thai_depression/thai_depression.py

Co-authored-by: Lj Miranda <[email protected]>

* fix: change lincense to unknown

* fix: minor errors

* Closes SEACrowd#80 | Implement MSVD-Indonesian Dataloader (SEACrowd#135)

* implement id_msvd dataloader

* change logic for seacrowd schema (text first, then video); quality of life change to video schema

* revert seacrowd video key from "text" to "texts"

* change source logic to match original data implementation

* run make check_file

* Closes SEACrowd#34  |  Create dataset loader for MKQA (SEACrowd#177)

* Create dataset loader for MKQA SEACrowd#34

* Refactor class variables _LANGUAGES to global for MKQA SEACrowd#34

* Filter supported languages (SEA only) of seacrowd_qa schema for MKQA SEACrowd#34

* Filter supported languages (SEA only) of source schema for MKQA SEACrowd#34

* Filter supported languages (SEA only) for MKQA SEACrowd#34 (a leftover)

* Change language code from macrolanguage, msa to zlm, for MKQA SEACrowd#34

* Change to a more appropriate language code of  for Malaysian variant used in MKQA SEACrowd#34

* Changed the value of field 'type' of QA schema to be more general, and moved the more specific value to 'meta' field for MKQA SEACrowd#34

* Replace None value to empty array in 'answer_aliases' sub-field for consistency in MKQA SEACrowd#34

* Closes SEACrowd#193 | Create dataset loader for MALINDO Morph (SEACrowd#332)

* Implement dataloader for MALINDO morph

* Specify file encoding and remove newlines when loading data

* Add blank __init__.py

* Fix typos in docstring

* Fix typos

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/malindo_morph/malindo_morph.py

---------

Co-authored-by: Jennifer Santoso <[email protected]>

* fix: subsets

* Closes SEACrowd#314 | Add dataloader for Indonesia chinese mt robust eval (SEACrowd#388)

* add dataloader for indonesian_madurese_bible_translation

* update dataloader for indonesia_chinese_mtrobusteval

* Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py

* Update indonesia_chinese_mtrobusteval.py

* update code based on the reviewer comments

* add __init__.py

* Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py

* Update seacrowd/sea_datasets/indonesia_chinese_mtrobusteval/indonesia_chinese_mtrobusteval.py

---------

Co-authored-by: Jennifer Santoso <[email protected]>

* refactor: feature naming

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* fix: homepage url

* Closes SEACrowd#211 | Implement dataloader for SEAHORSE (SEACrowd#407)

* implement seahorse dataloader

* update

* update

* incorporate the latest comments though tensorflow still needed for tfds

* update

* update

* fix: lowercase feature name

* refactor: subset name

* fix: limit the sentence paths to the relevant languages

* refactor: remove possible error

* Change default split to TEST

* Closes SEACrowd#447 |  Create dataset loader for Aya Dataset (SEACrowd#457)

* Implementing data loader for Aya Dataset

* Fixing license serialization issue

* Update based on formatter for aya_dataset.py

* update xlsum to extend more langs

* update based on formatter

* Closes SEACrowd#360 | Implement dataloader for khpos (SEACrowd#376)

* Implement dataloader for khpos

* Remove unneeded comment

* Implemented Test and Validation loading

* Streamlining code

* Closes SEACrowd#116 | Add pho_ner_covid Dataloader (SEACrowd#461)

* feat: pho_ner_covid dataloader

* refactor: classname

Co-authored-by: Lj Miranda <[email protected]>

* fix: remove main function

Co-authored-by: Lj Miranda <[email protected]>

* refactor: remove inplace uses for dataframe

* refactor: remove duplicate statement

---------

Co-authored-by: Lj Miranda <[email protected]>

* refactor: remove trailing spaces

Co-authored-by: Salsabil Maulana Akbar <[email protected]>

* refactor: url format

* edit 'texts' to 'text' key (SEACrowd#499)

* Closes SEACrowd#217 | Implement dataloader for `wili_2018` (SEACrowd#381)

* Implement dataloader for wili_2018

* update

* Closes SEACrowd#104 | Add lazada_review_filipino (SEACrowd#409)

* Add lazada_review_filipino Closes SEACrowd#104

* Update lazada_review_filipino.py

Update config name

* Update lazada_review_filipino.py

fix typo

* Update lazada_review_filipino.py

bug fix - ValueError: Class label 5 greater than configured num_classes 5

* Update seacrowd/sea_datasets/lazada_review_filipino/lazada_review_filipino.py

---------

Co-authored-by: Samuel Cahyawijaya <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>

* Adjust bash script test_example.sh and test_example_source_only.sh (SEACrowd#171)

* update: adjust test_example.sh and test_example_source_only.sh

* fix: minor error message when dataset is empty

* updated kde4 language codes to iso639-3

* fix: citation

* refactor: use base config class

* create dataset loader for myanmar-rakhine parallel (SEACrowd#471)

* add pyreadr==0.5.0 (SEACrowd#504)

usage: reads/writes R RData and Rds files into/from pandas data frames

* Closes SEACrowd#97 | Inter-Agency Task Force for the Management of Emerging Infectious Diseases (IATF) COVID-19 Resolutions  (SEACrowd#460)

* Closes SEACrowd#274 | Create OIL data loader (SEACrowd#389)

* initial commit

* refactor: move module

* feat: dataset implementation

* feat: oil dataloader

* refactor: move dataloader file

* refactor: move dataloader file

* fix: non unique id error

* refactor: file formating

* refactor: remove comments

* fix: invalid config name exception raise

* refactor: audio cache file path

* fix: remove useless loop

* refactor: formatting

* Create __init__.py

* fix: citation

* fix: remove seacrowd schema

* Closes SEACrowd#49 | Updated existing TICO_19 dataloader to support more sea languages (SEACrowd#414)

* Updated existing TICO_19 dataloader to support more sea languages

* added sea languages to _LANGUAGES

---------

Co-authored-by: ssfei81 <[email protected]>

* Closes SEACrowd#443 | Add dataloader for ASR-STIDUSC (SEACrowd#493)

* Add dataloader for ASR-STIDUSC

* update task, dataset name, pythonic coding

* add relation extraction task (SEACrowd#502)

* fix: subset and config name

* Update bibtex id

* Closes SEACrowd#356 | Implement dataloader for CodeSwitch-Reddit (SEACrowd#451)

* Add CODE_SWITCHING_IDENTIFICATION task

* Implement dataloader

* Update codeswitch_reddit.py

fix column naming in source (using lowercase instead of capitalized)

* Closes SEACrowd#222 | Create dataset loader for CreoleRC (SEACrowd#469)

* Create dataset loaderfor CreoleRC

* remove changes to constants.py

* remove document_id, add normalized, add sanity check on offset value

* Update REVIEWING.md

Clarify wording in Dataloader Reviewing Doc

* Closes SEACrowd#341  | Create dataset loader for myParaphrase (SEACrowd#436)

* [add]  dataloader for my_paraphrase

* [refactor]  removed redundant breakpoint; put right default schema function

* [refactor]  changed schema for dataset

* [refactor]  split data into 3 categories(paraphrase, non_paraphrase, all)

* [refactor]  default config name is changed

* [refactor]  source configs for _paraphrase,_non_paraphrase,_all; altered schema naming

* [refactor]  cleaner conditioning, defined else clause

* Closes SEACrowd#269 | Create dataset loader for ViVQA SEACrowd#269 (SEACrowd#318)

* add vivqa dataloader

* Update vivqa.py

* update viviq dataloader config

* Update vivqa.py

* add vivqa dataloader

* Update vivqa.py

* update viviq dataloader config

* Update vivqa.py

* Update vivqa.py

* update

* Update vivqa.py

* Update vivqa.py

* Delete .idea/vcs.xml

* Delete .idea/seacrowd-datahub.iml

* Delete .idea/inspectionProfiles/profiles_settings.xml

* Delete .idea/inspectionProfiles/Project_Default.xml

* Update vivqa.py

* Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"

This reverts commit a96fa80, reversing
changes made to 23700ca.

* Delete .idea/vcs.xml

* Delete .idea/seacrowd-datahub.iml

* Delete .idea/inspectionProfiles/profiles_settings.xml

* Delete .idea/inspectionProfiles/Project_Default.xml

* Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa"

This reverts commit a96fa80, reversing
changes made to 23700ca.

* Revert "Revert "Merge branch 'vivqa' of github.com:gyyz/seacrowd-datahub into vivqa""

This reverts commit 5f1a3d6.

* fixing trailing space and run Makefile

* Closes SEACrowd#445 | Create dataset loader for malaysia-tweets-with-sentiment-labels (SEACrowd#450)

* Fix typo syntax dictionary at constants.py

* Add dataloader for malaysia_tweets

* Completed requested changes

* add dataloader for ASR-Sindodusc (SEACrowd#491)

* Closes SEACrowd#475 | Add dataloader for indonglish-dataset (SEACrowd#490)

* create dataloader for indonglish

* make subset_id unique, use ClassLabel for label

* Closes SEACrowd#215 | Implement dataloader for `thai_gpteacher` (SEACrowd#382)

* Implement dataloader for thai_gpteacher

* update

* update

* Closes SEACrowd#275 | Create dataset loader for UIT-ViCoV19QA SEACrowd#275 (SEACrowd#463)

* add SeaCrowd dataloader for uit_vicov19qa

* Merge subsets to one

* remove unused imported package

* Closes SEACrowd#309 | Create dataset loader for Vietnamese Hate Speech Detection (UIT-ViHSD) #309Uit vihsd (SEACrowd#501)

* create dataloader for uit_vihsd

* Update uit_vihsd.py

* Add some info for the labels

* Update example for Seacrowd schema

* Closes SEACrowd#441 | Add dataloader for ASR-SMALDUSC (SEACrowd#492)

* Add dataloader for ASR-SMALDUSC

* add prompt field

* Closes SEACrowd#307 | Implement dataloader for ViSoBERT  (SEACrowd#466)

* Update constants.py

* Implement dataloader for ViSoBERT

* Fix conflicts with constants.py

* Combine source and seacrowd_ssp schemas

---------

Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>

* add dataloader for wikitext_tl_39 (SEACrowd#486)

* Closes SEACrowd#393 | Create dataset loader for WEATHub (SEACrowd#496)

* [Feature] Add Weathub DataLoader

* [Fix] Add filter for SEA languages only + add constants + run formatter

* [Chore] Fix data loader naming

* [Fix] Impelement request changes from review

* Closes SEACrowd#188 | Implement dataloader for Sea-bench (SEACrowd#375)

* Implement dataloader for WIT

* Implement dataloader for sea_bench

* Remove WIT

* Remove logger and unnecessary variables

* Add instruction tuning and remove QA and summarization tasks

* Add __init__.py file

* Remove machine translation task

* Fix nitpicks

---------

Co-authored-by: Railey Montalan <[email protected]>

* Closes SEACrowd#115 | Create dataset loader for PhoMT dataset (SEACrowd#489)

* add dataloader for PhoMT dataset

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* Update seacrowd/sea_datasets/phomt/phomt.py

Co-authored-by: Elyanah Aco <[email protected]>

* update text1/2 name for PhoMT dataset

* Update phomt.py to replace en&vi to eng&vie

---------

Co-authored-by: Elyanah Aco <[email protected]>

* Closes SEACrowd#310 |Create dataset loader for ViSpamReviews SEACrowd#310 (SEACrowd#454)

* add vispamreviews dataloader

* update vispamreviews

* update schema

* Closes SEACrowd#530 | Add/Update Dataloader Tatabahasa (SEACrowd#540)

* feat: dataloader QA commonsense-reasoning

* nitpick

* Closes SEACrowd#267  | Add dataloader for struct_amb_ind (SEACrowd#506)

* Implement dataloader for struct_amb_ind

* Update seacrowd/sea_datasets/struct_amb_ind/struct_amb_ind.py

Co-authored-by: Jonibek Mansurov <[email protected]>

---------

Co-authored-by: Jonibek Mansurov <[email protected]>

* Closes SEACrowd#347 | Create dataset loader for IndoWiki (SEACrowd#485)

* create dataset loader for IndoWiki

* remove seacrowd schema

* Closes SEACrowd#354 | Implement dataloader for ETOS (SEACrowd#416)

* Implement dataloader for ETOS

* Implement dataloader for ETOS

* Rename dataset class name to ETOSDataset

* Remove  schema due to insufficient annotations

* Change ETOS into a POS tagging dataset

* Add missing __init__.py file

* Fix nitpicks

* Add DEFAULT_CONFIG_NAME

---------

Co-authored-by: Railey Montalan <[email protected]>

* update common_parser for UD JV_CSUI (SEACrowd#558)

* Create dataset loader for UD Javanese-CSUI SEACrowd#427 (SEACrowd#432)

* Closes SEACrowd#446 | Add/Update Dataloader voxlingua (SEACrowd#543)

* add init voxlingua

* Update seacrowd/sea_datasets/voxlingua/voxlingua.py

Co-authored-by: Lj Miranda <[email protected]>

---------

Co-authored-by: Lj Miranda <[email protected]>

* Closes SEACrowd#428 | Create dataset loader for Indonesia BioNER (SEACrowd#434)

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update cc3m_35l.py

Changed "_LANGS" to "_LANGUAGES"

* init commit

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Update seacrowd/sea_datasets/cc3m_35l/cc3m_35l.py

Co-authored-by: Jennifer Santoso <[email protected]>

* Closes SEACrowd#344 | Create dataset loader for VLSP2016-SA (SEACrowd#500)

* [add]  dataloader for vlsp2016_sa[local]

* [refactor]  changed schema name

---------

Co-authored-by: Amir Djanibekov <[email protected]>

* Fix the private datasheet link in POINTS.md (SEACrowd#568)

* Closes SEACrowd#192 | Create dataset loader for MALINDO_parallel (SEACrowd#385)

* add malindo_parallel.py

* cleanup

* Class name fix

Co-authored-by: Lj Miranda <[email protected]>

* Remove sample licenses

Co-authored-by: Lj Miranda <[email protected]>

* fix dataset formatting error, use original dataset id

---------

Co-authored-by: Lj Miranda <[email protected]>

* Closes SEACrowd#114 | Implement dataloader for VnDT (SEACrowd#467)

* Implement dataloader for VnDT

* Add utility to impute missing sent_id and text fields from CoNLL files

* Fix imputed outputs

---------

Co-authored-by: Railey Montalan <[email protected]>

* add ocr task (SEACrowd#555)

* PR for update subset composition of TydiQA | Close SEACrowd#465 (SEACrowd#503)

* update csubset composition

* Update Subset Composition

* Update Subset Composition

* update subset name

indonesian --> ind
thai --> tha

* Update nusaparagraph_emot.py

* Update nusaparagraph_emot.py

* Update configs.py

* Closes SEACrowd#346 | Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings) (SEACrowd#406)

* Implement dataloader for MUSE (Multilingual Unsupervised and Supervised Embeddings)

* Create __init__.py for MUSE SEACrowd#346

* Remove unused comment lines for MUSE SEACrowd#346

* changed all 2 letters language codes to 3 letters

---------

Co-authored-by: ssfei81 <[email protected]>
Co-authored-by: Frederikus Hudi <[email protected]>

* Closes SEACrowd#12 | Add/Update Dataloader BalitaNLP (SEACrowd#550)

* Implement dataloader for balita_nlp

* Remove articles with missing images from imtext schema

* Add details to metadata

* Adding New Citation for Bhinneka korpus (SEACrowd#599)

* Add bhinnek_korpus dataset loader

* Updating the suggested changes

* Resolved review suggestions

* adding new citation

---------

Co-authored-by: Holy Lovenia <[email protected]>

* Closes SEACrowd#270 | Create dataset loader for OpenViVQA SEACrowd#270 (SEACrowd#464)

* add sample

* init submit for openvivqa dataloader

* Update openvivqa.py

* Update openvivqa.py

* update dict format

* Closes SEACrowd#516 | Add/Update Dataloader id_newspaper_2018 (SEACrowd#551)

* Implement dataloader for id_newspaper_2018

* Specify JSON ecoding

* Closes SEACrowd#429 | Implement dataloader for filipino_hatespeech_election (SEACrowd#487)

* Add dataloader for filipino_hatespeech_election

* update task

* update

* Closes SEACrowd#52 | Add cosem dataloader (SEACrowd#473)

* feat: cosem dataloader

* fix: citation

* refactor: dataloader class name

* fix: file parsing logic

* fix: id format

* fix: tab separator bug in text

* fix: check for unique id

* Closes SEACrowd#424 | Add Dataloader Bactrian-X

* Import `schemas` beforehand on `templates/template.py` (SEACrowd#644)

* add import statement for schemas

* add import statement for schemas

* Closes SEACrowd#313 | Add dataloader for Saltik (SEACrowd#387)

* add dataloader for indonesian_madurese_bible_translation

* add dataloader for saltik

* Delete seacrowd/sea_datasets/indonesian_madurese_bible_translation/indonesian_madurese_bible_translation.py

* update based on the reviewer comment

* update based on the reviewer comment

* Remove the modified constants.py from PR

---------

Co-authored-by: Holy Lovenia <[email protected]>

* Add `.upper` method for `--schema` parameter (SEACrowd#648)

* add upper method for --schema

* revert code-style

* Closes SEACrowd#438 | Add dataloader for ASR-INDOCSC (SEACrowd#509)

* add dataloader for asr_indocsc

* Update asr_indocsc.py for data downloading instructions

---------

Co-authored-by: Salsabil Maulana Akbar <[email protected]>
Co-authored-by: Elyanah Aco <[email protected]>
Co-authored-by: Yuze GAO <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: XU, Yan (Yana) <[email protected]>
Co-authored-by: Haochen Li <[email protected]>
Co-authored-by: Jennifer Santoso <[email protected]>
Co-authored-by: Holy Lovenia <[email protected]>
Co-authored-by: Lucky Susanto <[email protected]>
Co-authored-by: Samuel Cahyawijaya <[email protected]>
Co-authored-by: Muhammad Dehan Al Kautsar <[email protected]>
Co-authored-by: Lj Miranda <[email protected]>
Co-authored-by: Lucky Susanto <[email protected]>
Co-authored-by: Maria Khelli <[email protected]>
Co-authored-by: Ishan Jindal <[email protected]>
Co-authored-by: ssfei81 <[email protected]>
Co-authored-by: IvanHalimP <[email protected]>
Co-authored-by: Enliven26 <[email protected]>
Co-authored-by: Dan John Velasco <[email protected]>
Co-authored-by: Chenxi <[email protected]>
Co-authored-by: Bhavish Pahwa <[email protected]>
Co-authored-by: FawwazMayda <[email protected]>
Co-authored-by: fawwaz.mayda <[email protected]>
Co-authored-by: Ilham F Putra <[email protected]>
Co-authored-by: rafif-kewmann <[email protected]>
Co-authored-by: mrafifrbbn <[email protected]>
Co-authored-by: Yong Zheng-Xin <[email protected]>
Co-authored-by: Amir Djanibekov <[email protected]>
Co-authored-by: Amir Djanibekov <[email protected]>
Co-authored-by: joan <[email protected]>
Co-authored-by: joanitolopo <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>
Co-authored-by: Railey Montalan <[email protected]>
Co-authored-by: ssun32 <[email protected]>
Co-authored-by: Tyson <[email protected]>
Co-authored-by: Ilham Firdausi Putra <[email protected]>
Co-authored-by: Johanes Lee <[email protected]>
Co-authored-by: Akhdan Fadhilah <[email protected]>
Co-authored-by: Frederikus Hudi <[email protected]>
Co-authored-by: Börje Karlsson <[email protected]>
Co-authored-by: Muhammad Satrio Wicaksono <[email protected]>
Co-authored-by: Wenyu Zhang <[email protected]>
Co-authored-by: R. Damanhuri <[email protected]>
Co-authored-by: Patrick Amadeus Irawan <[email protected]>
Co-authored-by: Reza Qorib <[email protected]>
Co-authored-by: Bryan Wilie <[email protected]>
Co-authored-by: Muhammad Ravi Shulthan Habibi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for GlobalWoZ V2.0
4 participants