-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #569 | Add/Update Dataloader MSL4Emergency #677
Closes #569 | Add/Update Dataloader MSL4Emergency #677
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow, I still get error when running .load_dataset
or testing via tests.test_seacrowd
, the error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 11: character maps to <undefined>
.
Do you know why this happens? Or is everything fine (OK/passed) on your side? @sabilmakbar
Hmm, on my end it works fine. You can check this log file, which is a sh output from this cmd: I'm thinking that your error might come from these lines (in L114-115): with open(text_path, "r") as f:
text_data = [data.split("\t") for data in f.readlines()] You can try to print the Update: Oh perhaps it's related to let me know any of these solved the bug on this code @muhammadravi251001 |
It's not about the
I get an OK response and get the dataset perfectly with The problem is: that I run this dataloader implementation on my Windows notebook, and Windows does not always run encoding on To avoid this future-encoding-error, I recommend adding an |
thanks! done making changes, pls have a look @muhammadravi251001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test passed on both via tests.seacrowd
and .load_dataset()
. LGTM. Approved. Great work!
Thanks for the change & your contribution @sabilmakbar!
Hi @ljvmiranda921, sorry for the sudden notice. I assigned you here in place of @luckysusanto due to the lack of response. I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Sure! I'll just update the small nits myself as well just to speed things up! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Merging this in a few
Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:
Closes #569
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.