-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #424 | Add Dataloader Bactrian-X #552
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@misc{li2023bactrianx, | ||
title={Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation}, | ||
author={Haonan Li and Fajri Koto and Minghao Wu and Alham Fikri Aji and Timothy Baldwin}, | ||
year={2023}, | ||
eprint={2305.15011}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed and tested. LGTM, thanks @akhdanfadh :)
Merging in a few |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Closes #424
I implemented one config per language/subset. Thus, configs will look like this:
bactrian_x_id_source
,bactrian_x_km_seacrowd_t2t
, etc. When testing, passbactrian_x_<subset>
to the--subset_id
parameter.As there is one more variable for the input response in the source schema, I added that manually as
Instruction: {instruction}\nInput: {input}"
intext_1
ofseacrowd_t2t
schema. I don't know if that is allowed, so let's discuss.Note that for Khmer subset, the loaded data will look as follows:
At first, I thought this should be an encoding problem and need to be solved. But turns out I also get the same result when loading from HF directly as follows:
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.