-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #5 | Add tatoeba dataset loader #22
Conversation
About this, I think it will be better to "override" this with the actual language code, |
@ryanignatius I agree with you. Could you please override it with the actual language codes? @ljvmiranda921 |
Should be done: aba9c65 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing, could you also please provide tatoeba_source
and tatoeba_seacrowd_t2t
schemas where they load all data? @ljvmiranda921
_SEACROWD_VERSION = "1.0.0" | ||
|
||
|
||
class TatoebaDatset(datasets.GeneratorBasedBuilder): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please change the class name to TatoebaDataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching! Fixed d5bef23
|
||
SEACROWD_SCHEMA_NAME = "t2t" | ||
|
||
dataset_names = sorted([f"tatoeba.{lang}" for lang in _LANGUAGES]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using .
as the delimiter, could you please change it to _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! e95f83e
Converting this to draft for now. Will still work on the schema that loads all datasets at once. |
9317dea
to
e95f83e
Compare
a434bfa
to
a44be2b
Compare
Hi @holylovenia ! I implemented it here: a44be2b The default config now is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the changes, @ljvmiranda921! Looks perfect to me. 👍 Let's wait for @ryanignatius's review then we can merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good to me too,
thank you for the help @ljvmiranda921
Closes #5
Some notable things to point out from this dataset:
tatoeba.ind_seacrowd_t2t
,tatoeba.ind_source
and so on. When testing, we should pass a value to the--subset_id
parameter like so:source_lang
field in the original HF link seems incorrect? I'm not familiar with this dataset so for my implementation I just copied over what's in the source. Let me know if I should "override" this with the actual language code.Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.