-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #95 Adds Thai NNER Dataset #326
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bp-high
Thanks for implemeting the dataloader.
There are several issues I would like to address:
- The dataset is reported to be split to
train, dev, test
(numbers2935, 979, 980
in the paper, however it is onlytrain, test
(numbers3914, 980
in this dataloader. Could you recheck the link, and implement the dataset split as the one in the paper? This implementation fix would also apply to theSplitGenerator
. - Please revise the minor fixes to conform to our dataset format. The suggestions are provided as shown. Also, please be sure to run
make check_file
command.
Waiting for the update!
label_classes = ['O', 'S-org:religious', 'E-tv_show', 'I-soi', 'E-last', 'E-fund', 'B-country', 'B-port', 'B-airport', 'I-title', 'E-stock_exchange', 'I-restaurant', 'I-station', 'I-org:political', 'I-animal_species', 'I-index', 'E-latitude', 'B-law', 'S-middle', 'I-product:food', 'E-animal_species', 'I-goverment', 'S-city', 'E-longtitude', 'I-song', 'I-award', 'E-event:others', 'B-jargon', 'S-facility:other', 'S-money', 'I-stadium', 'E-space', 'S-animate', 'E-station', 'S-percent', 'E-country', 'I-year', 'S-district', 'E-org:religious', 'E-continent', 'I-food:ingredient', 'I-person', 'B-electronics', 'B-museum', 'B-product:drug', 'B-month', 'B-loc:others', 'B-org:religious', 'S-season', 'I-continent', 'E-norp:political', 'E-airport', 'B-money', 'S-role', 'B-cardinal', 'B-film', 'B-hotel', 'B-unit', 'E-energy', 'I-jargon', 'I-date', 'S-norp:others', 'S-country', 'E-film', 'I-energy', 'B-norp:political', 'E-index', 'I-concert', 'I-sciname', 'I-product:drug', 'I-org:other', 'S-animal_species', 'I-hospital', 'B-nationality', 'S-law', 'S-disease', 'I-latitude', 'E-award', 'B-periodic', 'I-roadname', 'S-jargon', 'E-city', 'S-person', 'S-date', 'E-goverment', 'B-stadium', 'S-religion', 'B-address', 'E-band', 'S-continent', 'I-fold', 'E-stadium', 'S-band', 'E-middle', 'B-org:edu', 'I-distance', 'I-island', 'S-hospital', 'S-loc:others', 'S-book', 'I-war', 'I-norp:political', 'E-title', 'S-war', 'S-org:edu', 'I-nationality', 'E-restaurant', 'S-province', 'E-war', 'I-quantity', 'S-orgcorp', 'E-person', 'E-day', 'E-disease', 'I-last', 'S-product:drug', 'S-space', 'E-product:drug', 'I-country', 'E-weight', 'B-fold', 'E-sciname', 'E-building', 'E-duration', 'B-longtitude', 'E-religion', 'S-soi', 'E-woa', 'B-date', 'I-city', 'I-role', 'E-org:edu', 'I-norp:others', 'B-title', 'I-namemod', 'E-speed', 'S-nicknametitle', 'E-mult', 'E-army', 'I-day', 'S-duration', 'B-time', 'I-rel', 'B-language', 'B-ocean', 'I-woa', 'E-animate', 'I-facility:other', 'E-psudoname', 'E-hotel', 'I-bridge', 'I-port', 'S-address', 'B-war', 'E-mountian', 'B-psudoname', 'I-sports_team', 'B-last', 'B-animate', 'B-latitude', 'E-cardinal', 'S-roadname', 'I-sub_district', 'E-food:ingredient', 'E-nickname', 'E-port', 'B-district', 'B-sports_team', 'E-org:political', 'I-loc:others', 'E-ocean', 'E-month', 'S-bridge', 'B-fund', 'I-money', 'I-state', 'S-unit', 'B-distance', 'B-food:ingredient', 'E-unit', 'S-namemod', 'S-airport', 'I-cardinal', 'S-sub_district', 'S-day', 'B-woa', 'S-army', 'B-tv_show', 'E-bridge', 'E-loc:others', 'S-sciname', 'I-law', 'B-space', 'B-religion', 'I-film', 'E-quantity', 'S-org:other', 'I-sports_event', 'I-event:others', 'S-firstname', 'I-band', 'I-weight', 'E-song', 'I-book', 'E-sports_event', 'S-weapon', 'I-space', 'B-speed', 'B-event:others', 'E-soi', 'I-org:edu', 'E-product:food', 'S-year', 'B-concert', 'B-province', 'B-sciname', 'I-vehicle', 'E-percent', 'I-percent', 'E-org:other', 'I-orgcorp', 'B-org:other', 'E-year', 'I-media', 'S-food:ingredient', 'I-electronics', 'E-hospital', 'I-psudoname', 'I-month', 'E-distance', 'B-duration', 'S-nationality', 'I-firstname', 'B-namemod', 'S-building', 'S-psudoname', 'B-restaurant', 'I-org:religious', 'B-vehicle', 'B-sports_event', 'B-book', 'B-temperature', 'E-norp:others', 'E-time', 'B-org:political', 'E-state', 'I-periodic', 'S-tv_show', 'S-product:food', 'S-postcode', 'S-mult', 'I-disease', 'S-month', 'B-weight', 'B-station', 'E-book', 'E-sub_district', 'E-orgcorp', 'S-fund', 'I-nickname', 'B-band', 'S-stock_exchange', 'S-natural_disaster', 'S-electronics', 'E-sports_team', 'E-museum', 'I-fund', 'B-river', 'B-rel', 'I-unit', 'I-time', 'S-mountian', 'S-port', 'I-temperature', 'I-airport', 'S-god', 'B-year', 'I-army', 'B-index', 'B-natural_disaster', 'E-roadname', 'E-namemod', 'S-game', 'I-hotel', 'B-quantity', 'B-person', 'B-mountian', 'S-river', 'E-vehicle', 'S-media', 'S-sports_team', 'I-river', 'I-mountian', 'B-hospital', 'S-state', 'B-mult', 'S-film', 'B-soi', 'E-role', 'S-org:political', 'S-station', 'E-jargon', 'S-last', 'S-ocean', 'S-quantity', 'S-award', 'E-law', 'I-middle', 'E-firstname', 'E-facility:other', 'I-address', 'I-religion', 'S-time', 'E-periodic', 'S-vehicle', 'B-nickname', 'S-fold', 'B-role', 'E-rel', 'S-goverment', 'I-district', 'B-song', 'E-money', 'E-island', 'I-duration', 'E-electronics', 'B-sub_district', 'E-language', 'E-temperature', 'B-building', 'S-sports_event', 'E-concert', 'B-percent', 'S-island', 'I-museum', 'B-island', 'B-media', 'S-language', 'B-stock_exchange', 'S-title', 'E-natural_disaster', 'B-animal_species', 'B-facility:other', 'B-state', 'E-address', 'I-animate', 'S-stadium', 'I-province', 'B-city', 'B-army', 'I-weapon', 'B-middle', 'S-cardinal', 'E-date', 'S-event:others', 'S-periodic', 'I-speed', 'B-day', 'B-goverment', 'E-media', 'B-norp:others', 'S-nickname', 'E-district', 'I-natural_disaster', 'E-fold', 'I-language', 'S-rel', 'B-product:food', 'B-bridge', 'I-longtitude', 'B-orgcorp', 'B-award', 'E-nationality', 'E-weapon', 'E-province', 'I-tv_show', 'B-energy', 'I-building', 'B-roadname', 'B-continent', 'B-firstname', 'I-stock_exchange', 'S-norp:political', 'S-weight', 'S-restaurant', 'B-weapon', 'S-song', 'B-disease', 'E-river'] | ||
BUILDER_CONFIGS = [ | ||
SEACrowdConfig( | ||
name="thai_nner_source", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name="thai_nner_source", | |
name=f"{_DATASETNAME}_source", |
version=SOURCE_VERSION, | ||
description=" Thai NNER source schema", | ||
schema="source", | ||
subset_id="thai_nner", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset_id="thai_nner", | |
subset_id=f"{_DATASETNAME}", |
version=SEACROWD_VERSION, | ||
description="Thai NNER SEACrowd schema", | ||
schema="seacrowd_seq_label", | ||
subset_id="thai_nner", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset_id="thai_nner", | |
subset_id=f"{_DATASETNAME}", |
else: | ||
raise ValueError(f"Invalid config: {self.config.name}") | ||
|
||
# This template is based on the following template from the datasets package: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the comment out part here and afterwards.
Hi @ & @, may I know if you are still working on this PR? |
I've been waiting for the changes for 2 weeks. Please fix or review immediately. |
@bp-high would you addressing @jen-santoso's review? |
Yes @gentaiscool on it. Will push the recommended changes by eod today. |
Hi @ & @, may I know if you are still working on this PR? |
hello @bp-high Would you mind addressing the changes I requested? I would like to get this PR soon, and also so that we all can proceed with completing this dataloader's PR. |
Hi @gentaiscool, @jen-santoso & @bp-high, may I know if you are still working on this PR? |
still working on this will address the changes in a few hours |
@jen-santoso Have added the necessary changes as suggested in the review and also run the make check_file command.
I found that the dataset present in the CONLL format (https://drive.google.com/drive/folders/1FIoSmj-JCRxaDTIbGsiuMhaWERmFUCWq )contains only train and test splits as also mentioned in the paper: Although I found that there is data present in json format which contains the dev split(https://drive.google.com/drive/folders/1F0fqRdnGMjryp8dAToPc7OEBJlA9Xdi7) when I tried to build it into the CONLL BIOES format tags that was provided by the authors it did not came out identical to the original conll data provided by the authors. I have mailed the authors to ask about this. |
@bp-high Thank you for the code updates and the investigation of the dataset. |
Agree. Let's wait for @gentaiscool's review. |
Replacing @gentaiscool with @patrickamadeus for review. |
def _custom_conll_loader(file_path): | ||
# Read file | ||
data = open(file_path, "r", encoding="utf-8").readlines() | ||
|
||
# Prepare buffer | ||
dataset = [] | ||
sentence, seq_label = [], [] | ||
for line in data: | ||
if len(line.strip()) > 0: | ||
# Handle lines with a single field | ||
if "\t" in line: | ||
token, label = line[:-1].split("\t") | ||
elif "\t" not in line and " " in line: | ||
line_split = line.split() | ||
token = line_split[0] | ||
label = line_split[1:] | ||
else: | ||
token, label = line.strip(), "O" # Default label if not provided | ||
|
||
sentence.append(token) | ||
seq_label.append(label) | ||
else: | ||
dataset.append({"sentence": sentence, "label": seq_label}) | ||
sentence = [] | ||
seq_label = [] | ||
return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bp-high! Thanks for the contribution!
I only want to address a minor bug that might happen here since the implementation hasn't passed the test case.
THIS IS YOUR `row['sentence']`
['5', '_', 'มี', '.ค.', '_']
THIS IS YOUR `row['label']`
[['B-date', 'S-day', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'B-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'E-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]
I think there is a slight mistake within the loader function resulting in the captured label
becomes list of list instead of list of string, sequentially matched with each token from the sentence. Hence, it prompts TypeError: '<=' not supported between instances of 'int' and 'list'
during the check.
Could you please do revision on this? Other than that, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loader is specially modified to give in the format so that we get all labels(upto depth 8) of each word in a corresponding list marked for it.
if self.config.schema == "source": | ||
features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]}) | ||
|
||
elif self.config.schema == "seacrowd_seq_label": | ||
features = schemas.seq_label_features(self.label_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification @bp-high . Since that's the case, could we revise the feature schema so the implementation satisfies the testcases?
if self.config.schema == "source": | |
features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]}) | |
elif self.config.schema == "seacrowd_seq_label": | |
features = schemas.seq_label_features(self.label_classes) | |
features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [[datasets.Value("string")]]}) | |
# and for SEACrowd seq_label |
same goes for the SEACrowd
schema
label_classes = [ | ||
"O", | ||
"S-org:religious", | ||
"E-tv_show", | ||
"I-soi", | ||
"E-last", | ||
"E-fund", | ||
"B-country", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm checking, I also found this error ValueError: Invalid string class label S-museum
, so far I noticed there are
"S-hotel",
"B-postcode",
"E-postcode",
"B-season",
"E-season",
....
and probably more labels that aren't present in label_classes
.
Could you please make sure the labels are complete and run python3 -m tests.test_seacrowd seacrowd/sea_datasets/thai_nner/thai_nner.py
again?
Thank you!
A friendly reminder for @bp-high to address @patrickamadeus's suggestions. |
Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Hi @bp-high, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! cc: @patrickamadeus |
Closes #95
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.