Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #95 Adds Thai NNER Dataset #326

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bp-high
Copy link
Contributor

@bp-high bp-high commented Jan 15, 2024

Closes #95

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Copy link
Collaborator

@jensan-1 jensan-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bp-high
Thanks for implemeting the dataloader.

There are several issues I would like to address:

  1. The dataset is reported to be split to train, dev, test (numbers 2935, 979, 980 in the paper, however it is only train, test (numbers 3914, 980 in this dataloader. Could you recheck the link, and implement the dataset split as the one in the paper? This implementation fix would also apply to the SplitGenerator.
  2. Please revise the minor fixes to conform to our dataset format. The suggestions are provided as shown. Also, please be sure to run make check_file command.

Waiting for the update!

seacrowd/sea_datasets/thai_nner/thai_nner.py Outdated Show resolved Hide resolved
label_classes = ['O', 'S-org:religious', 'E-tv_show', 'I-soi', 'E-last', 'E-fund', 'B-country', 'B-port', 'B-airport', 'I-title', 'E-stock_exchange', 'I-restaurant', 'I-station', 'I-org:political', 'I-animal_species', 'I-index', 'E-latitude', 'B-law', 'S-middle', 'I-product:food', 'E-animal_species', 'I-goverment', 'S-city', 'E-longtitude', 'I-song', 'I-award', 'E-event:others', 'B-jargon', 'S-facility:other', 'S-money', 'I-stadium', 'E-space', 'S-animate', 'E-station', 'S-percent', 'E-country', 'I-year', 'S-district', 'E-org:religious', 'E-continent', 'I-food:ingredient', 'I-person', 'B-electronics', 'B-museum', 'B-product:drug', 'B-month', 'B-loc:others', 'B-org:religious', 'S-season', 'I-continent', 'E-norp:political', 'E-airport', 'B-money', 'S-role', 'B-cardinal', 'B-film', 'B-hotel', 'B-unit', 'E-energy', 'I-jargon', 'I-date', 'S-norp:others', 'S-country', 'E-film', 'I-energy', 'B-norp:political', 'E-index', 'I-concert', 'I-sciname', 'I-product:drug', 'I-org:other', 'S-animal_species', 'I-hospital', 'B-nationality', 'S-law', 'S-disease', 'I-latitude', 'E-award', 'B-periodic', 'I-roadname', 'S-jargon', 'E-city', 'S-person', 'S-date', 'E-goverment', 'B-stadium', 'S-religion', 'B-address', 'E-band', 'S-continent', 'I-fold', 'E-stadium', 'S-band', 'E-middle', 'B-org:edu', 'I-distance', 'I-island', 'S-hospital', 'S-loc:others', 'S-book', 'I-war', 'I-norp:political', 'E-title', 'S-war', 'S-org:edu', 'I-nationality', 'E-restaurant', 'S-province', 'E-war', 'I-quantity', 'S-orgcorp', 'E-person', 'E-day', 'E-disease', 'I-last', 'S-product:drug', 'S-space', 'E-product:drug', 'I-country', 'E-weight', 'B-fold', 'E-sciname', 'E-building', 'E-duration', 'B-longtitude', 'E-religion', 'S-soi', 'E-woa', 'B-date', 'I-city', 'I-role', 'E-org:edu', 'I-norp:others', 'B-title', 'I-namemod', 'E-speed', 'S-nicknametitle', 'E-mult', 'E-army', 'I-day', 'S-duration', 'B-time', 'I-rel', 'B-language', 'B-ocean', 'I-woa', 'E-animate', 'I-facility:other', 'E-psudoname', 'E-hotel', 'I-bridge', 'I-port', 'S-address', 'B-war', 'E-mountian', 'B-psudoname', 'I-sports_team', 'B-last', 'B-animate', 'B-latitude', 'E-cardinal', 'S-roadname', 'I-sub_district', 'E-food:ingredient', 'E-nickname', 'E-port', 'B-district', 'B-sports_team', 'E-org:political', 'I-loc:others', 'E-ocean', 'E-month', 'S-bridge', 'B-fund', 'I-money', 'I-state', 'S-unit', 'B-distance', 'B-food:ingredient', 'E-unit', 'S-namemod', 'S-airport', 'I-cardinal', 'S-sub_district', 'S-day', 'B-woa', 'S-army', 'B-tv_show', 'E-bridge', 'E-loc:others', 'S-sciname', 'I-law', 'B-space', 'B-religion', 'I-film', 'E-quantity', 'S-org:other', 'I-sports_event', 'I-event:others', 'S-firstname', 'I-band', 'I-weight', 'E-song', 'I-book', 'E-sports_event', 'S-weapon', 'I-space', 'B-speed', 'B-event:others', 'E-soi', 'I-org:edu', 'E-product:food', 'S-year', 'B-concert', 'B-province', 'B-sciname', 'I-vehicle', 'E-percent', 'I-percent', 'E-org:other', 'I-orgcorp', 'B-org:other', 'E-year', 'I-media', 'S-food:ingredient', 'I-electronics', 'E-hospital', 'I-psudoname', 'I-month', 'E-distance', 'B-duration', 'S-nationality', 'I-firstname', 'B-namemod', 'S-building', 'S-psudoname', 'B-restaurant', 'I-org:religious', 'B-vehicle', 'B-sports_event', 'B-book', 'B-temperature', 'E-norp:others', 'E-time', 'B-org:political', 'E-state', 'I-periodic', 'S-tv_show', 'S-product:food', 'S-postcode', 'S-mult', 'I-disease', 'S-month', 'B-weight', 'B-station', 'E-book', 'E-sub_district', 'E-orgcorp', 'S-fund', 'I-nickname', 'B-band', 'S-stock_exchange', 'S-natural_disaster', 'S-electronics', 'E-sports_team', 'E-museum', 'I-fund', 'B-river', 'B-rel', 'I-unit', 'I-time', 'S-mountian', 'S-port', 'I-temperature', 'I-airport', 'S-god', 'B-year', 'I-army', 'B-index', 'B-natural_disaster', 'E-roadname', 'E-namemod', 'S-game', 'I-hotel', 'B-quantity', 'B-person', 'B-mountian', 'S-river', 'E-vehicle', 'S-media', 'S-sports_team', 'I-river', 'I-mountian', 'B-hospital', 'S-state', 'B-mult', 'S-film', 'B-soi', 'E-role', 'S-org:political', 'S-station', 'E-jargon', 'S-last', 'S-ocean', 'S-quantity', 'S-award', 'E-law', 'I-middle', 'E-firstname', 'E-facility:other', 'I-address', 'I-religion', 'S-time', 'E-periodic', 'S-vehicle', 'B-nickname', 'S-fold', 'B-role', 'E-rel', 'S-goverment', 'I-district', 'B-song', 'E-money', 'E-island', 'I-duration', 'E-electronics', 'B-sub_district', 'E-language', 'E-temperature', 'B-building', 'S-sports_event', 'E-concert', 'B-percent', 'S-island', 'I-museum', 'B-island', 'B-media', 'S-language', 'B-stock_exchange', 'S-title', 'E-natural_disaster', 'B-animal_species', 'B-facility:other', 'B-state', 'E-address', 'I-animate', 'S-stadium', 'I-province', 'B-city', 'B-army', 'I-weapon', 'B-middle', 'S-cardinal', 'E-date', 'S-event:others', 'S-periodic', 'I-speed', 'B-day', 'B-goverment', 'E-media', 'B-norp:others', 'S-nickname', 'E-district', 'I-natural_disaster', 'E-fold', 'I-language', 'S-rel', 'B-product:food', 'B-bridge', 'I-longtitude', 'B-orgcorp', 'B-award', 'E-nationality', 'E-weapon', 'E-province', 'I-tv_show', 'B-energy', 'I-building', 'B-roadname', 'B-continent', 'B-firstname', 'I-stock_exchange', 'S-norp:political', 'S-weight', 'S-restaurant', 'B-weapon', 'S-song', 'B-disease', 'E-river']
BUILDER_CONFIGS = [
SEACrowdConfig(
name="thai_nner_source",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name="thai_nner_source",
name=f"{_DATASETNAME}_source",

seacrowd/sea_datasets/thai_nner/thai_nner.py Outdated Show resolved Hide resolved
version=SOURCE_VERSION,
description=" Thai NNER source schema",
schema="source",
subset_id="thai_nner",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subset_id="thai_nner",
subset_id=f"{_DATASETNAME}",

seacrowd/sea_datasets/thai_nner/thai_nner.py Outdated Show resolved Hide resolved
version=SEACROWD_VERSION,
description="Thai NNER SEACrowd schema",
schema="seacrowd_seq_label",
subset_id="thai_nner",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subset_id="thai_nner",
subset_id=f"{_DATASETNAME}",

seacrowd/sea_datasets/thai_nner/thai_nner.py Outdated Show resolved Hide resolved
else:
raise ValueError(f"Invalid config: {self.config.name}")

# This template is based on the following template from the datasets package:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the comment out part here and afterwards.

seacrowd/sea_datasets/thai_nner/thai_nner.py Outdated Show resolved Hide resolved
seacrowd/sea_datasets/thai_nner/thai_nner.py Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Feb 6, 2024

Hi @ & @, may I know if you are still working on this PR?

@jensan-1
Copy link
Collaborator

jensan-1 commented Feb 6, 2024

Hi @ & @, may I know if you are still working on this PR?

I've been waiting for the changes for 2 weeks. Please fix or review immediately.

@gentaiscool
Copy link
Collaborator

@bp-high would you addressing @jen-santoso's review?

@bp-high
Copy link
Contributor Author

bp-high commented Feb 13, 2024

Yes @gentaiscool on it. Will push the recommended changes by eod today.

Copy link

Hi @ & @, may I know if you are still working on this PR?

@jensan-1
Copy link
Collaborator

hello @bp-high

Would you mind addressing the changes I requested? I would like to get this PR soon, and also so that we all can proceed with completing this dataloader's PR.

Copy link

Hi @gentaiscool, @jen-santoso & @bp-high, may I know if you are still working on this PR?

@bp-high
Copy link
Contributor Author

bp-high commented Mar 11, 2024

still working on this will address the changes in a few hours

@bp-high
Copy link
Contributor Author

bp-high commented Mar 26, 2024

@jen-santoso Have added the necessary changes as suggested in the review and also run the make check_file command.
As for this comment:

The dataset is reported to be split to train, dev, test (numbers 2935, 979, 980 in the paper, however it is only train, test (numbers 3914, 980 in this dataloader. Could you recheck the link, and implement the dataset split as the one in the paper? This implementation fix would also apply to the SplitGenerator

I found that the dataset present in the CONLL format (https://drive.google.com/drive/folders/1FIoSmj-JCRxaDTIbGsiuMhaWERmFUCWq )contains only train and test splits as also mentioned in the paper:
Screenshot 2024-03-26 at 1 43 29 PM

Although I found that there is data present in json format which contains the dev split(https://drive.google.com/drive/folders/1F0fqRdnGMjryp8dAToPc7OEBJlA9Xdi7) when I tried to build it into the CONLL BIOES format tags that was provided by the authors it did not came out identical to the original conll data provided by the authors. I have mailed the authors to ask about this.

@jensan-1
Copy link
Collaborator

jensan-1 commented Mar 27, 2024

@bp-high Thank you for the code updates and the investigation of the dataset.
The dataloader runs well and LGTM for now.
For the updates on the splits, I guess for the time being let's leave the dataset split as is and update in another PR (or later) when the authors respond.
WDYT? @SamuelCahyawijaya @holylovenia ?

@holylovenia
Copy link
Contributor

@bp-high Thank you for the code updates and the investigation of the dataset. The dataloader runs well and LGTM for now. For the updates on the splits, I guess for the time being let's leave the dataset split as is and update in another PR (or later) when the authors respond. WDYT? @SamuelCahyawijaya @holylovenia ?

Agree. Let's wait for @gentaiscool's review.

@holylovenia holylovenia requested review from patrickamadeus and removed request for gentaiscool April 8, 2024 08:12
@holylovenia
Copy link
Contributor

Replacing @gentaiscool with @patrickamadeus for review.

Comment on lines +556 to +581
def _custom_conll_loader(file_path):
# Read file
data = open(file_path, "r", encoding="utf-8").readlines()

# Prepare buffer
dataset = []
sentence, seq_label = [], []
for line in data:
if len(line.strip()) > 0:
# Handle lines with a single field
if "\t" in line:
token, label = line[:-1].split("\t")
elif "\t" not in line and " " in line:
line_split = line.split()
token = line_split[0]
label = line_split[1:]
else:
token, label = line.strip(), "O" # Default label if not provided

sentence.append(token)
seq_label.append(label)
else:
dataset.append({"sentence": sentence, "label": seq_label})
sentence = []
seq_label = []
return dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bp-high! Thanks for the contribution!

I only want to address a minor bug that might happen here since the implementation hasn't passed the test case.

THIS IS YOUR `row['sentence']`
['5', '_', 'มี', '.ค.', '_']

THIS IS YOUR `row['label']`
[['B-date', 'S-day', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'B-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'E-month', 'O', 'O', 'O', 'O', 'O', 'O'], ['I-date', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

I think there is a slight mistake within the loader function resulting in the captured label becomes list of list instead of list of string, sequentially matched with each token from the sentence. Hence, it prompts TypeError: '<=' not supported between instances of 'int' and 'list' during the check.

Could you please do revision on this? Other than that, LGTM!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope this is expected as it is a depth based NER(NNER/ nested NER) with each word having eight depth based ner tags. So the list belongs to each word.
Screenshot 2024-04-26 at 4 08 59 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loader is specially modified to give in the format so that we get all labels(upto depth 8) of each word in a corresponding list marked for it.

@bp-high bp-high requested a review from patrickamadeus April 25, 2024 22:42
Comment on lines +516 to +520
if self.config.schema == "source":
features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]})

elif self.config.schema == "seacrowd_seq_label":
features = schemas.seq_label_features(self.label_classes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification @bp-high . Since that's the case, could we revise the feature schema so the implementation satisfies the testcases?

Suggested change
if self.config.schema == "source":
features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [datasets.Value("string")]})
elif self.config.schema == "seacrowd_seq_label":
features = schemas.seq_label_features(self.label_classes)
features = datasets.Features({"index": datasets.Value("string"), "tokens": [datasets.Value("string")], "ner_tag": [[datasets.Value("string")]]})
# and for SEACrowd seq_label

same goes for the SEACrowd schema

Comment on lines +105 to +112
label_classes = [
"O",
"S-org:religious",
"E-tv_show",
"I-soi",
"E-last",
"E-fund",
"B-country",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm checking, I also found this error ValueError: Invalid string class label S-museum, so far I noticed there are

        "S-hotel",
        "B-postcode",
        "E-postcode",
        "B-season",
        "E-season",
       ....

and probably more labels that aren't present in label_classes.

Could you please make sure the labels are complete and run python3 -m tests.test_seacrowd seacrowd/sea_datasets/thai_nner/thai_nner.py again?

Thank you!

@holylovenia
Copy link
Contributor

A friendly reminder for @bp-high to address @patrickamadeus's suggestions.

@holylovenia
Copy link
Contributor

holylovenia commented May 13, 2024

Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

@holylovenia
Copy link
Contributor

Hi @bp-high, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

@holylovenia
Copy link
Contributor

Hi @bp-high, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @patrickamadeus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for Thai-NNER
5 participants