Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #113 | Create dataset loader for HSE Thai #557

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

khelli07
Copy link
Collaborator

@khelli07 khelli07 commented Mar 29, 2024

Closes #113

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/hse_thai/hse_thai.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Copy link
Collaborator

@ljvmiranda921 ljvmiranda921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @khelli07 ! Thank you for this PR!

I have some comments regarding the dataset. It seems that outside of language ID, we can use this for translation and POS tagging. The original source dataset seems richer than the one in Kaggle 🤔 .

Comment on lines 12 to 18
_CITATION = """\
@misc{rtatman2017hse_thai,
author = {Rachel Tatman},
title = {HSE Thai Corpus},
howpublished = {\\url{https://www.kaggle.com/datasets/rtatman/hse-thai-corpus}},
note = {Accessed: 2023-11-22}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I know that the source is from Kaggle, but I think the original authors of the dataset are different. I traced it a bit and it led me to this: http://web-corpora.net/ThaiCorpus/search/. Maybe it's better to cite this website and the mentioned authors instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look into it later!


_URLS = "rtatman/hse-thai-corpus"

_SUPPORTED_TASKS = [Tasks.LANGUAGE_IDENTIFICATION]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing. If we looked into the original source description, it seems that the tasks can be extended outside of language identification such as translation and parts of speech tagging:

This website gives access to the HSE Thai Corpus - the corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). Each token was assigned it's English translation and part of speech tag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think it is from the source of the kaggle dataset! But not included in the kaggle dataset.

Comment on lines 130 to 131
for row in reader:
i += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can use enumerate here? So you don't have to set i=-1?


_LANGUAGES = ["tha"]

_LICENSE = Licenses.APACHE_2_0.value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the License is correct here since it's a merge of two licenses and neither is Apache 2.0. Perhaps Licenses.OTHERS.value? What's the licensing practice here @holylovenia?

This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand, which can be viewed here (In Thai).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_DATASETNAME = "hse_thai"

_DESCRIPTION = """\
HSE Thai Corpus is a corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). To make it easier for non-Thai-speakers to comprehend and use texts in the corpus the researchers decided to separate words in each sentence with spaces. The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The text in this dataset is encoded in UTF-8. This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make check_file=seacrowd/sea_datasets/hse_thai/hse_thai.py returns error of E501 line too long (684 > 250 characters). We can split the lines here.

@yongzx
Copy link
Collaborator

yongzx commented Apr 19, 2024

I've run the test and it works. I agree with @ljvmiranda921's suggestions, and I have left further comments on the license.

@holylovenia
Copy link
Contributor

Friendly reminder for @khelli07 to address @yongzx and @ljvmiranda921's suggestions.

@khelli07
Copy link
Collaborator Author

khelli07 commented May 2, 2024

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.

List of folders:
image

File inside folders: (most are XML)
image

I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

@holylovenia
Copy link
Contributor

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.

List of folders: image

File inside folders: (most are XML) image

I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏

@mrpeerat
Copy link
Collaborator

mrpeerat commented May 6, 2024

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.
List of folders: image
File inside folders: (most are XML) image
I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏

Hi, to construct the sentence, you can merge all the words <w> (the black word) in the same sentence (<se>), as you mentioned. For the translation, all the blue words are word translation (word-to-word translation) and should not be used as sentence translation (you can see that the meaning is usually incorrect if you concatenate all the translation words together in the same sentence). For the PoS, the PoS is designed for English, not Thai. Feel free to ask more if you have any questions :)

@khelli07
Copy link
Collaborator Author

khelli07 commented May 6, 2024

the PoS is designed for English, not Thai

Hi, can you explain what do you mean by this?

Also, so far, what I understand on what I need to do is:

  1. Instead of using Kaggle source, use the original source.
  2. The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.

Am I correct?

@mrpeerat
Copy link
Collaborator

mrpeerat commented May 6, 2024

the PoS is designed for English, not Thai

Hi, can you explain what do you mean by this?

Also, so far, what I understand on what I need to do is:

  1. Instead of using Kaggle source, use the original source.
  2. The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.

Am I correct?

I looked at the PoS and found that some of them were annotated for the translation word, not for Thai. For instance, given the word "กับ", the PoS annotated that it is possibly a noun, preposition, or conjunction. However, the dataset said "preposition". But, in Thai, the word should be "prepositional phrase". In this case, it looks like the translation and the PoS of the translation are correct. But the PoS of Thai is incorrect.

  1. Correct.
    2a. It's a word-to-word translation (source=>target translation but doing only one word at a time and omitting the semantics of the sentence). So I don't know if it is useful for 2024 or not since we have a lot of bi-lingual corpora
    2b. Correct

@holylovenia
Copy link
Contributor

holylovenia commented May 13, 2024

Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

@khelli07
Copy link
Collaborator Author

Okay, doing it today or tomorrow!

@khelli07
Copy link
Collaborator Author

khelli07 commented May 15, 2024

The download is soooo slow :' )
*it's not my internet, but i think its the server's upload ability (this happened to me before)

image

@holylovenia
Copy link
Contributor

The download is soooo slow :' ) *it's not my internet, but i think its the server's upload ability (this happened to me before)

image

Hi @khelli07, did you manage to download it?

@holylovenia
Copy link
Contributor

Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @yongzx @ljvmiranda921

@holylovenia
Copy link
Contributor

Hi @khelli07, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @ljvmiranda921 @yongzx

@khelli07
Copy link
Collaborator Author

khelli07 commented Jul 8, 2024

Hi, I got a problem while downloading the data (the server upload capability is too low), so I decided to download it first and upload it with GIT LFS here (https://github.com/khelli07/hse-thai-for-seacrowd). It seems that the data can be redistributed (pls double check?). I'll try to finish this if this option works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for HSE Thai Corpus
5 participants