New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closes #113 | Create dataset loader for HSE Thai #557

Open

khelli07 wants to merge 3 commits into SEACrowd:master from khelli07:add/hse_thai

Collaborator

khelli07 commented Mar 29, 2024 •

edited

Loading

Closes #113

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/hse_thai/hse_thai.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

khelli07 added 2 commits

March 29, 2024 21:32


          [New Feature] Add dataloader for HSE Thai

15514b9


          Merge branch 'master' of github.com:khelli07/seacrowd-datahub into ad…

cbeba21

…d/hse_thai

khelli07 requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners

March 29, 2024 13:35

holylovenia removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, holylovenia, MJonibek, danjohnvelasco, jensan-1 and sabilmakbar

April 1, 2024 08:58

holylovenia assigned ljvmiranda921 and yongzx

ljvmiranda921 reviewed

View reviewed changes

Collaborator

ljvmiranda921 left a comment

Hi @khelli07 ! Thank you for this PR!

I have some comments regarding the dataset. It seems that outside of language ID, we can use this for translation and POS tagging. The original source dataset seems richer than the one in Kaggle 🤔 .

seacrowd/sea_datasets/hse_thai/hse_thai.py Outdated

Comment on lines 12 to 18

+              _CITATION = """\
+                  @misc{rtatman2017hse_thai,
+                      author = {Rachel Tatman},
+                      title = {HSE Thai Corpus},
+                      howpublished = {\\url{https://www.kaggle.com/datasets/rtatman/hse-thai-corpus}},
+                      note = {Accessed: 2023-11-22}
+                  }

Collaborator

ljvmiranda921 Apr 5, 2024

Hmm, I know that the source is from Kaggle, but I think the original authors of the dataset are different. I traced it a bit and it led me to this: http://web-corpora.net/ThaiCorpus/search/. Maybe it's better to cite this website and the mentioned authors instead?

Collaborator Author

khelli07 Apr 5, 2024

Let me take a look into it later!

seacrowd/sea_datasets/hse_thai/hse_thai.py


		_URLS = "rtatman/hse-thai-corpus"

		_SUPPORTED_TASKS = [Tasks.LANGUAGE_IDENTIFICATION]

Collaborator

ljvmiranda921 Apr 5, 2024

Another thing. If we looked into the original source description, it seems that the tasks can be extended outside of language identification such as translation and parts of speech tagging:

This website gives access to the HSE Thai Corpus - the corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). Each token was assigned it's English translation and part of speech tag.

Collaborator Author

khelli07 Apr 5, 2024

Yes, but I think it is from the source of the kaggle dataset! But not included in the kaggle dataset.

seacrowd/sea_datasets/hse_thai/hse_thai.py Outdated

Comment on lines 130 to 131

		for row in reader:
		i += 1

Collaborator

ljvmiranda921 Apr 5, 2024

Maybe you can use enumerate here? So you don't have to set i=-1?

yongzx requested changes

View reviewed changes

seacrowd/sea_datasets/hse_thai/hse_thai.py Outdated


		_LANGUAGES = ["tha"]

		_LICENSE = Licenses.APACHE_2_0.value

Collaborator

yongzx Apr 18, 2024

I don't think the License is correct here since it's a merge of two licenses and neither is Apache 2.0. Perhaps Licenses.OTHERS.value? What's the licensing practice here @holylovenia?

This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand, which can be viewed here (In Thai).

Contributor

holylovenia Apr 27, 2024

Let's use Licenses.OTHERS.value and please add the info that @yongzx quoted into the Licenses.OTHERS.value.

seacrowd/sea_datasets/hse_thai/hse_thai.py Outdated

+              _DATASETNAME = "hse_thai"
+              _DESCRIPTION = """\
+              HSE Thai Corpus is a corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). To make it easier for non-Thai-speakers to comprehend and use texts in the corpus the researchers decided to separate words in each sentence with spaces. The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The text in this dataset is encoded in UTF-8. This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand.

Collaborator

yongzx Apr 19, 2024

make check_file=seacrowd/sea_datasets/hse_thai/hse_thai.py returns error of E501 line too long (684 > 250 characters). We can split the lines here.

Collaborator

yongzx commented Apr 19, 2024

I've run the test and it works. I agree with @ljvmiranda921's suggestions, and I have left further comments on the license.

Contributor

holylovenia commented May 2, 2024

Friendly reminder for @khelli07 to address @yongzx and @ljvmiranda921's suggestions.

Collaborator Author

khelli07 commented May 2, 2024

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.

List of folders:

File inside folders: (most are XML)

I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Contributor

holylovenia commented May 6, 2024

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.

List of folders:

File inside folders: (most are XML)

I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏

Collaborator

mrpeerat commented May 6, 2024 •

edited

Loading

Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots.
List of folders:
File inside folders: (most are XML)
I can guess that <se> is sentence, <w> is word. For a language identification, I guess I just make 1 sentence 1 data row. 1 sentence is constructed by merging all the black string inside the the <w>. But yeah for this, translation and parts of speech tagging, I might have to pass this task to someone else :)

Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏

Hi, to construct the sentence, you can merge all the words <w> (the black word) in the same sentence (<se>), as you mentioned. For the translation, all the blue words are word translation (word-to-word translation) and should not be used as sentence translation (you can see that the meaning is usually incorrect if you concatenate all the translation words together in the same sentence). For the PoS, the PoS is designed for English, not Thai. Feel free to ask more if you have any questions :)

Collaborator Author

khelli07 commented May 6, 2024 •

edited

Loading

the PoS is designed for English, not Thai

Hi, can you explain what do you mean by this?

Also, so far, what I understand on what I need to do is:

Instead of using Kaggle source, use the original source.
The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.

Am I correct?

Collaborator

mrpeerat commented May 6, 2024

the PoS is designed for English, not Thai

Hi, can you explain what do you mean by this?

Also, so far, what I understand on what I need to do is:

Instead of using Kaggle source, use the original source.

The task is still language identification /modelling (same as the original issue) since a) the translation is not valid for translation task; and b) because of a), pos is also not valid for PoS tagging task.

Am I correct?

I looked at the PoS and found that some of them were annotated for the translation word, not for Thai. For instance, given the word "กับ", the PoS annotated that it is possibly a noun, preposition, or conjunction. However, the dataset said "preposition". But, in Thai, the word should be "prepositional phrase". In this case, it looks like the translation and the PoS of the translation are correct. But the PoS of Thai is incorrect.

Correct.
2a. It's a word-to-word translation (source=>target translation but doing only one word at a time and omitting the semantics of the sentence). So I don't know if it is useful for 2024 or not since we have a lot of bi-lingual corpora
2b. Correct

Contributor

holylovenia commented May 13, 2024 •

edited

Loading

Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Collaborator Author

khelli07 commented May 14, 2024

Okay, doing it today or tomorrow!

Collaborator Author

khelli07 commented May 15, 2024 •

edited

Loading

The download is soooo slow :' )
*it's not my internet, but i think its the server's upload ability (this happened to me before)

Contributor

holylovenia commented May 21, 2024

The download is soooo slow :' ) *it's not my internet, but i think its the server's upload ability (this happened to me before)

Hi @khelli07, did you manage to download it?

Contributor

holylovenia commented May 30, 2024

Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @yongzx @ljvmiranda921

github-actions bot added the need-fu-pr label

Contributor

holylovenia commented Jul 8, 2024

Hi @khelli07, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

cc: @ljvmiranda921 @yongzx

Collaborator Author

khelli07 commented Jul 8, 2024

Hi, I got a problem while downloading the data (the server upload capability is too low), so I decided to download it first and upload it with GIT LFS here (https://github.com/khelli07/hse-thai-for-seacrowd). It seems that the data can be redistributed (pls double check?). I'll try to finish this if this option works.


          [Fix] Changes based on reviews

18bb7f7

github-actions bot removed the need-fu-pr label

github-actions bot added the need-fu-pr label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels