-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #113 | Create dataset loader for HSE Thai #557
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @khelli07 ! Thank you for this PR!
I have some comments regarding the dataset. It seems that outside of language ID, we can use this for translation and POS tagging. The original source dataset seems richer than the one in Kaggle 🤔 .
_CITATION = """\ | ||
@misc{rtatman2017hse_thai, | ||
author = {Rachel Tatman}, | ||
title = {HSE Thai Corpus}, | ||
howpublished = {\\url{https://www.kaggle.com/datasets/rtatman/hse-thai-corpus}}, | ||
note = {Accessed: 2023-11-22} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I know that the source is from Kaggle, but I think the original authors of the dataset are different. I traced it a bit and it led me to this: http://web-corpora.net/ThaiCorpus/search/. Maybe it's better to cite this website and the mentioned authors instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a look into it later!
|
||
_URLS = "rtatman/hse-thai-corpus" | ||
|
||
_SUPPORTED_TASKS = [Tasks.LANGUAGE_IDENTIFICATION] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing. If we looked into the original source description, it seems that the tasks can be extended outside of language identification such as translation and parts of speech tagging:
This website gives access to the HSE Thai Corpus - the corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). Each token was assigned it's English translation and part of speech tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think it is from the source of the kaggle dataset! But not included in the kaggle dataset.
for row in reader: | ||
i += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can use enumerate
here? So you don't have to set i=-1
?
|
||
_LANGUAGES = ["tha"] | ||
|
||
_LICENSE = Licenses.APACHE_2_0.value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the License is correct here since it's a merge of two licenses and neither is Apache 2.0. Perhaps Licenses.OTHERS.value
? What's the licensing practice here @holylovenia?
This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand, which can be viewed here (In Thai).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Licenses.OTHERS.value
and please add the info that @yongzx quoted into the Licenses.OTHERS.value
.
_DATASETNAME = "hse_thai" | ||
|
||
_DESCRIPTION = """\ | ||
HSE Thai Corpus is a corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). To make it easier for non-Thai-speakers to comprehend and use texts in the corpus the researchers decided to separate words in each sentence with spaces. The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The text in this dataset is encoded in UTF-8. This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make check_file=seacrowd/sea_datasets/hse_thai/hse_thai.py
returns error of E501 line too long (684 > 250 characters)
. We can split the lines here.
I've run the test and it works. I agree with @ljvmiranda921's suggestions, and I have left further comments on the license. |
Friendly reminder for @khelli07 to address @yongzx and @ljvmiranda921's suggestions. |
Hmm, I downloaded the original data from the original source (http://web-corpora.net/ThaiCorpus/search/). But unfortunately I have no idea at all since I do not understand Thai. Here are some screen shoots. File inside folders: (most are XML) I can guess that |
Hmmm, let me ask our Thai contributor. Hello @mrpeerat, could you please help @khelli07 understand this dataset? 🙏 |
Hi, to construct the sentence, you can merge all the words |
Hi, can you explain what do you mean by this? Also, so far, what I understand on what I need to do is:
Am I correct? |
I looked at the PoS and found that some of them were annotated for the translation word, not for Thai. For instance, given the word "กับ", the PoS annotated that it is possibly a noun, preposition, or conjunction. However, the dataset said "preposition". But, in Thai, the word should be "prepositional phrase". In this case, it looks like the translation and the PoS of the translation are correct. But the PoS of Thai is incorrect.
|
Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) by 30 May, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Okay, doing it today or tomorrow! |
Hi @khelli07, did you manage to download it? |
Hi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. |
Hi @khelli07, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! |
Hi, I got a problem while downloading the data (the server upload capability is too low), so I decided to download it first and upload it with GIT LFS here (https://github.com/khelli07/hse-thai-for-seacrowd). It seems that the data can be redistributed (pls double check?). I'll try to finish this if this option works. |
Closes #113
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/hse_thai/hse_thai.py
.