Create dataset loader for HSE Thai Corpus #113

SamuelCahyawijaya · 2023-11-22T03:14:56Z

Dataloader name: hse_thai/hse_thai.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?hse_thai

Dataset	hse_thai
Description	he HSE Thai Corpus is a corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). To make it easier for non-Thai-speakers to comprehend and use texts in the corpus the researchers decided to separate words in each sentence with spaces. The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The text in this dataset is encoded in UTF-8. This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand.
Subsets	-
Languages	tha
Tasks	Language Modeling, Language Identification
License	Apache license 2.0 (apache-2.0)
Homepage	https://www.kaggle.com/datasets/rtatman/hse-thai-corpus
HF URL	-
Paper URL	https://www.kaggle.com/datasets/rtatman/hse-thai-corpus/data

The text was updated successfully, but these errors were encountered:

bp-high · 2023-11-22T03:31:37Z

#self-assign

github-actions · 2023-12-15T02:05:53Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

bp-high · 2023-12-15T07:16:42Z

Yep working on this issues got busy with some things but will try to wrap this issues by next week.

sabilmakbar · 2023-12-15T08:38:53Z

Thanks for letting us know, @bp-high, I'm removing the stale tag for now. Please add a tag pr-ready whenever you have finished on your dataloader so that the bot won't tag this issue as stale or let us know if you need more time for this issue.

github-actions · 2023-12-30T01:59:50Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

bp-high · 2023-12-30T08:24:46Z

Sorry couldn't work on this last weekend due to christmas holidays and celebration will try to conclude this, this weekend.

sabilmakbar · 2023-12-31T09:37:49Z

Thanks for the update, @bp-high! no rush on this; please take your time to enjoy ur holiday!

github-actions · 2024-01-15T02:09:06Z

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

sabilmakbar · 2024-02-01T16:31:50Z

Hi @bp-high, may we know the update on this dataloader issue? It's been 3 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

github-actions · 2024-02-16T01:57:53Z

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

khelli07 · 2024-02-20T02:23:34Z

#self-assign

khelli07 · 2024-03-05T13:24:39Z

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

I'm quite confused as the (a) dataset has a lot of similar values:

should we still include this?
For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture

holylovenia · 2024-03-05T13:41:13Z

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

I'm quite confused as the (a) dataset has a lot of similar values:

should we still include this?

For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture

Hi @khelli07, I'm also not sure what the content is about since I don't understand Thai. May I ask for your suggestion on this dataset, @mrpeerat and @parinzee? 🙏

mrpeerat · 2024-03-05T14:02:54Z

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

I'm quite confused as the (a) dataset has a lot of similar values:

should we still include this?

For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture

I looked at some samples and found that those are duplicate texts. Feel free to pick only one of them.
Look like the article column is the header of wikipedia. Picking only the text column is fine.

khelli07 · 2024-03-11T05:06:36Z

Hi, I want to ask again. For this dataset, do we count this as local or public? Because as far as I know, we have to login to download the dataset. So even though it is accessible by everyone, you have to login first. Another option is Kaggle API, but it is CLI-based (and ofc, you still need to login though https://github.com/Kaggle/kaggle-api)

holylovenia · 2024-03-11T07:48:21Z

Hi, I want to ask again. For this dataset, do we count this as local or public? Because as far as I know, we have to login to download the dataset. So even though it is accessible by everyone, you have to login first. Another option is Kaggle API, but it is CLI-based (and ofc, you still need to login though https://github.com/Kaggle/kaggle-api)

Hi @khelli07, if it can be solved using CLI, could we make it _LOCAL = False and attach a guide on how to use it to the _DESCRIPTION like this?

khelli07 · 2024-03-26T09:42:42Z

Main code is done, just have not done the metadata yet. I'll do it in near future.

SamuelCahyawijaya added this to SEACrowd Data Hub Nov 22, 2023

SamuelCahyawijaya converted this from a draft issue Nov 22, 2023

github-actions bot assigned bp-high Nov 22, 2023

github-actions bot added the staled-issue label Dec 15, 2023

sabilmakbar removed the staled-issue label Dec 15, 2023

github-actions bot added the staled-issue label Dec 30, 2023

github-actions bot removed the staled-issue label Dec 31, 2023

github-actions bot added the staled-issue label Jan 15, 2024

github-actions bot removed the staled-issue label Feb 2, 2024

github-actions bot added the staled-issue label Feb 16, 2024

holylovenia unassigned bp-high Feb 19, 2024

holylovenia added help wanted Extra attention is needed and removed staled-issue help wanted Extra attention is needed labels Feb 19, 2024

github-actions bot assigned khelli07 Feb 20, 2024

github-actions bot added the staled-issue label Mar 26, 2024

github-actions bot removed the staled-issue label Mar 27, 2024

khelli07 linked a pull request Mar 29, 2024 that will close this issue

Closes #113 | Create dataset loader for HSE Thai #557

Open

8 tasks

github-actions bot added the staled-issue label Apr 10, 2024

holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed staled-issue labels Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for HSE Thai Corpus #113

Create dataset loader for HSE Thai Corpus #113

SamuelCahyawijaya commented Nov 22, 2023 •

edited

Loading

bp-high commented Nov 22, 2023

github-actions bot commented Dec 15, 2023

bp-high commented Dec 15, 2023

sabilmakbar commented Dec 15, 2023

github-actions bot commented Dec 30, 2023

bp-high commented Dec 30, 2023

sabilmakbar commented Dec 31, 2023

github-actions bot commented Jan 15, 2024

sabilmakbar commented Feb 1, 2024

github-actions bot commented Feb 16, 2024

khelli07 commented Feb 20, 2024

khelli07 commented Mar 5, 2024

holylovenia commented Mar 5, 2024

mrpeerat commented Mar 5, 2024

khelli07 commented Mar 11, 2024

holylovenia commented Mar 11, 2024

khelli07 commented Mar 26, 2024

Create dataset loader for HSE Thai Corpus #113

Create dataset loader for HSE Thai Corpus #113

Comments

SamuelCahyawijaya commented Nov 22, 2023 • edited Loading

bp-high commented Nov 22, 2023

github-actions bot commented Dec 15, 2023

bp-high commented Dec 15, 2023

sabilmakbar commented Dec 15, 2023

github-actions bot commented Dec 30, 2023

bp-high commented Dec 30, 2023

sabilmakbar commented Dec 31, 2023

github-actions bot commented Jan 15, 2024

sabilmakbar commented Feb 1, 2024

github-actions bot commented Feb 16, 2024

khelli07 commented Feb 20, 2024

khelli07 commented Mar 5, 2024

holylovenia commented Mar 5, 2024

mrpeerat commented Mar 5, 2024

khelli07 commented Mar 11, 2024

holylovenia commented Mar 11, 2024

khelli07 commented Mar 26, 2024

SamuelCahyawijaya commented Nov 22, 2023 •

edited

Loading