Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for HSE Thai Corpus #113

Open
SamuelCahyawijaya opened this issue Nov 22, 2023 · 17 comments · May be fixed by #557
Open

Create dataset loader for HSE Thai Corpus #113

SamuelCahyawijaya opened this issue Nov 22, 2023 · 17 comments · May be fixed by #557
Assignees
Labels
pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Nov 22, 2023

Dataloader name: hse_thai/hse_thai.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?hse_thai

Dataset hse_thai
Description he HSE Thai Corpus is a corpus of modern texts written in Thai language. The texts, containing in whole 50 million tokens, were collected from various Thai websites (mostly news websites). To make it easier for non-Thai-speakers to comprehend and use texts in the corpus the researchers decided to separate words in each sentence with spaces. The data for the corpus was collected by means of Scrapy. To tokenize texts the Pythai module was used. The text in this dataset is encoded in UTF-8. This dataset contains text from two sources: Wikipedia and thaigov.go.th. The former is licensed under a standard Wikipedia license, and the latter under an Open Government License for Thailand.
Subsets -
Languages tha
Tasks Language Modeling, Language Identification
License Apache license 2.0 (apache-2.0)
Homepage https://www.kaggle.com/datasets/rtatman/hse-thai-corpus
HF URL -
Paper URL https://www.kaggle.com/datasets/rtatman/hse-thai-corpus/data
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Nov 22, 2023
@bp-high
Copy link
Contributor

bp-high commented Nov 22, 2023

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@bp-high
Copy link
Contributor

bp-high commented Dec 15, 2023

Yep working on this issues got busy with some things but will try to wrap this issues by next week.

@sabilmakbar
Copy link
Collaborator

Thanks for letting us know, @bp-high, I'm removing the stale tag for now. Please add a tag pr-ready whenever you have finished on your dataloader so that the bot won't tag this issue as stale or let us know if you need more time for this issue.

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@bp-high
Copy link
Contributor

bp-high commented Dec 30, 2023

Sorry couldn't work on this last weekend due to christmas holidays and celebration will try to conclude this, this weekend.

@sabilmakbar
Copy link
Collaborator

Thanks for the update, @bp-high! no rush on this; please take your time to enjoy ur holiday!

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

Hi @bp-high, may we know the update on this dataloader issue? It's been 3 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

Copy link

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@holylovenia holylovenia added help wanted Extra attention is needed and removed staled-issue help wanted Extra attention is needed labels Feb 19, 2024
@khelli07
Copy link
Collaborator

#self-assign

@khelli07
Copy link
Collaborator

khelli07 commented Mar 5, 2024

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

  1. I'm quite confused as the (a) dataset has a lot of similar values:
    image
    should we still include this?

  2. For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture
    image

@holylovenia
Copy link
Contributor

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

  1. I'm quite confused as the (a) dataset has a lot of similar values:
    image
    should we still include this?
  2. For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture
    image

Hi @khelli07, I'm also not sure what the content is about since I don't understand Thai. May I ask for your suggestion on this dataset, @mrpeerat and @parinzee? 🙏

@mrpeerat
Copy link
Collaborator

mrpeerat commented Mar 5, 2024

Hi, I wanna ask about this.

In the Kaggle, there are two sources of the dataset, namely (a) thai-government-corpus.csv and (b) thai-wikipedia-corpus.csv. Both have "article" and "text" columns. I assume here both of the sources should be combined. Hereby, I have two questions.

  1. I'm quite confused as the (a) dataset has a lot of similar values:
    image
    should we still include this?
  2. For the seacrowd schema, do we need to concat it as "{}-{}".format(article, text) or just take the text one? If concat, the article value of the (a) dataset is integer, while the (b) one is string. How should we process this? *compare prev picture and the following picture
    image
  1. I looked at some samples and found that those are duplicate texts. Feel free to pick only one of them.
  2. Look like the article column is the header of wikipedia. Picking only the text column is fine.

@khelli07
Copy link
Collaborator

Hi, I want to ask again. For this dataset, do we count this as local or public? Because as far as I know, we have to login to download the dataset. So even though it is accessible by everyone, you have to login first. Another option is Kaggle API, but it is CLI-based (and ofc, you still need to login though https://github.com/Kaggle/kaggle-api)

@holylovenia
Copy link
Contributor

Hi, I want to ask again. For this dataset, do we count this as local or public? Because as far as I know, we have to login to download the dataset. So even though it is accessible by everyone, you have to login first. Another option is Kaggle API, but it is CLI-based (and ofc, you still need to login though https://github.com/Kaggle/kaggle-api)

Hi @khelli07, if it can be solved using CLI, could we make it _LOCAL = False and attach a guide on how to use it to the _DESCRIPTION like this?

@khelli07
Copy link
Collaborator

Main code is done, just have not done the metadata yet. I'll do it in near future.

@khelli07 khelli07 linked a pull request Mar 29, 2024 that will close this issue
8 tasks
@holylovenia holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed staled-issue labels Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

6 participants