Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for NUS SMS Corpus #221

Open
SamuelCahyawijaya opened this issue Dec 26, 2023 · 10 comments · May be fixed by #596
Open

Create dataset loader for NUS SMS Corpus #221

SamuelCahyawijaya opened this issue Dec 26, 2023 · 10 comments · May be fixed by #596
Assignees
Labels
pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: nus_sms_corpus/nus_sms_corpus.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?nus_sms_corpus

Dataset nus_sms_corpus
Description This is a corpus of SMS (Short Message Service) messages collected for research at the Department of Computer Science at the National University of Singapore. This dataset consists of 67,093 SMS messages taken from the corpus on Mar 9, 2015. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The data collectors opportunistically collected as much metadata about the messages and their senders as possible, so as to enable different types of analyses.
Subsets English, Mandarin Chinese
Languages eng, cmn
Tasks Language Modeling
License Unknown (unknown)
Homepage https://github.com/kite1988/nus-sms-corpus
HF URL -
Paper URL https://link.springer.com/article/10.1007/s10579-012-9197-9
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Dec 26, 2023
@reynardryanda
Copy link

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

Hi @reynardryanda, may we know the update on this dataloader issue? It's been 3 weeks since the last poke from the SEACrowd stale-checker, and we might consider unassigning if there's no progress update in the next 24 hours.

Copy link

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@reynardryanda
Copy link

I will try to create a PR by this weekend or next week if that's okay with you guys. So sorry for taking so long.

@reynardryanda
Copy link

Hello @sabilmakbar or maybe @holylovenia, I think this corpus does not have a clear downstream task. The paper also concluded that the corpus may need further annotations for it to be used on other projects. Any suggestions? Please also check the sample data, just in case that I might be wrong.

@holylovenia
Copy link
Contributor

holylovenia commented Mar 11, 2024

Hello @sabilmakbar or maybe @holylovenia, I think this corpus does not have a clear downstream task. The paper also concluded that the corpus may need further annotations for it to be used on other projects. Any suggestions? Please also check the sample data, just in case that I might be wrong.

Hi @reynardryanda, sorry I missed your comment. We can use Tasks.LANGUAGE_MODELING and the ssp schema for unlabeled data like this.

Here's the link to constants.py just in case you want to take a look at other tasks and schemas available.

@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Apr 1, 2024

@reynardryanda may we know if you are still working on this issue? It has already been one month since your last update.

@holylovenia
Copy link
Contributor

@reynardryanda may we know if you are still working on this issue? It has already been one month since your last update.

I removed @reynardryanda assignment due to the lack of response. Anyone can take this dataloader now.

@akhdanfadh
Copy link
Collaborator

#self-assign

@akhdanfadh akhdanfadh linked a pull request Apr 1, 2024 that will close this issue
8 tasks
@akhdanfadh akhdanfadh added the pr-ready A PR that closes this issue is Ready to be reviewed label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

5 participants