Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Vietnamese Hate and Offensive Spans Detection (ViHOS) #218

Open
SamuelCahyawijaya opened this issue Dec 26, 2023 · 5 comments
Assignees

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: vihos/vihos.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?vihos

Dataset vihos
Description This dataset consists of human-annotated hateful and offensive spans in Vietnamese Facebook and Youtube comments. Each comment has a corresponding list of indices indicating the characters included in these hate and offensive spans. Individual words and syllables are also tagged as inside or outside spans using the Inside-Outside-Beginning (IOB) tagging representation.
Subsets -
Languages vie
Tasks Hate Speech Detection
License MIT (mit)
Homepage https://github.com/phusroyal/ViHOS
HF URL -
Paper URL https://aclanthology.org/2023.eacl-main.47
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Dec 26, 2023
@elyanah-aco
Copy link
Collaborator

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@elyanah-aco
Copy link
Collaborator

@holylovenia @SamuelCahyawijaya @sabilmakbar

Was planning to implement ABUSIVE_LANGUAGE_PREDICTION task here.

Would like your thoughts on whether dataset can also support SPAN_BASED_ABSA task or not. Spans are BIO-tagged, but all spans would be labelled "offensive".

Copy link

github-actions bot commented Feb 4, 2024

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@holylovenia
Copy link
Contributor

Spans are BIO-tagged

Hi @elyanah-aco, sorry I missed your question. Doesn't the sequence labeling version of the data use B- and I- for offensive words and O for others? This is my assumption based on a quick look of the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants