Create dataset loader for ProSub #683

SamuelCahyawijaya · 2024-05-27T04:53:56Z

Dataset	prosub
Description	ProSub is a collection of datasets and corpus annotations dealing with pronoun substitutes and related linguistic categories (personal pronouns, honorific titles, address terms). Pronoun substitutes are non-pronominal expressions (e.g. 'mother', 'aunt', 'teacher') used to refer to the speaker and the addressee, thus functioning like 1st and 2nd person personal pronouns. Pronoun substitutes are very common in languages in SEA, Japan and Korea, but extremely limited elsewhere. The Common subset is based on a common questionnaire. It provides information about whether a given concept (e.g. 'child') can be used as 1st person, 2nd person, title and address term. If the use exists, example sentences are also given. The Annotations subset contains annotation of 1st and 2nd person expressions, including both personal pronouns and pronoun substitutes, and address terms. The corpora used differ from language to language. However, the annotation scheme is the same across languages.
Subsets	Common, Annotations
Languages	zsm, ind, jav, tha, vie, mya
Tasks	Word Sense Disambiguation, Word lists, Semantic Role Labeling, Machine Translation
License	Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage	https://github.com/matbahasa/ProSub
HF URL	-
Paper URL	https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P9-4.pdf

The text was updated successfully, but these errors were encountered:

SamuelCahyawijaya added this to SEACrowd Data Hub May 27, 2024

SamuelCahyawijaya converted this from a draft issue May 27, 2024

SamuelCahyawijaya added the out-of-hackathon label May 27, 2024

Provide feedback