Welcome to the NusaCrowd!

Baca README ini dalam Bahasa Indonesia.

Indonesian NLP is underrepresented in research community, and one of the reasons is the lack of access to public datasets (Aji et al., 2022). To address this issue, we initiate NusaCrowd, a joint collaboration to collect NLP datasets for Indonesian languages. Help us collect and centralize Indonesian NLP datasets, and be a co-author of our upcoming paper.

How to contribute?

You can contribute by proposing unregistered NLP dataset on our record. You can also propose datasets from your past work that have not been released to the public. Just fill out this form, and we will check and approve your entry.

We will give contribution points based on several factors, including: dataset quality, language scarcity, or task scarcity.

You can submit multiple entries, and if the total contribution points is already above the threshold, we will include you as a co-author (Generally it is enough to only propose 1-2 datasets). Read the full method of calculating points here.

Any other way to help?

Yes! Aside from new dataset collection, we are also centralizing existing datasets in a single schema that makes it easier for researchers to use Indonesian NLP datasets. You can help us there by building dataset loader. More details about that here.

FAQs

How can I find the appropriate license for my dataset?

The license for a dataset is not always obvious. Here are some strategies to try in your search,

check for files such as README or LICENSE that may be distributed with the dataset itself
check the dataset webpage
check publications that announce the release of the dataset
check the website of the organization providing the dataset

If no official license is listed anywhere, but you find a webpage that describes general data usage policies for the dataset, you can fall back to providing that URL in the _LICENSE variable. If you can't find any license information, please note in your PR and put _LICENSE="Unknown" in your dataset script.

What if my dataset is not yet publicly available?

You can upload your dataset publicly first, eg. on Github.

I am confused, can you help me?

Yes, you can ask for helps in NusaCrowd's community channel! Please join our WhatsApp group or Slack server.

Thank you!

We greatly appreciate your help!

The artifacts of this hackathon will be described in a forthcoming academic paper targeting a machine learning or NLP audience. Please refer to this section for your contribution rewards for helping Nusantara NLP. We recognize that some datasets require more effort than others, so please reach out if you have questions. Our goal is to be inclusive with credit!

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
docs/_static/img		docs/_static/img
nusantara		nusantara
templates		templates
tests		tests
.gitignore		.gitignore
DATALOADER.md		DATALOADER.md
LICENSE		LICENSE
Makefile		Makefile
POINTS.id.md		POINTS.id.md
POINTS.md		POINTS.md
README.id.md		README.id.md
README.md		README.md
UPLOADING.id.md		UPLOADING.id.md
UPLOADING.md		UPLOADING.md
conda.yml		conda.yml
requirements.txt		requirements.txt
task_schemas.md		task_schemas.md
test_example.sh		test_example.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to the NusaCrowd!

How to contribute?

Any other way to help?

FAQs

How can I find the appropriate license for my dataset?

What if my dataset is not yet publicly available?

I am confused, can you help me?

Thank you!

About

Releases

Packages

Languages

License

IkhlasulHanif/nusa-crowd

Folders and files

Latest commit

History

Repository files navigation

Welcome to the NusaCrowd!

How to contribute?

Any other way to help?

FAQs

How can I find the appropriate license for my dataset?

What if my dataset is not yet publicly available?

I am confused, can you help me?

Thank you!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages