Home

IndoNLU

IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia.

~~For Wiki Bahasa Indonesia version please follow this [Link]~~

Dataset

12 Downstream Tasks

You can check [Link]

We provide train, valid, and test set (with masked labels, no true labels). We are currently preparing a platform for auto-evaluation using Codalab. Please stay tuned!

Indo4B

23GB Indo4B Pretraining Dataset [Link]

Model

IndoBERT models

IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective.

All Pre-trained Models

Model	# Params	Arch.	Training Data	Link
`indobenchmark/indobert-base-p1`	124.5M	Base	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-base-p2`	124.5M	Base	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-large-p1`	335.2M	Large	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-large-p2`	335.2M	Large	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-lite-base-p1`	11.7M	Base	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-lite-base-p2`	11.7M	Base	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-lite-large-p1`	17.7M	Large	Indo4B (23.43 GB of text)	Link
`indobenchmark/indobert-lite-large-p2`	17.7M	Large	Indo4B (23.43 GB of text)	Link

How to use

Load model and tokenizer

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p1")
model = AutoModel.from_pretrained("indobenchmark/indobert-large-p1")

Extract contextual representation

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

Leaderboard

Community Portal and Public Leaderboard [Link]
Submission Portal https://competitions.codalab.org/competitions/26537

Submission Format

Please follow this [Link]

Quickstart

Prediction [Link]
Fine Tune [Link]
Reproduce Result [Link]

Contributing

Please follow this [Link]

Paper

IndoNLU has been accepted on AACL 2020 and you can find the detail on https://arxiv.org/abs/2009.05387 If you are using any component on IndoNLU for research purposes, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

ID
EN
- Dataset
- Model
- Leaderboard
  - Submission Format
- Quickstart
- Contributing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly