louisowen6 / NLP_bahasa_resources Public

Notifications You must be signed in to change notification settings
Fork 132
Star 492

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

492 stars 132 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
LICENSE		LICENSE
README.md		README.md
combined_root_words.txt		combined_root_words.txt
combined_slang_words.txt		combined_slang_words.txt
combined_stop_words.txt		combined_stop_words.txt

Repository files navigation

NLP Bahasa Indonesia Resources

This repository provides link to useful dataset and another resources for NLP in Bahasa Indonesia.

Last Update: 15 Mar 2022

Table of contents

Corpus
Dictionary
Articles and Papers
Pre-trained Models
Usable Library
Spelling Correction
Twitter Scraping
Other Resources

Corpus

Named Entity Recognition

Product NER. https://github.com/dziem/proner-labeled-text
NER-grit. https://github.com/grit-id/nergrit-corpus

POS-Tagging

IDN Tagged Corpus. https://github.com/famrashel/idn-tagged-corpus
Indonesian Part-of-Speech (POS) Tagging. https://github.com/kmkurn/id-pos-tagging/blob/master/data/dataset.tar.gz

Question and Answering

TydiQA. https://github.com/google-research-datasets/tydiqa

Paraphrasing

Quora Paraphrasing. https://github.com/louisowen6/quora_paraphrasing_id
Paraphrase Adversaries from Word Scrambling. https://github.com/Wikidepia/indonesian_datasets/tree/master/paraphrase/paws

Text Summarization

Indosum. https://github.com/kata-ai/indosum
Liputan6. https://huggingface.co/datasets/id_liputan6

Hate-speech

ID Multi Label Hate Speech. https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection

Word Analogy

KAWAT. https://github.com/kata-ai/kawat

Formal-Informal

Multilingual Parallel

Unsupervised Corpus

OSCAR. https://oscar-corpus.com/
Online Newspaper. https://github.com/feryandi/Dataset-Artikel
IndoNLU. https://huggingface.co/datasets/indonlu
IndoNLG. https://github.com/indobenchmark/indonlg
IndoNLI. https://github.com/ir-nlp-csui/indonli
IndoBERTweet. https://github.com/indolem/IndoBERTweet
http://data.statmt.org/cc-100/
https://huggingface.co/datasets/id_clickbait
https://huggingface.co/datasets/id_newspapers_2018
https://opus.nlpl.eu/QED.php

Voice-Text

Puisi and Pantun

https://github.com/ilhamfp/puisi-pantun-generator

Dictionary

Synonym

https://github.com/victoriasovereigne/tesaurus

Sentiment

Position or Degree

Root Words

I have made the combined root words list from all of the above repositories.

Slang Words

I have made the combined slang words dictionary from all of the above repositories.

Stop Words

I have made the combined stop words list from all of the above repositories.

Swear Words

https://github.com/abhimantramb/elang/blob/master/word2vec/utils/swear-words.txt

Composite Words

https://github.com/panggi/pujangga/blob/master/resource/tokenizer/compositewords.txt

Number Words

https://github.com/panggi/pujangga/blob/master/resource/netagger/morphologicalfeature/number.txt

Calendar Words

https://github.com/onlyphantom/elang/blob/master/build/lib/elang/word2vec/utils/negative/calendar-words.txt

Emoticon

Acronym

Indonesia Region

Country

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/country.txt

Region

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/lpre.txt

Title of Name

https://github.com/panggi/pujangga/blob/master/resource/netagger/contextualfeature/ppre.txt

Gender by Name

https://github.com/seuriously/genderprediction/blob/master/namatraining.txt

Organization

https://github.com/panggi/pujangga/blob/master/resource/reference/opre.txt

Articles and Papers

POS-Tagging

https://medium.com/@puspitakaban/pos-tagging-bahasa-indonesia-dengan-flair-nlp-c12e45542860
Manually Tagged Indonesian Corpus [Paper] [GitHub]

Word Embedding

Topic Analysis

(Introduction to LSA & LDA). https://monkeylearn.com/blog/introduction-to-topic-modeling/
(Introduction to LDA w/ Code & Tips). https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
(Topic Modeling Methods Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(Original LDA Paper). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
(LDA Python Library). https://pypi.org/project/lda/; https://radimrehurek.com/gensim/models/ldamodel.html; https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
(Original CTM Paper). http://people.ee.duke.edu/~lcarin/Blei2005CTM.pdf
(CTM Python Library). https://pypi.org/project/tomotopy/; https://github.com/kzhai/PyCTM
(Gaussian LDA Paper). https://www.aclweb.org/anthology/P15-1077.pdf
(Gaussian LDA Library). https://github.com/rajarshd/Gaussian_LDA
(Temporal Topic Modeling Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(TOT: A Non-Markov Continuous-Time Model of Topical Trends Paper). https://people.cs.umass.edu/~mccallum/papers/tot-kdd06s.pdf
(TOT Library). https://github.com/ahmaurya/topics_over_time
(Example of LDA in Bahasa Project Code). https://github.com/kirralabs/text-clustering

Text Classification

Zero-shot Learning

(Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach) https://arxiv.org/pdf/1909.00161.pdf | https://github.com/yinwenpeng/BenchmarkingZeroShot
(Integrating Semantic Knowledge to Tackle Zero-shot Text Classification) https://arxiv.org/abs/1903.12626 | https://github.com/JingqingZ/KG4ZeroShotText
(Train Once, Test Anywhere: Zero-Shot Learning for Text Classification) https://arxiv.org/abs/1712.05972 | https://amitness.com/2020/05/zero-shot-text-classification/
(Zero-shot Text Classification With Generative Language Models) https://arxiv.org/abs/1912.10165 | https://amitness.com/2020/06/zero-shot-classification-via-generation/
(Zero-shot User Intent Detection via Capsule Neural Networks) https://arxiv.org/abs/1809.00385 | https://github.com/congyingxia/ZeroShotCapsule

Few-shot Learning

(Few-shot Text Classification with Distributional Signatures) https://arxiv.org/pdf/1908.06039.pdf | https://github.com/YujiaBao/Distributional-Signatures
(Few Shot Text Classification with a Human in the Loop) https://katbailey.github.io/talks/Few-shot%20text%20classification.pdf | https://github.com/katbailey/few-shot-text-classification
(Induction Networks for Few-Shot Text Classification) https://arxiv.org/pdf/1902.10482v2.pdf | https://github.com/zhongyuchen/few-shot-learning

Pre-trained Models

Indo-BERT. https://github.com/indobenchmark/indonlu & https://huggingface.co/indobenchmark/indobert-base-p1
Indo-BERTweet. https://github.com/indolem/IndoBERTweet & https://huggingface.co/indolem/indobertweet-base-uncased
Transformer-based Pre-trained Model in Bahasa. https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers
Generate Word-Embedding / Sentence-Embedding using pre-Trained Multilingual Bert model. (https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=Zn0n2S-FWZih). P.S: Just change the model using 'bert-base-multilingual-uncased'
https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset. [Paper]
https://github.com/Kyubyong/wordvectors
https://drive.google.com/uc?id=0B5YTktu2dOKKNUY1OWJORlZTcUU&export=download
https://github.com/deryrahman/word2vec-bahasa-indonesia
https://sites.google.com/site/rmyeid/projects/polyglot

Usable Library

Pujangga: Indonesian Natural Language Processing REST API. https://github.com/panggi/pujangga
Sastrawi Stemmer Bahasa Indonesia. https://github.com/sastrawi/sastrawi
NLP-ID. https://github.com/kumparan/nlp-id
MorphInd: Indonesian Morphological Analyzer. http://septinalarasati.com/morphind/
INDRA: Indonesian Resource Grammar. https://github.com/davidmoeljadi/INDRA
Typo Checker. https://github.com/mamat-rahmat/checker_id
Multilingual NLP Package. https://github.com/flairNLP/flair
spaCy [GitHub] [Tutorial]
https://github.com/yohanesgultom/nlp-experiments
https://github.com/yasirutomo/python-sentianalysis-id
https://github.com/riochr17/Analisis-Sentimen-ID
https://github.com/yusufsyaifudin/indonesia-ner

Spelling Correction

You can adjust this code with Bahasa corpus to do the spelling correction

Twitter Scraping

GetOldTweets3. https://github.com/Mottl/GetOldTweets3

Usage:

import GetOldTweets3 as got
tweetCriteria=got.manager.TweetCriteria().setQuerySearch('#CoronaVirusIndonesia').setSince("2020-01-01").setUntil("2020-03-05").setNear("Jakarta, Indonesia").setLang("id")
tweets=got.manager.TweetManager.getTweets(tweetCriteria)
for tweet in tweets:
	print(tweet.username)
	print(tweet.text)
	print(tweet.date)
	print("tweet.to")
	print("tweet.retweets")
	print("tweet.favorites")
	print("tweet.mentions")
	print("tweet.hashtags")
	print("tweet.geo")

Tweepy. http://docs.tweepy.org/en/latest/

Step-by-step how to use Tweepy. https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1

Sign in to Twitter Developer. https://developer.twitter.com/en

Full List of Tweets Object. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

Increasing Tweepy’s standard API search limit. https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./

Other Resources