diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index b3416e5..8ee5aab 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -15,6 +15,6 @@ jobs: with: python-version: 3.x - name: Install dependencies - run: pip install mkdocs-material pillow cairosvg + run: pip install mkdocs-material pillow cairosvg mkdocs-embed-external-markdown - name: Deploy docs run: mkdocs gh-deploy --force diff --git a/docs/projects/indonesian-sentence-embeddings.md b/docs/projects/indonesian-sentence-embeddings.md deleted file mode 100644 index a1e2d55..0000000 --- a/docs/projects/indonesian-sentence-embeddings.md +++ /dev/null @@ -1,283 +0,0 @@ -# Indonesian Sentence Embeddings - -Inspired by [Thai Sentence Vector Benchmark](https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark), we decided to embark on the journey of training Indonesian sentence embedding models! - -

- logo -

- -## Evaluation - -### Semantic Textual Similarity - -We believe that a synthetic baseline is better than no baseline. Therefore, we followed approached done in the Thai Sentence Vector Benchmark project and translated the [STS-B](https://github.com/facebookresearch/SentEval) dev and test set to Indonesian via Google Translate API. This dataset will be used to evaluate our model's Spearman correlation score on the translated test set. - -> You can find the translated dataset on [🤗 HuggingFace Hub](https://huggingface.co/datasets/LazarusNLP/stsb_mt_id). - -### Retrieval - -To evaluate our models' capability to perform retrieval tasks, we evaluate them on Indonesian subsets of MIRACL and TyDiQA datasets. In both datasets, the model's ability to retrieve relevant documents given a query is tested. We employ R@1 (top-1 accuracy), MRR@10, and nDCG@10 metrics to measure our model's performance. - -### Classification - -For text classification, we will be doing emotion classification and sentiment analysis on the EmoT and SmSA subsets of [IndoNLU](https://huggingface.co/datasets/indonlp/indonlu), respectively. To do so, we will be doing the same approach as Thai Sentence Vector Benchmark and simply fit a Linear SVC on sentence representations of our texts with their corresponding labels. Thus, unlike conventional fine-tuning method where the backbone model is also updated, the Sentence Transformer stays frozen in our case; with only the classification head being trained. - -Further, we will evaluate our models using the official [MTEB](https://github.com/embeddings-benchmark/mteb.git) code that contains two Indonesian classification subtasks: `MassiveIntentClassification (id)` and `MassiveScenarioClassification (id)`. - -### Pair Classification - -We followed [MTEB](https://github.com/embeddings-benchmark/mteb.git)'s PairClassification evaluation procedure for pair classification. Specifically for zero-shot natural language inference tasks, all neutral pairs are dropped, while contradictions and entailments are re-mapped as `0`s and `1`s. The maximum average precision (AP) score is found by finding the best threshold value. - -We leverage the [IndoNLI](https://huggingface.co/datasets/indonli) dataset's two test subsets: `test_lay` and `test_expert`. - -## Methods - -### (Unsupervised) SimCSE - -We followed [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821) and trained a sentence embedding model in an unsupervised fashion. Unsupervised SimCSE allows us to leverage an unsupervised corpus -- which are plenty -- and with different dropout masks in the encoder, contrastively learn sentence representations. This is parallel with the situation that there is a lack of supervised Indonesian sentence similarity datasets, hence SimCSE is a natural first move into this field. We used the [Sentence Transformer implementation](https://www.sbert.net/examples/unsupervised_learning/README.html#simcse) of [SimCSE](https://github.com/princeton-nlp/SimCSE). - -### ConGen - -Like SimCSE, [ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation](https://github.com/KornWtp/ConGen) is another unsupervised technique to train a sentence embedding model. Since it is in-part a distillation method, ConGen relies on a teacher model which will then be distilled to a student model. The original paper proposes back-translation as the best data augmentation technique. However, due to the lack of resources, we implemented word deletion, which was found to be on-par with back-translation despite being trivial. We used the [official ConGen implementation](https://github.com/KornWtp/ConGen) which was written on top of the Sentence Transformers library. - -### SCT - -[SCT: An Efficient Self-Supervised Cross-View Training For Sentence Embedding](https://github.com/mrpeerat/SCT) is another unsupervised technique to train a sentence embedding model. It is very similar to ConGen in its knowledge distillation methodology, but also supports self-supervised training procedure without a teacher model. The original paper proposes back-translation as its data augmentation technique, but we implemented single-word deletion and found it to perform better than our backtranslated corpus. We used the [official SCT implementation](https://github.com/mrpeerat/SCT) which was written on top of the Sentence Transformers library. - -## Models - -| Model | #params | Base/Student Model | Teacher Model | Train Dataset | Supervised | -| --------------------------------------------------------------------------------------------------------------------------- | :-----: | --------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | :--------: | -| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 125M | [IndoRoBERTa Base](https://huggingface.co/flax-community/indonesian-roberta-base) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 12M | [IndoBERT Lite Base](https://huggingface.co/indobenchmark/indobert-lite-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 125M | [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | [Wikipedia](https://huggingface.co/datasets/LazarusNLP/wikipedia_id_20230520) | | -| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 125M | [IndoBERT Base](https://huggingface.co/indobenchmark/indobert-base-p1) | N/A | [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) | ✅ | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 134M | [DistilBERT Base Multilingual](https://huggingface.co/distilbert-base-multilingual-cased) | mUSE | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 125M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | See: [SBERT](https://www.sbert.net/docs/pretrained_models.html#model-overview) | ✅ | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 118M | [Multilingual-MiniLM-L12-H384](https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-small) | ✅ | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 278M | [XLM-RoBERTa Base](https://huggingface.co/xlm-roberta-base) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-base) | ✅ | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 560M | [XLM-RoBERTa Large](https://huggingface.co/xlm-roberta-large) | See: [arXiv](https://arxiv.org/abs/2212.03533) | See: [🤗](https://huggingface.co/intfloat/multilingual-e5-large) | ✅ | - -## Results - -### Semantic Textual Similarity - -#### Machine Translated Indonesian STS-B - -| Model | Spearman's Correlation (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :--------------------------: | -| [SimCSE-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/simcse-indobert-lite-base) | 44.08 | -| [SimCSE-IndoRoBERTa Base](https://huggingface.co/LazarusNLP/simcse-indoroberta-base) | 61.26 | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 70.13 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 79.97 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 80.47 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 81.16 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 74.56 | -| [S-IndoBERT Base mMARCO](https://huggingface.co/LazarusNLP/s-indobert-base-mmarco) | 72.95 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 75.08 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | **83.83** | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 78.89 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 79.72 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 79.44 | - -### Retrieval - -#### MIRACL - -| Model | R@1 (%) ↑ | MRR@10 (%) ↑ | nDCG@10 (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :-------: | :----------: | :-----------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 36.04 | 48.25 | 39.70 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 46.04 | 59.06 | 51.01 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 45.93 | 58.58 | 49.95 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 45.83 | 58.27 | 49.91 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 40.41 | 47.29 | 40.68 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 41.35 | 54.93 | 48.79 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 52.81 | 65.07 | 57.97 | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 68.33 | 78.85 | 73.84 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 68.95 | 78.92 | 74.58 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | **69.89** | **80.09** | **75.64** | - -#### TyDiQA - -| Model | R@1 (%) ↑ | MRR@10 (%) ↑ | nDCG@10 (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :-------: | :----------: | :-----------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 61.94 | 69.89 | 73.52 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 75.22 | 81.55 | 84.13 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 73.09 | 80.32 | 83.29 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 72.38 | 79.37 | 82.51 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 76.81 | 83.16 | 85.87 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 70.44 | 77.94 | 81.56 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 81.41 | 87.05 | 89.44 | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 90.97 | 94.14 | 95.25 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 91.85 | 94.88 | 95.82 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | **94.15** | **96.36** | **97.14** | - -### Classification - -#### MTEB - Massive Intent Classification `(id)` - -| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 59.71 | 57.70 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 62.41 | 60.94 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 61.14 | 60.02 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 60.93 | 59.50 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 55.66 | 54.48 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 55.99 | 52.44 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 65.43 | 63.55 | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 64.16 | 61.33 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 66.63 | 63.88 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | **70.04** | **67.66** | - -#### MTEB - Massive Scenario Classification `(id)` - -| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 66.14 | 65.56 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 67.25 | 66.53 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 67.72 | 67.32 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 67.12 | 66.64 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 61.89 | 60.97 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 65.25 | 63.45 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 70.72 | 70.58 | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 67.92 | 67.23 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 70.70 | 70.26 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | **74.11** | **73.82** | - -#### IndoNLU - Emotion Classification (EmoT) - -| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 55.45 | 55.78 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 58.18 | 58.84 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 57.04 | 57.06 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 59.54 | 60.37 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 61.13 | 61.70 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 63.63 | 64.13 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 63.18 | 63.78 | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 64.54 | 65.04 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 68.63 | 69.07 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | **74.77** | **74.66** | - -#### IndoNLU - Sentiment Analysis (SmSA) - -| Model | Accuracy (%) ↑ | F1 Macro (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :------------: | :------------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 85.6 | 81.50 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 81.2 | 75.59 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 85.4 | 82.12 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 83.0 | 78.74 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 82.0 | 76.92 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 78.8 | 73.64 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 89.6 | **86.56** | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 83.6 | 79.51 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 89.4 | 86.22 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | **90.0** | 86.50 | - -### Pair Classification - -#### IndoNLI - -| Model | `test_lay` AP (%) ↑ | `test_expert` AP (%) ↑ | -| --------------------------------------------------------------------------------------------------------------------------- | :-----------------: | :--------------------: | -| [SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/simcse-indobert-base) | 56.06 | 50.72 | -| [ConGen-IndoBERT Lite Base](https://huggingface.co/LazarusNLP/congen-indobert-lite-base) | 69.44 | 53.74 | -| [ConGen-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-indobert-base) | 71.14 | 56.35 | -| [ConGen-SimCSE-IndoBERT Base](https://huggingface.co/LazarusNLP/congen-simcse-indobert-base) | 70.80 | 56.59 | -| [SCT-IndoBERT Base](https://huggingface.co/LazarusNLP/sct-indobert-base) | 59.82 | 53.41 | -| [distiluse-base-multilingual-cased-v2](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v2) | 58.48 | 50.50 | -| [paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | **74.87** | **57.96** | -| [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) | 63.97 | 51.85 | -| [multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 60.25 | 50.91 | -| [multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 61.39 | 51.62 | - -## References - -```bibtex -@misc{Thai-Sentence-Vector-Benchmark-2022, - author = {Limkonchotiwat, Peerat}, - title = {Thai-Sentence-Vector-Benchmark}, - year = {2022}, - publisher = {GitHub}, - journal = {GitHub repository}, - howpublished = {\url{https://github.com/mrpeerat/Thai-Sentence-Vector-Benchmark}} -} -``` - -```bibtex -@inproceedings{reimers-2019-sentence-bert, - title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", - author = "Reimers, Nils and Gurevych, Iryna", - booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", - month = "11", - year = "2019", - publisher = "Association for Computational Linguistics", - url = "https://arxiv.org/abs/1908.10084", -} -``` - -```bibtex -@inproceedings{gao2021simcse, - title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings}, - author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi}, - booktitle={Empirical Methods in Natural Language Processing (EMNLP)}, - year={2021} -} -``` - -```bibtex -@inproceedings{limkonchotiwat-etal-2022-congen, - title = "{ConGen}: Unsupervised Control and Generalization Distillation For Sentence Representation", - author = "Limkonchotiwat, Peerat and - Ponwitayarat, Wuttikorn and - Lowphansirikul, Lalita and - Udomcharoenchaikit, Can and - Chuangsuwanich, Ekapol and - Nutanong, Sarana", - booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", - year = "2022", - publisher = "Association for Computational Linguistics", -} -``` - -```bibtex -@article{10.1162/tacl_a_00620, - author = {Limkonchotiwat, Peerat and Ponwitayarat, Wuttikorn and Lowphansirikul, Lalita and Udomcharoenchaikit, Can and Chuangsuwanich, Ekapol and Nutanong, Sarana}, - title = "{An Efficient Self-Supervised Cross-View Training For Sentence Embedding}", - journal = {Transactions of the Association for Computational Linguistics}, - volume = {11}, - pages = {1572-1587}, - year = {2023}, - month = {12}, - issn = {2307-387X}, - doi = {10.1162/tacl_a_00620}, - url = {https://doi.org/10.1162/tacl\_a\_00620}, - eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00620/2196817/tacl\_a\_00620.pdf}, -} -``` - -## Credits - -Indonesian Sentence Embeddings is developed with love by: - -
- - GitHub Profile - - - - GitHub Profile - - - - GitHub Profile - - - - GitHub Profile - -
\ No newline at end of file diff --git a/docs/projects/machine-translation.md b/docs/projects/machine-translation.md index 56b55b4..41044d3 100644 --- a/docs/projects/machine-translation.md +++ b/docs/projects/machine-translation.md @@ -1,105 +1,4 @@ -# Machine Translation +- [:material-github: GitHub Repository](https://github.com/LazarusNLP/machine-translation) +- [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/indot5-6541fbdfa385933e811c2e1f) -## Indo-mT5 - -Indo-mT5 is mT5 fine-tuned for machine translation of regional languages of Indonesia. We release our dataset creation scripts, training code, and fine-tuned models for other to leverage. - -There are two types of models: - -- **Multilingual**: Many-to-many, multilingual translation model. -- **Bilingual**: Unidirectional, bilingual translation model. - -We also further experiment with two settings: - -- **Baseline**: Model trained on 7 languages (`ace`, `ban`, `bug`, `ind`, `jav`, `min`, `sun`). -- **All**: Model trained on 45 languages as listed [here](languages.md). - -## Training - -Our experiments are conducted in these steps: - -- **Multilingual Training on Bible**: We first fine-tuned mT5 on multilingual translation on parallel Bible dataset, creating Indo-mT5. -- **Multilingual Training on NusaX**: We take Indo-mT5 and fine-tune them on multilingual pairs of the NusaX dataset. -- **Bilingual Training on NusaX**: We take Indo-mT5 and fine-tune them on bilingual pairs of the NusaX dataset. - -Therefore, we have six training scripts: - -| Dataset | Config | Type | Training Script | Evaluation Script | -| ------- | -------- | ------------ | ---------------------------------------------------------------------------- | -------------------------------------------------------------------------- | -| Bible | Baseline | Multilingual | [train_bible_baseline.sh](train_bible_baseline.sh) | [eval_bible_baseline.sh](eval_bible_baseline.sh) | -| Bible | All (v2) | Multilingual | [train_bible_all.sh](train_bible_all.sh) | [eval_bible_all.sh](eval_bible_all.sh) | -| NusaX | Baseline | Multilingual | [train_nusax_baseline_multilingual.sh](train_nusax_baseline_multilingual.sh) | [eval_nusax_baseline_multilingual.sh](eval_nusax_baseline_multilingual.sh) | -| NusaX | All (v2) | Multilingual | [train_nusax_all_multilingual.sh](train_nusax_all_multilingual.sh) | [eval_nusax_all_multilingual.sh](eval_nusax_all_multilingual.sh) | -| NusaX | Baseline | Bilingual | [train_nusax_baseline_bilingual.sh](train_nusax_baseline_bilingual.sh) | [eval_nusax_baseline_bilingual.sh](eval_nusax_baseline_bilingual.sh) | -| NusaX | All (v2) | Bilingual | [train_nusax_all_bilingual.sh](train_nusax_all_bilingual.sh) | [eval_nusax_all_bilingual.sh](eval_nusax_all_bilingual.sh) | - -## Results - -We evaluated our models on NusaX (Winata et al., 2022) and compared them to existing models. - -### `ind -> x` - -| Model | #params | `ace` | `ban` | `bbc` | `bjn` | `bug` | `jav` | `mad` | `min` | `nij` | `sun` | avg | -| ------------------------------------ | ------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | -| IndoGPT (Winata et al., 2022) | 117M | 9.60 | 14.17 | 8.20 | 22.23 | 5.18 | 24.05 | 14.44 | 26.95 | 17.56 | 23.15 | 16.55 | -| IndoBART v2 (Winata et al., 2022) | 132M | **19.21** | **27.08** | **18.41** | **40.03** | **11.06** | **39.97** | 28.95 | 48.48 | **27.11** | **38.46** | 29.88 | -| mBART-50 Large (Winata et al., 2022) | 610M | 17.21 | 22.67 | 17.79 | 34.26 | 10.78 | 35.33 | 28.63 | 43.87 | 25.91 | 31.21 | 26.77 | -| mT5 Base (Winata et al., 2022) | 580M | 14.79 | 18.07 | 18.22 | 38.64 | 6.68 | 33.48 | 0.96 | 45.84 | 13.59 | 33.79 | 22.41 | -| NLLB-200 Distilled (zero-shot) | 600M | 2.74 | 4.87 | - | - | 1.66 | 17.66 | - | 9.79 | - | 11.92 | 8.11 | -| Indo-mT5 NusaX Multilingual | 580M | 16.02 | 22.48 | - | - | 8.86 | 33.65 | - | 33.65 | - | 29.76 | 24.07 | -| Indo-mT5 NusaX Bilingual | 580M | 17.99 | 27.03 | - | - | 10.80 | 39.63 | - | **51.56** | - | 35.16 | **30.36** | -| Indo-mT5 v2 NusaX Multilingual | 580M | 14.28 | 19.19 | 14.86 | 28.39 | 8.05 | 28.70 | 20.95 | 32.70 | 22.30 | 26.19 | 21.56 | -| Indo-mT5 v2 NusaX Bilingual | 580M | 17.58 | 24.24 | 16.69 | 38.81 | 10.20 | 37.87 | **29.77** | 50.90 | 26.93 | 34.22 | 28.72 | - -### `x -> ind` - -| Model | #params | `ace` | `ban` | `bbc` | `bjn` | `bug` | `jav` | `mad` | `min` | `nij` | `sun` | avg | -| ------------------------------------ | ------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | -| IndoGPT (Winata et al., 2022) | 117M | 7.01 | 13.23 | 5.27 | 19.53 | 1.98 | 27.31 | 13.75 | 23.03 | 10.83 | 23.18 | 14.51 | -| IndoBART v2 (Winata et al., 2022) | 132M | 24.44 | 40.49 | 19.94 | **47.81** | 12.64 | **50.64** | 36.10 | 58.38 | **33.50** | **45.96** | 36.99 | -| mBART-50 Large (Winata et al., 2022) | 610M | 18.45 | 34.23 | 17.43 | 41.73 | 10.87 | 39.66 | 32.11 | 59.66 | 29.84 | 35.19 | 31.92 | -| mT5 Base (Winata et al., 2022) | 580M | 18.59 | 21.73 | 12.85 | 42.29 | 2.64 | 45.22 | 32.35 | 58.65 | 25.61 | 36.58 | 29.65 | -| NLLB-200 Distilled (zero-shot) | 600M | 9.42 | 21.24 | - | - | 6.18 | 30.54 | - | 40.49 | - | 26.91 | 22.46 | -| Indo-mT5 NusaX Multilingual | 580M | 23.94 | 35.30 | - | - | **16.68** | 29.76 | - | 48.10 | - | 36.54 | 31.72 | -| Indo-mT5 NusaX Bilingual | 580M | **24.78** | **42.15** | - | - | 16.27 | 47.26 | - | **62.94** | - | 42.39 | **39.30** | -| Indo-mT5 v2 NusaX Multilingual | 580M | 21.01 | 30.43 | 18.57 | 34.21 | 14.42 | 35.19 | 27.04 | 42.64 | 26.90 | 33.78 | 28.42 | -| Indo-mT5 v2 NusaX Bilingual | 580M | 22.87 | 39.48 | **20.48** | 44.53 | 15.97 | 45.20 | **36.65** | 60.97 | 32.38 | 39.80 | 35.83 | - -## Parallel Bible Dataset Creation - -This will cover the creation process of our Bible machine-translation dataset. - -### Overview - -1. Scrape Bible Data -2. Align Bible Verses -3. Load as Machine-Translation Dataset - -### Bible Scraping - -```sh -python utils/scrape_parallel.py \ - --codes abun aceh ambdr aralle balantak bali bambam bauzi berik bugis dairi duri ende galela gorontalo iban jawa kaili_daa karo kupang lampung madura makasar mamasa manggarai mentawai meyah minang mongondow napu ngaju nias rote sabu sangir sasak simalungun sunda taa tabaru tb toba toraja uma yali yawa \ - --outdir corpus \ - -j 4 -``` - -### Align Bible Verses - -```sh -for LANGUAGE in abun aceh ambdr aralle balantak bali bambam bauzi berik bugis dairi duri ende galela gorontalo iban jawa kaili_daa karo kupang lampung madura makasar mamasa manggarai mentawai meyah minang mongondow napu ngaju nias rote sabu sangir sasak simalungun sunda taa tabaru tb toba toraja uma yali yawa -do - python utils/align.py --path corpus/$LANGUAGE.json --outdir corpus_aligned -done -``` - -You can read more about aligning Bible verses in [our blogpost](https://lazarusnlp.github.io/blogs/bible_alignment/). - -### Data Loading Script - -In the data loading script, we have to do these two steps: - -1. Split unique verse IDs into train/test/validation subsets. -2. Generate permutations of every verse ID for every subset. - -You can find our data loading implementation in [src/alkitab-sabda-mt.py](src/alkitab-sabda-mt.py). \ No newline at end of file +{{ external_markdown('https://raw.githubusercontent.com/LazarusNLP/machine-translation/main/README.md', '') }} \ No newline at end of file diff --git a/docs/projects/sentence-embeddings.md b/docs/projects/sentence-embeddings.md new file mode 100644 index 0000000..36310fb --- /dev/null +++ b/docs/projects/sentence-embeddings.md @@ -0,0 +1,5 @@ +- [:material-github: GitHub Repository](https://github.com/LazarusNLP/indonesian-sentence-embeddings) +- [:material-web: Documentation](https://lazarusnlp.github.io/indonesian-sentence-embeddings/) +- [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/indonesian-sentence-embedding-6541fce662e82d932ff360c5) + +{{ external_markdown('https://raw.githubusercontent.com/LazarusNLP/indonesian-sentence-embeddings/main/README.md', '') }} \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index b39dd00..7e331ce 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -24,9 +24,22 @@ theme: - navigation.tabs - navigation.tabs.sticky # - navigation.sections + - navigation.top + # - toc.integrate + +nav: + - Home: index.md + - Blogs: + - Bible Alignment: blogs/bible_alignment.md + - Indonesian Accents and Regional Languages: blogs/accents_and_languages.md + - Projects: + - Sentence Embeddings: projects/sentence-embeddings.md + - Machine Translation: projects/machine-translation.md plugins: - social + - search + - external-markdown markdown_extensions: - attr_list @@ -47,4 +60,5 @@ markdown_extensions: pygments_lang_class: true - pymdownx.inlinehilite - pymdownx.snippets + - pymdownx.details - pymdownx.superfences