Added NusaBERT

LazarusNLP · Mar 6, 2024 · fa7e609 · fa7e609
1 parent 1aa139e
commit fa7e609
Show file tree

Hide file tree

Showing 4 changed files with 36 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -10,24 +10,32 @@
 
 <table>
   <tr>
+    <td valign="top">
+      <h3>NusaBERT: Teaching IndoBERT to be multilingual and multicultural!</h3>
+      <p>This project aims to extend the multilingual and multicultural capability of <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a>. We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as <a href="https://github.com/IndoNLP/indonlu">IndoNLU</a>, <a href="https://github.com/IndoNLP/nusax">NusaX</a>, and <a href="https://github.com/IndoNLP/nusa-writes">NusaWrites</a>.</p>
+    </td>
     <td valign="top">
       <h3>IndoT5: T5 Language Models for the Indonesian Language</h3>
       <p>IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the <a href="https://github.com/IndoNLP/indonlg">IndoNLG</a> (text generation) benchmark.</p>
     </td>
+  </tr>
+  <tr>
     <td valign="top">
       <h3>Indonesian Sentence Embedding Models</h3>
       <p>We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a> and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.</p>
     </td>
-  </tr>
-  <tr>
     <td valign="top">
       <h3>Indonesian Natural Language Inference Models</h3>
       <p>Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.</p>
     </td>
+  </tr>
+  <tr>
     <td valign="top">
       <h3>Many-to-Many Multilingual Translation Models</h3>
       <p>Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the <a href="https://github.com/IndoNLP/nusax">NusaX</a> benchmark.</p>
     </td>
+    <td valign="top">
+    </td>
   </tr>
 </table>
 

diff --git a/docs/index.md b/docs/index.md
@@ -16,21 +16,27 @@ description: "Lazarus NLP is a collective initiative to revive the dying languag
 
 <table>
   <tr>
-    <td valign="top" width="50%">
+    <td valign="top">
+      <h3>NusaBERT: Teaching IndoBERT to be multilingual and multicultural!</h3>
+      <p>This project aims to extend the multilingual and multicultural capability of <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a>. We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as <a href="https://github.com/IndoNLP/indonlu">IndoNLU</a>, <a href="https://github.com/IndoNLP/nusax">NusaX</a>, and <a href="https://github.com/IndoNLP/nusa-writes">NusaWrites</a>.</p>
+    </td>
+    <td valign="top">
       <h3>IndoT5: T5 Language Models for the Indonesian Language</h3>
       <p>IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the <a href="https://github.com/IndoNLP/indonlg">IndoNLG</a> (text generation) benchmark.</p>
     </td>
-    <td valign="top" width="50%">
+  </tr>
+  <tr>
+    <td valign="top">
       <h3>Indonesian Sentence Embedding Models</h3>
       <p>We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a> and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.</p>
     </td>
-  </tr>
-  <tr>
-    <td valign="top" width="50%">
+    <td valign="top">
       <h3>Indonesian Natural Language Inference Models</h3>
       <p>Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.</p>
     </td>
-    <td valign="top" width="50%">
+  </tr>
+  <tr>
+    <td valign="top">
       <h3>Many-to-Many Multilingual Translation Models</h3>
       <p>Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the <a href="https://github.com/IndoNLP/nusax">NusaX</a> benchmark.</p>
     </td>

diff --git a/docs/projects/nusabert.md b/docs/projects/nusabert.md
@@ -0,0 +1,13 @@
+---
+title: "NusaBERT"
+description: "NusaBERT: Teaching IndoBERT to be multilingual and multicultural!"
+---
+
+<div class="grid cards" markdown>
+
+- [:material-github: GitHub Repository](https://github.com/LazarusNLP/NusaBERT)
+- [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/nusabert-65dc7abe183c499cc3588b58)
+
+</div>
+
+{{ external_markdown('https://raw.githubusercontent.com/LazarusNLP/NusaBERT/main/README.md', '') }}
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -36,6 +36,7 @@ nav:
       - Bible Alignment: blogs/bible_alignment.md
       - Indonesian Accents and Regional Languages: blogs/accents_and_languages.md
   - Projects:
+      - NusaBERT: projects/nusabert.md
       - Sentence Embeddings: projects/sentence-embeddings.md
       - Indonesian T5 Language Models: projects/t5-language-models.md
       - Machine Translation: projects/machine-translation.md