From f10280b0ae6f999447604521aea9b236c7f25e98 Mon Sep 17 00:00:00 2001 From: w11wo Date: Mon, 19 Feb 2024 16:54:55 +0700 Subject: [PATCH] Added Demo to Launch Blog --- docs/blogs/launch.md | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/docs/blogs/launch.md b/docs/blogs/launch.md index 0e6e05b..5cbe9cb 100644 --- a/docs/blogs/launch.md +++ b/docs/blogs/launch.md @@ -9,6 +9,19 @@ Today we are launching LazarusNLP, an independent research group dedicated to le This blog aims to discuss the gaps in NLP research and development for Indonesian languages and introduce our initial projects. We are excited to share our work and invite the community to join us in our mission! +You can try out our projects in the following web app demo: + + + +!!! info + + This web app is available at our [🤗 HuggingFace Space](https://huggingface.co/spaces/LazarusNLP/LazarusNLP). + ## Background Indonesia's linguistic landscape is rich and varied, with languages evolving independently across different regions. Despite the prevalence of Indonesian (*Bahasa Indonesia*) as the national language, many of these regional languages face the threat of extinction. UNESCO has identified 137 Indonesian languages as vulnerable or endangered, highlighting the urgent need for action[^1]. @@ -25,9 +38,13 @@ While advancements in NLP have benefited major languages like Indonesian, there IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the [IndoNLG](https://github.com/IndoNLP/indonlg) (text generation) benchmark. +
+ - [:material-github: GitHub Repository](https://github.com/LazarusNLP/IndoT5/) - [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/indonesian-t5-language-models-65c1b9a0f6342b3eb3d6d450) +
+ ### Indonesian Sentence Embedding Models
@@ -36,23 +53,35 @@ IndoT5 is a T5-based language model trained specifically for the Indonesian lang We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like [IndoBERT](https://github.com/IndoNLP/indonlu) and state-of-the-art unsupervised techniques and established sentence embedding benchmarks. +
+ - [:material-github: GitHub Repository](https://github.com/LazarusNLP/indonesian-sentence-embeddings) - [:material-web: Documentation](https://lazarusnlp.github.io/indonesian-sentence-embeddings/) - [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/indonesian-sentence-embedding-6541fce662e82d932ff360c5) +
+ ### Indonesian Natural Language Inference (NLI) Models Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search. +
+ - [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/indonesian-natural-language-inference-65b9d95539ac63290a418d67) +
+ ### Many-to-Many Multilingual Translation Models Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the [NusaX](https://github.com/IndoNLP/nusax) benchmark. +
+ - [:material-github: GitHub Repository](https://github.com/LazarusNLP/machine-translation) - [🤗 HuggingFace Collection](https://huggingface.co/collections/LazarusNLP/indot5-6541fbdfa385933e811c2e1f) +
+ ## Future Plans Our journey has just begun. Looking ahead, we are committed to expanding our repository of open-source pre-trained language models, with a focus on Indonesia's languages, multilinguality, culture, and code-switching. By democratizing access to NLP tools for all Indonesian languages, we aim to catalyze a renaissance in linguistic diversity. @@ -65,6 +94,6 @@ We are always open to collaboration and welcome contributions from the community --- -_Written by David Samuel Setiawan, Steven Limcorn, and Wilson Wongso. Last updated 13 February 2024._ +_Written by David Samuel Setiawan, Steven Limcorn, and Wilson Wongso. Last updated 19 February 2024._ [^1]: Moseley, Christopher, ed. (2010). Atlas of the World’s Languages in Danger. Memory of Peoples (3rd ed.). Paris: UNESCO Publishing. ISBN 978-92-3-104096-2.