NusaBERT

Teaching IndoBERT to be Multilingual and Multicultural

Try it here: https://huggingface.co/spaces/LazarusNLP/NusaBERT

Play around with the many fine-tunes of NusaBERT that demonstrate the versatility of our our base language model! It is extremely easy to adapt NusaBERT to your desired downstream task. Details about NusaBERT are available through the following links: - HuggingFace Collections: https://huggingface.co/collections/LazarusNLP/nusabert-65dc7abe183c499cc3588b58 - arXiv Pre-print: https://arxiv.org/abs/2403.01817

About

This project aims to extend the multilingual and multicultural capability of IndoBERT (Wilie et al., 2020). We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as IndoNLU, NusaX, and NusaWrites.

Builders

Steven Limcorn

Deep learning enthusiast

Wilson Wongso

NLP Researcher

Ananto Joyoadikusumo

No bio yet

David Samuel

No bio yet