Play around with the many fine-tunes of NusaBERT that demonstrate the versatility of our our base language model! It is extremely easy to adapt NusaBERT to your desired downstream task. Details about NusaBERT are available through the following links: - HuggingFace Collections: https://huggingface.co/collections/LazarusNLP/nusabert-65dc7abe183c499cc3588b58 - arXiv Pre-print: https://arxiv.org/abs/2403.01817
About
This project aims to extend the multilingual and multicultural capability of IndoBERT (Wilie et al., 2020). We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as IndoNLU, NusaX, and NusaWrites.