This repo provides pretrained ALBERT model ("A Lite" version of BERT) and SentencePiece model (unsupervised text tokenizer and detokenizer) trained on Mongolian text corpus.
Contents:
You can use ALBERT-Mongolian
in both PyTorch and TensorFlow2.0 using transformers
library.
link to HuggingFace model card 🤗
import torch
from transformers import AlbertTokenizer, AlbertForMaskedLM
tokenizer = AlbertTokenizer.from_pretrained('bayartsogt/albert-mongolian')
model = AlbertForMaskedLM.from_pretrained('bayartsogt/albert-mongolian')
[Colab]
Text classification using TPU on Colab: ALBERT_Mongolian_text_classification.ipynb[Colab]
Masked Language Modeling (MLM) on Colab: ALBERT_Mongolian_MLM.ipynb[Video]
AWS-Mongolians e-meetup #3:
Model | Problem | Task | weighted F1 |
---|---|---|---|
ALBERT-base | Text Classification | Eduge dataset | 0.90 |
... | ... | ... | ... |
Note that While ALBERT-base is compatible in terms of results shown below, it is over 10 times (only 135MB) smaller than BERT-base (1.2GB).
- ALBERT-Mongolian:
precision recall f1-score support
байгал орчин 0.85 0.83 0.84 999
боловсрол 0.80 0.80 0.80 873
спорт 0.98 0.98 0.98 2736
технологи 0.88 0.93 0.91 1102
улс төр 0.92 0.85 0.89 2647
урлаг соёл 0.93 0.94 0.94 1457
хууль 0.89 0.87 0.88 1651
эдийн засаг 0.83 0.88 0.86 2509
эрүүл мэнд 0.89 0.92 0.90 1159
accuracy 0.90 15133
macro avg 0.89 0.89 0.89 15133
weighted avg 0.90 0.90 0.90 15133
- BERT-Mongolian: from Mongolian Text Classification
precision recall f1-score support
байгал орчин 0.82 0.84 0.83 999
боловсрол 0.91 0.70 0.79 873
спорт 0.97 0.98 0.97 2736
технологи 0.91 0.85 0.88 1102
улс төр 0.87 0.86 0.86 2647
урлаг соёл 0.88 0.96 0.92 1457
хууль 0.86 0.85 0.86 1651
эдийн засаг 0.84 0.87 0.85 2509
эрүүл мэнд 0.90 0.90 0.90 1159
accuracy 0.88 15133
macro avg 0.88 0.87 0.87 15133
weighted avg 0.88 0.88 0.88 15133
Pretrain from Scratch: You can follow the PRETRAIN_SCRATCH.md to reproduce the results.
- ALBERT - official repo
- WikiExtrator
- Mongolian BERT
- ALBERT - Japanese
- Mongolian Text Classification
- You's paper
- AWS-Mongolia e-meetup #3
@misc{albert-mongolian,
author = {Bayartsogt Yadamsuren},
title = {ALBERT Pretrained Model on Mongolian Datasets},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/bayartsogt-ya/albert-mongolian/}}
}