LOLA is a massively multilingual large language model trained on more than 160 languages using a sparse Mixture-of-Experts Transformer architecture. Evaluation results shows competitive performance in natural language generation and understanding tasks. As an open-source model, LOLA promotes reproducibility and serves as a robust foundation for future research.
The final model weights, trained using the Deepspeed-Megatron framework, are available at: https://files.dice-research.org/projects/LOLA/large/global_step296000/
Additional information about the model, along with its HuggingFace 🤗 implementation, can be found at: https://huggingface.co/dice-research/lola_v1
Note: This repository is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed. It contains the training source code for LOLA, which can be mainly found in lola_ws/. Some of the implementations from the original source have been modified within this fork for our use-case.
The original README.md can be found here: archive/README.md
This repository contains various utilities and implementations that can be used within the context of LOLA or adapted for other similar projects. Below is a list of key functionalities provided by this code repository:
You can find the scripts for fine-tuning the model in the lola_ws/fine-tune directory. We recommend using the PEFT-based implementation, which trains instructions in the Alpaca format using LORAs on top of our model. These scripts are located here: lola_ws/fine-tune/lora-peft. The scripts can be easily adapted for other similar (decoder-only) models or datasets.
To conduct your own analysis of the LOLA MoE routing, you can reuse the scripts in lola_ws/moe_analysis.
Note: Some scripts are configured for a specific SLURM-based computing cluster, such as noctua2. Feel free to modify them for your own use case.
You can pretrain a similar model from scratch or continue training the LOLA model using the script: lola_ws/gpt/run-gpt3-moe-pretrain.sh.
To prepare the CulturaX dataset for pretraining, refer to this README: lola_ws/README.md.
If you plan to train your own model using frameworks like Megatron or Megatron-DeepSpeed, the scripts in lola_ws/ can be especially useful. For preprocessing large datasets, we included a distributed implementation inspired by Megatron-LM/issues/492. This approach significantly improves efficiency on computing clusters with ample CPU resources.
If you use this code or data in your research, please cite our work:
@misc{srivastava2024lolaopensourcemassively,
title={LOLA -- An Open-Source Massively Multilingual Large Language Model},
author={Nikit Srivastava and Denis Kuchelev and Tatiana Moteu Ngoli and Kshitij Shetty and Michael Röder and Hamada Zahera and Diego Moussallem and Axel-Cyrille Ngonga Ngomo},
year={2024},
eprint={2409.11272},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.11272},
}