Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Authors: Tuka Alhanai tuka@ghamut.com, Adam Kasumovic adam.kasumovic@ghamut.com, Mohammad Ghassemi ghassemi@ghamut.com, Aven Zitzelberger aven.zitzelberger@ghamut.com, Jessica Lundin jessica.lundin@gatesfoundation.org, Guillaume Chabot-Couture Guillaume.Chabot-Couture@gatesfoundation.org

This repository contains the benchmarks, results, and all the code required to reproduce the results, tables, and figures presented in our paper.

More specifically, this repository contains:

Translated Winogrande Benchmarks: Human and machine translations of Winogrande into 8 African languages: Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, and Tsonga (as well as preexisting translations into Afrikaans, Zulu, and Xhosa).
Translated MMLU-Clinical Benchmarks: Human and machine translations of the clinical sections "college medicine", "clinical knowledge", and "virology" of MMLU into 8 African languages: Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, and Tsonga (as well as preexisting translations into Afrikaans, Zulu, and Xhosa for the "clinical knowledge" and "college medicine" sections; we translated the "virology" section into Afrikaans, Zulu, and Xhosa as well in this release).
Human Annotation of Winogrande: Human annotations of the Winogrande dataset assessing translation quality and appropriateness in 11 African languages: Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, Tsonga, Afrikaans, Zulu, and Xhosa.
Scripts to Reproduce Results: Code used to regenerate the results, tables, and figures presented in the paper.

Note that Meta's Belebele in English as well as Shona, Igbo, Bambara, Amharic, Sepedi, Sesotho, Swtswana, Tsonga, Afrikaans, Zulu, and Xhosa is also included since it was used as an African language evaluation benchmark for our experiments.

A PDF-version of a slideshow presentation depicting our work is also provided in this repository.

Our translated datasets can also be accessed on HuggingFace at the following link: https://huggingface.co/datasets/Institute-Disease-Modeling/mmlu-winogrande-afr

Benchmarks Download

The evaluation benchmarks (MMLU-Clinical [called mmlu_cm_ck_vir/], Winogrande [called winogrande_s/], and Belebele [called belebele/]) in English and 11 African languages are available as a ZIP file.

The ZIP file also includes machine translations of the benchmarks (see folders suffixed with _mt), as well as their backtranslations (see folders suffixed with _bt).

Within each benchmark directory, files are suffixed with ISO Language Codes (or no suffix for the original English) to denote the language the file is in (or was translated from for backtranslated versions). These language codes are used throughout the project to denote languages.

Installation

Note that if cloning this repository, be sure to run git lfs install first, using apt-get update; apt-get install git-lfs to install Git LFS if it isn't already. This is necessary to actually download the full data files needed to run many scripts. Cloning with git clone https://<your GitHub PAT>@github.com/ghamut/BMGF_AAAI_Paper.git should allow you to clone this private repository. See here for more information about cloning a private repository.

1. Set up the Python environment:

Feel free to use your IDE to automatically do this.

python3 -m venv venv  # Set up virtual environment
source venv/bin/activate  # Activate virtual environment
pip install --upgrade pip  # Update pip
pip install -r requirements.txt  # Install packages
conda install -c conda-forge firefox geckodriver  # Run this command to get Bokeh working for creating one of the figures

2. Set up `config.py` file (gitignored by default) containing secrets:

Note that this step is not required if you do not want to run any scripts that incur monetary costs (e.g. calling OpenAI's Batch API, which is not free).

cp config_template.py config.py

You will need to edit the newly created config.py, filling in the string for your OpenAI API key:

gpt_api_key = "your-OpenAI-GPT-API-key-here"

Your OpenAI API key can be found here: Creating/finding your API key

If you would like to run every script to reproduce the results that does not incur any monetary cost, simply run the following:

./run_everything.sh

The scripts will run in parallel but are not compute-heavy. The results can be viewed in results/. If using version control, only the figures should come up as "modified" due to how saving PDFs/SVGs works. The figures will be visually identical to their predecessors, though.

Disclaimer

The code in this repository was developed by IDM, the Bill & Melinda Gates Foundation, and Ghamut Corporation to further research in Large Language Models (LLMs) for low-resource African languages by allowing them to be evaluated on question-answering and commonsense reasoning tasks, like those commonly available in English. We’ve made it publicly available under the MIT License to provide others with a better understanding of our research and an opportunity to build upon it for their own work. We make no representations that the code works as intended or that we will provide support, address issues that are found, or accept pull requests. You are welcome to create your own fork and modify the code to suit your own modeling needs as contemplated under the MIT License.

Acknowledgments

This repository includes data derived from the following datasets, each subject to their respective licenses (copied from their respective GitHub repositories):

MMLU Dataset

GitHub Repository: https://github.com/hendrycks/test
License: LICENSE-MMLU
For more licensing details, see the license terms specified in the file.

Citation (see below):

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

Winogrande Dataset

GitHub Repository: https://github.com/allenai/winogrande
License: LICENSE-Winogrande
For more licensing details, see the license terms specified in the file.

Citation (see below):

 @article{sakaguchi2019winogrande,
   title={WinoGrande: An Adversarial Winograd Schema Challenge at Scale},
   author={Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin},
   journal={arXiv preprint arXiv:1907.10641},
   year={2019}
 }

Belebele Dataset

GitHub Repository: https://github.com/facebookresearch/belebele
Licenses: LICENSE-Belebele-CC-BY-NC4.0 and LICENSE-Belebele-CC-BY-SA4.0
For more licensing details, see the license terms specified in the files.

Citation (see below):

@inproceedings{bandarkar-etal-2024-belebele,
  title = "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants",
  author = "Bandarkar, Lucas  and
    Liang, Davis  and
    Muller, Benjamin  and
    Artetxe, Mikel  and
    Shukla, Satya Narayan  and
    Husa, Donald  and
    Goyal, Naman  and
    Krishnan, Abhinandan  and
    Zettlemoyer, Luke  and
    Khabsa, Madian",
  booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  month = aug,
  year = "2024",
  address = "Bangkok, Thailand and virtual meeting",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2024.acl-long.44",
  pages = "749--775",
}

Please note that the licenses for the included datasets are separate from and may impose additional restrictions beyond the repository's main license.

Citation

If you find this repository useful, please consider citing it:

@article{,
  title={Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments},
  author={Tuka Alhanai and Adam Kasumovic and Mohammad Ghassemi and Aven Zitzelberger and Jessica Lundin and Guillaume Chabot-Couture},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Contents

Benchmarks Download

Installation

1. Set up the Python environment:

2. Set up `config.py` file (gitignored by default) containing secrets:

Disclaimer

Acknowledgments

Citation

About

Licenses found

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
results		results
scripts		scripts
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
Bridging_the_Gap_Low_Resource_Languages_Presentation.pdf		Bridging_the_Gap_Low_Resource_Languages_Presentation.pdf
LICENSE-Belebele-CC-BY-NC4.0		LICENSE-Belebele-CC-BY-NC4.0
LICENSE-Belebele-CC-BY-SA4.0		LICENSE-Belebele-CC-BY-SA4.0
LICENSE-MMLU		LICENSE-MMLU
LICENSE-Winogrande		LICENSE-Winogrande
LICENSE.md		LICENSE.md
README.md		README.md
config_template.py		config_template.py
requirements.txt		requirements.txt
run_everything.sh		run_everything.sh

License

Licenses found

InstituteforDiseaseModeling/Bridging-the-Gap-Low-Resource-African-Languages

Folders and files

Latest commit

History

Repository files navigation

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments

Contents

Benchmarks Download

Installation

1. Set up the Python environment:

2. Set up config.py file (gitignored by default) containing secrets:

Disclaimer

Acknowledgments

Citation

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Set up `config.py` file (gitignored by default) containing secrets:

Packages