BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models
This is the repository for the paper BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models. This is a work in progress and more materials will be added over time.
The repository currently contains:
- Indonesian & Tamil LINDSEA linguistic diagnostic dataset
- Indonesian & Tamil cultural representation dataset
.
├── LICENSE
├── README.md
├── culture
│ └── representation
│ ├── README.md
│ ├── id # Data for Indonesian cultural representation
│ └── ta # Data for Tamil cultural representation
└── lindsea
├── README.md
├── id
│ ├── pragmatics # Data for Indonesian pragmatic reasoning (scalar implicatures/presuppositions)
│ ├── prompts.yaml # Prompts (English & Translated) for LINDSEA (Indonesian)
│ ├── semantics # Data for Indonesian semantic tests (coreference/translation)
│ └── syntax # Data for Indonesian syntactic tests (minimal pairs)
└── ta
├── pragmatics # Data for Tamil pragmatic reasoning (scalar implicatures/presuppositions)
├── prompts.yaml # Prompts (English & Translated) for LINDSEA (Tamil)
├── semantics # Data for Tamil semantic tests (coreference/translation)
└── syntax # Data for Tamil syntactic tests (minimal pairs)
This work is licensed under a Creative Commons Attribution 4.0 International License.
Please cite our paper if you use our data:
@misc{leong2023bhasa,
title={BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models},
author={Wei Qi Leong and Jian Gang Ngui and Yosephine Susanto and Hamsawardhini Rengarajan and Kengatharaiyer Sarveswaran and William Chandra Tjhi},
year={2023},
eprint={2309.06085},
archivePrefix={arXiv},
primaryClass={cs.CL}
}