A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.
Dataset | Year | Languages | GitHub | Download |
---|---|---|---|---|
OMGEval : An Open Multilingual Generative Evaluation Benchmark for Large Language Models |
2024 | Chinese (zh) (🇨🇳), Russian (ru) (🇷🇺), French (fr) (🇫🇷), Spanish (es) (🇪🇸), Arabic (ar) (🇸🇦) | Github | Data |
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property |
2024 | Chinese (zh) (🇨🇳), English (en) (🇬🇧), German (de) (🇩🇪), Japanese (ja) (🇯🇵), French (fr) (🇫🇷), Korean (ko) (🇰🇷), Russian (ru) (🇷🇺), Spanish (es) (🇪🇸), Portuguese (pt) (🇵🇹), Catalan (ca) (🇦🇩) | Github | Data |
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models |
2024 | English (en) (🇬🇧), Chinese (zh) (🇨🇳), Japanese (ja) (🇯🇵), French (fr) (🇫🇷), German (de) (🇩🇪) | Github | Data |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models |
2023 | English (🇺🇸), Chinese (🇨🇳), Italian (🇮🇹), Portuguese (🇧🇷), Vietnamese (🇻🇳), Thai (🇹🇭), Swahili (🇰🇪), Afrikaans (🇿🇦), Javanese (🇮🇩) | Github | Data |
Language models are multilingual chain-of-thought reasoners |
2023 | Bengali (🇧🇩), Chinese (🇨🇳), French (🇫🇷), German (🇩🇪), Japanese (🇯🇵), Russian (🇷🇺), Spanish (🇪🇸), Swahili (🇰🇪), Telugu (🇮🇳), Thai (🇹🇭) | Github | Data |
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages |
2023 | English [🇬🇧], Russian [🇷🇺], Spanish [🇪🇸], German [🇩🇪], French [🇫🇷], Chinese [🇨🇳], Italian [🇮🇹], Portuguese [🇵🇹], Polish [🇵🇱], Japanese [🇯🇵], Vietnamese [🇻🇳], Dutch [🇳🇱], Arabic [🇸🇦], Turkish [🇹🇷], Czech [🇨🇿], Persian [🇮🇷], Hungarian [🇭🇺], Greek [🇬🇷], Romanian [🇷🇴], Swedish [🇸🇪], Ukrainian [🇺🇦], Finnish [🇫🇮], Korean [🇰🇷], Danish [🇩🇰], Bulgarian [🇧🇬], Norwegian [🇳🇴], Hindi [🇮🇳], Slovak [🇸🇰], Thai [🇹🇭], Lithuanian [🇱🇹], Catalan [🇪🇸], Indonesian [🇮🇩], Bangla [🇧🇩], Estonian [🇪🇪], Slovenian [🇸🇮], Latvian [🇱🇻], Hebrew [🇮🇱], Serbian [🇷🇸], Tamil [🇮🇳], Albanian [🇦🇱], Azerbaijani [🇦🇿] | 🤗 | Data |
Language models are multilingual chain-of-thought reasoners |
2023 | Bengali (🇧🇩), Chinese (🇨🇳), French (🇫🇷), German (🇩🇪), Japanese (🇯🇵), Russian (🇷🇺), Spanish (🇪🇸), Swahili (🇰🇪), Telugu (🇮🇳), Thai (🇹🇭) | Github | Data |
Wiki-40B: Multilingual Language Model Dataset | 2020 | English (🇺🇸), German (🇩🇪), French (🇫🇷), Russian (🇷🇺), Spanish (🇪🇸), Italian (🇮🇹), Japanese (🇯🇵), Chinese Simplified (🇨🇳), Chinese Traditional (🇹🇼), Polish (🇵🇱), Ukrainian (🇺🇦), Dutch (🇳🇱), Swedish (🇸🇪), Portuguese (🇵🇹), Serbian (🇷🇸), Hungarian (🇭🇺), Catalan (🇪🇸), Czech (🇨🇿), Finnish (🇫🇮), Arabic (🇸🇦), Korean (🇰🇷), Persian (🇮🇷), Norwegian (🇳🇴), Vietnamese (🇻🇳), Hebrew (🇮🇱), Indonesian (🇮🇩), Romanian (🇷🇴), Turkish (🇹🇷), Bulgarian (🇧🇬), Estonian (🇪🇪), Malay (🇲🇾), Danish (🇩🇰), Slovak (🇸🇰), Croatian (🇭🇷), Greek (🇬🇷), Lithuanian (🇱🇹), Slovenian (🇸🇮), Thai (🇹🇭), Hindi (🇮🇳), Latvian (🇱🇻), Filipino (🇵🇭) | 👁️ | Data |
Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning | 2021 | English (🇺🇸), German (🇩🇪), French (🇫🇷), Russian (🇷🇺), Spanish (🇪🇸), Hindi (🇮🇳), Vietnamese (🇻🇳), Bulgarian (🇧🇬), Chinese (🇨🇳), Dutch (🇳🇱), Italian (🇮🇹), Japanese (🇯🇵), Polish (🇵🇱), Portuguese (🇵🇹), Arabic (🇸🇦), Swahili (🇹🇿), Urdu (🇵🇰) | GitHub️ | Data |
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset | 2022 | Akan (🇬🇭), Arabic (🇸🇦), Assamese (🇮🇳), Bambara (🇲🇱), Basque (🇪🇸), Bengali (🇧🇩), Catalan (🇪🇸), Chichewa (🇲🇼), chiShona (🇿🇼), Chitumbuka (🇲🇼), English (🇬🇧), Fon (🇧🇯), French (🇫🇷), Gujarati (🇮🇳), Hindi (🇮🇳), Igbo (🇳🇬), Indonesian (🇮🇩), isiXhosa (🇿🇦), isiZulu (🇿🇦), Kannada (🇮🇳), Kikuyu (🇰🇪), Kinyarwanda (🇷🇼), Kirundi (🇧🇮), Lingala (🇨🇩), Luganda (🇺🇬), Malayalam (🇮🇳), Marathi (🇮🇳), Nepali (🇳🇵), Northern Sotho (🇿🇦), Odia (🇮🇳), Portuguese (🇵🇹), Punjabi (🇮🇳), Sesotho (🇱🇸), Setswana (🇧🇼), Simplified Chinese (🇨🇳), Spanish (🇪🇸), Swahili (🇰🇪), Tamil (🇮🇳), Telugu (🇮🇳), Traditional Chinese (🇹🇼), Twi (🇬🇭), Urdu (🇵🇰), Vietnamese (🇻🇳), Wolof (🇸🇳), Xitsonga (🇿🇦), Yoruba (🇳🇬), Programming Languages (💻) | GitHub️ | Data |
GEOMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models | 2022 | English (🇺🇸), Chinese (🇨🇳), Hindi (🇮🇳), Persian (🇮🇷), Swahili (🇰🇪) | GitHub️ | 🔍 |
Title | Year | Languages | Code | Demo |
---|---|---|---|---|
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model |
2024 | Afrikaans [🇿🇦], Amharic [🇪🇹], Arabic [🇸🇦], Azerbaijani [🇦🇿], Belarusian [🇧🇾], Bengali [🇧🇩], Bulgarian [🇧🇬], Catalan [🇪🇸], Cebuano [🇵🇭], Czech [🇨🇿], Welsh [🏴], Danish [🇩🇰], German [🇩🇪], Greek [🇬🇷], English [🇬🇧], Esperanto [🇪🇸], Estonian [🇪🇪], Basque [🇪🇸], Finnish [🇫🇮], Tagalog [🇵🇭], French [🇫🇷], Western Frisian [🇳🇱], Scottish Gaelic [🏴], Irish [🇮🇪], Galician [🇪🇸], Gujarati [🇮🇳], Haitian Creole [🇭🇹], Hausa [🇳🇪], Hebrew [🇮🇱], Hindi [🇮🇳], Hungarian [🇭🇺], Armenian [🇦🇲], Igbo [🇳🇬], Indonesian [🇮🇩], Icelandic [🇮🇸], Italian [🇮🇹], Javanese [🇮🇩], Japanese [🇯🇵], Kannada [🇮🇳], Georgian [🇬🇪], Kazakh [🇰🇿], Khmer [🇰🇭], Kyrgyz [🇰🇬], Korean [🇰🇷], Kurdish [🇹🇷], Lao [🇱🇦], Latvian [🇱🇻], Latin [🇻🇦], Lithuanian [🇱🇹], Luxembourgish [🇱🇺], Malayalam [🇮🇳], Marathi [🇮🇳], Macedonian [🇲🇰], Malagasy [🇲🇬], Maltese [🇲🇹], Mongolian [🇲🇳], Maori [🇳🇿], Malay [🇲🇾], Burmese [🇲🇲], Nepali [🇳🇵], Dutch [🇳🇱], Norwegian [🇳🇴], Northern Sotho [🇿🇦], Chichewa [🇲🇼], Oriya [🇮🇳], Punjabi [🇮🇳], Persian [🇮🇷], Polish [🇵🇱], Portuguese [🇵🇹], Pashto [🇦🇫], Romanian [🇷🇴], Russian [🇷🇺], Sinhala [🇱🇰], Slovak [🇸🇰], Slovenian [🇸🇮], Samoan [🇼🇸], Shona [🇿🇼], Sindhi [🇵🇰], Somali [🇸🇴], Southern Sotho [🇱🇸], Spanish [🇪🇸], Albanian [🇦🇱], Serbian [🇷🇸], Sundanese [🇮🇩], Swahili [🇰🇪], Swedish [🇸🇪], Tamil [🇮🇳], Telugu [🇮🇳], Tajik [🇹🇯], Thai [🇹🇭], Turkish [🇹🇷], Twi [🇬🇭], Ukrainian [🇺🇦], Urdu [🇵🇰], Uzbek [🇺🇿], Vietnamese [🇻🇳], Xhosa [🇿🇦], Yiddish [🇮🇱], Yoruba [🇳🇬], Chinese [🇨🇳], Zulu [🇿🇦] | Source | 🤗 |
LANGBRIDGE: Multilingual Reasoning Without Multilingual Supervision |
2024 | Arabic (ar) (🇸🇦), Bengali (bn) (🇧🇩), Chinese (zh) (🇨🇳), Danish (da) (🇩🇰), Dutch (nl) (🇳🇱), English (en) (🇬🇧), French (fr) (🇫🇷), German (de) (🇩🇪), Hindi (hi) (🇮🇳), Japanese (ja) (🇯🇵), Korean (ko) (🇰🇷), Marathi (mr) (🇮🇳), Punjabi (pa) (🇮🇳), Russian (ru) (🇷🇺), Spanish (es) (🇪🇸), Swahili (sw) (🇰🇪), Telugu (te) (🇮🇳), Turkish (tr) (🇹🇷), Urdu (ur) (🇵🇰) | Github | 🤗 |
Orion-14B: Open-source Multilingual Large Language Models |
2024 | English [🇬🇧], Chinese [🇨🇳], Japanese [🇯🇵], Korean [🇰🇷], Spanish [🇪🇸], French [🇫🇷], German [🇩🇪], Arabic [🇸🇦] | Github | 🤗 |
Baichuan 2: Open Large-scale Language Models |
2023 | Arabic (ar) (🇸🇦), Chinese (zh) (🇨🇳), English (en) (🇬🇧), French (fr) (🇫🇷), Russian (ru) (🇷🇺), Spanish (es) (🇪🇸), German (de) (🇩🇪), Japanese (ja) (🇯🇵) | Github | 🤗 |
Larger-Scale Transformers for Multilingual Masked Language Modeling |
2021 | Afrikaans (🇿🇦), Albanian (🇦🇱), Amharic (🇪🇹), Arabic (🇸🇦), Armenian (🇦🇲), Assamese (🇮🇳), Azerbaijani (🇦🇿), Basque (🇪🇸), Belarusian (🇧🇾), Bengali (🇧🇩), Bengali Romanize (🇧🇩), Bosnian (🇧🇦), Breton (🏴), Bulgarian (🇧🇬), Burmese (🇲🇲), Burmese zawgyi font (🇲🇲), Catalan (🇪🇸), Chinese (Simplified) (🇨🇳), Chinese (Traditional) (🇹🇼), Croatian (🇭🇷), Czech (🇨🇿), Danish (🇩🇰), Dutch (🇳🇱), English (🇬🇧), Esperanto (🏴), Estonian (🇪🇪), Filipino (🇵🇭), Finnish (🇫🇮), French (🇫🇷), Galician (🇪🇸), Georgian (🇬🇪), German (🇩🇪), Greek (🇬🇷), Gujarati (🇮🇳), Hausa (🇳🇬), Hebrew (🇮🇱), Hindi (🇮🇳), Hindi Romanize (🇮🇳), Hungarian (🇭🇺), Icelandic (🇮🇸), Indonesian (🇮🇩), Irish (🇮🇪), Italian (🇮🇹), Japanese (🇯🇵), Javanese (🇮🇩), Kannada (🇮🇳), Kazakh (🇰🇿), Khmer (🇰🇭), Korean (🇰🇷), Kurdish (Kurmanji) (🇹🇷), Kyrgyz (🇰🇬), Lao (🇱🇦), Latin (🏛️), Latvian (🇱🇻), Lithuanian (🇱🇹), Macedonian (🇲🇰), Malagasy (🇲🇬), Malay (🇲🇾), Malayalam (🇮🇳), Marathi (🇮🇳), Mongolian (🇲🇳), Nepali (🇳🇵), Norwegian (🇳🇴), Oriya (🇮🇳), Oromo (🇪🇹), Pashto (🇦🇫), Persian (🇮🇷), Polish (🇵🇱), Portuguese (🇵🇹), Punjabi (🇮🇳), Romanian (🇷🇴), Russian (🇷🇺), Sanskrit (🇮🇳), Scottish Gaelic (🏴), Serbian (🇷🇸), Sindhi (🇵🇰), Sinhala (🇱🇰), Slovak (🇸🇰), Slovenian (🇸🇮), Somali (🇸🇴), Spanish (🇪🇸), Sundanese (🇮🇩), Swahili (🇰🇪), Swedish (🇸🇪), Tamil (🇮🇳), Tamil Romanize (🇮🇳), Telugu (🇮🇳), Telugu Romanize (🇮🇳), Thai (🇹🇭), Turkish (🇹🇷), Ukrainian (🇺🇦), Urdu (🇵🇰), Urdu Romanize (🇵🇰), Uyghur (🇨🇳), Uzbek (🇺🇿), Vietnamese (🇻🇳), Welsh (🏴), Western Frisian (🇳🇱), Xhosa (🇿🇦), Yiddish (🇮🇱) | Github | 🔍 |
InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities |
2023 | English (🇺🇸), Chinese (🇨🇳) | Github | 🔍 |
PolyLM: An Open Source Polyglot Large Language Model |
2023 | English (EN) [🇬🇧], Chinese (ZH) [🇨🇳], Russian (RU) [🇷🇺], Spanish (ES) [🇪🇸], German (DE) [🇩🇪], French (FR) [🇫🇷], Italian (IT) [🇮🇹], Portuguese (PT) [🇵🇹], Japanese (JA) [🇯🇵], Vietnamese (VI) [🇻🇳], Indonesian (ID) [🇮🇩], Polish (PL) [🇵🇱], Dutch (NL) [🇳🇱], Arabic (AR) [🇦🇪], Turkish (TR) [🇹🇷], Thai (TH) [🇹🇭], Hebrew (HE) [🇮🇱], Korean (KO) [🇰🇷] | Model | 🔍 |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
2023 | Akan (🇬🇭), Arabic (🇸🇦), Assamese (🇮🇳), Bambara (🇲🇱), Basque (🇪🇸), Bengali (🇧🇩), Catalan (🇪🇸), Chichewa (🇲🇼), chiShona (🇿🇼), Chitumbuka (🇲🇼), English (🇬🇧), Fon (🇧🇯), French (🇫🇷), Gujarati (🇮🇳), Hindi (🇮🇳), Igbo (🇳🇬), Indonesian (🇮🇩), isiXhosa (🇿🇦), isiZulu (🇿🇦), Kannada (🇮🇳), Kikuyu (🇰🇪), Kinyarwanda (🇷🇼), Kirundi (🇧🇮), Lingala (🇨🇩), Luganda (🇺🇬), Malayalam (🇮🇳), Marathi (🇮🇳), Nepali (🇳🇵), Northern Sotho (🇿🇦), Odia (🇮🇳), Portuguese (🇵🇹), Punjabi (🇮🇳), Sesotho (🇱🇸), Setswana (🇧🇼), Simplified Chinese (🇨🇳), Spanish (🇪🇸), Swahili (🇰🇪), Tamil (🇮🇳), Telugu (🇮🇳), Traditional Chinese (🇹🇼), Twi (🇬🇭), Urdu (🇵🇰), Vietnamese (🇻🇳), Wolof (🇸🇳), Xitsonga (🇿🇦), Yoruba (🇳🇬), Programming Languages (💻) | Github | 🤗 |
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages |
2023 | hbs_Latn (🇭🇷), mal_Mlym (🇮🇳), aze_Latn (🇦🇿), guj_Gujr (🇮🇳), ben_Beng (🇮🇳), kan_Knda (🇮🇳), tel_Telu (🇮🇳), mlt_Latn (🇲🇹), fra_Latn (🇫🇷), spa_Latn (🇪🇸), eng_Latn (🇬🇧), fil_Latn (🇵🇭), nob_Latn (🇳🇴), rus_Cyrl (🇷🇺), deu_Latn (🇩🇪), tur_Latn (🇹🇷), pan_Guru (🇮🇳), mar_Deva (🇮🇳), por_Latn (🇵🇹), nld_Latn (🇳🇱), ara_Arab (🇸🇦), zho_Hani (🇨🇳), ita_Latn (🇮🇹), ind_Latn (🇮🇩), ell_Grek (🇬🇷), bul_Cyrl (🇧🇬), swe_Latn (🇸🇪), ces_Latn (🇨🇿), isl_Latn (🇮🇸), pol_Latn (🇵🇱), ron_Latn (🇷🇴), dan_Latn (🇩🇰), hun_Latn (🇭🇺), tgk_Cyrl (🇹🇯), srp_Latn (🇷🇸), fas_Arab (🇮🇷), ceb_Latn (🇵🇭), heb_Hebr (🇮🇱), hrv_Latn (🇭🇷), glg_Latn (🇪🇸), fin_Latn (🇫🇮), slv_Latn (🇸🇮), vie_Latn (🇻🇳), mkd_Cyrl (🇲🇰), slk_Latn (🇸🇰), nor_Latn (🇳🇴), est_Latn (🇪🇪), ltz_Latn (🇱🇺), eus_Latn (🇪🇸), lit_Latn (🇱🇹), kaz_Cyrl (🇰🇿), lav_Latn (🇱🇻), bos_Latn (🇧🇦), epo_Latn (🇺🇸), cat_Latn (🇪🇸), tha_Thai (🇹🇭), ukr_Cyrl (🇺🇦), tgl_Latn (🇵🇭), sin_Sinh (🇱🇰), gle_Latn (🇮🇪), hin_Deva (🇮🇳), kor_Hang (🇰🇷), ory_Orya (🇮🇳), urd_Arab (🇵🇰), swa_Latn (🇰🇪), sqi_Latn (🇦🇱), bel_Cyrl (🇧🇾), afr_Latn (🇿🇦), nno_Latn (🇳🇴), tat_Cyrl (🇷🇺), asm_Beng (🇮🇳), hil_Latn (🇵🇭), nso_Latn (🇿🇦), ibo_Latn (🇳🇬), kin_Latn (🇷🇼), tpi_Latn (🇵🇬), twi_Latn (🇬🇭), kir_Cyrl (🇰🇬), nep_Deva (🇳🇵), azj_Latn (🇦🇿), bcl_Latn (🇵🇭), xho_Latn (🇿🇦), cym_Latn (🏴), gaa_Latn (🇬🇭), ton_Latn (🇹🇴), tah_Latn (🇵🇫), lat_Latn (🇻🇦), srn_Latn (🇸🇷), ewe_Latn (🇬🇭), bem_Latn (🇿🇲), orm_Latn (🇪🇹), haw_Latn (🇺🇸), hmo_Latn (🇵🇬), kat_Geor (🇬🇪), pag_Latn (🇵🇭), loz_Latn (🇿🇲), fry_Latn (🇳🇱), mya_Mymr (🇲🇲), nds_Latn (🇩🇪), run_Latn (🇧🇮), pnb_Arab (🇵🇰), rar_Latn (🇨🇰), fij_Latn (🇫🇯), wls_Latn (🇼🇸), ckb_Arab (🇮🇶), ven_Latn (🇿🇦), zsm_Latn (🇲🇾), chv_Cyrl (🇷🇺), lua_Latn (🇨🇩), que_Latn (🇵🇪), sag_Latn (🇨🇫), guw_Latn (🇬🇼), bre_Latn (🇫🇷), toi_Latn (🇨🇫), pus_Arab (🇦🇫), che_Cyrl (🇷🇺), pis_Latn (🇸🇧), kon_Latn (🇨🇩), oss_Cyrl (🇷🇺), hyw_Armn (🇦🇲), iso_Latn (🇻🇺), nan_Latn (🇹🇼), lub_Latn (🇨🇩), lim_Latn (🇳🇱), tuk_Latn (🇹🇲), tir_Ethi (🇪🇹), tgk_Latn (🇹🇯), yua_Latn (🇲🇽), min_Latn (🇮🇩), lue_Latn (🇨🇩), khm_Khmr (🇰🇭), tum_Latn (🇲🇼), tll_Latn (🇳🇦), ekk_Latn (🇪🇪), lug_Latn (🇺🇬), niu_Latn (🇳🇺), tzo_Latn (🇲🇽), mah_Latn (🇲🇭), tvl_Latn (🇹🇻), jav_Latn (🇮🇩), hau_Latn (🇳🇬), som_Latn (🇸🇴), uzb_Latn (🇺🇿), sot_Latn (🇿🇦), uzb_Cyrl (🇺🇿), cos_Latn (🇫🇷), als_Latn (🇦🇱), amh_Ethi (🇪🇹), sun_Latn (🇮🇩), war_Latn (🇵🇭), div_Thaa (🇲🇻), yor_Latn (🇳🇬), fao_Latn (🇫🇴), uzn_Cyrl (🇺🇿), smo_Latn (🇼🇸), bak_Cyrl (🇷🇺), ilo_Latn (🇵🇭), tso_Latn (🇿🇦), mri_Latn (🇳🇿), hmn_Latn (🇺🇸), nau_Latn (🇳🇷), asm_Beng (🇮🇳), hil_Latn (🇵🇭), nso_Latn (🇿🇦), ibo_Latn (🇳🇬), kin_Latn (🇷🇼), tpi_Latn (🇵🇬), twi_Latn (🇬🇭), kir_Cyrl (🇰🇬), pap_Latn (🇳🇱), aze_Latn (🇦🇿), qvi_Latn (🇵🇪), cak_Latn (🇬🇹), kbp_Latn (🇧🇫), kri_Latn (🇸🇱), mau_Latn (🇲🇽), scn_Latn (🇮🇹), tyv_Cyrl (🇷🇺), ina_Latn (🇧🇪), btx_Latn (🇮🇩), nch_Latn (🇲🇽), ncj_Latn (🇲🇽), pau_Latn (🇵🇼), toj_Latn (🇲🇽), pcm_Latn (🇳🇬), dyu_Latn (🇧🇫), kss_Latn (🇳🇬), afb_Arab (🇸🇦), urh_Latn (🇳🇬), quc_Latn (🇬🇹), new_Deva (🇳🇵), yao_Latn (🇲🇼), ngl_Latn (🇲🇿), nyu_Latn (🇲🇿), kab_Latn (🇩🇿), tuk_Cyrl (🇹🇲), xmf_Geor (🇬🇪), ndc_Latn (🇲🇿), san_Deva (🇮🇳), nba_Latn (🇳🇬), bpy_Beng (🇮🇳), ncx_Latn (🇲🇽), qug_Latn (🇵🇪), rmn_Latn (🇮🇳), cjk_Latn (🇬🇹), arb_Arab (🇸🇦), kea_Latn (🇨🇻), mck_Latn (🇨🇩), arn_Latn (🇨🇱), pdt_Latn (🇩🇪), her_Latn (🇳🇦), tlh_Latn (🇺🇸), suz_Deva (🇮🇳), kat_Geor (🇬🇪), kmr_Cyrl (🇷🇺), gcr_Latn (🇬🇵), jbo_Latn (🇺🇸), tbz_Latn (🇵🇼), bam_Latn (🇲🇱), prk_Latn (🇸🇮), jam_Latn (🇯🇲), twx_Latn (🇹🇼), sme_Latn (🇫🇮), gom_Latn (🇮🇳), bum_Latn (🇨🇲), mgr_Latn (🇲🇼), ahk_Latn (🇵🇰), kur_Arab (🇮🇶), bas_Latn (🇨🇲), bin_Latn (🇳🇬), tsz_Latn (🇲🇽), sid_Latn (🇪🇹), diq_Latn (🇹🇷), srd_Latn (🇮🇹), tcf_Latn (🇲🇽), bzj_Latn (🇮🇳), udm_Cyrl (🇷🇺), cce_Latn (🇨🇲), meu_Latn (🇨🇩), chw_Latn (🇨🇲), cbk_Latn (🇵🇭), ibg_Latn (🇮🇩), bhw_Latn (🇮🇩), ngu_Latn (🇲🇽), nyy_Latn (🇹🇿), szl_Latn (🇵🇱), ish_Latn (🇹🇿), naq_Latn (🇳🇦), toh_Latn (🇳🇿), ttj_Latn (🇰🇪), nse_Latn (🇳🇬), ami_Latn (🇹🇼), alz_Latn (🇸🇩), apc_Arab (🇸🇾), vls_Latn (🇳🇱), mhr_Cyrl (🇷🇺), djk_Latn (🇩🇪), prs_Arab (🇦🇫), san_Latn (🇮🇳), som_Arab (🇸🇴), uig_Latn (🇨🇳), hau_Arab (🇳🇬) | Github | 🔍 |
Few-shot Learning with Multilingual Generative Language Models |
2022 | English (🇺🇸), Russian (🇷🇺), Chinese (🇨🇳), German (🇩🇪), Spanish (🇪🇸), French (🇫🇷), Japanese (🇯🇵), Italian (🇮🇹), Portuguese (🇵🇹), Greek (🇬🇷), Romanian (🇷🇴), Ukrainian (🇺🇦), Hungarian (🇭🇺), Korean (🇰🇷), Polish (🇵🇱), Norwegian (🇳🇴), Dutch (🇳🇱), Finnish (🇫🇮), Danish (🇩🇰), Indonesian (🇮🇩), Croatian (🇭🇷), Turkish (🇹🇷), Arabic (🇸🇦), Vietnamese (🇻🇳), Thai (🇹🇭), Bulgarian (🇧🇬), Persian (🇮🇷), Swedish (🇸🇪), Malay (🇲🇾), Hebrew (🇮🇱), Czech (🇨🇿), Slovak (🇸🇰), Catalan (🇪🇸), Lithuanian (🇱🇹), Slovene (🇸🇮), Hindi (🇮🇳), Estonian (🇪🇪), Latvian (🇱🇻), Tagalog (🇵🇭), Albanian (🇦🇱), Serbian (🇷🇸), Azerbaijani (🇦🇿), Bengali (🇧🇩), Tamil (🇮🇳), Urdu (🇵🇰), Kazakh (🇰🇿), Armenian (🇦🇲), Georgian (🇬🇪), Icelandic (🇮🇸), Belarusian (🇧🇾), Bosnian (🇧🇦), Malayalam (🇮🇳), Macedonian (🇲🇰), Swahili (🇹🇿), Afrikaans (🇿🇦), Telugu (🇮🇳), Arabic Romanized (🇸🇦), Mongolian (🇲🇳), Latin (🇮🇹), Nepali (🇳🇵), Sinhalese (🇱🇰), Marathi (🇮🇳), Kannada (🇮🇳), Somali (🇸🇴), Welsh (🏴), Javanese (🇮🇩), Pashto (🇦🇫), Uzbek (🇺🇿), Gujarati (🇮🇳), Khmer (🇰🇭), Urdu Romanized (🇵🇰), Amharic (🇪🇹), Bengali Romanized (🇧🇩), Punjabi (🇮🇳), Galician (🇪🇸), Hausa (🇳🇬), Sanskrit (🇮🇳), Basque (🇪🇸), Burmese (🇲🇲), Sundanese (🇮🇩), Oriya (🇮🇳), Haitian (🇭🇹), Lao (🇱🇦), Kyrgyz (🇰🇬), Breton (🇫🇷), Irish (🇮🇪), Yoruba (🇳🇬), Esperanto (🌐), Tamil Romanized (🇮🇳), Zulu (🇿🇦), Tigrinya (🇪🇷), Telugu Romanized (🇮🇳), Kurdish (🇹🇷), Oromo (🇪🇹), Xhosa (🇿🇦), Scottish Gaelic (🇬🇧), Igbo (🇳🇬), Assamese (🇮🇳), Ganda (🇺🇬), Wolof (🇸🇳), Western Frisian (🇳🇱), Tswana (🇧🇼), Fula (🇸🇳), Guaraní (🇵🇾), Sindhi (🇵🇰), Lingala (🇨🇩), Bambara (🇲🇱), Inuktitut (🇨🇦), Kongo (🇨🇩), Quechua (🇵🇪), Swati (🇸🇿), Unassigned (🌐) | Github | 🔍 |
Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions |
2024 | English (🇺🇸), Chinese (🇨🇳), Telugu (🇮🇳), Hindi (🇮🇳), Arabic (🇸🇦), Swahili (🇹🇿), Bengali (🇧🇩) | 🔍 | 🔍 |
Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning |
2022 | Afrikaans (🇿🇦), Amharic (🇪🇹), Hausa (🇳🇬), Igbo (🇳🇬), Malagasy (🇲🇬), Chichewa (🇲🇼), Oromo (🇪🇹), Naija (🇳🇬), Kinyarwanda (🇷🇼), Kirundi (🇧🇮), Shona (🇿🇼), Somali (🇸🇴), Sesotho (🇱🇸), Swahili (🇹🇿), isiXhosa (🇿🇦), Yoruba (🇳🇬), isiZulu (🇿🇦), English (🇬🇧), French (🇫🇷), Arabic (🇸🇦), Lingala (🇨🇩), Luganda (🇺🇬), Luo (🇰🇪), Wolof (🇸🇳) | GitHub | 🤗 |
MuRIL: Multilingual Representations for Indian Languages |
2021 | Assamese (🇮🇳), Bengali (🇧🇩), Gujarati (🇮🇳), Hindi (🇮🇳), Kannada (🇮🇳), Kashmiri (🇮🇳), Malayalam (🇮🇳), Marathi (🇮🇳), Nepali (🇳🇵), Oriya (🇮🇳), Punjabi (🇮🇳), Sanskrit (🇮🇳), Sindhi (🇵🇰), Tamil (🇮🇳), Telugu (🇮🇳), Urdu (🇮🇳), English (🇬🇧) | 🔍 | 🔍 |
From English to Foreign Languages: Transferring Pretrained Language Models |
2020 | French (🇫🇷), Russian (🇷🇺), Arabic (🇦🇪), Chinese (🇨🇳), Hindi (🇮🇳), Vietnamese (🇻🇳) | 🔍 | 🔍 |
- PaLI-X: On Scaling up a Multilingual Vision and Language Model (2023)
- PALI: A Jointly-Scaled Multilingual Language-Image Model (2023)
- Learning to Scale Multilingual Representations for Vision-Language Tasks (2020)
- A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias (2024)
- Towards Building Multilingual Language Model for Medicine (2024)
- What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models (2024)
- All Languages Matter: On the Multilingual Safety of Large Language Models (2024)
- Multilingual Jailbreak Challenges in Large Language Models (2024)
- EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation (2024)
- Chat2VIS: Fine-Tuning Data Visualisations using Multilingual Natural Language Text and Pre-Trained Large Language (2024)
- How Linguistically Fair Are Multilingual Pre-Trained Language Models? (2021)
- IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages (2020)
- Are Multilingual Models the Best Choice for Moderately Under-Resourced Languages? A Comprehensive Assessment for Catalan (2021)
- You Reap What You Sow: On the Challenges of Bias Evaluation Under Multilingual Settings (2022)
- How to Adapt Your Pretrained Multilingual Model to 1600 Languages (2021)
- MEGA: Multilingual Evaluation of Generative AI / GitHub (2023)
- XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models (2023)