A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.
Dataset | Year | Languages | GitHub | Download |
---|---|---|---|---|
OMGEval : An Open Multilingual Generative Evaluation Benchmark for Large Language Models |
2024 | Chinese (zh) (๐จ๐ณ), Russian (ru) (๐ท๐บ), French (fr) (๐ซ๐ท), Spanish (es) (๐ช๐ธ), Arabic (ar) (๐ธ๐ฆ) | Github | Data |
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property |
2024 | Chinese (zh) (๐จ๐ณ), English (en) (๐ฌ๐ง), German (de) (๐ฉ๐ช), Japanese (ja) (๐ฏ๐ต), French (fr) (๐ซ๐ท), Korean (ko) (๐ฐ๐ท), Russian (ru) (๐ท๐บ), Spanish (es) (๐ช๐ธ), Portuguese (pt) (๐ต๐น), Catalan (ca) (๐ฆ๐ฉ) | Github | Data |
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models |
2024 | English (en) (๐ฌ๐ง), Chinese (zh) (๐จ๐ณ), Japanese (ja) (๐ฏ๐ต), French (fr) (๐ซ๐ท), German (de) (๐ฉ๐ช) | Github | Data |
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models |
2023 | English (๐บ๐ธ), Chinese (๐จ๐ณ), Italian (๐ฎ๐น), Portuguese (๐ง๐ท), Vietnamese (๐ป๐ณ), Thai (๐น๐ญ), Swahili (๐ฐ๐ช), Afrikaans (๐ฟ๐ฆ), Javanese (๐ฎ๐ฉ) | Github | Data |
Language models are multilingual chain-of-thought reasoners |
2023 | Bengali (๐ง๐ฉ), Chinese (๐จ๐ณ), French (๐ซ๐ท), German (๐ฉ๐ช), Japanese (๐ฏ๐ต), Russian (๐ท๐บ), Spanish (๐ช๐ธ), Swahili (๐ฐ๐ช), Telugu (๐ฎ๐ณ), Thai (๐น๐ญ) | Github | Data |
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages |
2023 | English [๐ฌ๐ง], Russian [๐ท๐บ], Spanish [๐ช๐ธ], German [๐ฉ๐ช], French [๐ซ๐ท], Chinese [๐จ๐ณ], Italian [๐ฎ๐น], Portuguese [๐ต๐น], Polish [๐ต๐ฑ], Japanese [๐ฏ๐ต], Vietnamese [๐ป๐ณ], Dutch [๐ณ๐ฑ], Arabic [๐ธ๐ฆ], Turkish [๐น๐ท], Czech [๐จ๐ฟ], Persian [๐ฎ๐ท], Hungarian [๐ญ๐บ], Greek [๐ฌ๐ท], Romanian [๐ท๐ด], Swedish [๐ธ๐ช], Ukrainian [๐บ๐ฆ], Finnish [๐ซ๐ฎ], Korean [๐ฐ๐ท], Danish [๐ฉ๐ฐ], Bulgarian [๐ง๐ฌ], Norwegian [๐ณ๐ด], Hindi [๐ฎ๐ณ], Slovak [๐ธ๐ฐ], Thai [๐น๐ญ], Lithuanian [๐ฑ๐น], Catalan [๐ช๐ธ], Indonesian [๐ฎ๐ฉ], Bangla [๐ง๐ฉ], Estonian [๐ช๐ช], Slovenian [๐ธ๐ฎ], Latvian [๐ฑ๐ป], Hebrew [๐ฎ๐ฑ], Serbian [๐ท๐ธ], Tamil [๐ฎ๐ณ], Albanian [๐ฆ๐ฑ], Azerbaijani [๐ฆ๐ฟ] | ๐ค | Data |
Language models are multilingual chain-of-thought reasoners |
2023 | Bengali (๐ง๐ฉ), Chinese (๐จ๐ณ), French (๐ซ๐ท), German (๐ฉ๐ช), Japanese (๐ฏ๐ต), Russian (๐ท๐บ), Spanish (๐ช๐ธ), Swahili (๐ฐ๐ช), Telugu (๐ฎ๐ณ), Thai (๐น๐ญ) | Github | Data |
Wiki-40B: Multilingual Language Model Dataset | 2020 | English (๐บ๐ธ), German (๐ฉ๐ช), French (๐ซ๐ท), Russian (๐ท๐บ), Spanish (๐ช๐ธ), Italian (๐ฎ๐น), Japanese (๐ฏ๐ต), Chinese Simplified (๐จ๐ณ), Chinese Traditional (๐น๐ผ), Polish (๐ต๐ฑ), Ukrainian (๐บ๐ฆ), Dutch (๐ณ๐ฑ), Swedish (๐ธ๐ช), Portuguese (๐ต๐น), Serbian (๐ท๐ธ), Hungarian (๐ญ๐บ), Catalan (๐ช๐ธ), Czech (๐จ๐ฟ), Finnish (๐ซ๐ฎ), Arabic (๐ธ๐ฆ), Korean (๐ฐ๐ท), Persian (๐ฎ๐ท), Norwegian (๐ณ๐ด), Vietnamese (๐ป๐ณ), Hebrew (๐ฎ๐ฑ), Indonesian (๐ฎ๐ฉ), Romanian (๐ท๐ด), Turkish (๐น๐ท), Bulgarian (๐ง๐ฌ), Estonian (๐ช๐ช), Malay (๐ฒ๐พ), Danish (๐ฉ๐ฐ), Slovak (๐ธ๐ฐ), Croatian (๐ญ๐ท), Greek (๐ฌ๐ท), Lithuanian (๐ฑ๐น), Slovenian (๐ธ๐ฎ), Thai (๐น๐ญ), Hindi (๐ฎ๐ณ), Latvian (๐ฑ๐ป), Filipino (๐ต๐ญ) | ๐๏ธ | Data |
Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning | 2021 | English (๐บ๐ธ), German (๐ฉ๐ช), French (๐ซ๐ท), Russian (๐ท๐บ), Spanish (๐ช๐ธ), Hindi (๐ฎ๐ณ), Vietnamese (๐ป๐ณ), Bulgarian (๐ง๐ฌ), Chinese (๐จ๐ณ), Dutch (๐ณ๐ฑ), Italian (๐ฎ๐น), Japanese (๐ฏ๐ต), Polish (๐ต๐ฑ), Portuguese (๐ต๐น), Arabic (๐ธ๐ฆ), Swahili (๐น๐ฟ), Urdu (๐ต๐ฐ) | GitHub๏ธ | Data |
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset | 2022 | Akan (๐ฌ๐ญ), Arabic (๐ธ๐ฆ), Assamese (๐ฎ๐ณ), Bambara (๐ฒ๐ฑ), Basque (๐ช๐ธ), Bengali (๐ง๐ฉ), Catalan (๐ช๐ธ), Chichewa (๐ฒ๐ผ), chiShona (๐ฟ๐ผ), Chitumbuka (๐ฒ๐ผ), English (๐ฌ๐ง), Fon (๐ง๐ฏ), French (๐ซ๐ท), Gujarati (๐ฎ๐ณ), Hindi (๐ฎ๐ณ), Igbo (๐ณ๐ฌ), Indonesian (๐ฎ๐ฉ), isiXhosa (๐ฟ๐ฆ), isiZulu (๐ฟ๐ฆ), Kannada (๐ฎ๐ณ), Kikuyu (๐ฐ๐ช), Kinyarwanda (๐ท๐ผ), Kirundi (๐ง๐ฎ), Lingala (๐จ๐ฉ), Luganda (๐บ๐ฌ), Malayalam (๐ฎ๐ณ), Marathi (๐ฎ๐ณ), Nepali (๐ณ๐ต), Northern Sotho (๐ฟ๐ฆ), Odia (๐ฎ๐ณ), Portuguese (๐ต๐น), Punjabi (๐ฎ๐ณ), Sesotho (๐ฑ๐ธ), Setswana (๐ง๐ผ), Simplified Chinese (๐จ๐ณ), Spanish (๐ช๐ธ), Swahili (๐ฐ๐ช), Tamil (๐ฎ๐ณ), Telugu (๐ฎ๐ณ), Traditional Chinese (๐น๐ผ), Twi (๐ฌ๐ญ), Urdu (๐ต๐ฐ), Vietnamese (๐ป๐ณ), Wolof (๐ธ๐ณ), Xitsonga (๐ฟ๐ฆ), Yoruba (๐ณ๐ฌ), Programming Languages (๐ป) | GitHub๏ธ | Data |
GEOMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models | 2022 | English (๐บ๐ธ), Chinese (๐จ๐ณ), Hindi (๐ฎ๐ณ), Persian (๐ฎ๐ท), Swahili (๐ฐ๐ช) | GitHub๏ธ | ๐ |
Title | Year | Languages | Code | Demo |
---|---|---|---|---|
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model |
2024 | Afrikaans [๐ฟ๐ฆ], Amharic [๐ช๐น], Arabic [๐ธ๐ฆ], Azerbaijani [๐ฆ๐ฟ], Belarusian [๐ง๐พ], Bengali [๐ง๐ฉ], Bulgarian [๐ง๐ฌ], Catalan [๐ช๐ธ], Cebuano [๐ต๐ญ], Czech [๐จ๐ฟ], Welsh [๐ด], Danish [๐ฉ๐ฐ], German [๐ฉ๐ช], Greek [๐ฌ๐ท], English [๐ฌ๐ง], Esperanto [๐ช๐ธ], Estonian [๐ช๐ช], Basque [๐ช๐ธ], Finnish [๐ซ๐ฎ], Tagalog [๐ต๐ญ], French [๐ซ๐ท], Western Frisian [๐ณ๐ฑ], Scottish Gaelic [๐ด], Irish [๐ฎ๐ช], Galician [๐ช๐ธ], Gujarati [๐ฎ๐ณ], Haitian Creole [๐ญ๐น], Hausa [๐ณ๐ช], Hebrew [๐ฎ๐ฑ], Hindi [๐ฎ๐ณ], Hungarian [๐ญ๐บ], Armenian [๐ฆ๐ฒ], Igbo [๐ณ๐ฌ], Indonesian [๐ฎ๐ฉ], Icelandic [๐ฎ๐ธ], Italian [๐ฎ๐น], Javanese [๐ฎ๐ฉ], Japanese [๐ฏ๐ต], Kannada [๐ฎ๐ณ], Georgian [๐ฌ๐ช], Kazakh [๐ฐ๐ฟ], Khmer [๐ฐ๐ญ], Kyrgyz [๐ฐ๐ฌ], Korean [๐ฐ๐ท], Kurdish [๐น๐ท], Lao [๐ฑ๐ฆ], Latvian [๐ฑ๐ป], Latin [๐ป๐ฆ], Lithuanian [๐ฑ๐น], Luxembourgish [๐ฑ๐บ], Malayalam [๐ฎ๐ณ], Marathi [๐ฎ๐ณ], Macedonian [๐ฒ๐ฐ], Malagasy [๐ฒ๐ฌ], Maltese [๐ฒ๐น], Mongolian [๐ฒ๐ณ], Maori [๐ณ๐ฟ], Malay [๐ฒ๐พ], Burmese [๐ฒ๐ฒ], Nepali [๐ณ๐ต], Dutch [๐ณ๐ฑ], Norwegian [๐ณ๐ด], Northern Sotho [๐ฟ๐ฆ], Chichewa [๐ฒ๐ผ], Oriya [๐ฎ๐ณ], Punjabi [๐ฎ๐ณ], Persian [๐ฎ๐ท], Polish [๐ต๐ฑ], Portuguese [๐ต๐น], Pashto [๐ฆ๐ซ], Romanian [๐ท๐ด], Russian [๐ท๐บ], Sinhala [๐ฑ๐ฐ], Slovak [๐ธ๐ฐ], Slovenian [๐ธ๐ฎ], Samoan [๐ผ๐ธ], Shona [๐ฟ๐ผ], Sindhi [๐ต๐ฐ], Somali [๐ธ๐ด], Southern Sotho [๐ฑ๐ธ], Spanish [๐ช๐ธ], Albanian [๐ฆ๐ฑ], Serbian [๐ท๐ธ], Sundanese [๐ฎ๐ฉ], Swahili [๐ฐ๐ช], Swedish [๐ธ๐ช], Tamil [๐ฎ๐ณ], Telugu [๐ฎ๐ณ], Tajik [๐น๐ฏ], Thai [๐น๐ญ], Turkish [๐น๐ท], Twi [๐ฌ๐ญ], Ukrainian [๐บ๐ฆ], Urdu [๐ต๐ฐ], Uzbek [๐บ๐ฟ], Vietnamese [๐ป๐ณ], Xhosa [๐ฟ๐ฆ], Yiddish [๐ฎ๐ฑ], Yoruba [๐ณ๐ฌ], Chinese [๐จ๐ณ], Zulu [๐ฟ๐ฆ] | Source | ๐ค |
LANGBRIDGE: Multilingual Reasoning Without Multilingual Supervision |
2024 | Arabic (ar) (๐ธ๐ฆ), Bengali (bn) (๐ง๐ฉ), Chinese (zh) (๐จ๐ณ), Danish (da) (๐ฉ๐ฐ), Dutch (nl) (๐ณ๐ฑ), English (en) (๐ฌ๐ง), French (fr) (๐ซ๐ท), German (de) (๐ฉ๐ช), Hindi (hi) (๐ฎ๐ณ), Japanese (ja) (๐ฏ๐ต), Korean (ko) (๐ฐ๐ท), Marathi (mr) (๐ฎ๐ณ), Punjabi (pa) (๐ฎ๐ณ), Russian (ru) (๐ท๐บ), Spanish (es) (๐ช๐ธ), Swahili (sw) (๐ฐ๐ช), Telugu (te) (๐ฎ๐ณ), Turkish (tr) (๐น๐ท), Urdu (ur) (๐ต๐ฐ) | Github | ๐ค |
Orion-14B: Open-source Multilingual Large Language Models |
2024 | English [๐ฌ๐ง], Chinese [๐จ๐ณ], Japanese [๐ฏ๐ต], Korean [๐ฐ๐ท], Spanish [๐ช๐ธ], French [๐ซ๐ท], German [๐ฉ๐ช], Arabic [๐ธ๐ฆ] | Github | ๐ค |
Baichuan 2: Open Large-scale Language Models |
2023 | Arabic (ar) (๐ธ๐ฆ), Chinese (zh) (๐จ๐ณ), English (en) (๐ฌ๐ง), French (fr) (๐ซ๐ท), Russian (ru) (๐ท๐บ), Spanish (es) (๐ช๐ธ), German (de) (๐ฉ๐ช), Japanese (ja) (๐ฏ๐ต) | Github | ๐ค |
Larger-Scale Transformers for Multilingual Masked Language Modeling |
2021 | Afrikaans (๐ฟ๐ฆ), Albanian (๐ฆ๐ฑ), Amharic (๐ช๐น), Arabic (๐ธ๐ฆ), Armenian (๐ฆ๐ฒ), Assamese (๐ฎ๐ณ), Azerbaijani (๐ฆ๐ฟ), Basque (๐ช๐ธ), Belarusian (๐ง๐พ), Bengali (๐ง๐ฉ), Bengali Romanize (๐ง๐ฉ), Bosnian (๐ง๐ฆ), Breton (๐ด), Bulgarian (๐ง๐ฌ), Burmese (๐ฒ๐ฒ), Burmese zawgyi font (๐ฒ๐ฒ), Catalan (๐ช๐ธ), Chinese (Simplified) (๐จ๐ณ), Chinese (Traditional) (๐น๐ผ), Croatian (๐ญ๐ท), Czech (๐จ๐ฟ), Danish (๐ฉ๐ฐ), Dutch (๐ณ๐ฑ), English (๐ฌ๐ง), Esperanto (๐ด), Estonian (๐ช๐ช), Filipino (๐ต๐ญ), Finnish (๐ซ๐ฎ), French (๐ซ๐ท), Galician (๐ช๐ธ), Georgian (๐ฌ๐ช), German (๐ฉ๐ช), Greek (๐ฌ๐ท), Gujarati (๐ฎ๐ณ), Hausa (๐ณ๐ฌ), Hebrew (๐ฎ๐ฑ), Hindi (๐ฎ๐ณ), Hindi Romanize (๐ฎ๐ณ), Hungarian (๐ญ๐บ), Icelandic (๐ฎ๐ธ), Indonesian (๐ฎ๐ฉ), Irish (๐ฎ๐ช), Italian (๐ฎ๐น), Japanese (๐ฏ๐ต), Javanese (๐ฎ๐ฉ), Kannada (๐ฎ๐ณ), Kazakh (๐ฐ๐ฟ), Khmer (๐ฐ๐ญ), Korean (๐ฐ๐ท), Kurdish (Kurmanji) (๐น๐ท), Kyrgyz (๐ฐ๐ฌ), Lao (๐ฑ๐ฆ), Latin (๐๏ธ), Latvian (๐ฑ๐ป), Lithuanian (๐ฑ๐น), Macedonian (๐ฒ๐ฐ), Malagasy (๐ฒ๐ฌ), Malay (๐ฒ๐พ), Malayalam (๐ฎ๐ณ), Marathi (๐ฎ๐ณ), Mongolian (๐ฒ๐ณ), Nepali (๐ณ๐ต), Norwegian (๐ณ๐ด), Oriya (๐ฎ๐ณ), Oromo (๐ช๐น), Pashto (๐ฆ๐ซ), Persian (๐ฎ๐ท), Polish (๐ต๐ฑ), Portuguese (๐ต๐น), Punjabi (๐ฎ๐ณ), Romanian (๐ท๐ด), Russian (๐ท๐บ), Sanskrit (๐ฎ๐ณ), Scottish Gaelic (๐ด), Serbian (๐ท๐ธ), Sindhi (๐ต๐ฐ), Sinhala (๐ฑ๐ฐ), Slovak (๐ธ๐ฐ), Slovenian (๐ธ๐ฎ), Somali (๐ธ๐ด), Spanish (๐ช๐ธ), Sundanese (๐ฎ๐ฉ), Swahili (๐ฐ๐ช), Swedish (๐ธ๐ช), Tamil (๐ฎ๐ณ), Tamil Romanize (๐ฎ๐ณ), Telugu (๐ฎ๐ณ), Telugu Romanize (๐ฎ๐ณ), Thai (๐น๐ญ), Turkish (๐น๐ท), Ukrainian (๐บ๐ฆ), Urdu (๐ต๐ฐ), Urdu Romanize (๐ต๐ฐ), Uyghur (๐จ๐ณ), Uzbek (๐บ๐ฟ), Vietnamese (๐ป๐ณ), Welsh (๐ด), Western Frisian (๐ณ๐ฑ), Xhosa (๐ฟ๐ฆ), Yiddish (๐ฎ๐ฑ) | Github | ๐ |
InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities |
2023 | English (๐บ๐ธ), Chinese (๐จ๐ณ) | Github | ๐ |
PolyLM: An Open Source Polyglot Large Language Model |
2023 | English (EN) [๐ฌ๐ง], Chinese (ZH) [๐จ๐ณ], Russian (RU) [๐ท๐บ], Spanish (ES) [๐ช๐ธ], German (DE) [๐ฉ๐ช], French (FR) [๐ซ๐ท], Italian (IT) [๐ฎ๐น], Portuguese (PT) [๐ต๐น], Japanese (JA) [๐ฏ๐ต], Vietnamese (VI) [๐ป๐ณ], Indonesian (ID) [๐ฎ๐ฉ], Polish (PL) [๐ต๐ฑ], Dutch (NL) [๐ณ๐ฑ], Arabic (AR) [๐ฆ๐ช], Turkish (TR) [๐น๐ท], Thai (TH) [๐น๐ญ], Hebrew (HE) [๐ฎ๐ฑ], Korean (KO) [๐ฐ๐ท] | Model | ๐ |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
2023 | Akan (๐ฌ๐ญ), Arabic (๐ธ๐ฆ), Assamese (๐ฎ๐ณ), Bambara (๐ฒ๐ฑ), Basque (๐ช๐ธ), Bengali (๐ง๐ฉ), Catalan (๐ช๐ธ), Chichewa (๐ฒ๐ผ), chiShona (๐ฟ๐ผ), Chitumbuka (๐ฒ๐ผ), English (๐ฌ๐ง), Fon (๐ง๐ฏ), French (๐ซ๐ท), Gujarati (๐ฎ๐ณ), Hindi (๐ฎ๐ณ), Igbo (๐ณ๐ฌ), Indonesian (๐ฎ๐ฉ), isiXhosa (๐ฟ๐ฆ), isiZulu (๐ฟ๐ฆ), Kannada (๐ฎ๐ณ), Kikuyu (๐ฐ๐ช), Kinyarwanda (๐ท๐ผ), Kirundi (๐ง๐ฎ), Lingala (๐จ๐ฉ), Luganda (๐บ๐ฌ), Malayalam (๐ฎ๐ณ), Marathi (๐ฎ๐ณ), Nepali (๐ณ๐ต), Northern Sotho (๐ฟ๐ฆ), Odia (๐ฎ๐ณ), Portuguese (๐ต๐น), Punjabi (๐ฎ๐ณ), Sesotho (๐ฑ๐ธ), Setswana (๐ง๐ผ), Simplified Chinese (๐จ๐ณ), Spanish (๐ช๐ธ), Swahili (๐ฐ๐ช), Tamil (๐ฎ๐ณ), Telugu (๐ฎ๐ณ), Traditional Chinese (๐น๐ผ), Twi (๐ฌ๐ญ), Urdu (๐ต๐ฐ), Vietnamese (๐ป๐ณ), Wolof (๐ธ๐ณ), Xitsonga (๐ฟ๐ฆ), Yoruba (๐ณ๐ฌ), Programming Languages (๐ป) | Github | ๐ค |
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages |
2023 | hbs_Latn (๐ญ๐ท), mal_Mlym (๐ฎ๐ณ), aze_Latn (๐ฆ๐ฟ), guj_Gujr (๐ฎ๐ณ), ben_Beng (๐ฎ๐ณ), kan_Knda (๐ฎ๐ณ), tel_Telu (๐ฎ๐ณ), mlt_Latn (๐ฒ๐น), fra_Latn (๐ซ๐ท), spa_Latn (๐ช๐ธ), eng_Latn (๐ฌ๐ง), fil_Latn (๐ต๐ญ), nob_Latn (๐ณ๐ด), rus_Cyrl (๐ท๐บ), deu_Latn (๐ฉ๐ช), tur_Latn (๐น๐ท), pan_Guru (๐ฎ๐ณ), mar_Deva (๐ฎ๐ณ), por_Latn (๐ต๐น), nld_Latn (๐ณ๐ฑ), ara_Arab (๐ธ๐ฆ), zho_Hani (๐จ๐ณ), ita_Latn (๐ฎ๐น), ind_Latn (๐ฎ๐ฉ), ell_Grek (๐ฌ๐ท), bul_Cyrl (๐ง๐ฌ), swe_Latn (๐ธ๐ช), ces_Latn (๐จ๐ฟ), isl_Latn (๐ฎ๐ธ), pol_Latn (๐ต๐ฑ), ron_Latn (๐ท๐ด), dan_Latn (๐ฉ๐ฐ), hun_Latn (๐ญ๐บ), tgk_Cyrl (๐น๐ฏ), srp_Latn (๐ท๐ธ), fas_Arab (๐ฎ๐ท), ceb_Latn (๐ต๐ญ), heb_Hebr (๐ฎ๐ฑ), hrv_Latn (๐ญ๐ท), glg_Latn (๐ช๐ธ), fin_Latn (๐ซ๐ฎ), slv_Latn (๐ธ๐ฎ), vie_Latn (๐ป๐ณ), mkd_Cyrl (๐ฒ๐ฐ), slk_Latn (๐ธ๐ฐ), nor_Latn (๐ณ๐ด), est_Latn (๐ช๐ช), ltz_Latn (๐ฑ๐บ), eus_Latn (๐ช๐ธ), lit_Latn (๐ฑ๐น), kaz_Cyrl (๐ฐ๐ฟ), lav_Latn (๐ฑ๐ป), bos_Latn (๐ง๐ฆ), epo_Latn (๐บ๐ธ), cat_Latn (๐ช๐ธ), tha_Thai (๐น๐ญ), ukr_Cyrl (๐บ๐ฆ), tgl_Latn (๐ต๐ญ), sin_Sinh (๐ฑ๐ฐ), gle_Latn (๐ฎ๐ช), hin_Deva (๐ฎ๐ณ), kor_Hang (๐ฐ๐ท), ory_Orya (๐ฎ๐ณ), urd_Arab (๐ต๐ฐ), swa_Latn (๐ฐ๐ช), sqi_Latn (๐ฆ๐ฑ), bel_Cyrl (๐ง๐พ), afr_Latn (๐ฟ๐ฆ), nno_Latn (๐ณ๐ด), tat_Cyrl (๐ท๐บ), asm_Beng (๐ฎ๐ณ), hil_Latn (๐ต๐ญ), nso_Latn (๐ฟ๐ฆ), ibo_Latn (๐ณ๐ฌ), kin_Latn (๐ท๐ผ), tpi_Latn (๐ต๐ฌ), twi_Latn (๐ฌ๐ญ), kir_Cyrl (๐ฐ๐ฌ), nep_Deva (๐ณ๐ต), azj_Latn (๐ฆ๐ฟ), bcl_Latn (๐ต๐ญ), xho_Latn (๐ฟ๐ฆ), cym_Latn (๐ด), gaa_Latn (๐ฌ๐ญ), ton_Latn (๐น๐ด), tah_Latn (๐ต๐ซ), lat_Latn (๐ป๐ฆ), srn_Latn (๐ธ๐ท), ewe_Latn (๐ฌ๐ญ), bem_Latn (๐ฟ๐ฒ), orm_Latn (๐ช๐น), haw_Latn (๐บ๐ธ), hmo_Latn (๐ต๐ฌ), kat_Geor (๐ฌ๐ช), pag_Latn (๐ต๐ญ), loz_Latn (๐ฟ๐ฒ), fry_Latn (๐ณ๐ฑ), mya_Mymr (๐ฒ๐ฒ), nds_Latn (๐ฉ๐ช), run_Latn (๐ง๐ฎ), pnb_Arab (๐ต๐ฐ), rar_Latn (๐จ๐ฐ), fij_Latn (๐ซ๐ฏ), wls_Latn (๐ผ๐ธ), ckb_Arab (๐ฎ๐ถ), ven_Latn (๐ฟ๐ฆ), zsm_Latn (๐ฒ๐พ), chv_Cyrl (๐ท๐บ), lua_Latn (๐จ๐ฉ), que_Latn (๐ต๐ช), sag_Latn (๐จ๐ซ), guw_Latn (๐ฌ๐ผ), bre_Latn (๐ซ๐ท), toi_Latn (๐จ๐ซ), pus_Arab (๐ฆ๐ซ), che_Cyrl (๐ท๐บ), pis_Latn (๐ธ๐ง), kon_Latn (๐จ๐ฉ), oss_Cyrl (๐ท๐บ), hyw_Armn (๐ฆ๐ฒ), iso_Latn (๐ป๐บ), nan_Latn (๐น๐ผ), lub_Latn (๐จ๐ฉ), lim_Latn (๐ณ๐ฑ), tuk_Latn (๐น๐ฒ), tir_Ethi (๐ช๐น), tgk_Latn (๐น๐ฏ), yua_Latn (๐ฒ๐ฝ), min_Latn (๐ฎ๐ฉ), lue_Latn (๐จ๐ฉ), khm_Khmr (๐ฐ๐ญ), tum_Latn (๐ฒ๐ผ), tll_Latn (๐ณ๐ฆ), ekk_Latn (๐ช๐ช), lug_Latn (๐บ๐ฌ), niu_Latn (๐ณ๐บ), tzo_Latn (๐ฒ๐ฝ), mah_Latn (๐ฒ๐ญ), tvl_Latn (๐น๐ป), jav_Latn (๐ฎ๐ฉ), hau_Latn (๐ณ๐ฌ), som_Latn (๐ธ๐ด), uzb_Latn (๐บ๐ฟ), sot_Latn (๐ฟ๐ฆ), uzb_Cyrl (๐บ๐ฟ), cos_Latn (๐ซ๐ท), als_Latn (๐ฆ๐ฑ), amh_Ethi (๐ช๐น), sun_Latn (๐ฎ๐ฉ), war_Latn (๐ต๐ญ), div_Thaa (๐ฒ๐ป), yor_Latn (๐ณ๐ฌ), fao_Latn (๐ซ๐ด), uzn_Cyrl (๐บ๐ฟ), smo_Latn (๐ผ๐ธ), bak_Cyrl (๐ท๐บ), ilo_Latn (๐ต๐ญ), tso_Latn (๐ฟ๐ฆ), mri_Latn (๐ณ๐ฟ), hmn_Latn (๐บ๐ธ), nau_Latn (๐ณ๐ท), asm_Beng (๐ฎ๐ณ), hil_Latn (๐ต๐ญ), nso_Latn (๐ฟ๐ฆ), ibo_Latn (๐ณ๐ฌ), kin_Latn (๐ท๐ผ), tpi_Latn (๐ต๐ฌ), twi_Latn (๐ฌ๐ญ), kir_Cyrl (๐ฐ๐ฌ), pap_Latn (๐ณ๐ฑ), aze_Latn (๐ฆ๐ฟ), qvi_Latn (๐ต๐ช), cak_Latn (๐ฌ๐น), kbp_Latn (๐ง๐ซ), kri_Latn (๐ธ๐ฑ), mau_Latn (๐ฒ๐ฝ), scn_Latn (๐ฎ๐น), tyv_Cyrl (๐ท๐บ), ina_Latn (๐ง๐ช), btx_Latn (๐ฎ๐ฉ), nch_Latn (๐ฒ๐ฝ), ncj_Latn (๐ฒ๐ฝ), pau_Latn (๐ต๐ผ), toj_Latn (๐ฒ๐ฝ), pcm_Latn (๐ณ๐ฌ), dyu_Latn (๐ง๐ซ), kss_Latn (๐ณ๐ฌ), afb_Arab (๐ธ๐ฆ), urh_Latn (๐ณ๐ฌ), quc_Latn (๐ฌ๐น), new_Deva (๐ณ๐ต), yao_Latn (๐ฒ๐ผ), ngl_Latn (๐ฒ๐ฟ), nyu_Latn (๐ฒ๐ฟ), kab_Latn (๐ฉ๐ฟ), tuk_Cyrl (๐น๐ฒ), xmf_Geor (๐ฌ๐ช), ndc_Latn (๐ฒ๐ฟ), san_Deva (๐ฎ๐ณ), nba_Latn (๐ณ๐ฌ), bpy_Beng (๐ฎ๐ณ), ncx_Latn (๐ฒ๐ฝ), qug_Latn (๐ต๐ช), rmn_Latn (๐ฎ๐ณ), cjk_Latn (๐ฌ๐น), arb_Arab (๐ธ๐ฆ), kea_Latn (๐จ๐ป), mck_Latn (๐จ๐ฉ), arn_Latn (๐จ๐ฑ), pdt_Latn (๐ฉ๐ช), her_Latn (๐ณ๐ฆ), tlh_Latn (๐บ๐ธ), suz_Deva (๐ฎ๐ณ), kat_Geor (๐ฌ๐ช), kmr_Cyrl (๐ท๐บ), gcr_Latn (๐ฌ๐ต), jbo_Latn (๐บ๐ธ), tbz_Latn (๐ต๐ผ), bam_Latn (๐ฒ๐ฑ), prk_Latn (๐ธ๐ฎ), jam_Latn (๐ฏ๐ฒ), twx_Latn (๐น๐ผ), sme_Latn (๐ซ๐ฎ), gom_Latn (๐ฎ๐ณ), bum_Latn (๐จ๐ฒ), mgr_Latn (๐ฒ๐ผ), ahk_Latn (๐ต๐ฐ), kur_Arab (๐ฎ๐ถ), bas_Latn (๐จ๐ฒ), bin_Latn (๐ณ๐ฌ), tsz_Latn (๐ฒ๐ฝ), sid_Latn (๐ช๐น), diq_Latn (๐น๐ท), srd_Latn (๐ฎ๐น), tcf_Latn (๐ฒ๐ฝ), bzj_Latn (๐ฎ๐ณ), udm_Cyrl (๐ท๐บ), cce_Latn (๐จ๐ฒ), meu_Latn (๐จ๐ฉ), chw_Latn (๐จ๐ฒ), cbk_Latn (๐ต๐ญ), ibg_Latn (๐ฎ๐ฉ), bhw_Latn (๐ฎ๐ฉ), ngu_Latn (๐ฒ๐ฝ), nyy_Latn (๐น๐ฟ), szl_Latn (๐ต๐ฑ), ish_Latn (๐น๐ฟ), naq_Latn (๐ณ๐ฆ), toh_Latn (๐ณ๐ฟ), ttj_Latn (๐ฐ๐ช), nse_Latn (๐ณ๐ฌ), ami_Latn (๐น๐ผ), alz_Latn (๐ธ๐ฉ), apc_Arab (๐ธ๐พ), vls_Latn (๐ณ๐ฑ), mhr_Cyrl (๐ท๐บ), djk_Latn (๐ฉ๐ช), prs_Arab (๐ฆ๐ซ), san_Latn (๐ฎ๐ณ), som_Arab (๐ธ๐ด), uig_Latn (๐จ๐ณ), hau_Arab (๐ณ๐ฌ) | Github | ๐ |
Few-shot Learning with Multilingual Generative Language Models |
2022 | English (๐บ๐ธ), Russian (๐ท๐บ), Chinese (๐จ๐ณ), German (๐ฉ๐ช), Spanish (๐ช๐ธ), French (๐ซ๐ท), Japanese (๐ฏ๐ต), Italian (๐ฎ๐น), Portuguese (๐ต๐น), Greek (๐ฌ๐ท), Romanian (๐ท๐ด), Ukrainian (๐บ๐ฆ), Hungarian (๐ญ๐บ), Korean (๐ฐ๐ท), Polish (๐ต๐ฑ), Norwegian (๐ณ๐ด), Dutch (๐ณ๐ฑ), Finnish (๐ซ๐ฎ), Danish (๐ฉ๐ฐ), Indonesian (๐ฎ๐ฉ), Croatian (๐ญ๐ท), Turkish (๐น๐ท), Arabic (๐ธ๐ฆ), Vietnamese (๐ป๐ณ), Thai (๐น๐ญ), Bulgarian (๐ง๐ฌ), Persian (๐ฎ๐ท), Swedish (๐ธ๐ช), Malay (๐ฒ๐พ), Hebrew (๐ฎ๐ฑ), Czech (๐จ๐ฟ), Slovak (๐ธ๐ฐ), Catalan (๐ช๐ธ), Lithuanian (๐ฑ๐น), Slovene (๐ธ๐ฎ), Hindi (๐ฎ๐ณ), Estonian (๐ช๐ช), Latvian (๐ฑ๐ป), Tagalog (๐ต๐ญ), Albanian (๐ฆ๐ฑ), Serbian (๐ท๐ธ), Azerbaijani (๐ฆ๐ฟ), Bengali (๐ง๐ฉ), Tamil (๐ฎ๐ณ), Urdu (๐ต๐ฐ), Kazakh (๐ฐ๐ฟ), Armenian (๐ฆ๐ฒ), Georgian (๐ฌ๐ช), Icelandic (๐ฎ๐ธ), Belarusian (๐ง๐พ), Bosnian (๐ง๐ฆ), Malayalam (๐ฎ๐ณ), Macedonian (๐ฒ๐ฐ), Swahili (๐น๐ฟ), Afrikaans (๐ฟ๐ฆ), Telugu (๐ฎ๐ณ), Arabic Romanized (๐ธ๐ฆ), Mongolian (๐ฒ๐ณ), Latin (๐ฎ๐น), Nepali (๐ณ๐ต), Sinhalese (๐ฑ๐ฐ), Marathi (๐ฎ๐ณ), Kannada (๐ฎ๐ณ), Somali (๐ธ๐ด), Welsh (๐ด), Javanese (๐ฎ๐ฉ), Pashto (๐ฆ๐ซ), Uzbek (๐บ๐ฟ), Gujarati (๐ฎ๐ณ), Khmer (๐ฐ๐ญ), Urdu Romanized (๐ต๐ฐ), Amharic (๐ช๐น), Bengali Romanized (๐ง๐ฉ), Punjabi (๐ฎ๐ณ), Galician (๐ช๐ธ), Hausa (๐ณ๐ฌ), Sanskrit (๐ฎ๐ณ), Basque (๐ช๐ธ), Burmese (๐ฒ๐ฒ), Sundanese (๐ฎ๐ฉ), Oriya (๐ฎ๐ณ), Haitian (๐ญ๐น), Lao (๐ฑ๐ฆ), Kyrgyz (๐ฐ๐ฌ), Breton (๐ซ๐ท), Irish (๐ฎ๐ช), Yoruba (๐ณ๐ฌ), Esperanto (๐), Tamil Romanized (๐ฎ๐ณ), Zulu (๐ฟ๐ฆ), Tigrinya (๐ช๐ท), Telugu Romanized (๐ฎ๐ณ), Kurdish (๐น๐ท), Oromo (๐ช๐น), Xhosa (๐ฟ๐ฆ), Scottish Gaelic (๐ฌ๐ง), Igbo (๐ณ๐ฌ), Assamese (๐ฎ๐ณ), Ganda (๐บ๐ฌ), Wolof (๐ธ๐ณ), Western Frisian (๐ณ๐ฑ), Tswana (๐ง๐ผ), Fula (๐ธ๐ณ), Guaranรญ (๐ต๐พ), Sindhi (๐ต๐ฐ), Lingala (๐จ๐ฉ), Bambara (๐ฒ๐ฑ), Inuktitut (๐จ๐ฆ), Kongo (๐จ๐ฉ), Quechua (๐ต๐ช), Swati (๐ธ๐ฟ), Unassigned (๐) | Github | ๐ |
Introducing L2M3, A Multilingual Medical Large Language Model to Advance Health Equity in Low-Resource Regions |
2024 | English (๐บ๐ธ), Chinese (๐จ๐ณ), Telugu (๐ฎ๐ณ), Hindi (๐ฎ๐ณ), Arabic (๐ธ๐ฆ), Swahili (๐น๐ฟ), Bengali (๐ง๐ฉ) | ๐ | ๐ |
Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning |
2022 | Afrikaans (๐ฟ๐ฆ), Amharic (๐ช๐น), Hausa (๐ณ๐ฌ), Igbo (๐ณ๐ฌ), Malagasy (๐ฒ๐ฌ), Chichewa (๐ฒ๐ผ), Oromo (๐ช๐น), Naija (๐ณ๐ฌ), Kinyarwanda (๐ท๐ผ), Kirundi (๐ง๐ฎ), Shona (๐ฟ๐ผ), Somali (๐ธ๐ด), Sesotho (๐ฑ๐ธ), Swahili (๐น๐ฟ), isiXhosa (๐ฟ๐ฆ), Yoruba (๐ณ๐ฌ), isiZulu (๐ฟ๐ฆ), English (๐ฌ๐ง), French (๐ซ๐ท), Arabic (๐ธ๐ฆ), Lingala (๐จ๐ฉ), Luganda (๐บ๐ฌ), Luo (๐ฐ๐ช), Wolof (๐ธ๐ณ) | GitHub | ๐ค |
MuRIL: Multilingual Representations for Indian Languages |
2021 | Assamese (๐ฎ๐ณ), Bengali (๐ง๐ฉ), Gujarati (๐ฎ๐ณ), Hindi (๐ฎ๐ณ), Kannada (๐ฎ๐ณ), Kashmiri (๐ฎ๐ณ), Malayalam (๐ฎ๐ณ), Marathi (๐ฎ๐ณ), Nepali (๐ณ๐ต), Oriya (๐ฎ๐ณ), Punjabi (๐ฎ๐ณ), Sanskrit (๐ฎ๐ณ), Sindhi (๐ต๐ฐ), Tamil (๐ฎ๐ณ), Telugu (๐ฎ๐ณ), Urdu (๐ฎ๐ณ), English (๐ฌ๐ง) | ๐ | ๐ |
From English to Foreign Languages: Transferring Pretrained Language Models |
2020 | French (๐ซ๐ท), Russian (๐ท๐บ), Arabic (๐ฆ๐ช), Chinese (๐จ๐ณ), Hindi (๐ฎ๐ณ), Vietnamese (๐ป๐ณ) | ๐ | ๐ |
- PaLI-X: On Scaling up a Multilingual Vision and Language Model (2023)
- PALI: A Jointly-Scaled Multilingual Language-Image Model (2023)
- Learning to Scale Multilingual Representations for Vision-Language Tasks (2020)
- A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias (2024)
- Towards Building Multilingual Language Model for Medicine (2024)
- What is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models (2024)
- All Languages Matter: On the Multilingual Safety of Large Language Models (2024)
- Multilingual Jailbreak Challenges in Large Language Models (2024)
- EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation (2024)
- Chat2VIS: Fine-Tuning Data Visualisations using Multilingual Natural Language Text and Pre-Trained Large Language (2024)
- How Linguistically Fair Are Multilingual Pre-Trained Language Models? (2021)
- IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages (2020)
- Are Multilingual Models the Best Choice for Moderately Under-Resourced Languages? A Comprehensive Assessment for Catalan (2021)
- You Reap What You Sow: On the Challenges of Bias Evaluation Under Multilingual Settings (2022)
- How to Adapt Your Pretrained Multilingual Model to 1600 Languages (2021)
- MEGA: Multilingual Evaluation of Generative AI / GitHub (2023)
- XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models (2023)