An up-to-date awesome list of open-source large language models, instruction & RLHF datasets and high-performance codebases.
- Open Large Language Models
- Open Instruction-following Large Language Models
- Open Instruction Tuning and RLHF Datasets
- High-performance Open Instruction Tuning Codebase
Model | Stars | Organization | Language1 | Checkpoints |
---|---|---|---|---|
LLaMa2 | Meta AI | Multi-Lang | 7B & 13B & 30B & 65B | |
OPT | Meta AI | Eng. | 125M & 350M & 1.3B & 2.7B & 6.7B & 13B & 30B & 66B & 175B3 | |
BLOOM | -- | BigScience | Multi-Lang | 560m & 1.1B & 1.7B & 3B & 7.1B & 176B |
GLM4 | Tsinghua | CHI. & Eng. | 2B & 10B & 130B5 |
Model | Stars | Organization | Based Model | Checkpoints |
---|---|---|---|---|
BLOOMZ | -- | BigScience | BLOOM | 560m & 1.1B & 1.7B & 3B & 7.1B & 176B |
Alpaca | Stanford | LLaMa | 7B | |
Vicuna | lm-sys | LLaMa | 7B & 13B | |
Chinese-Vicuna | Facico | LLaMa | 7B |
Dataset | Language | #Samples | Annotation Type | Online Link |
---|---|---|---|---|
alpaca_chinese_dataset | Chinese | 52K | GPT-generated & Human-annotated | hikariming/alpaca_chinese_dataset |
BELLE/data | Chinese | 1.5M | GPT-generated | BELLE/data/1.5M |
pCLUE | Chinese | 1.2M | Formated from Existing Datasets | CLUEbenchmark/pCLUE |
Med-ChatGLM/data | Chinese | 7K | GPT-generated | SCIR-HI/Med-ChatGLM |
COIG | Chinese | 181K | Follow this paper | BAAI/COIG |
- DeepSpeed Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
- ColossalChat is a project to implement LLM with RLHF, powered by the Colossal-AI project.
- LMFlow: An extensible, convenient, and efficient toolbox for finetuning large machine learning models, designed to be user-friendly, speedy and reliable, and accessible to the entire community.
Footnotes
-
Languages used in pretraining texts. CHI is short for Chinese. Eng is short for English. ↩
-
Direct access to the LLaMa models is currently not possible due to Meta AI's policies. However, interested individuals can apply to access the models by filling out a form, which can be found at here. Alternatively, the checkpoints for the LLaMa models can be accessed via the following link on the Decapoda Research page of Huggingface. ↩
-
Direct access to the OPT models with 176B params is currently not possible due to Meta AI's policies. However, interested individuals can apply to access the models by filling out a form, which can be found at here. ↩
-
There are other GLM models (with different pretraining corpus) that are available at the THUDM page of Huggingface. GLM models include different size: 10B/2B/515M/410M/335M/110M (English), 10B/335M (Chinese). In above Table, GLM models with Engilish are provided. ↩
-
Direct access to the GLM model with 130B params is currently not possible due to THUDM's policies. However, interested individuals can apply to access the models by filling out a form, which can be found at here. ↩