Stage | Pre-Training | Supervised Fine-Tuning | Reward Modeling | Reinforcement Learning |
---|---|---|---|---|
Training Data | Trillions of tokens from websites, books, etc. | Prompt-Response Pairs for various tasks | Response Preferences | Prompts |
Modeling Method | Language Modeling (negative log-likelihood) | Language Modeling (negative log-likelihood) | Binary Classification or Regression | Reinforcement Learning (Using PPO Method) |
Model | Base Model | SFT Model | Reward Model | RL Model |
- 语言模型是怎么被训练出来的
-
- pre-train:无监督预训练,海量的文本;(学前时代)
-
- Alignment (如下的2和3,严格意义上都是 Alignment,都属于对齐技术)
-
- SFT:supervised fine-tuning:有监督训练(学生时代),少量有标注;
-
- RLHF:真实的人类反馈,强化学习训练;
-
- llama introduction
- llama text/chat completion
- rmsnorm & swiGLU
- ROPE
- ROPE & apply_rotary_emb
- Cache KV
- GQA
- Cache kv & Generate process
- Llama3: 最强小模型