idea: Precision scaling research #127

hahuyhoang411 · 2024-11-20T23:44:20Z

Problem Statement

Hypothesis: Increasing numerical precision during training can improve the performance of small language models (≈1B parameters), potentially enabling them to achieve capabilities comparable to larger models (3B-7B parameters).

Implications

If validated, this hypothesis could:

Reduce the computational resources needed for training effective language models
Enable broader adoption of smaller, more efficient models
Lead to new approaches in optimizer design and implementation

Idea

Reference: https://arxiv.org/pdf/2411.04330

hahuyhoang411 · 2024-11-21T00:22:19Z

@bachvudinh as co-author please help me to add some exploded attempts and MMLU score

We have ran some testing to train Llama 3.2 1B Instruct to check if fp32 can perform better bf16

Precision	Learning Rate	Weight Decay	Global batchsize	Trained Samples	Final Loss	MMLU
fp32	3e-4	0.01	96	0.2M	1.24
fp32	2.5e-4	0.01	96	0.2M	1.22
bf16	3e-4	0.01	96	0.2M	exploded
bf16	2e-4	0.01	96	0.2M	exploded
b16	2.5e-4	0.01	96	0.2M	1.26
fp32	3e-4	0.2	?	0.5M	0.67	25.54
fp32	1e-4	0.05	?	1.7M	1.32	23.18

for the training configs with fp32 and setting lr as 1e-4 and weight decay as 0.05 , there are some weird mmlu results with checkpoint step 1000, 2000 and 3000:

step 1000:
step 2000:
step 3000:

hahuyhoang411 · 2024-11-21T00:23:06Z

fp32 3e-4

hahuyhoang411 · 2024-11-21T00:23:53Z

fp32 2.5e-4

hahuyhoang411 · 2024-11-21T00:24:15Z

bf16 2.5e-4

hahuyhoang411 · 2024-11-21T00:24:36Z

fp32 0.5M

hahuyhoang411 · 2024-11-21T00:25:07Z

fp32 1.7M

tikikun · 2024-11-21T02:52:23Z

A few pending issues:

It's very obvious that even tho more stabilized the training is not converging and still have hiccup, which indicating issue inside the optimizer itself not being able to give direction good enough for the optimizing process
Since we are currently hitting a wall on the optimizer itself, we will not continue scaling up the precision

Next steps:

@tikikun to do his own research on optimizer
We make use of the cluster for training qwen 32b instruct

cc @0xSage for interested

hahuyhoang411 added the type: idea Research, data, any new ideas label Nov 20, 2024

hahuyhoang411 assigned hahuyhoang411 and bachvudinh Nov 20, 2024

tikikun assigned tikikun and unassigned hahuyhoang411 and bachvudinh Nov 21, 2024

hiento09 added this to Jan & Cortex Nov 22, 2024

github-project-automation bot moved this to Investigating in Jan & Cortex Nov 22, 2024

tikikun moved this from Investigating to Icebox in Jan & Cortex Nov 25, 2024

bachvudinh added this to the Icebox milestone Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: Precision scaling research #127

idea: Precision scaling research #127

hahuyhoang411 commented Nov 20, 2024 •

edited

Loading

hahuyhoang411 commented Nov 21, 2024 •

edited by bachvudinh

Loading

hahuyhoang411 commented Nov 21, 2024 •

edited

Loading

hahuyhoang411 commented Nov 21, 2024

hahuyhoang411 commented Nov 21, 2024

hahuyhoang411 commented Nov 21, 2024 •

edited

Loading

hahuyhoang411 commented Nov 21, 2024

tikikun commented Nov 21, 2024

idea: Precision scaling research #127

idea: Precision scaling research #127

Comments

hahuyhoang411 commented Nov 20, 2024 • edited Loading

Problem Statement

Implications

Idea

hahuyhoang411 commented Nov 21, 2024 • edited by bachvudinh Loading

hahuyhoang411 commented Nov 21, 2024 • edited Loading

hahuyhoang411 commented Nov 21, 2024

hahuyhoang411 commented Nov 21, 2024

hahuyhoang411 commented Nov 21, 2024 • edited Loading

hahuyhoang411 commented Nov 21, 2024

tikikun commented Nov 21, 2024

hahuyhoang411 commented Nov 20, 2024 •

edited

Loading

hahuyhoang411 commented Nov 21, 2024 •

edited by bachvudinh

Loading

hahuyhoang411 commented Nov 21, 2024 •

edited

Loading

hahuyhoang411 commented Nov 21, 2024 •

edited

Loading