grad norm too large，梯度爆炸 #6249

Zbaoli · 2024-12-05T09:31:45Z

Reminder

I have read the README and searched the existing issues.

System Info

Name: llamafactory
Version: 0.9.2.dev0

Reproduction

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "round_robin_gradients": true
  }
}

Expected behavior

grad norm 的值达到了几十万，不知道为什么会出现这种情况，gradient clipping 没起作用吗
正常 grad norm 的值应该小于 10 的

Others

No response

hiyouga · 2024-12-05T10:21:00Z

use bf16

Zbaoli · 2024-12-06T02:31:43Z

是有设置 bf16 为 true 的，gradient_clipping 也是 1，可以看下 log 信息：

[2024-12-05 05:20:48,838] [INFO] [config.py:988:print]   zero_enabled ................. True
[2024-12-05 05:20:48,838] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
[2024-12-05 05:20:48,838] [INFO] [config.py:988:print]   zero_optimization_stage ...... 2
[2024-12-05 05:20:48,838] [INFO] [config.py:974:print_user_config]   json = {
    "train_batch_size": 64, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 32, 
    "gradient_clipping": 1.0, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true, 
        "round_robin_gradients": true
    }, 
    "steps_per_print": inf
}
{'loss': 1.5237, 'grad_norm': 242584.1252844052, 'learning_rate': 3.0303030303030305e-06, 'epoch': 0.15}
{'loss': 0.8214, 'grad_norm': 20728.77593919863, 'learning_rate': 6.060606060606061e-06, 'epoch': 0.3}
{'loss': 0.6033, 'grad_norm': 23017.38460381631, 'learning_rate': 9.090909090909091e-06, 'epoch': 0.46}
{'loss': 0.5532, 'grad_norm': 19155.674684672424, 'learning_rate': 9.985826900114391e-06, 'epoch': 0.61}
{'loss': 0.5194, 'grad_norm': 7820.733340574591, 'learning_rate': 9.916600996652726e-06, 'epoch': 0.76}
{'loss': 0.5978, 'grad_norm': 14871.576720585146, 'learning_rate': 9.790518603735191e-06, 'epoch': 0.91}
{'loss': 0.6351, 'grad_norm': 41228.25537177143, 'learning_rate': 9.609037761631552e-06, 'epoch': 1.07}
{'loss': 0.4846, 'grad_norm': 19792.373880171628, 'learning_rate': 9.374257148593824e-06, 'epoch': 1.22}
{'loss': 0.4728, 'grad_norm': 9547.522599148431, 'learning_rate': 9.088891811350164e-06, 'epoch': 1.37}
{'loss': 0.4904, 'grad_norm': 37734.130961736744, 'learning_rate': 8.756241767798934e-06, 'epoch': 1.53}

Zbaoli · 2024-12-06T03:36:17Z

应该是硬件问题，换了个 gpu 就好了，之前 grad norom 爆炸的硬件现在报 ECC 错误

github-actions bot added the pending This problem is yet to be addressed label Dec 5, 2024

hiyouga closed this as completed Dec 5, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grad norm too large，梯度爆炸 #6249

grad norm too large，梯度爆炸 #6249

Zbaoli commented Dec 5, 2024 •

edited

Loading

hiyouga commented Dec 5, 2024

Zbaoli commented Dec 6, 2024

Zbaoli commented Dec 6, 2024

grad norm too large，梯度爆炸 #6249

grad norm too large，梯度爆炸 #6249

Comments

Zbaoli commented Dec 5, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Dec 5, 2024

Zbaoli commented Dec 6, 2024

Zbaoli commented Dec 6, 2024

Zbaoli commented Dec 5, 2024 •

edited

Loading