Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grad norm too large,梯度爆炸 #6249

Closed
1 task done
Zbaoli opened this issue Dec 5, 2024 · 3 comments
Closed
1 task done

grad norm too large,梯度爆炸 #6249

Zbaoli opened this issue Dec 5, 2024 · 3 comments
Labels
solved This problem has been already solved

Comments

@Zbaoli
Copy link

Zbaoli commented Dec 5, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

Name: llamafactory
Version: 0.9.2.dev0

Reproduction

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "round_robin_gradients": true
  }
}

Expected behavior

image

grad norm 的值达到了几十万,不知道为什么会出现这种情况,gradient clipping 没起作用吗
正常 grad norm 的值应该小于 10 的

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Dec 5, 2024
@hiyouga
Copy link
Owner

hiyouga commented Dec 5, 2024

use bf16

@hiyouga hiyouga closed this as completed Dec 5, 2024
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 5, 2024
@Zbaoli
Copy link
Author

Zbaoli commented Dec 6, 2024

是有设置 bf16 为 true 的,gradient_clipping 也是 1,可以看下 log 信息:

[2024-12-05 05:20:48,838] [INFO] [config.py:988:print]   zero_enabled ................. True
[2024-12-05 05:20:48,838] [INFO] [config.py:988:print]   zero_force_ds_cpu_optimizer .. True
[2024-12-05 05:20:48,838] [INFO] [config.py:988:print]   zero_optimization_stage ...... 2
[2024-12-05 05:20:48,838] [INFO] [config.py:974:print_user_config]   json = {
    "train_batch_size": 64, 
    "train_micro_batch_size_per_gpu": 1, 
    "gradient_accumulation_steps": 32, 
    "gradient_clipping": 1.0, 
    "zero_allow_untested_optimizer": true, 
    "fp16": {
        "enabled": false, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "initial_scale_power": 16, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "bf16": {
        "enabled": true
    }, 
    "zero_optimization": {
        "stage": 2, 
        "offload_optimizer": {
            "device": "cpu", 
            "pin_memory": true
        }, 
        "allgather_partitions": true, 
        "allgather_bucket_size": 5.000000e+08, 
        "overlap_comm": true, 
        "reduce_scatter": true, 
        "reduce_bucket_size": 5.000000e+08, 
        "contiguous_gradients": true, 
        "round_robin_gradients": true
    }, 
    "steps_per_print": inf
}
{'loss': 1.5237, 'grad_norm': 242584.1252844052, 'learning_rate': 3.0303030303030305e-06, 'epoch': 0.15}
{'loss': 0.8214, 'grad_norm': 20728.77593919863, 'learning_rate': 6.060606060606061e-06, 'epoch': 0.3}
{'loss': 0.6033, 'grad_norm': 23017.38460381631, 'learning_rate': 9.090909090909091e-06, 'epoch': 0.46}
{'loss': 0.5532, 'grad_norm': 19155.674684672424, 'learning_rate': 9.985826900114391e-06, 'epoch': 0.61}
{'loss': 0.5194, 'grad_norm': 7820.733340574591, 'learning_rate': 9.916600996652726e-06, 'epoch': 0.76}
{'loss': 0.5978, 'grad_norm': 14871.576720585146, 'learning_rate': 9.790518603735191e-06, 'epoch': 0.91}
{'loss': 0.6351, 'grad_norm': 41228.25537177143, 'learning_rate': 9.609037761631552e-06, 'epoch': 1.07}
{'loss': 0.4846, 'grad_norm': 19792.373880171628, 'learning_rate': 9.374257148593824e-06, 'epoch': 1.22}
{'loss': 0.4728, 'grad_norm': 9547.522599148431, 'learning_rate': 9.088891811350164e-06, 'epoch': 1.37}
{'loss': 0.4904, 'grad_norm': 37734.130961736744, 'learning_rate': 8.756241767798934e-06, 'epoch': 1.53}

@Zbaoli
Copy link
Author

Zbaoli commented Dec 6, 2024

应该是硬件问题,换了个 gpu 就好了,之前 grad norom 爆炸的硬件现在报 ECC 错误

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants