[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445

Nevermetyou65 · 2024-11-21T02:15:12Z

Nevermetyou65
Nov 21, 2024

Hi,

Appendix D's modified train model function has a code for applying gradient clipping.
Here is the note

Applies gradient clipping
after the warmup phase
to avoid exploding
gradients

I can understand why we want to apply this technique after the warm-up period, but I don't understand why this is to avoid an exploding gradient. Could someone explain the reason to me?

rasbt · 2024-11-21T09:50:25Z

rasbt
Nov 21, 2024
Maintainer

Good question. It's basically to avoid the gradients from becoming too large, because gradients that are too large can cause weight updates that are too large, which can cause overshooting the loss (local) minimum you are aiming towards. During warmup, the learning rate is smaller, so the updates are smaller, but yeah, with a larger learning rate, large gradients become more of a problem.

6 replies

rasbt Dec 17, 2024
Maintainer

Hi there, this is odd. I recall that the training on the Gutenberg dataset went smoothly for me. But that was more than a year ago, and the Gutenberg dataset constantly keeps changing. I remember another reader having convergence issues on that dataset recently. It may have been either @d-kleine or @TITC (or possibly someone else). Maybe they have some insights in how they handled that.

d-kleine Dec 18, 2024

wasn't me, probably #293 ?

TITC Dec 18, 2024

#293 mentions two potential concerns: the inclusion of counts and tokens subdirectories, and the presence of non-English words. The first concern is clarified in the README.md. The second concern is handled in the prepare_dataset.py

LLMs-from-scratch/ch05/03_bonus_pretraining_on_gutenberg/prepare_dataset.py

Lines 40 to 42 in bb31de8

    
           if not is_english(content): 
        
               tqdm.write(f"Skipping {file_path} as it does not contain primarily English text.") 
        
               continue

I'm trying to reproduce the issue. If possible, could you provide more detailed logs or related information? @bpalmer7440

bpalmer7440 Dec 18, 2024

I produced this plot after training on a fairly old iMac Pro system (Intel Xeon with AMD Radeon 64). This afternoon I switched to a Linux box with a Nvidia 4090 using the exact same Gutenberg files and GPT-2 python model with training script (only change is switching from the mps device to cuda), and the training is going much smoother even without gradient clipping. My gut is that it has something to do with the mps system on MacOS versus anything about the model or Gutenberg files.

rasbt Dec 18, 2024
Maintainer

I see. Yes, in this case I think MPS is the culprit. While it has improved in recent editions PyTorch versions, unfortunately, MPS is still not very stable for many applications, which is why I didn't recommend it in the main chapters of the book. It's fine for inference, but for training I recommend CUDA or CPU. There was a related discussion here: #442

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Appendix D] apply gradient clipping after warmup steps to avoid exploding gradient. #445

Nevermetyou65 Nov 21, 2024

Replies: 1 comment · 6 replies

rasbt Nov 21, 2024 Maintainer

rasbt Dec 17, 2024 Maintainer

d-kleine Dec 18, 2024

TITC Dec 18, 2024

bpalmer7440 Dec 18, 2024

rasbt Dec 18, 2024 Maintainer

Nevermetyou65
Nov 21, 2024

Replies: 1 comment 6 replies

rasbt
Nov 21, 2024
Maintainer

rasbt Dec 17, 2024
Maintainer

rasbt Dec 18, 2024
Maintainer