Some Demos on How to config to offload tensors to nvme device #6752

niebowen666 · 2024-11-15T08:23:57Z

Dear authors:
I am wondering if I could get some demos on config which is used to train a Language Model with ZeRO-Infinity?
It confused me a lot that how to config the "offload_param" and "offload_optimizer"
Thanks!

tjruwase · 2024-11-15T14:25:12Z

@niebowen666, how about

niebowen666 · 2024-11-16T05:35:56Z

Thanks for your reply!
But I suffered some new issues:
I configured a new configure during the training of llama as below:

ds_config = {
            "train_batch_size":32,
            "optimizer": {
                "type": "Adam",
                "params": {
                    "lr": 0.00006,
                    "betas": [0.9, 0.95],
                    "weight_decay": 0.01
                }
            },
            "zero_optimization": {
                "stage": 3,
                "contiguous_gradients": True,
                "stage3_max_live_parameters": 1e9,
                "stage3_max_reuse_distance": 1e9,
                "stage3_prefetch_bucket_size": 1e7,
                "stage3_param_persistence_threshold": 1e5,
                "reduce_bucket_size": 1e7,
                "sub_group_size": 1e9,
                "offload_optimizer": {
                    "device": "cpu"
                },
                "offload_param": {
                    "device": "cpu"
                }
            }
        }

It get the error"out of memory " when I set train_batch_size to 64.
I also read your source code, and there exits some confusions for me：
In runtime/engine.py, line 314 to line 322, It seems that if I configured the"optimizer" as adam, it wouldnot run the _configure_zero_optimizer so that the tensor generated by model will not be offload to CPU.
Is my idea right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Demos on How to config to offload tensors to nvme device #6752

Some Demos on How to config to offload tensors to nvme device #6752

niebowen666 commented Nov 15, 2024

tjruwase commented Nov 15, 2024

niebowen666 commented Nov 16, 2024

Some Demos on How to config to offload tensors to nvme device #6752

Some Demos on How to config to offload tensors to nvme device #6752

Comments

niebowen666 commented Nov 15, 2024

tjruwase commented Nov 15, 2024

niebowen666 commented Nov 16, 2024