[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled #6718

jerrychenhf · 2024-11-06T03:54:46Z

Describe the bug
When running DeepSpeed with torch.compile and activation checkpointing on LLAMA2 model training, if activation checkpoint function is not compiler disabled, we found accuracy issues for the training (fine tuning).

DeepSpeed implemented a WA for resolving this accuracy issue:
#5590

But the root cause was never identified and a proper fix from the root cause should be provided instead of a WA.

To Reproduce
The following condition need to be satisfied to reproduce:

We used LLAMA2 7B model
torch.compile is used for compiling the model being trained
Peft LoRA optimization applied to the model
DeepSpeed ZeRO3 training
Activation checkpointing is enabled
remove the compiler.disable decorator on the DeepSpeed checkpoint function.

Expected behavior
The accuracy with or without compiler.disable decorator on the DeepSpeed checkpoint function should be the same.
But actually, without compiler.disable decorator on the DeepSpeed checkpoint function, the end accuracy achieved is much lower.

Investigation
We recently have a in-depth investigation and analysis to this DeepSpeed issue and found the root cause of the issue. This bug entry is to track this problem we found behind the scene so that we can provide a PR to fix which can be understood and accepted.

What is happening:

LoRA and activation checkpointing is in use, the "x = x.to(lora_A.weight.dtype)" code line the triggered a special path to ZeRO3 parameter offloading. Which is ZeROOrderedDict getitem method.

result = self.base_layer(x, *args, **kwargs)

lora_A = self.lora_A[active_adapter]
lora_B = self.lora_B[active_adapter]
dropout = self.lora_dropout[active_adapter]
scaling = self.scaling[active_adapter]
x = x.to(lora_A.weight.dtype)
result = result + lora_B(lora_A(dropout(x))) * scaling

ZeROOrderedDict getitem calls into param.all_gather if the param data is not avaiable.

        if hasattr(param, "ds_status") and getattr(param, "ds_status") == ZeroParamStatus.NOT_AVAILABLE:
            if self._parent_module._parameters._in_forward:
                register_external_parameter(FWD_MODULE_STACK[-1], param)
                param.all_gather()
                print_rank_0(f'Registering external parameter from getter {key} ds_id = {param.ds_id}', force=False)

When this is in torch.compile context, the _allgather_params function called by param.all_gather will be finally torch compiled.
And torch.compile will finally compile the following line of code in the _allgather_params.
param.data = replicated_tensor.data
The param.data = replicated_tensor.data assignment was traced to param.set_(replicated_tensor.detach()) by dynamo.
When in-place set_ is called on param, DeepSpeed hooks on AccNode is not working as expected. This behavior is demonstrated in PyTorch issue we submitted: Hooks on param AccumulateGrad are not called when the param was called with set_ pytorch/pytorch#139742

def wrapper(param):
    param_tmp = param.expand_as(param)
    grad_acc = param_tmp.grad_fn.next_functions[0][0]

    @instrument_w_nvtx
    def reduce_partition_and_remove_grads(*notneeded):
        self.reduce_ready_partitions_and_remove_grads(param)

    self._grad_acc_hooks.append(grad_acc.register_hook(reduce_partition_and_remove_grads))
    self.grad_accs.append(grad_acc)

When reduce_partition_and_remove_grads was not called for those parameters which were called by set_, the grads data will be wrong and thus cause accuracy issues.

Based on the PyTorch community discussion in pytorch/pytorch#139742. register_post_accumulate_grad_hook is the robust API to do this work instead of using AccuNode hooks.

I have verified that register_post_accumulate_grad_hook works and solves the accuracy issue in the original environment with the accuracy issue.

The text was updated successfully, but these errors were encountered:

jerrychenhf added bug Something isn't working training labels Nov 6, 2024

tohtana self-assigned this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled #6718

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled #6718

jerrychenhf commented Nov 6, 2024 •

edited

Loading

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled #6718

[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled #6718

Comments

jerrychenhf commented Nov 6, 2024 • edited Loading

jerrychenhf commented Nov 6, 2024 •

edited

Loading