Gradient caching vs Model dropout #12

harveyp123 · 2024-02-05T05:46:09Z

The GC-DPR has two steps

The first step did a full batch forward without gradient, to get the full batch contrastive learning loss and corresponding embedding gradient.
The second step conduct mini-batch forward, and assign the embedding gradient, then do backward. The mini-batch will loop through the full batch to computing all gradient and accumulate.

However, during the computation, there might be one issues:

The backbone model has randomized dropout process, the dropout will make the 1 & 2 to be inconsistent. 1's dropout process will be different from 2, so 1's gradient can not be directly applied to 2. 2's gradient shall be calculated again for every mini-batch. This bug can be fixed using some more sophisticated operation to make sure 1&2 to be consistent.

harveyp123 · 2024-02-22T05:42:00Z

In short, in the second for loop, for everything minibatch query and passage loss backward, you put the query and passage embedding into the original batch, and calculate the gradient for the current query/passage, so you can make sure the dropout behavior doesn't change your gradient too much.

luyug · 2024-02-23T02:47:03Z

In our train code, the random states are snapshot using the RandContext class

GC-DPR/train_dense_encoder.py

Lines 53 to 69 in 79e1fe0

    
           class RandContext: 
        
               def __init__(self, *tensors): 
        
                   self.fwd_cpu_state = torch.get_rng_state() 
        
                   self.fwd_gpu_devices, self.fwd_gpu_states = get_device_states(*tensors) 
        
               def __enter__(self): 
        
                   self._fork = torch.random.fork_rng( 
        
                       devices=self.fwd_gpu_devices, 
        
                       enabled=True 
        
                   ) 
        
                   self._fork.__enter__() 
        
                   torch.set_rng_state(self.fwd_cpu_state) 
        
                   set_device_states(self.fwd_gpu_devices, self.fwd_gpu_states) 
        
               def __exit__(self, exc_type, exc_val, exc_tb): 
        
                   self._fork.__exit__(exc_type, exc_val, exc_tb) 
        
                   self._fork = None

in the first fwd and restored at the beggining of the 2nd, so what you described shouldn't be a problem.

harveyp123 · 2024-02-23T05:55:21Z

Oh, okay, I was using deepspeed + gradient caching, the model is wrapped into a deepspeed defined object, and RandContext doesn't work on my side. But it's good to learn from your code : )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient caching vs Model dropout #12

Gradient caching vs Model dropout #12

harveyp123 commented Feb 5, 2024 •

edited

Loading

harveyp123 commented Feb 22, 2024

luyug commented Feb 23, 2024

harveyp123 commented Feb 23, 2024 •

edited

Loading

Gradient caching vs Model dropout #12

Gradient caching vs Model dropout #12

Comments

harveyp123 commented Feb 5, 2024 • edited Loading

harveyp123 commented Feb 22, 2024

luyug commented Feb 23, 2024

harveyp123 commented Feb 23, 2024 • edited Loading

harveyp123 commented Feb 5, 2024 •

edited

Loading

harveyp123 commented Feb 23, 2024 •

edited

Loading