-
We currently have a working solution for gradient accumulation which is based on overloading the train_step method of a tf.keras.Model. This is quite straight forward, and can be easily implemented. However, we have implemented a convenience class which can be used for this exact task called GAModelWrapper. However, overloading the train_step might not be optimal. There are situations were people are creating advanced models where they need to overload the train_step themselves. Hence, using our GAModelWrapper would be a bad idea, as it would remove their own edits. Then they should rather incorporate what we did in GAModelWrapper themselves, but that somewhat defeats the purpose... On the other hand, the optimizer wrapper approach does not seem to work currently. And therefore, the model wrapper approach seem to be the only viable approach in TF 2 for gradient accumulation. Any thoughts? What could we do to make a solution that is good enough to be integrated into Keras? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
In the latest release v0.3.0 now we support both approaches: The main reason why optimizer wrapping is a better solution with the current state of TF2, is that multi-GPU distribute strategy is incompatible with our |
Beta Was this translation helpful? Give feedback.
In the latest release v0.3.0 now we support both approaches:
https://github.com/andreped/GradientAccumulator/releases/tag/v0.3.0
The main reason why optimizer wrapping is a better solution with the current state of TF2, is that multi-GPU distribute strategy is incompatible with our
train_step
approach. However, it should work with the optimizer wrapper approach. To be added in the future.