-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
37 Implementation details of PPO #17
Comments
Oir original minibatch shape defaults were bad.
I did a sweep over a couple of values monitoring performance, wall-clock time and GPU utilisation (on an RTX 4090). Running with the following options:
seemed to make training go quite a bit faster (40% faster or so on the RTX 4090) and performance metrics seem roughly the same or maybe a bit better. So, we set this to the new default. |
Commit 090dcec introduces four new metrics:
|
Usman ran some experiments and found that 0.1 works better, solving a stability issue we had encountered. We should use this going forward (along with lr=5e-5 instead of 5e-4). We should still remember to try both clip range annealing and learning rate annealing at some point. |
Our baselines use a PPO algorithm that is adapted from PureJaxRL. But it doesn't appear to stick to all of the relevant implementation details from Huang et al., 2022 (henceforth 37Details):
We're not necessarily trying to 'replicate PPO' but we should consider trying each of these if/when we get a chance and see if it makes a big difference for our environments.
Core implementation details.
policy_loss
,value_loss
,entropy_loss
.clipfrac
(fraction of training data triggering the clipped objective) andapproxkl
(KL estimator using(-logratio).mean()
AND((ratio - 1) - logratio).mean()
, see linked post).Atari-specific implementation details. Most of these are n/a.
FIRE
action at the start of each episode for environments that don't start until this happens.Hyper-parameters from the original PPO paper (for some reason these aren't considered part of the 37 details).
0.01
.0.001
instead.Details for continuous action domains. Most of these are really N/A and not even worth writing down, with one exception.
LSTM implementation details.
initialize_carry
method).There is a final detail of multi-discrete action support, which is not necessary for me at the moment.
Auxiliary implementation details (not used by original PPO implementation, but potentially useful in some situations).
The text was updated successfully, but these errors were encountered: