-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use alpha not log alpha in autotune #414
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@SobhanMP Nice catch. Thansk a lot for bringing this to our attention. Indeed, both your code reference as well as the 2018 SAC paper suggest stipulate the usage of alpha instead of log_allpha. Are you perhaps already testing your change ? Otherwise I can get some runs started on my side to check the effect of this change. Chances are, the results will not be that much different, but let's definitely stick as close as possible to the correct theory and implementation. @timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason Line 314 in f36d4a6
Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin |
@@ -276,7 +276,7 @@ def get_action(self, x): | |||
if args.autotune: | |||
with torch.no_grad(): | |||
_, log_pi, _ = actor.get_action(data.observations) | |||
alpha_loss = (-log_alpha * (log_pi + target_entropy)).mean() | |||
alpha_loss = (-log_alpha.exp() * (log_pi + target_entropy)).mean() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken the liberty of changing torch.exp(log_alpha)
to log_alpha.exp() which is slight cleaner 😅
I recall doing it that way before, but it was likely lost during refactors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
@dosssman would be nice if you can test it, I'm having a big of a rough time installing clearnrl 😅 |
@SobhanMP Actually tried to install a clean environment since most of the ones I have are quite dated but I also get a lot of failures from poetry. What kind of errors do you get when trying to install cleanrl now ? |
Poetry seems to shoot itself in the leg and remove tomli. Other than that, GL.h not found when i try to run it on the cluster. Ah, pyyaml doesn't work with cython 3 anymore, something about the license link being broken |
Installing poetry as root with a package manager seems to overcome a bunch of the issues |
I usually create a fresh conda env first, within which I then use Also, notes for solving mujoco_py related issues: Mujoco install notes
|
Tracking benchmarks results in this Wandb report: https://api.wandb.ai/links/openrlbenchmark/owsly8ve EDIT: given that alpha values are very small with or without the fix, we likely won't see much difference. |
@vwxyzjn Hello there. Any idea why the vercel test fails ? Thanks for your time. |
It's confusing for SAC discrete. The paper states a similar formula (using alpha) whereas the author's original code uses According to the SAC authors this should be a non-issue though and most other major libraries use I'm honestly not quite sure what to do here. @SobhanMP plots show improved stability, so we should change it I guess? |
1 similar comment
This comment was marked as duplicate.
This comment was marked as duplicate.
Actually, let me benchmark it again; I ran the original code w/o any change, tho it is suspicious that I have less variance. |
Benchmarks available here:https://api.wandb.ai/links/openrlbenchmark/eykdulk2. Should be ready to merge. |
@dosssman thanks! |
I'm gonna run some experiments with SAC-discrete to verify that it works there as well. It's a good opportunity for me to get into the rlops/benchmarking stuff which I haven't used before. After everything works we can merge it and also merge #383 while we're at it. |
Great. I went ahead and tested the combination of that paper and with this one, yellow lines on top of teal and pink as specified in earlier post: Also no significant regression in performance. Any idea why the vercel related test fails here though ? |
0 clue about vercel. I'm not using it personally and haven't really interacted with it. |
Atari results take a while, and I can't run that many experiments in parallel due to RAM constraints for the replay buffer. Here are some results for SAC-discrete BeamRider and BreakOut. Pong results are still waiting, but I can already say:
|
Did you manage to identify the specific reason why it is dying ? |
By dying I mean the performance collapses and never recovers. Sorry for not being clear. |
To verify this claim, I ran the same seed that failed before (seed=1) with I'm gonna add the
so a non-performing seed isn't the end of the world IMO. |
Ready for merge from my side @dosssman |
@timoklein Thanks a lot. Looking good on my side too. |
Who wants the honor to merge it? Or should we wait for Costa? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Great work everyone.
Really great work everyone! Thanks so much for the PR and the benchmark effort. @dosssman or @timoklein Fee free to merge it! |
Description
The lagrange multiplier in the entropy constraint (alpha, H(\pi) < H) needs to be positive for the relaxation to work. This means that alpha not log alpha should be used here. See the Softlearnign code at:
https://github.com/rail-berkeley/softlearning/blob/13cf187cc93d90f7c217ea2845067491c3c65464/softlearning/algorithms/sac.py#L256
but the real question is:
A) does this matter?
B) am i making sense?
I'll answer A tomorrow, not sure about B tho.
Types of changes
Checklist:
pre-commit run --all-files
passes (required).mkdocs serve
.If you need to run benchmark experiments for a performance-impacting changes:
--capture-video
.python -m openrlbenchmark.rlops
.python -m openrlbenchmark.rlops
utility to the documentation.python -m openrlbenchmark.rlops ....your_args... --report
, to the documentation.