Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use alpha not log alpha in autotune #414

Merged
merged 4 commits into from
Sep 14, 2023
Merged

Conversation

SobhanMP
Copy link
Contributor

@SobhanMP SobhanMP commented Aug 20, 2023

Description

The lagrange multiplier in the entropy constraint (alpha, H(\pi) < H) needs to be positive for the relaxation to work. This means that alpha not log alpha should be used here. See the Softlearnign code at:
https://github.com/rail-berkeley/softlearning/blob/13cf187cc93d90f7c217ea2845067491c3c65464/softlearning/algorithms/sac.py#L256

but the real question is:
A) does this matter?
B) am i making sense?

I'll answer A tomorrow, not sure about B tho.

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the tests accordingly (if applicable).
  • I have updated the documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

  • I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
  • I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture-video.
  • I have performed RLops with python -m openrlbenchmark.rlops.
    • For new feature or bug fix:
      • I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
    • For new algorithm:
      • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
    • I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

@vercel
Copy link

vercel bot commented Aug 20, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
cleanrl ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 30, 2023 0:07am

@dosssman
Copy link
Collaborator

@SobhanMP Nice catch. Thansk a lot for bringing this to our attention.

Indeed, both your code reference as well as the 2018 SAC paper suggest stipulate the usage of alpha instead of log_allpha.

image

Are you perhaps already testing your change ? Otherwise I can get some runs started on my side to check the effect of this change. Chances are, the results will not be that much different, but let's definitely stick as close as possible to the correct theory and implementation.

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ?

alpha_loss = (action_probs.detach() * (-log_alpha * (log_pi + target_entropy).detach())).mean()

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin
https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

@@ -276,7 +276,7 @@ def get_action(self, x):
if args.autotune:
with torch.no_grad():
_, log_pi, _ = actor.get_action(data.observations)
alpha_loss = (-log_alpha * (log_pi + target_entropy)).mean()
alpha_loss = (-log_alpha.exp() * (log_pi + target_entropy)).mean()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken the liberty of changing torch.exp(log_alpha) to log_alpha.exp() which is slight cleaner 😅
I recall doing it that way before, but it was likely lost during refactors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

@SobhanMP
Copy link
Contributor Author

@dosssman would be nice if you can test it, I'm having a big of a rough time installing clearnrl 😅

@dosssman
Copy link
Collaborator

@SobhanMP Actually tried to install a clean environment since most of the ones I have are quite dated but I also get a lot of failures from poetry.

What kind of errors do you get when trying to install cleanrl now ?

@SobhanMP
Copy link
Contributor Author

SobhanMP commented Aug 20, 2023

Poetry seems to shoot itself in the leg and remove tomli. Other than that, GL.h not found when i try to run it on the cluster. Ah, pyyaml doesn't work with cython 3 anymore, something about the license link being broken

@SobhanMP
Copy link
Contributor Author

Installing poetry as root with a package manager seems to overcome a bunch of the issues

@dosssman
Copy link
Collaborator

I usually create a fresh conda env first, within which I then use poetry install.
Most issues did not happen when using python=3.9 instead of `3.10.

Also, notes for solving mujoco_py related issues:

Mujoco install notes

  • GL/osmesa.h not found when building mujoco-py: sudo apt-get install libosmesa6-dev.
  • patchelf not found: sudo apt-get install patchelf ...

@SobhanMP
Copy link
Contributor Author

sudo apt-get install libosmesa6-dev
I figured that much, I don't have root on the cluster...

See the runs below: the left is from the open benchmark thing, right is with the fix. On another note, I think it would be a good idea to record alpha.
image

@dosssman
Copy link
Collaborator

dosssman commented Aug 21, 2023

alpha is also tracked by default, autotune or not.

Tracking benchmarks results in this Wandb report: https://api.wandb.ai/links/openrlbenchmark/owsly8ve
Light teal for logalpha (master) variant, pink for the PR's fix.

EDIT: given that alpha values are very small with or without the fix, we likely won't see much difference.
At least not on the three tasks we usually benchmark. Maybe a more complex task that benefits more from autotune would see some improvements though.

@dosssman
Copy link
Collaborator

@vwxyzjn Hello there. Any idea why the vercel test fails ? Thanks for your time.

@timoklein
Copy link
Collaborator

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ?

alpha_loss = (action_probs.detach() * (-log_alpha * (log_pi + target_entropy).detach())).mean()

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

It's confusing for SAC discrete. The paper states a similar formula (using alpha)
Screenshot from 2023-08-21 13-45-11

whereas the author's original code uses log_alpha. Can be seen here

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/135d3e2e06bbde2868047d738e3fc2d73fd8cc93/agents/actor_critic_agents/SAC.py#L184

According to the SAC authors this should be a non-issue though and most other major libraries use log_alpha as well, e.g. tianshou.

I'm honestly not quite sure what to do here. @SobhanMP plots show improved stability, so we should change it I guess?

1 similar comment
@timoklein

This comment was marked as duplicate.

@SobhanMP
Copy link
Contributor Author

Actually, let me benchmark it again; I ran the original code w/o any change, tho it is suspicious that I have less variance.

@dosssman
Copy link
Collaborator

dosssman commented Aug 22, 2023

Benchmarks available here:https://api.wandb.ai/links/openrlbenchmark/eykdulk2.
No significant difference with either log_alpha (light teal lines) or alpha (pink lines).
Even the value of alpha coefficient itself are similar across variants.

image
image
image

Should be ready to merge.

@SobhanMP
Copy link
Contributor Author

@dosssman thanks!

@timoklein
Copy link
Collaborator

I'm gonna run some experiments with SAC-discrete to verify that it works there as well. It's a good opportunity for me to get into the rlops/benchmarking stuff which I haven't used before.

After everything works we can merge it and also merge #383 while we're at it.

@dosssman
Copy link
Collaborator

After everything works we can merge it and also merge #383 while we're at it.

Great. I went ahead and tested the combination of that paper and with this one, yellow lines on top of teal and pink as specified in earlier post:

image

image

image

Also no significant regression in performance.

Any idea why the vercel related test fails here though ?

@timoklein
Copy link
Collaborator

0 clue about vercel. I'm not using it personally and haven't really interacted with it.

@timoklein
Copy link
Collaborator

Atari results take a while, and I can't run that many experiments in parallel due to RAM constraints for the replay buffer. Here are some results for SAC-discrete BeamRider and BreakOut.

BeamRider
BreakOut

Pong results are still waiting, but I can already say:

  • Using alpha.exp() in the autotune loss leads to one run dying. It can most likely be fixed by adjusting the target entropy scale from 0.89 to a slightly lower value. Pong has been very sensitive to entropy scaling in my past experiments and this parameter is best tuned for each environment individually anyway.

@dosssman
Copy link
Collaborator

Using alpha.exp() in the autotune loss leads to one run dying. It can most likely be fixed by adjusting the target entropy scale from 0.89 to a slightly lower value. Pong has been very sensitive to entropy scaling in my past experiments and this parameter is best tuned for each environment individually anyway.

Did you manage to identify the specific reason why it is dying ?
Since you mentioned RAM constraints, in my experience, I had some run die due to those.
Might be useful to check the System / Allocated System Memory around the time the run died.
It might not have anything to do with alpha.exp(), unless it explicitly throws an error regarding an invalid computation, or something along those lines (Wandb / Logs ?)

@timoklein
Copy link
Collaborator

Did you manage to identify the specific reason why it is dying?

By dying I mean the performance collapses and never recovers. Sorry for not being clear.
I had one other run crash due to RAM constraints but I'm re-running it.

@timoklein
Copy link
Collaborator

Here's the results for Pong
Pong

As I said this can probably be fixed by adjusting the target entropy scale which is very seed and environment-dependent.

@timoklein
Copy link
Collaborator

As I said this can probably be fixed by adjusting the target entropy scale which is very seed and environment-dependent.

To verify this claim, I ran the same seed that failed before (seed=1) with target_entropy_scale=0.8. Results are below:

seed_1_entropy_0_8

I'm gonna add the .exp() later as well. It is explicitly stated in the docs that

Tuning this parameter to the environment at hand is advised and can lead to significant performance gains.

so a non-performing seed isn't the end of the world IMO.

@timoklein
Copy link
Collaborator

Ready for merge from my side @dosssman

@dosssman
Copy link
Collaborator

@timoklein Thanks a lot. Looking good on my side too.
I have also validated the actor_loss shape #383 (combined with this PR's changes too, no regression), so that one is also good to merge.

@timoklein
Copy link
Collaborator

Who wants the honor to merge it? Or should we wait for Costa?

Copy link
Collaborator

@dosssman dosssman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Great work everyone.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Sep 12, 2023

Really great work everyone! Thanks so much for the PR and the benchmark effort. @dosssman or @timoklein Fee free to merge it!

@timoklein timoklein merged commit 7e24ae2 into vwxyzjn:master Sep 14, 2023
20 checks passed
@SobhanMP SobhanMP deleted the lagrange branch September 14, 2023 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants