use alpha not log alpha in autotune #414

SobhanMP · 2023-08-20T01:16:45Z

Description

The lagrange multiplier in the entropy constraint (alpha, H(\pi) < H) needs to be positive for the relaxation to work. This means that alpha not log alpha should be used here. See the Softlearnign code at:
https://github.com/rail-berkeley/softlearning/blob/13cf187cc93d90f7c217ea2845067491c3c65464/softlearning/algorithms/sac.py#L256

but the real question is:
A) does this matter?
B) am i making sense?

I'll answer A tomorrow, not sure about B tho.

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the tests accordingly (if applicable).
I have updated the documentation and previewed the changes via mkdocs serve.
- I have explained note-worthy implementation details.
- I have explained the logged metrics.
- I have added links to the original paper and related papers.

If you need to run benchmark experiments for a performance-impacting changes:

I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team.
I have used the benchmark utility to submit the tracked experiments to the openrlbenchmark/cleanrl W&B project, optionally with --capture-video.
I have performed RLops with python -m openrlbenchmark.rlops.
- For new feature or bug fix:
  - I have used the RLops utility to understand the performance impact of the changes and confirmed there is no regression.
- For new algorithm:
  - I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
- I have added the learning curves generated by the python -m openrlbenchmark.rlops utility to the documentation.
- I have added links to the tracked experiments in W&B, generated by python -m openrlbenchmark.rlops ....your_args... --report, to the documentation.

vercel · 2023-08-20T01:16:48Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cleanrl	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 30, 2023 0:07am

dosssman · 2023-08-20T14:33:24Z

@SobhanMP Nice catch. Thansk a lot for bringing this to our attention.

Indeed, both your code reference as well as the 2018 SAC paper suggest stipulate the usage of alpha instead of log_allpha.

Are you perhaps already testing your change ? Otherwise I can get some runs started on my side to check the effect of this change. Chances are, the results will not be that much different, but let's definitely stick as close as possible to the correct theory and implementation.

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ?

cleanrl/cleanrl/sac_atari.py

Line 314 in f36d4a6

    
           alpha_loss = (action_probs.detach() * (-log_alpha * (log_pi + target_entropy).detach())).mean()

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin
https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

dosssman · 2023-08-20T14:36:41Z

cleanrl/sac_continuous_action.py

@@ -276,7 +276,7 @@ def get_action(self, x):
                    if args.autotune:
                        with torch.no_grad():
                            _, log_pi, _ = actor.get_action(data.observations)
-                        alpha_loss = (-log_alpha * (log_pi + target_entropy)).mean()
+                        alpha_loss = (-log_alpha.exp() * (log_pi + target_entropy)).mean()


I have taken the liberty of changing torch.exp(log_alpha) to log_alpha.exp() which is slight cleaner 😅
I recall doing it that way before, but it was likely lost during refactors.

Works for me.

SobhanMP · 2023-08-20T15:14:55Z

@dosssman would be nice if you can test it, I'm having a big of a rough time installing clearnrl 😅

dosssman · 2023-08-20T15:22:27Z

@SobhanMP Actually tried to install a clean environment since most of the ones I have are quite dated but I also get a lot of failures from poetry.

What kind of errors do you get when trying to install cleanrl now ?

SobhanMP · 2023-08-20T15:33:46Z

Poetry seems to shoot itself in the leg and remove tomli. Other than that, GL.h not found when i try to run it on the cluster. Ah, pyyaml doesn't work with cython 3 anymore, something about the license link being broken

SobhanMP · 2023-08-20T15:44:43Z

Installing poetry as root with a package manager seems to overcome a bunch of the issues

dosssman · 2023-08-21T00:57:07Z

I usually create a fresh conda env first, within which I then use poetry install.
Most issues did not happen when using python=3.9 instead of `3.10.

Also, notes for solving mujoco_py related issues:

Mujoco install notes

GL/osmesa.h not found when building mujoco-py: sudo apt-get install libosmesa6-dev.
patchelf not found: sudo apt-get install patchelf ...

SobhanMP · 2023-08-21T01:20:48Z

sudo apt-get install libosmesa6-dev
I figured that much, I don't have root on the cluster...

See the runs below: the left is from the open benchmark thing, right is with the fix. On another note, I think it would be a good idea to record alpha.

dosssman · 2023-08-21T01:33:40Z

alpha is also tracked by default, autotune or not.

Tracking benchmarks results in this Wandb report: https://api.wandb.ai/links/openrlbenchmark/owsly8ve
Light teal for logalpha (master) variant, pink for the PR's fix.

EDIT: given that alpha values are very small with or without the fix, we likely won't see much difference.
At least not on the three tasks we usually benchmark. Maybe a more complex task that benefits more from autotune would see some improvements though.

dosssman · 2023-08-21T03:52:47Z

@vwxyzjn Hello there. Any idea why the vercel test fails ? Thanks for your time.

timoklein · 2023-08-21T12:05:11Z

@timoklein Hello there. It seems the same issue also occurs in the discrete variant. Is there any particular reason log_alpha is used there ? Or is it an artifact of the sac_continuous variant ?

cleanrl/cleanrl/sac_atari.py

Line 314 in f36d4a6

alpha_loss = (action_probs.detach() * (-log_alpha * (log_pi + target_entropy).detach())).mean()

Also noticed that stable-baselines3 seems to also have the same behavior. Attempt at pinging @ARrafin https://github.com/DLR-RM/stable-baselines3/blob/f4ec0f6afa1a23b0e0b746174cd0074471cc0b89/stable_baselines3/sac/sac.py#L231

It's confusing for SAC discrete. The paper states a similar formula (using alpha)

whereas the author's original code uses log_alpha. Can be seen here

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/135d3e2e06bbde2868047d738e3fc2d73fd8cc93/agents/actor_critic_agents/SAC.py#L184

According to the SAC authors this should be a non-issue though and most other major libraries use log_alpha as well, e.g. tianshou.

I'm honestly not quite sure what to do here. @SobhanMP plots show improved stability, so we should change it I guess?

SobhanMP · 2023-08-22T13:20:28Z

Actually, let me benchmark it again; I ran the original code w/o any change, tho it is suspicious that I have less variance.

dosssman · 2023-08-22T13:26:06Z

Benchmarks available here:https://api.wandb.ai/links/openrlbenchmark/eykdulk2.
No significant difference with either log_alpha (light teal lines) or alpha (pink lines).
Even the value of alpha coefficient itself are similar across variants.

Should be ready to merge.

SobhanMP · 2023-08-22T14:43:38Z

@dosssman thanks!

timoklein · 2023-08-23T08:03:59Z

I'm gonna run some experiments with SAC-discrete to verify that it works there as well. It's a good opportunity for me to get into the rlops/benchmarking stuff which I haven't used before.

After everything works we can merge it and also merge #383 while we're at it.

dosssman · 2023-08-23T08:17:04Z

After everything works we can merge it and also merge #383 while we're at it.

Great. I went ahead and tested the combination of that paper and with this one, yellow lines on top of teal and pink as specified in earlier post:

Also no significant regression in performance.

Any idea why the vercel related test fails here though ?

timoklein · 2023-08-23T10:26:10Z

0 clue about vercel. I'm not using it personally and haven't really interacted with it.

timoklein · 2023-08-28T09:35:31Z

Atari results take a while, and I can't run that many experiments in parallel due to RAM constraints for the replay buffer. Here are some results for SAC-discrete BeamRider and BreakOut.

Pong results are still waiting, but I can already say:

Using alpha.exp() in the autotune loss leads to one run dying. It can most likely be fixed by adjusting the target entropy scale from 0.89 to a slightly lower value. Pong has been very sensitive to entropy scaling in my past experiments and this parameter is best tuned for each environment individually anyway.

dosssman · 2023-08-28T09:44:51Z

Using alpha.exp() in the autotune loss leads to one run dying. It can most likely be fixed by adjusting the target entropy scale from 0.89 to a slightly lower value. Pong has been very sensitive to entropy scaling in my past experiments and this parameter is best tuned for each environment individually anyway.

Did you manage to identify the specific reason why it is dying ?
Since you mentioned RAM constraints, in my experience, I had some run die due to those.
Might be useful to check the System / Allocated System Memory around the time the run died.
It might not have anything to do with alpha.exp(), unless it explicitly throws an error regarding an invalid computation, or something along those lines (Wandb / Logs ?)

timoklein · 2023-08-28T10:05:18Z

Did you manage to identify the specific reason why it is dying?

By dying I mean the performance collapses and never recovers. Sorry for not being clear.
I had one other run crash due to RAM constraints but I'm re-running it.

timoklein · 2023-08-29T10:01:28Z

Here's the results for Pong

As I said this can probably be fixed by adjusting the target entropy scale which is very seed and environment-dependent.

timoklein · 2023-08-30T07:34:05Z

As I said this can probably be fixed by adjusting the target entropy scale which is very seed and environment-dependent.

To verify this claim, I ran the same seed that failed before (seed=1) with target_entropy_scale=0.8. Results are below:

I'm gonna add the .exp() later as well. It is explicitly stated in the docs that

Tuning this parameter to the environment at hand is advised and can lead to significant performance gains.

so a non-performing seed isn't the end of the world IMO.

…og_alpha

timoklein · 2023-08-30T12:20:35Z

Ready for merge from my side @dosssman

dosssman · 2023-08-30T12:41:06Z

@timoklein Thanks a lot. Looking good on my side too.
I have also validated the actor_loss shape #383 (combined with this PR's changes too, no regression), so that one is also good to merge.

timoklein · 2023-08-30T13:23:29Z

Who wants the honor to merge it? Or should we wait for Costa?

dosssman

LGTM.
Great work everyone.

vwxyzjn · 2023-09-12T12:40:21Z

Really great work everyone! Thanks so much for the PR and the benchmark effort. @dosssman or @timoklein Fee free to merge it!

use alpha not log alpha in autotune

377d9e3

vercel bot had a problem deploying to Preview August 20, 2023 01:16 Failure

vwxyzjn requested a review from dosssman August 20, 2023 14:04

log_alpha.exp() instead of torch.exp(log_alpha) is slightly more clean.

e1c9217

vercel bot had a problem deploying to Preview August 20, 2023 14:35 Failure

dosssman reviewed Aug 20, 2023

View reviewed changes

This comment was marked as duplicate.

Sign in to view

fix alpha optimization in sac_atari: use log_alpha.exp() instead of l…

0317c1f

…og_alpha

vercel bot deployed to Preview August 30, 2023 12:05 View deployment

update docs to use log_alpha.exp() instead of log_alpha in autotune

2ff2c66

vercel bot deployed to Preview August 30, 2023 12:07 View deployment

dosssman approved these changes Aug 31, 2023

View reviewed changes

timoklein approved these changes Aug 31, 2023

View reviewed changes

timoklein merged commit 7e24ae2 into vwxyzjn:master Sep 14, 2023
20 checks passed

SobhanMP deleted the lagrange branch September 14, 2023 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use alpha not log alpha in autotune #414

use alpha not log alpha in autotune #414

SobhanMP commented Aug 20, 2023 •

edited

Loading

vercel bot commented Aug 20, 2023 •

edited

Loading

dosssman commented Aug 20, 2023

dosssman Aug 20, 2023

SobhanMP Aug 20, 2023

SobhanMP commented Aug 20, 2023

dosssman commented Aug 20, 2023

SobhanMP commented Aug 20, 2023 •

edited

Loading

SobhanMP commented Aug 20, 2023

dosssman commented Aug 21, 2023

SobhanMP commented Aug 21, 2023

dosssman commented Aug 21, 2023 •

edited

Loading

dosssman commented Aug 21, 2023

timoklein commented Aug 21, 2023

This comment was marked as duplicate.

SobhanMP commented Aug 22, 2023

dosssman commented Aug 22, 2023 •

edited

Loading

SobhanMP commented Aug 22, 2023

timoklein commented Aug 23, 2023

dosssman commented Aug 23, 2023

timoklein commented Aug 23, 2023

timoklein commented Aug 28, 2023

dosssman commented Aug 28, 2023

timoklein commented Aug 28, 2023

timoklein commented Aug 29, 2023

timoklein commented Aug 30, 2023

timoklein commented Aug 30, 2023

dosssman commented Aug 30, 2023

timoklein commented Aug 30, 2023

dosssman left a comment

vwxyzjn commented Sep 12, 2023

use alpha not log alpha in autotune #414

use alpha not log alpha in autotune #414

Conversation

SobhanMP commented Aug 20, 2023 • edited Loading

Description

Types of changes

Checklist:

vercel bot commented Aug 20, 2023 • edited Loading

dosssman commented Aug 20, 2023

dosssman Aug 20, 2023

Choose a reason for hiding this comment

SobhanMP Aug 20, 2023

Choose a reason for hiding this comment

SobhanMP commented Aug 20, 2023

dosssman commented Aug 20, 2023

SobhanMP commented Aug 20, 2023 • edited Loading

SobhanMP commented Aug 20, 2023

dosssman commented Aug 21, 2023

SobhanMP commented Aug 21, 2023

dosssman commented Aug 21, 2023 • edited Loading

dosssman commented Aug 21, 2023

timoklein commented Aug 21, 2023

This comment was marked as duplicate.

SobhanMP commented Aug 22, 2023

dosssman commented Aug 22, 2023 • edited Loading

SobhanMP commented Aug 22, 2023

timoklein commented Aug 23, 2023

dosssman commented Aug 23, 2023

timoklein commented Aug 23, 2023

timoklein commented Aug 28, 2023

dosssman commented Aug 28, 2023

timoklein commented Aug 28, 2023

timoklein commented Aug 29, 2023

timoklein commented Aug 30, 2023

timoklein commented Aug 30, 2023

dosssman commented Aug 30, 2023

timoklein commented Aug 30, 2023

dosssman left a comment

Choose a reason for hiding this comment

vwxyzjn commented Sep 12, 2023

SobhanMP commented Aug 20, 2023 •

edited

Loading

vercel bot commented Aug 20, 2023 •

edited

Loading

SobhanMP commented Aug 20, 2023 •

edited

Loading

dosssman commented Aug 21, 2023 •

edited

Loading

dosssman commented Aug 22, 2023 •

edited

Loading