Skip to content

Muesli RL algorithm implementation (PyTorch) (LunarLander-v2)

License

Notifications You must be signed in to change notification settings

howsmyanimeprofilepicture/Muesli-lunarlander

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Muesli (LunarLander-v2)

Introduction

Here is simple implementation of Muesli algorithm. Muesli has same performance and network architecture as MuZero, but it can be trained without MCTS lookahead search, just use one-step lookahead. It can reduce computational cost significantly compared to MuZero.

Paper : Muesli: Combining Improvements in Policy Optimization, Hessel et al., 2021 (v2 version)

You can run this code on colab demo link, train the agent and monitor with tensorboard, play LunarLander-v2 environment with trained network. This agent can solve LunarLander-v2 within 1~2 hours computed by Google Colab CPU backend. It can reach about > 250 average score.

Implemented

  • MuZero network
  • 5 step unroll
  • L_pg+cmpo
  • L_v
  • L_r
  • L_m (5 step)
  • Stacking 8 observations
  • Mini-batch update
  • Hidden state scaled within [-1,1]
  • Gradient clipping by value [-1,1]
  • Dynamics network gradient scale 1/2
  • Target network(prior parameters) moving average update
  • Categorical representation (value, reward model)
  • Normalized advantage
  • Tensorboard monitoring

Todo

  • Retrace estimator
  • CNN representation network
  • LSTM dynamics network
  • Atari environment

Differences from paper

  • Self-play use agent network (originally target network)

Self-play

Flow of self-play. selfplay3

Unroll structure

Target network 1-step unroll : When calculating v_pi_prior(s) and second term of L_pg+cmpo.

Unroll 5-step(agent network) : Unroll agent network to optimize.

1-step unrolls for L_m (target network) : When calculating pi_cmpo of L_m. Unroll

Results

Score graph score Loss graph loss Lunarlander play length and last rewards lastframe_lastreward Var variables of advantage normalization var

Comment

Need your help! Welcome to contribute, advice, question, etc.

Contact : emtgit2@gmail.com (Available languages : English, Korean)

Links

Author's presentation : https://icml.cc/virtual/2021/poster/10769

Lunarlander-v2 env document : https://www.gymlibrary.dev/environments/box2d/lunar_lander/

Colab demo link (main branch)

Colab demo link (develop branch)

About

Muesli RL algorithm implementation (PyTorch) (LunarLander-v2)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%