DERL is a a deep reinforcement learning package with a focus on simplicity. It is built on top of PyTorch.
Parts of episodes where policies were learned with PPO.
Currently implemented algorithms:
- A2C
- PPO
- SAC
- DQN (n-step, double, dueling, with prioritized experience replay, with noisy networks for exploration)
The installation script will install torch but it might be better it install it manually beforehand in order to insure proper system and CUDA dependencies.
git clone https://github.com/MichaelKonobeev/derl.git
pip install -e derl
gym[atari]
and pybullet
will be installed by setup.py
, but you will need
to install other environment requirements such as mujoco manually.
Now you can run training:
derl ppo --env-id BreakoutNoFrameskip-v4 --logdir logdir/breakout.00
or 3-step double DQN with prioritized experience replay and dueling architecture:
derl dqn --env-id BreakoutNoFrameskip-v4 --logdir logdir/breakout.00
Or if gym[mujoco]
is installed:
derl ppo --env-id HalfCheetah-v3 --logdir logdir/half-cheetah.00
Different learning algorithms are selected by passing positional
argument to derl script, for a full list of arguments you can use derl -h
,
to see arguments of a particular algorithm (call it alg
) use derl alg
.
The script automatically selects different hyperparameters for atari and mujoco envs.
The package defines four essential components:
-
Policy
takes atorch.nn.Module
during construction and upon callingact
method returns a dictionary of predictions made with that policy. This method could be called in two modes: one to collect rollouts and another useful for training, for example when we want to give Q-value predictions for all actions in order to perform training. Switching between the modes is done by passing a boolean totraining
keyword argument. -
EnvRunner
takes aPolicy
and anenv
during construction and collects rollouts.EnvRunner
's count the number of steps performed during agent-environment interaction which is useful for several reasons including annealing variables (such as learning rate and exploration epsilon) and writing summaries. Performing composition withEnvRunner
's could easily be done usingEnvRunnerWrapper
. Additionally, one could apply transformations to a rollout, for example in the package this is used to computed Generalized Advantage Estimators for PPO. Experience replay is also implemented as anEnvRunner
which simplifies the interface when working with learning algorithms. -
Alg
usesEnvRunner
to collect rollouts and perform updates. In the current implementation algorithms wrapAlgLoss
class which in turn defines algorithm loss functions (they tend to be rather complicated in RL which is why using a separate class for them makes a lot of sense), but of course there is no requirement to do so thanks to duck typing. All of the currently implemented algorithms write nice summaries of their training to tensorboard. -
Factory
constructs firstEnvRunner
, thenAlg
instances. It simplifies working with large number of keyword arguments as much as possible. In particular it by default ensures that all keyword arguments passed to it during construction were used during algorithm instantiation.
In addition, there are several smaller things that greatly simplify
working on an RL project with typical benchmarks: it is easy to
create appropriately wrapped env instance just by calling derl.env.make
and it is easy to create a torch.nn.Module
for mujoco or atari
environments using make_model
function.