This is the official implementation of "Approximating Gradients for Differentiable Quality Diversity in Reinforcement Learning" by Bryon Tjanaka, Matt Fontaine, Julian Togelius, and Stefanos Nikolaidis. Below is the abstract of the paper:
Consider the problem of training robustly capable agents. One approach is to generate a diverse collection of agent polices. Training can then be viewed as a quality diversity (QD) optimization problem, where we search for a collection of performant policies that are diverse with respect to quantified behavior. Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available. However, agent policies typically assume that the environment is not differentiable. To apply DQD algorithms to training agent policies, we must approximate gradients for performance and behavior. We propose two variants of the current state-of-the-art DQD algorithm that compute gradients via approximation methods common in reinforcement learning (RL). We evaluate our approach on four simulated locomotion tasks. One variant achieves results comparable to the current state-of-the-art in combining QD and RL, while the other performs comparably in two locomotion tasks. These results provide insight into the limitations of current DQD algorithms in domains where gradients must be approximated. Source code is available at https://github.com/icaros-usc/dqd-rl
For more info, visit the following links:
To cite this paper, please use the following bibtex:
@misc{tjanaka2022approximating,
title = {Approximating Gradients for Differentiable Quality Diversity
in Reinforcement Learning},
author = {Bryon Tjanaka and Matthew C. Fontaine and Julian Togelius
and Stefanos Nikolaidis},
year = {2022},
eprint = {2202.03666},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://dqd-rl.github.io},
note = "\url{https://dqd-rl.github.io}",
}
We primarily use the pyribs library in this implementation. If you use this code in your research, please also cite pyribs:
@misc{pyribs,
title = {pyribs: A bare-bones Python library for quality diversity
optimization},
author = {Bryon Tjanaka and Matthew C. Fontaine and David H. Lee and
Yulun Zhang and Trung Tran Minh Vu and Sam Sommerer and
Nathan Dennler and Stefanos Nikolaidis},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/icaros-usc/pyribs}},
}
- Manifest
- Getting Started
- Running Experiments
- Running Analysis and Generating Figures
- Results
- Implementation
- Miscellaneous
- License
config/
: gin configuration files.docs/
: Additional documentation.src/
: Python implementations and related tools.scripts/
: Bash scripts.
- Clone the repo:
git clone https://github.com/icaros-usc/dqd-rl.git
- Install Singularity: All of our code runs in a Singularity / Apptainer container. See here to install Singularity.
- Build or download the container: Build the Singularity container with
Alternatively, download the container here and place it in the root directory of this repo.
sudo make container.sif
- (Optional) Install NVIDIA drivers and CUDA: The node where the main script runs should have a GPU with NVIDIA drivers and CUDA installed (we have not included CUDA in the container). This is only necessary if you are running algorithms which use TD3.
There are two commands for running experiments.
scripts/run_local.sh
runs on a local machine:Wherebash scripts/run_local.sh CONFIG SEED NUM_WORKERS
CONFIG
is a gin file inconfig/
,SEED
is a random integer seed (we used values from 1-100), andNUM_WORKERS
is the number of worker processes.scripts/run_slurm.sh
runs on a SLURM cluster:Here,bash scripts/run_slurm.sh CONFIG SEED HPC_CONFIG
HPC_CONFIG
is the path to a config inconfig/hpc
. It specifies the number of nodes on the cluster and the number of workers per node.
In our paper, we evaluated five algorithms (CMA-MEGA (ES), CMA-MEGA (TD3, ES), PGA-MAP-Elites, ME-ES, MAP-Elites) in four environments (QD Ant, QD Half-Cheetah, QD Hopper, QD Walker). We have included config files for all of these experiments. To replicate results from the paper, you will need to run each of the following commands 5 times with different random seeds.
# QD Ant
bash scripts/run_slurm.sh config/qd_ant/cma_mega_es.gin SEED config/hpc/100.sh
bash scripts/run_slurm.sh config/qd_ant/cma_mega_td3_es.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_ant/pga_me.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_ant/me_es.gin SEED config/hpc/100_high_mem.sh
bash scripts/run_slurm.sh config/qd_ant/map_elites.gin SEED config/hpc/100.sh
# QD Half-Cheetah
bash scripts/run_slurm.sh config/qd_half_cheetah/cma_mega_es.gin SEED config/hpc/100.sh
bash scripts/run_slurm.sh config/qd_half_cheetah/cma_mega_td3_es.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_half_cheetah/pga_me.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_half_cheetah/me_es.gin SEED config/hpc/100_high_mem.sh
bash scripts/run_slurm.sh config/qd_half_cheetah/map_elites.gin SEED config/hpc/100.sh
# QD Hopper
bash scripts/run_slurm.sh config/qd_hopper/cma_mega_es.gin SEED config/hpc/100.sh
bash scripts/run_slurm.sh config/qd_hopper/cma_mega_td3_es.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_hopper/pga_me.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_hopper/me_es.gin SEED config/hpc/100_high_mem.sh
bash scripts/run_slurm.sh config/qd_hopper/map_elites.gin SEED config/hpc/100.sh
# QD Walker
bash scripts/run_slurm.sh config/qd_walker/cma_mega_es.gin SEED config/hpc/100.sh
bash scripts/run_slurm.sh config/qd_walker/cma_mega_td3_es.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_walker/pga_me.gin SEED config/hpc/100_gpu.sh
bash scripts/run_slurm.sh config/qd_walker/me_es.gin SEED config/hpc/100_high_mem.sh
bash scripts/run_slurm.sh config/qd_walker/map_elites.gin SEED config/hpc/100.sh
These commands run with 100 workers, but more workers would not help since we
only evaluate 100 solutions at a time. To run locally, replace run_slurm.sh
with run_local.sh
and pass a number of workers instead of an HPC config:
bash scripts/run_local.sh config/qd_ant/cma_mega_es.gin SEED 100
Regardless of whether experiments are run locally or on a cluster, all results
are placed in a logging directory under logs/
. The directory's name is of
the form logs/%Y-%m-%d_%H-%M-%S_dashed-name
, e.g.
logs/2020-12-01_15-00-30_experiment-1
. Refer to the
logging directory manifest for a list of files in
the directory. run_slurm.sh
additionally outputs a separate directory which
stores the stdout of the scheduler and workers; see below
for more info.
See below for how to analyze results and generate figures.
The remainder of this section provides useful info for running experiments.
Each logging directory contains the following files:
- config.gin # All experiment config variables, lumped into one file.
- seed # Text file containing the seed for the experiment.
- reload.pkl # Data necessary to reload the experiment if it fails.
- reload_td3.pkl # Pickle data for TD3 (only applicable in some experiments).
- reload_td3.pth # PyTorch models for TD3 (only applicable in some experiments).
- metrics.json # Metrics like QD score; intended for MetricLogger.
- all_results.pkl # All returns and BCs from function evaluations during the run.
- hpc_config.sh # Same as the config in the Slurm dir, if Slurm is used.
- archive/ # Snapshots of the full archive, including solutions and
# metadata, in pickle format.
- archive_history.pkl # Stores objective values and behavior values necessary
# to reconstruct the archive. Solutions and metadata are
# excluded to save memory.
- slurm_YYYY-MM-DD_HH-MM-SS/ # Slurm log dir (only exists if using Slurm).
# There can be a few of these if there were reloads.
- config/
- [config].sh # Copied from `hpc/config`
- job_ids.txt # Job IDs; can be used to cancel job (scripts/slurm_cancel.sh).
- logdir # File containing the name of the main logdir.
- scheduler.slurm # Slurm script for scheduler and experiment invocation.
- scheduler.out # stdout and stderr from running scheduler.slurm.
- worker-{i}.slurm # Slurm script for worker i.
- worker-{i}.out # stdout and stderr for worker i.
In addition to a logging directory,
run_slurm.sh
outputs a Slurm directory with items like the content of stdout on scheduler and workers. To move these into the logging directory, runslurm_postprocess.sh
(see below).
There are a number of helpful utilities associated with Slurm scripts. These
reminders are output on the command line by run_slurm.sh
after it executes:
tail -f ...
- Use this to monitor stdout and stderr of the main experiment script.bash scripts/slurm_cancel.sh ...
- This will cancel the job.ssh -N ...
- This will set up a tunnel from the HPC to your laptop so you can monitor the Dask dashboard. Run this on your local machine.bash scripts/slurm_postprocess.sh ...
- This will move the slurm logs into the logging directory. Run it after the experiment has finished.
You can monitor the status of your slurm experiments with:
watch scripts/slurm_dashboard.sh
Since the dashboard output can be quite long, it can be useful to scroll through
it. For this, consider an alternative to watch
, such as
viddy.
To test an experiment configuration with smaller settings, add _test
to the
end of a name, e.g. config/qd_ant/cma_mega_es.gin_test
. Then, the original
config (config/qd_ant/cma_mega_es.gin
) and config/test.gin
will be included.
While the experiment is running, its state is saved to "reload files" (AKA checkpoints) in the logging directory. If the experiment fails, e.g. due to memory limits, time limits, or network connection issues, run this command with the name of the existing logging directory:
bash scripts/slurm_reload.sh LOGDIR
This will continue the job with the exact same configurations as before. For
finer-grained control, refer to the -r
flag in run_slurm.sh
. run_local.sh
also provides an option for reloading:
bash scripts/run_local.sh CONFIG SEED NUM_WORKERS LOGDIR
Refer to src/analysis/figures.py
and src/analysis/supplemental.py
.
The following plot shows all metrics for all algorithms after 1 million evaluations. Refer to the appendix of our paper for final numerical values.
Each experiment is structured as shown in the following diagram. Dask is the distributed compute library we use. When we run an experiment, we connect to a Dask scheduler, which is in turn connected to one or more Dask workers. Each component runs in a Singularity container.
The algorithm implementations are primarily located in the following files:
CMA-MEGA (ES)
andCMA-MEGA (TD3, ES)
:src/emitters/gradient_improvement_emitter.py
PGA-ME
:src/emitters/pga_emitter.py
,src/emitters/gaussian_emitter.py
ME-ES
:src/me_es/
(adapted from authors' implementation)MAP-Elites
:src/emitters/gaussian_emitter.py
src/main.py
: Entry point for all experiments.src/manager.py
: Handles all experiments that are implemented with pyribs.src/objectives/gym_control/
: This is the code that evaluates all solutions in the QDGym environments.
The Makefile has several useful commands. Run make
for a full command
reference.
There are some tests alongside the code to ensure basic correctness. To run these, start a Singularity container with:
make shell
Within that container, execute:
make test
To understand the code, it will be useful to be familiar with the following libraries:
- In the codebase, we refer to
behavior_values
andBCs
(behavior characteristics). These are synonymous withmeasures
in the paper. - We use
PGA-ME
andPGA-MAP-Elites
interchangeably in the code. - We also use
iterations
andgenerations
interchangeably. - In our code (specifically
src/manager.py
), we measureRobustness
on every iteration. However, this metric is only the robustness of the best-performing solution. TheMean Robustness
that we describe in the paper is computed in a separate script (src/analysis/robustness.py
) after experiments are completed.
This code is released under the MIT License, with the following exceptions:
- The
src/me_es/
directory is derived from Colas 2020 (repository) and is released under the Uber Non-Commercial License. - The
src/qd_gym/
directory is adapted from Olle Nilsson's QDgym and is released under an MIT license.