Akella Ravi Tej1, Kamyar Azizzadenesheli1, Mohammad Ghavamzadeh2, Anima Anandkumar3, Yisong Yue3
1Purdue University, 2Google Research, 3Caltech
Preprint: arxiv.org/abs/2006.15637
Publication: AAAI-21 (also presented at NeurIPS Deep RL and Real-World RL Workshops 2020)
Project Website: akella17.github.io/publications/Deep-Bayesian-Quadrature-Policy-Optimization/
Bayesian quadrature is an approach in probabilistic numerics for approximating a numerical integration. When estimating the policy gradient integral, replacing standard Monte-Carlo estimation with Bayesian quadrature provides
- more accurate gradient estimates with a significantly lower variance
- a consistent improvement in the sample complexity and average return for several policy gradient algorithms
- a methodological way to quantify the uncertainty in gradient estimation.
This repository contains a computationally efficient implementation of BQ for estimating the policy gradient integral (gradient vector) and the estimation uncertainty (gradient covariance matrix). The source code is written in a modular fashion, currently supporting three policy gradient estimators and three policy gradient algorithms (9 combinations overall):
Policy Gradient Estimators :-
- Monte-Carlo Estimation
- Deep Bayesian Quadrature Policy Gradient (DBQPG)
- Uncertainty Aware Policy Gradient (UAPG)
Policy Gradient Algorithms :-
- Vanilla Policy Gradient
- Natural Policy Gradient (NPG)
- Trust-Region Policy Optimization (TRPO)
This codebase requires Python 3.6 (or higher). We recommend using Anaconda or Miniconda for setting up the virtual environment. Here's a walk through for the installation and project setup.
git clone https://github.com/Akella17/Deep-Bayesian-Quadrature-Policy-Optimization.git
cd Deep-Bayesian-Quadrature-Policy-Optimization
conda create -n DBQPG python=3.6
conda activate DBQPG
pip install -r requirements.txt
Modular implementation:
python agent.py --env-name <gym_environment_name> --pg_algorithm <VanillaPG/NPG/TRPO> --pg_estimator <MC/BQ> --UAPG_flag
All the experiments will run for 1000 policy updates and the logs get stored in session_logs/
folder. To reproduce the results in the paper, refer the following command:
# Running Monte-Carlo baselines
python agent.py --env-name <gym_environment_name> --pg_algorithm <VanillaPG/NPG/TRPO> --pg_estimator MC
# DBQPG as the policy gradient estimator
python agent.py --env-name <gym_environment_name> --pg_algorithm <VanillaPG/NPG/TRPO> --pg_estimator BQ
# UAPG as the policy gradient estimator
python agent.py --env-name <gym_environment_name> --pg_algorithm <VanillaPG/NPG/TRPO> --pg_estimator BQ --UAPG_flag
For more customization options, kindly take a look at the arguments.py
.
visualize.ipynb
can be used to visualize the Tensorboard files stored in session_logs/
(requires jupyter
and tensorboard
installed).
- pytorch-trpo
- TRPO and NPG implementation.
- GPyTorch library
- Structured kernel interpolation (SKI) with Toeplitz method for RBF kernel.
- Kernel learning with GPU acceleration.
- fbpca
- Fast randomized singular value decomposition (SVD) through implicit matrix-vector multiplications.
- "A new trick for calculating Jacobian vector products"
- Efficient Jvp computation through regular reverse-mode autodiff (more details in Appendix D of our paper).
Contributions are very welcome. If you know how to make this code better, please open an issue. If you want to submit a pull request, please open an issue first. Also see the todo list below.
- Implement policy network for discrete action space and test on Arcade Learning Environment (ALE).
- Add other policy gradient algorithms.
If you find this work useful, please consider citing:
@article{ravi2020DBQPG,
title={Deep Bayesian Quadrature Policy Optimization},
author={Akella Ravi Tej and Kamyar Azizzadenesheli and Mohammad Ghavamzadeh and Anima Anandkumar and Yisong Yue},
journal={arXiv preprint arXiv:2006.15637},
year={2020}
}