MSc. project on Deep Reinforcement Learning. A3C was chosen as the baseline algorithm. My thesis is available here.
Currently, the master branch only supports a minimal working example of the most relevant contributions, which are:
- Learning policy quantization, run with
python3 train.py --model a3clpq
orpython train.py --model a3cglpq
- Local weight sharing layers, run with
python3 train.py --model a3clws
- Spatial softmax and what-and-where architectures, run with
python3 train.py --model a3css
orpython train.py --model a3cww
For other command line arguments, check out common/config.py
in which you'll find a Config
object class that contains the parameters with their default values.
Just pip install -r requirements.txt
should get you up and running!
Learning policy quantization is an output layer that is largely inspired by learning vector quantization. It replaces the softmax operator with an LVQ. The learning mechanism must be slightly adapted, since LVQ normally works on supervised learning problems. By using a soft class assignment as in robust soft learning vector quantization, we can still apply default actor-critic update for the policy weights. The sign and the magnitude of the advantage are then implicitly 'supervising' the direction of the prototypes.
The video here demonstrates an agent with an LPQ layer:
Formerly, I called this spatial interpolation soft weight sharing layers. For further info see this repo.
These architectures use spatial softmax layers, which are taken from (Levine et al. 2015). They generally perform worse than the default A3C architecture. The video below demonstrates a trained spatial softmax agent on Pong:
The what-and-where architecture works better for certain games, e.g.:
Where A3C WW is the what-and-where architecture, the A3C SS is the architecture that uses spatial softmax and finally, the A3C FF is the default architecture as proposed by Mnih et al. (2016). The what-and-where architecture basically combines normal ReLU activations on output features with the spatial softmax mechanism. The idea is that the network can then decide for itself what it should use (similar to Inception layers).
train.py
is the default script. Check out the Config
class in common/config.py
to see the command line arguments
that you can specify. Any member of Config
can be set through the command line.