General distributed RL platform based on modified IMPALA architecture written in pure Python (>=3.8). It provides features for training/testing/evaluating distributed RL agents on a single-node computational instance. Multi-node scaling is not supported.
Platform presently supports these agents:
For mixing on/off-policy data these replay buffer methods are supported:
- Experience replay
- Attentive experience replay
- Elite experience replay
Environment support:
- ALE environments
Before using the RL platform, we need to install all the required dependencies. We have implemented the application in
python 3.8 and PyTorch 1.8.2. Therefore, it can be run on any OS with a python 3 interpreter. We recommend using python
virtual environments for building an execution environment manually or Docker for automatic deployment. Required python
modules are listed in requirements.txt
located inside the top folder of the project alongside the main.py
file – beware
that some sub-dependent modules may not be listed.
pip install -r requirements.txt
wget http://www.atarimania.com/roms/Roms.rar
sudo apt install unrar
unrar e Roms.rar
unzip ROMS.zip
ale-import-roms ROMS
Before starting any training, it is advantageous to study the file option_flags.py
that contains all application
(hyper)parameters and their default values. The entry point of the application is located inside main.py
. Each execution
is uniquely identified – within the computational instance – with the agent’s environment name + UNIX timestamp. All
files related to the specific execution, regardless of their purpose, are stores inside folder <project>/results/<environment_name>_<unix_timestamp>
.
A user can safely interrupt the application with a SIGINT signal (CTRL+C in terminal). Training progress will be safely stored before termination.
The current implementation only supports environments from ALE with explicit suffix NoFrameskip-v4
. Frameskipping is handled by
a custom environment pre-processing wrapper, and usage of sticky actions has not been tested yet – therefore, it is not supported.
Presently, the RL platform only supports the V-trace RL algorithm and can be operated in 3 modes – new training
, testing
, and training from a checkpoint
.
python main.py --op_mode=train \
--env=PongNoFrameskip-v4 \
--environment_max_steps=1000000 \
--batch_size=10 \
--r_f_steps=10 \
--worker_count=2 \
--envs_per_worker=2 \
--replay_parameters='[{"type": "queue", "capacity": 1, "sample_ratio": 0.5}, {"type": "standard", "capacity": 1000, "sample_ratio": 0.5}]'
python main.py --op_mode=train_w_load \
--environment_max_steps=1000000 \
--load_model_url=<path_to_model>
python main.py --op_mode=test \
--test_episode_count=1000 \
--load_model_url=<path_to_model> \
--render=True
Multiple experiments can be executed in sequence using a python loop in file multi_train.py
or a custom loop in terminal scrip (like bash script) applied on standard application entry point in main.py
. The order of importance of different application arguments is this:
- Standard application arguments (argv[:1])
- Additional arguments passed to application from
multi_train.py
with change_args function - Arguments loaded from saved checkpoint file
- Default argument values stored inside
option_flags.py
Values of arguments with higher priority overwrite those with lower priority.
Another important thing to note, is that each training needs to have at least 1 replay object and total sum
of the sample_ratios of all used replays must be 1. Sample ratio dictates proportion of samples taken from each replay to form a batch.
If we don't want to use replay and only want to pass experiences as they are being generated we can use Queue with size 1.
replay_parameters='[{"type": "queue", "capacity": 1, "sample_ratio": 1}]'
- Multi-learner architecture
- Adaptive asynchronous batching (caching)
- Support other environment collections like MuJoCo, DeepMind Lab
- Implement PPO based distributed algorithm, i.e., IMPACT
- System for saving performance metric values into text files in chunks in periodical intervals
- Custom testing-worker used solely for collecting values of performance metrics by following current policy
- Multi-GPU support
- Implement a graphical user interface (GUI) for monitoring training progress and hardware utilization
It is a replay buffer method that utilizes elite sampling
technique that uses an estimate of n-steps state transition „off-policiness” to
prioritize selected samples from replay to increase the overall sample efficiency of the RL algorithm. Elite sampling
calculates the similarity between the same state sequence encoded with behavioral policy (encoded when the sequence is
generated by a worker) and target policy . States encoded into several values using the policy NN model are referred to as
state feature vectors.
We have tested elite experience replay in combination with the V-trace agent on several environments from ALE and compared its performance to an agent with a standard replay buffer. Our experiments proved that elite sampling improves agents' performance over uniform sampling in the high policy volatile parts of the training process. Furthermore, a decrease in the agent’s training speed caused by the computation of feature vector distance metric can be partially counteracted by preparing training batches pre-emptively in the background with caching.
Breakout | Seaquest |
Implemented elite sampling strategies (sampling is executed only on small random subset of replay - based on the size of batch and batch multiplier hyperparameters):
- Pick a batch number of samples with the lowest off-policy distance metric.
- Sort samples based on the off-policy distance metric. Then divide them into a batch number of subsets. From each subset, pick the trajectory with the lowest off-policy distance metric.
- Same as 1 with the addition that we prioritize those samples that have been sampled the least.
- Same as 2 with the addition that we prioritize those samples that have been sampled the least.
python main.py --op_mode=train \
--env=PongNoFrameskip-v4 \
--environment_max_steps=1000000 \
--replay_parameters='[{"type": "custom", "capacity": 1000, "sample_ratio": 0.5, "dist_function":"ln_norm", "sample_strategy":"elite_sampling", "lambda_batch_multiplier":6, "alfa_annealing_factor":2.0} ,{"type": "queue", "capacity": 1, "sample_ratio": 0.5}]'