A collection of off-policy reinforcement learning algorithms with Mujoco benchmark results from 4 seeds.

Welcome to our Off-Policy Reinforcement Learning (RL) Repository, where we bring together three state-of-the-art algorithms implemented in PyTorch: Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), and the latest innovation, TD7, introduced by S. Fujimoto in 2023.

Our repository aims to provide a clean and comprehensive implementation of these three off-policy RL algorithms. We meticulously experiment with the Mujoco benchmark across eight seeds, ensuring a thorough evaluation of each algorithm's capabilities.

RL Algorithms:

SAC: Soft Actor-Critic, a powerful off-policy algorithm known for its excellent sample efficiency and stable training.
TD3: Twin Delayed DDPG, a proven performer in off-policy learning, now accompanied by the latest enhancements from TD7.
TD7: The latest advancement introduced by S. Fujimoto, incorporating four additional "know-hows" to elevate both performance and stability.

Installation

# Clone repo.
git clone https://github.com/seungju-k1m/sac-td3-td7.git

# Change Directory.
cd sac-td3-td7

# Make-up virtual environment. (python version is 3.10)
python -m venv .venv --prompt rl

# Activate virtual env.
source .venv/bin/activate

# install
pip install -r requirements.txt

Example codes

# There are three ways.
# First, Use cli
make cli

>>> rl sac --help
>>> rl td3 --help
>>> rl td7 --help
>>> # TD7 Example.
>>> rl td7 Ant-v4 ant@seed111 --seed 1111 --record-video
>>> # Replay your training result.
>>> replay save/td7/<your_train_dir>
>>> # ctrl+D to exit cli mode.

# Second, run python file.
python cli.py rl td7 Ant-v4 ant@seed111 -seed 1111

# Third, reproduce all the experiments: experiments with 4 different seeds corresponding to each Mujoco Env.
source run.sh

Experiment Results

We compared the results of our Mujoco benchmark with the paper in two aspects: performance and training stability. Performance was measured as the average value across eight different seeds, while training stability was assessed by recording the agent's intermediate performance throughout the training process for each seed and visualizing these trendsation based on seeds and performance evolution during the training process.

Environment	Timesteps	TD7(Ours)	TD7(Paper)	TD3(Ours)	SAC (Ours)
Ant-v4	300k 1M	6556 ± 95 7729 ± 504	6171 ± 831 8509 ± 422	3230 ± 700 5423 ± 682	1953 ± 785 4623 ± 984
HalfCheetah-v4	300k 1M	14833 ± 575 17356 ± 299	15031 ± 401 17434 ± 155	7316 ± 118 10626 ± 829	7847 ± 384 10469 ± 1123
Hopper-v4	300k 1M	2858 ± 1051 3905 ± 128	2948 ± 464 3512 ± 315	733 ± 817 2337 ± 1157	1908 ± 1117 3149 ± 661
Walker2d-v4	300k 1M	5016 ± 452 5938 ± 487	5379 ± 328 6097 ± 570	1541.5 ± 1040 3439 ± 225	2560 ± 820 4293 ± 500
Humanoid-v4	300k 1M	4910 ± 767 7287 ± 352	5332 ± 714 7429 ± 153	2504 ± 1700 5255 ± 220	3199 ± 458 5044 ± 390

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
config		config
data		data
rl		rl
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
cli.py		cli.py
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A collection of off-policy reinforcement learning algorithms with Mujoco benchmark results from 4 seeds.

Installation

Example codes

Experiment Results

About

Releases

Packages

Languages

seungju-k1m/sac-td3-td7

Folders and files

Latest commit

History

Repository files navigation

A collection of off-policy reinforcement learning algorithms with Mujoco benchmark results from 4 seeds.

Installation

Example codes

Experiment Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages