Actor-only policy learning with zeroth-order gradient representations of the Critic outperforms (heuristics of) selecting feasible Critic models. We establish this in this work on model-free learning of optimal deterministic resource allocations in wireless systems via action space exploration. Check out the paper for our PD-ZDPG+ algorithm: https://ieeexplore.ieee.org/abstract/document/9596327. If you find our algorithm useful, please consider citing our paper.
Deterministic Policy Gradient via Action-Space Exploration:
Before running the experiments, please clone gym-cstr-optim
from here. Afterwards, run the following:
pip install -e gym-cstr-optim
sudo apt-get install texlive-latex-recommended
sudo apt install texlive-latex-extra
sudo apt install dvipng
sudo apt install cm-super