Skip to content

Commit

Permalink
multi-agent RL works!
Browse files Browse the repository at this point in the history
  • Loading branch information
Stefan Schneider committed Jul 1, 2020
1 parent 56d8120 commit 6415829
Show file tree
Hide file tree
Showing 5 changed files with 56 additions and 16 deletions.
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Using deep RL for mobility management.

![example](docs/gifs/v05.gif)
![example](docs/gifs/v06.gif)

## Setup

Expand Down Expand Up @@ -61,8 +61,9 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt
* Multi-agent: Separate agents for each UE. I should look into ray/rllib: https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical
* Collaborative learning: Share experience or gradients to train agents together. Use same NN. Later separate NNs? Federated learing
* Possibilities: Higher=better
1. Use & train exactly same NN for all UEs (still per UE decisions).
1. DONE: Use & train exactly same NN for all UEs (still per UE decisions).
2. Separate NNs for each agent, but share gradient updates or experiences occationally
* Larger scenarios with more UEs and BS. Auto create rand BS, UE; just configure number in env.
* Generic utlitiy function: Currently, reward is a step function (pos if enough rate, neg if not). Could also be any other function of the rate, eg, logarithmic
* Efficient caching of connection data rate:
* Currently always recalculate the data rate per connection per UE, eg, when calculating reward or checking whether we can connect
Expand All @@ -73,7 +74,7 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt

### Findings

* Binary observations: [BS available?, BS connected?] work very well
* Binary observations: (BS available?, BS connected?) work very well
* Replacing binary "BS available?" with achievable data rate by BS does not work at all
* Probably, because data rate is magnitudes larger (up to 150x) than "BS connected?" --> agent becomes blind to 2nd part of obs
* Just cutting the data rate off at some small value (eg, 3 Mbit/s) leads to much better results
Expand All @@ -82,6 +83,8 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt
* Central agent with observations and actions for all UEs in every time step works fine with 2 UEs
* Even with rate-fair sharing, agent tends to connect UEs as long as possible (until connection drops) rather than actively disconnecting UEs that are far away
* This is improved by adding a penalty for losing connections (without active disconnect) and adding obs about the total current dr of each UE (from all connections combined)
* Adding this extra obs about total UE dr (over all BS connections) seems to slightly improve reward, but not a lot
* Multi-agent RL learns better results more quickly than a centralized RL agent

## Development

Expand Down
Binary file added docs/gifs/v06.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 43 additions & 6 deletions docs/mdp.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,50 @@
# MDP Formulation & Release Details

## [v0.5](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.5): Improved radio model (week 26)
## Latest MDP Formulation

Using the multi-agent environment with the latest common configuration.

Observations: Observation for each agent (controlling a single UE)

* Achievable data rate to each BS. Processed/normlaized to `[-1, 1]` depending on the UE's requested data rate
* Total current data rate of the UE over all its current connections. Also normalized to `[-1,1]`.
* Currently connected BS (binary vector).

Actions:

* Discrete selection of either noop (0) or one of the BS.
* The latter toggles the connection status and either tries to connects or disconnect the UE to/from the BS, depending on whether it currently already is connected.

Reward: Immediate rewards for each time step

* For each UE:
* +10 if its requested data rate is covered by all its combined connections, -10 otherwise
* -3 for unsuccessful connection attempts (because the BS is out of range)
* -x where x is the number of lost connections during movement (that were not actively disconnected)
* In multi-UE envs, the total reward is summed up for all UEs
* In multi-agent RL, each agent still only learns from its own reward


## Release Details and MDP Changes

### [v0.6](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.6): Multi-agent RL (week 27)

* Support for multi-agent RL: Each UE is trained by its own RL agent
* Currently, all agents share the same RL algorithm and NN
* Already with 2 UEs, multi-agent leads to better results more quickly than a central agent

Example: Multi-agent PPO after 25k training

![v0.5 example](gifs/v06.gif)

### [v0.5](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.5): Improved radio model (week 26)

* Improved [radio model](https://github.com/CN-UPB/deep-rl-mobility-management/blob/master/docs/model.md):
* Configurable model for sharing resources/data rate between connected UEs at a BS. Support capacity maximization, rate-fair, and resource-fair sharing. Use rate-fair as new default.
* Allow UEs to connect based on SNR not data rate threshold
* Clean up: Removed unused interference calculation from model (assume no interference)
* Improved observations:
* Variant `CentralRemainingDrEnv` with extra observation indicating each UE's total current data rate in `[-1, 1]`: 0 = requirements exactly fulfilled
* Environment variant with extra observation indicating each UE's total current data rate in `[-1, 1]`: 0 = requirements exactly fulfilled
* Improves avg. episode reward from 343 (+- 92) to 388 (+- 111); after 30k train, tested over 30 eps
* Dict space obs allow distinguishing continuous data rate obs and binary connected obs. Were both treated as binary (Box) before --> smaller obs space now
* Penalty for losing connection to BS through movement rather than actively disconnecting --> Agent learns to disconnect
Expand All @@ -17,7 +54,7 @@ Example: Centralized PPO agent controlling two UEs after 30k training with RLlib

![v0.5 example](gifs/v05.gif)

## [v0.4](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.4): Replaced stable_baselines with ray's RLlib (week 26)
### [v0.4](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.4): Replaced stable_baselines with ray's RLlib (week 26)

* Replaced the RL framework: [RLlib](https://docs.ray.io/en/latest/rllib.html) instead of [stable_baselines](https://stable-baselines.readthedocs.io/en/master/)
* Benefit: RLlib is more powerful and supports multi-agent environments
Expand All @@ -28,7 +65,7 @@ Example: Centralized PPO agent controlling two UEs after 20k training with RLlib

![v0.4 example](gifs/v04.gif)

## [v0.3](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.3): Centralized, single-agent, multi-UE-BS selection, basic radio model (week 25)
### [v0.3](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.3): Centralized, single-agent, multi-UE-BS selection, basic radio model (week 25)

* Simple but improved radio load model:
* Split achievable load equally among connected UEs
Expand All @@ -53,7 +90,7 @@ Example: Centralized PPO agent controlling two UEs after 20k training

![v0.3 example](gifs/v03.gif)

## [v0.2](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.2): Just BS selection, basic radio model (week 21)
### [v0.2](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.2): Just BS selection, basic radio model (week 21)

* Same as v0, but with path loss, SNR to data rate calculation. No interference or scheduling yet.
* State/Observation: S = [Achievable data rates per BS (processed), connected BS]
Expand All @@ -73,7 +110,7 @@ Example: PPO with auto clipping & normalization observations after 10k training

![v0.2 example](gifs/v02.gif)

## [v0.1](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.1): Just BS selection, no radio model (week 19)
### [v0.1](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.1): Just BS selection, no radio model (week 19)

Env. dynamics:

Expand Down
3 changes: 1 addition & 2 deletions drl_mobile/env/multi_ue/multi_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ def reset(self):
def step(self, action_dict):
"""
Apply actions of all agents (here UEs) and step the environment
:param action_dict: Dict of UE IDs --> selected action
:return: obs, rewards, dones, infos. Again in the form of dicts: UE ID --> value
"""
Expand Down Expand Up @@ -70,5 +71,3 @@ def step(self, action_dict):
infos = {ue.id: {'time': self.time} for ue in self.ue_list}
self.log.info("Step", time=self.time, prev_obs=prev_obs, action=action_dict, rewards=rewards, next_obs=self.obs, done=done)
return self.obs, rewards, dones, infos

# TODO: implement similar variant with total current dr as in the central env
11 changes: 6 additions & 5 deletions drl_mobile/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from ray.rllib.env.multi_agent_env import MultiAgentEnv

from drl_mobile.env.single_ue.variants import BinaryMobileEnv, DatarateMobileEnv
from drl_mobile.env.multi_ue.central import CentralMultiUserEnv, CentralRemainingDrEnv
from drl_mobile.env.multi_ue.central import CentralMultiUserEnv
from drl_mobile.env.multi_ue.multi_agent import MultiAgentMobileEnv
from drl_mobile.util.simulation import Simulation
from drl_mobile.util.logs import config_logging
Expand All @@ -22,6 +22,7 @@
def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=None):
"""
Create environment and RLlib config. Return config.
:param eps_length: Number of time steps per episode (parameter of the environment)
:param num_workers: Number of RLlib workers for training. For longer training, num_workers = cpu_cores-1 makes sense
:param train_batch_size: Number of sampled env steps in a single training iteration
Expand All @@ -36,11 +37,11 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=Non
bs1 = Basestation('bs1', pos=Point(50, 50))
bs2 = Basestation('bs2', pos=Point(100, 50))
bs_list = [bs1, bs2]
env_class = CentralMultiUserEnv
env_class = MultiAgentMobileEnv

env_config = {
'episode_length': eps_length, 'map': map, 'bs_list': bs_list, 'ue_list': ue_list, 'dr_cutoff': 'auto',
'sub_req_dr': True, 'curr_dr_obs': False, 'seed': seed
'sub_req_dr': True, 'curr_dr_obs': True, 'seed': seed
}

# create and return the config
Expand Down Expand Up @@ -83,10 +84,10 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=Non
# 'episode_reward_mean': 250
}
# train or load trained agent; only set train=True for ppo agent
train = True
train = False
agent_name = 'ppo'
# name of the RLlib dir to load the agent from for testing
agent_path = '../training/PPO/PPO_CentralRemainingDrEnv_0_2020-07-01_11-12-05vmts7p3t/checkpoint_25/checkpoint-25'
agent_path = '../training/PPO/PPO_MultiAgentMobileEnv_0_2020-07-01_15-42-31ypyfzmte/checkpoint_25/checkpoint-25'
# seed for agent & env
seed = 42

Expand Down

0 comments on commit 6415829

Please sign in to comment.