diff --git a/README.md b/README.md index f6244665..3373f52e 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ Using deep RL for mobility management. -![example](docs/gifs/v05.gif) +![example](docs/gifs/v06.gif) ## Setup @@ -61,8 +61,9 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt * Multi-agent: Separate agents for each UE. I should look into ray/rllib: https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical * Collaborative learning: Share experience or gradients to train agents together. Use same NN. Later separate NNs? Federated learing * Possibilities: Higher=better - 1. Use & train exactly same NN for all UEs (still per UE decisions). + 1. DONE: Use & train exactly same NN for all UEs (still per UE decisions). 2. Separate NNs for each agent, but share gradient updates or experiences occationally +* Larger scenarios with more UEs and BS. Auto create rand BS, UE; just configure number in env. * Generic utlitiy function: Currently, reward is a step function (pos if enough rate, neg if not). Could also be any other function of the rate, eg, logarithmic * Efficient caching of connection data rate: * Currently always recalculate the data rate per connection per UE, eg, when calculating reward or checking whether we can connect @@ -73,7 +74,7 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt ### Findings -* Binary observations: [BS available?, BS connected?] work very well +* Binary observations: (BS available?, BS connected?) work very well * Replacing binary "BS available?" with achievable data rate by BS does not work at all * Probably, because data rate is magnitudes larger (up to 150x) than "BS connected?" --> agent becomes blind to 2nd part of obs * Just cutting the data rate off at some small value (eg, 3 Mbit/s) leads to much better results @@ -82,6 +83,8 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt * Central agent with observations and actions for all UEs in every time step works fine with 2 UEs * Even with rate-fair sharing, agent tends to connect UEs as long as possible (until connection drops) rather than actively disconnecting UEs that are far away * This is improved by adding a penalty for losing connections (without active disconnect) and adding obs about the total current dr of each UE (from all connections combined) +* Adding this extra obs about total UE dr (over all BS connections) seems to slightly improve reward, but not a lot +* Multi-agent RL learns better results more quickly than a centralized RL agent ## Development diff --git a/docs/gifs/v06.gif b/docs/gifs/v06.gif new file mode 100644 index 00000000..bdbb7f4f Binary files /dev/null and b/docs/gifs/v06.gif differ diff --git a/docs/mdp.md b/docs/mdp.md index 280caf32..fe444ac6 100644 --- a/docs/mdp.md +++ b/docs/mdp.md @@ -1,13 +1,50 @@ # MDP Formulation & Release Details -## [v0.5](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.5): Improved radio model (week 26) +## Latest MDP Formulation + +Using the multi-agent environment with the latest common configuration. + +Observations: Observation for each agent (controlling a single UE) + +* Achievable data rate to each BS. Processed/normlaized to `[-1, 1]` depending on the UE's requested data rate +* Total current data rate of the UE over all its current connections. Also normalized to `[-1,1]`. +* Currently connected BS (binary vector). + +Actions: + +* Discrete selection of either noop (0) or one of the BS. +* The latter toggles the connection status and either tries to connects or disconnect the UE to/from the BS, depending on whether it currently already is connected. + +Reward: Immediate rewards for each time step + +* For each UE: + * +10 if its requested data rate is covered by all its combined connections, -10 otherwise + * -3 for unsuccessful connection attempts (because the BS is out of range) + * -x where x is the number of lost connections during movement (that were not actively disconnected) +* In multi-UE envs, the total reward is summed up for all UEs + * In multi-agent RL, each agent still only learns from its own reward + + +## Release Details and MDP Changes + +### [v0.6](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.6): Multi-agent RL (week 27) + +* Support for multi-agent RL: Each UE is trained by its own RL agent +* Currently, all agents share the same RL algorithm and NN +* Already with 2 UEs, multi-agent leads to better results more quickly than a central agent + +Example: Multi-agent PPO after 25k training + +![v0.5 example](gifs/v06.gif) + +### [v0.5](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.5): Improved radio model (week 26) * Improved [radio model](https://github.com/CN-UPB/deep-rl-mobility-management/blob/master/docs/model.md): * Configurable model for sharing resources/data rate between connected UEs at a BS. Support capacity maximization, rate-fair, and resource-fair sharing. Use rate-fair as new default. * Allow UEs to connect based on SNR not data rate threshold * Clean up: Removed unused interference calculation from model (assume no interference) * Improved observations: - * Variant `CentralRemainingDrEnv` with extra observation indicating each UE's total current data rate in `[-1, 1]`: 0 = requirements exactly fulfilled + * Environment variant with extra observation indicating each UE's total current data rate in `[-1, 1]`: 0 = requirements exactly fulfilled * Improves avg. episode reward from 343 (+- 92) to 388 (+- 111); after 30k train, tested over 30 eps * Dict space obs allow distinguishing continuous data rate obs and binary connected obs. Were both treated as binary (Box) before --> smaller obs space now * Penalty for losing connection to BS through movement rather than actively disconnecting --> Agent learns to disconnect @@ -17,7 +54,7 @@ Example: Centralized PPO agent controlling two UEs after 30k training with RLlib ![v0.5 example](gifs/v05.gif) -## [v0.4](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.4): Replaced stable_baselines with ray's RLlib (week 26) +### [v0.4](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.4): Replaced stable_baselines with ray's RLlib (week 26) * Replaced the RL framework: [RLlib](https://docs.ray.io/en/latest/rllib.html) instead of [stable_baselines](https://stable-baselines.readthedocs.io/en/master/) * Benefit: RLlib is more powerful and supports multi-agent environments @@ -28,7 +65,7 @@ Example: Centralized PPO agent controlling two UEs after 20k training with RLlib ![v0.4 example](gifs/v04.gif) -## [v0.3](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.3): Centralized, single-agent, multi-UE-BS selection, basic radio model (week 25) +### [v0.3](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.3): Centralized, single-agent, multi-UE-BS selection, basic radio model (week 25) * Simple but improved radio load model: * Split achievable load equally among connected UEs @@ -53,7 +90,7 @@ Example: Centralized PPO agent controlling two UEs after 20k training ![v0.3 example](gifs/v03.gif) -## [v0.2](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.2): Just BS selection, basic radio model (week 21) +### [v0.2](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.2): Just BS selection, basic radio model (week 21) * Same as v0, but with path loss, SNR to data rate calculation. No interference or scheduling yet. * State/Observation: S = [Achievable data rates per BS (processed), connected BS] @@ -73,7 +110,7 @@ Example: PPO with auto clipping & normalization observations after 10k training ![v0.2 example](gifs/v02.gif) -## [v0.1](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.1): Just BS selection, no radio model (week 19) +### [v0.1](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.1): Just BS selection, no radio model (week 19) Env. dynamics: diff --git a/drl_mobile/env/multi_ue/multi_agent.py b/drl_mobile/env/multi_ue/multi_agent.py index c2ef2bb1..2f3348ef 100644 --- a/drl_mobile/env/multi_ue/multi_agent.py +++ b/drl_mobile/env/multi_ue/multi_agent.py @@ -30,6 +30,7 @@ def reset(self): def step(self, action_dict): """ Apply actions of all agents (here UEs) and step the environment + :param action_dict: Dict of UE IDs --> selected action :return: obs, rewards, dones, infos. Again in the form of dicts: UE ID --> value """ @@ -70,5 +71,3 @@ def step(self, action_dict): infos = {ue.id: {'time': self.time} for ue in self.ue_list} self.log.info("Step", time=self.time, prev_obs=prev_obs, action=action_dict, rewards=rewards, next_obs=self.obs, done=done) return self.obs, rewards, dones, infos - -# TODO: implement similar variant with total current dr as in the central env diff --git a/drl_mobile/main.py b/drl_mobile/main.py index 0e521711..9c9e7462 100644 --- a/drl_mobile/main.py +++ b/drl_mobile/main.py @@ -7,7 +7,7 @@ from ray.rllib.env.multi_agent_env import MultiAgentEnv from drl_mobile.env.single_ue.variants import BinaryMobileEnv, DatarateMobileEnv -from drl_mobile.env.multi_ue.central import CentralMultiUserEnv, CentralRemainingDrEnv +from drl_mobile.env.multi_ue.central import CentralMultiUserEnv from drl_mobile.env.multi_ue.multi_agent import MultiAgentMobileEnv from drl_mobile.util.simulation import Simulation from drl_mobile.util.logs import config_logging @@ -22,6 +22,7 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=None): """ Create environment and RLlib config. Return config. + :param eps_length: Number of time steps per episode (parameter of the environment) :param num_workers: Number of RLlib workers for training. For longer training, num_workers = cpu_cores-1 makes sense :param train_batch_size: Number of sampled env steps in a single training iteration @@ -36,11 +37,11 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=Non bs1 = Basestation('bs1', pos=Point(50, 50)) bs2 = Basestation('bs2', pos=Point(100, 50)) bs_list = [bs1, bs2] - env_class = CentralMultiUserEnv + env_class = MultiAgentMobileEnv env_config = { 'episode_length': eps_length, 'map': map, 'bs_list': bs_list, 'ue_list': ue_list, 'dr_cutoff': 'auto', - 'sub_req_dr': True, 'curr_dr_obs': False, 'seed': seed + 'sub_req_dr': True, 'curr_dr_obs': True, 'seed': seed } # create and return the config @@ -83,10 +84,10 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=Non # 'episode_reward_mean': 250 } # train or load trained agent; only set train=True for ppo agent - train = True + train = False agent_name = 'ppo' # name of the RLlib dir to load the agent from for testing - agent_path = '../training/PPO/PPO_CentralRemainingDrEnv_0_2020-07-01_11-12-05vmts7p3t/checkpoint_25/checkpoint-25' + agent_path = '../training/PPO/PPO_MultiAgentMobileEnv_0_2020-07-01_15-42-31ypyfzmte/checkpoint_25/checkpoint-25' # seed for agent & env seed = 42