diff --git a/README.md b/README.md
index f6244665..3373f52e 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 Using deep RL for mobility management.
 
-![example](docs/gifs/v05.gif)
+![example](docs/gifs/v06.gif)
 
 ## Setup
 
@@ -61,8 +61,9 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt
     * Multi-agent: Separate agents for each UE. I should look into ray/rllib: https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical
     * Collaborative learning: Share experience or gradients to train agents together. Use same NN. Later separate NNs? Federated learing
     * Possibilities: Higher=better
-        1. Use & train exactly same NN for all UEs (still per UE decisions).
+        1. DONE: Use & train exactly same NN for all UEs (still per UE decisions).
         2. Separate NNs for each agent, but share gradient updates or experiences occationally
+* Larger scenarios with more UEs and BS. Auto create rand BS, UE; just configure number in env.
 * Generic utlitiy function: Currently, reward is a step function (pos if enough rate, neg if not). Could also be any other function of the rate, eg, logarithmic
 * Efficient caching of connection data rate:
     * Currently always recalculate the data rate per connection per UE, eg, when calculating reward or checking whether we can connect
@@ -73,7 +74,7 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt
 
 ### Findings
 
-* Binary observations: [BS available?, BS connected?] work very well
+* Binary observations: (BS available?, BS connected?) work very well
 * Replacing binary "BS available?" with achievable data rate by BS does not work at all
 * Probably, because data rate is magnitudes larger (up to 150x) than "BS connected?" --> agent becomes blind to 2nd part of obs
 * Just cutting the data rate off at some small value (eg, 3 Mbit/s) leads to much better results
@@ -82,6 +83,8 @@ Run the command in a WSL not a PyCharm terminal. Tensorboard is available at htt
 * Central agent with observations and actions for all UEs in every time step works fine with 2 UEs
 * Even with rate-fair sharing, agent tends to connect UEs as long as possible (until connection drops) rather than actively disconnecting UEs that are far away
 * This is improved by adding a penalty for losing connections (without active disconnect) and adding obs about the total current dr of each UE (from all connections combined)
+* Adding this extra obs about total UE dr (over all BS connections) seems to slightly improve reward, but not a lot
+* Multi-agent RL learns better results more quickly than a centralized RL agent
 
 ## Development
 
diff --git a/docs/gifs/v06.gif b/docs/gifs/v06.gif
new file mode 100644
index 00000000..bdbb7f4f
Binary files /dev/null and b/docs/gifs/v06.gif differ
diff --git a/docs/mdp.md b/docs/mdp.md
index 280caf32..fe444ac6 100644
--- a/docs/mdp.md
+++ b/docs/mdp.md
@@ -1,13 +1,50 @@
 # MDP Formulation & Release Details
 
-## [v0.5](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.5): Improved radio model (week 26)
+## Latest MDP Formulation
+
+Using the multi-agent environment with the latest common configuration.
+
+Observations: Observation for each agent (controlling a single UE)
+
+* Achievable data rate to each BS. Processed/normlaized to `[-1, 1]` depending on the UE's requested data rate
+* Total current data rate of the UE over all its current connections. Also normalized to `[-1,1]`.
+* Currently connected BS (binary vector).
+
+Actions:
+
+* Discrete selection of either noop (0) or one of the BS.
+* The latter toggles the connection status and either tries to connects or disconnect the UE to/from the BS, depending on whether it currently already is connected.
+
+Reward: Immediate rewards for each time step
+
+* For each UE:
+    * +10 if its requested data rate is covered by all its combined connections, -10 otherwise
+    * -3 for unsuccessful connection attempts (because the BS is out of range)
+    * -x where x is the number of lost connections during movement (that were not actively disconnected)
+* In multi-UE envs, the total reward is summed up for all UEs
+    * In multi-agent RL, each agent still only learns from its own reward
+
+
+## Release Details and MDP Changes
+
+### [v0.6](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.6): Multi-agent RL (week 27)
+
+* Support for multi-agent RL: Each UE is trained by its own RL agent
+* Currently, all agents share the same RL algorithm and NN
+* Already with 2 UEs, multi-agent leads to better results more quickly than a central agent
+
+Example: Multi-agent PPO after 25k training
+
+![v0.5 example](gifs/v06.gif)
+
+### [v0.5](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.5): Improved radio model (week 26)
 
 * Improved [radio model](https://github.com/CN-UPB/deep-rl-mobility-management/blob/master/docs/model.md):
     * Configurable model for sharing resources/data rate between connected UEs at a BS. Support capacity maximization, rate-fair, and resource-fair sharing. Use rate-fair as new default.
     * Allow UEs to connect based on SNR not data rate threshold
     * Clean up: Removed unused interference calculation from model (assume no interference)
 * Improved observations:
-    * Variant `CentralRemainingDrEnv` with extra observation indicating each UE's total current data rate in `[-1, 1]`: 0 = requirements exactly fulfilled
+    * Environment variant with extra observation indicating each UE's total current data rate in `[-1, 1]`: 0 = requirements exactly fulfilled
     * Improves avg. episode reward from 343 (+- 92) to 388 (+- 111); after 30k train, tested over 30 eps
     * Dict space obs allow distinguishing continuous data rate obs and binary connected obs. Were both treated as binary (Box) before --> smaller obs space now
 * Penalty for losing connection to BS through movement rather than actively disconnecting --> Agent learns to disconnect
@@ -17,7 +54,7 @@ Example: Centralized PPO agent controlling two UEs after 30k training with RLlib
 
 ![v0.5 example](gifs/v05.gif)
 
-## [v0.4](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.4): Replaced stable_baselines with ray's RLlib (week 26)
+### [v0.4](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.4): Replaced stable_baselines with ray's RLlib (week 26)
 
 * Replaced the RL framework: [RLlib](https://docs.ray.io/en/latest/rllib.html) instead of [stable_baselines](https://stable-baselines.readthedocs.io/en/master/)
 * Benefit: RLlib is more powerful and supports multi-agent environments
@@ -28,7 +65,7 @@ Example: Centralized PPO agent controlling two UEs after 20k training with RLlib
 
 ![v0.4 example](gifs/v04.gif)
 
-## [v0.3](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.3): Centralized, single-agent, multi-UE-BS selection, basic radio model (week 25)
+### [v0.3](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.3): Centralized, single-agent, multi-UE-BS selection, basic radio model (week 25)
 
 * Simple but improved radio load model: 
     * Split achievable load equally among connected UEs
@@ -53,7 +90,7 @@ Example: Centralized PPO agent controlling two UEs after 20k training
 
 ![v0.3 example](gifs/v03.gif)
 
-## [v0.2](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.2): Just BS selection, basic radio model (week 21)
+### [v0.2](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.2): Just BS selection, basic radio model (week 21)
 
 * Same as v0, but with path loss, SNR to data rate calculation. No interference or scheduling yet.
 * State/Observation: S = [Achievable data rates per BS (processed), connected BS]
@@ -73,7 +110,7 @@ Example: PPO with auto clipping & normalization observations after 10k training
 
 ![v0.2 example](gifs/v02.gif)
 
-## [v0.1](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.1): Just BS selection, no radio model (week 19)
+### [v0.1](https://github.com/CN-UPB/deep-rl-mobility-management/releases/tag/v0.1): Just BS selection, no radio model (week 19)
 
 Env. dynamics:
 
diff --git a/drl_mobile/env/multi_ue/multi_agent.py b/drl_mobile/env/multi_ue/multi_agent.py
index c2ef2bb1..2f3348ef 100644
--- a/drl_mobile/env/multi_ue/multi_agent.py
+++ b/drl_mobile/env/multi_ue/multi_agent.py
@@ -30,6 +30,7 @@ def reset(self):
     def step(self, action_dict):
         """
         Apply actions of all agents (here UEs) and step the environment
+
         :param action_dict: Dict of UE IDs --> selected action
         :return: obs, rewards, dones, infos. Again in the form of dicts: UE ID --> value
         """
@@ -70,5 +71,3 @@ def step(self, action_dict):
         infos = {ue.id: {'time': self.time} for ue in self.ue_list}
         self.log.info("Step", time=self.time, prev_obs=prev_obs, action=action_dict, rewards=rewards, next_obs=self.obs, done=done)
         return self.obs, rewards, dones, infos
-
-# TODO: implement similar variant with total current dr as in the central env
diff --git a/drl_mobile/main.py b/drl_mobile/main.py
index 0e521711..9c9e7462 100644
--- a/drl_mobile/main.py
+++ b/drl_mobile/main.py
@@ -7,7 +7,7 @@
 from ray.rllib.env.multi_agent_env import MultiAgentEnv
 
 from drl_mobile.env.single_ue.variants import BinaryMobileEnv, DatarateMobileEnv
-from drl_mobile.env.multi_ue.central import CentralMultiUserEnv, CentralRemainingDrEnv
+from drl_mobile.env.multi_ue.central import CentralMultiUserEnv
 from drl_mobile.env.multi_ue.multi_agent import MultiAgentMobileEnv
 from drl_mobile.util.simulation import Simulation
 from drl_mobile.util.logs import config_logging
@@ -22,6 +22,7 @@
 def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=None):
     """
     Create environment and RLlib config. Return config.
+
     :param eps_length: Number of time steps per episode (parameter of the environment)
     :param num_workers: Number of RLlib workers for training. For longer training, num_workers = cpu_cores-1 makes sense
     :param train_batch_size: Number of sampled env steps in a single training iteration
@@ -36,11 +37,11 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=Non
     bs1 = Basestation('bs1', pos=Point(50, 50))
     bs2 = Basestation('bs2', pos=Point(100, 50))
     bs_list = [bs1, bs2]
-    env_class = CentralMultiUserEnv
+    env_class = MultiAgentMobileEnv
 
     env_config = {
         'episode_length': eps_length, 'map': map, 'bs_list': bs_list, 'ue_list': ue_list, 'dr_cutoff': 'auto',
-        'sub_req_dr': True, 'curr_dr_obs': False, 'seed': seed
+        'sub_req_dr': True, 'curr_dr_obs': True, 'seed': seed
     }
 
     # create and return the config
@@ -83,10 +84,10 @@ def create_env_config(eps_length, num_workers=1, train_batch_size=1000, seed=Non
         # 'episode_reward_mean': 250
     }
     # train or load trained agent; only set train=True for ppo agent
-    train = True
+    train = False
     agent_name = 'ppo'
     # name of the RLlib dir to load the agent from for testing
-    agent_path = '../training/PPO/PPO_CentralRemainingDrEnv_0_2020-07-01_11-12-05vmts7p3t/checkpoint_25/checkpoint-25'
+    agent_path = '../training/PPO/PPO_MultiAgentMobileEnv_0_2020-07-01_15-42-31ypyfzmte/checkpoint_25/checkpoint-25'
     # seed for agent & env
     seed = 42