Skip to content

Commit

Permalink
clean up
Browse files Browse the repository at this point in the history
  • Loading branch information
Stefan Schneider committed Dec 8, 2020
1 parent e62c06c commit 1d9d30b
Show file tree
Hide file tree
Showing 10 changed files with 30 additions and 2,734 deletions.
137 changes: 22 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,31 @@
# deep-rl-mobility-management
# DeepCoMP: Self-Learning Dynamic Multi-Cell Selection for Coordinated Multipoint (CoMP)

Using deep RL for mobility management.
Deep reinforcement learning for dynamic multi-cell selection in CoMP scenarios.
Three variants: DeepCoMP (central agent), DD-CoMP (distributed agents using central policy), D3-CoMP (distributed agents with separate policies).

![example](docs/gifs/v10.gif)

![example](docs/gifs/v010.gif)

## Setup

You need Python 3.8+.
To install everything, run

```
# on ubuntu
# only on ubuntu
sudo apt update
sudo apt upgrade
sudo apt install cmake build-essential zlib1g-dev python3-dev
# while the issues below persist
# then install rllib and structlog manually for now
pip install ray[rllib]==1
pip install git+https://github.com/stefanbschneider/structlog.git@dev
# on all systems
# complete installation of remaining dependencies
python setup.py install
```

Tested on Ubuntu 20.04 (on WSL) with Python 3.8. RLlib does not ([yet](https://github.com/ray-project/ray/issues/631)) run on Windows, but it does on WSL.
Tested on Ubuntu 20.04 and Windows 10 with Python 3.8.

For saving videos and gifs, you also need to install ffmpeg (not on Windows) and [ImageMagick](https://imagemagick.org/index.php).
On Ubuntu:
Expand All @@ -31,30 +34,27 @@ On Ubuntu:
sudo apt install ffmpeg imagemagick
```

**While structlog doesn't support deepcopy:**

Install patched version from my `structlog` fork & branch:
## Usage

```
pip install git+https://github.com/stefanbschneider/structlog.git@dev
# get an overview of all options
deepcomp -h
```

**Other known issues:**

* [`ray does not provide extra 'rllib'`](https://github.com/ray-project/ray/issues/11274): uninstall and install via `pip` instead of `setup.py`
* [Unable to schedule actor or task](https://github.com/ray-project/ray/issues/6781#issuecomment-708281404)


## Usage
For example:

```
deepcomp -h
deepcomp --env medium --slow-ues 3 --fast-ues 0 --agent central --workers 2 --train-steps 50000 --seed 42 --video both --sharing mixed
```

Adjust further settings in `drl_mobile/main.py`.
To run DeepCoMP, use `--alg ppo --agent central`.
For DD-CoMP, use `--alg ppo --agent multi`, and for D3-CoMP, use `--alg ppo --agent multi --separate-agent-nns`.

Training logs, results, videos, and trained agents are saved in the `results` directory.

#### Accessing results remotely

When running remotely, you can serve the replay video by running:

```
Expand All @@ -69,101 +69,8 @@ Then access at `<remote-ip>:8000`.
To view learning curves (and other metrics) when training an agent, use Tensorboard:

```
tensorboard --logdir results/PPO/ --host 0.0.0.0
tensorboard --logdir results/PPO/ (--host 0.0.0.0)
```

Run the command in a WSL not a PyCharm terminal. Tensorboard is available at http://localhost:6006

## Documentation

* See documents in `docs` folder
* See docstrings in code (TODO: generate read-the-docs in the end for v1.0)

## Research

Evaluation results: https://github.com/CN-UPB/b5g-results

### Available Machines

tango4, tango5, (swc01)

### Status

* RL learn reasonable behavior, very close to greedy-all heuristic, ie, trying to connect to all BS
* For Multi-agent PPO, that makes sense since each agent/UE greedily tries to maximize own utility, even if it hurts other's utilities (not considered in reward)
* It still can learn to disconnect weak connections of UEs that have fully satisfied data rate anyways through another connection
* For central PPO, it doesn't - but it still doesn't learn fairer behavior
* That's weird because often greedy-best, with a single connection per UE, gets better overall utility, which is also what central PPO optimizes
* Problem trade-off not clear:
* Fairness? UEs should only connect to multiple BS if it increases their utility enough to justify samll reductions in utility for other connected UEs?
* Or explicit cost/overhead for multiple concurrent connections?
* Even when penalizing concurrent connections, the RL agent still only learned to behave similar to greedy-all.
* It should have learned to only use concurrent connections if it is really useful for improving utility, ie, at the edge. Not when the UE is close to another BS anyways.
* Problem scenario not clear: Do we typically have >1 UE per BS? So few BS and many UEs or the other way around? Or neither
* I tried many variations of observations (different components, different normalization).
* Overall, normalization is crucial for central PPO (weirdly not so much for multi-agent).
* Binary connected, dr and total_dr obs seem to work best so far
* Adding info about connected UEs per BS, about BS that are in range, about number of connected BS, about unshared dr, postion & movement (distance to BS), etc did not help or even reduce performance
* Training takes long for many UEs (>5). But multi-agent can infere to envs with more UEs and works fine even with 30, 40, etc UEs (still similar to greedy-all)

### Todos

* Always return `done=False` for infinite episode. But set some eval eps length in simulation
* Implement LTE baseline and optimization approach
* Evaluation:
* Double check all units in my scenario, esp. for movement, distance, dr. Makes sense?
* Different utilities for each UE? Shift log function to cut x-axis at different points correspondign to the requirement
* Then normalize data rates accordingly
* Real-world traces for UE movement somewhere? From 5G measurement mmW paper?

Later:

* Let agent coordinate the number/amount of RBs per connected UE actively. With log utility, a centralized agent should learn proportional-fair scheduling by itself.
* optimize performance by using more numpy arrays less looping over UEs


### Findings

* Binary observations: (BS available?, BS connected?) work very well
* Replacing binary "BS available?" with achievable data rate by BS does not work at all
* Probably, because data rate is magnitudes larger (up to 150x) than "BS connected?" --> agent becomes blind to 2nd part of obs
* Just cutting the data rate off at some small value (eg, 3 Mbit/s) leads to much better results
* Agent keeps trying to connect to all BS, even if out of range. --> Subtracting req. dr by UE + higher penalty (both!) solves the issue
* Normalizing loses info about which BS has enough dr and connectivity --> does not work as well
* Central agent with observations and actions for all UEs in every time step works fine with 2 UEs
* Even with rate-fair sharing, agent tends to connect UEs as long as possible (until connection drops) rather than actively disconnecting UEs that are far away
* This is improved by adding a penalty for losing connections (without active disconnect) and adding obs about the total current dr of each UE (from all connections combined)
* Adding this extra obs about total UE dr (over all BS connections) seems to slightly improve reward, but not a lot
* Multi-agent RL learns better results more quickly than a centralized RL agent
* Multi-agents using the same NN vs. separate NNs results in comparable performance (slightly worse with separate NN).
* Theoretically, separate NNs should take more training as they only see one agent's obs, but allow learning different policies for different agents (eg, slow vs fast UEs)
* Training many workers in parallel on a server for much longer (eg, 100 iters), does improve performance!
* More training + extra observation on the number of connecte UEs --> central agents learns to not be too greedy and only connect to 1 BS to not take away resources from other UE
* Seems like this is due to longer training, not the additional observation (even though eps reward is slightly higher with the obs)
* It seems like the extra obs rather hurts the agent in the MultiAgent setting and leads to worse reward --> disable
* Agent learns well also with random waypoint UE movement. Multi-agent RL learns much faster than centralized.
* Another benefit of multi-agent RL is that we can train with few UEs and then extend testing to many more UEs that use the same NN.
That doesn't work with centralized RL as the fixed NN size depends on the number of UEs.
* Log utility: Also works well (at least multi agent)! Absolute reward not comparable between step and log utility
* Different normalization and cutoff works better for log utility
* Central agent is much more sensitive to normalization!

## Development

* The latest version uses the [RLlib](https://docs.ray.io/en/latest/rllib.html) library for multi-agent RL.
* There is also an older version using [stable_baselines](https://stable-baselines.readthedocs.io/en/master/) for single-agent RL
in the [stable_baselines branch](https://github.com/CN-UPB/deep-rl-mobility-management/tree/stable_baselines) (used for v0.1-v0.3).
* The RLlib version on the `rllib` branch is functionally roughly equivalent to the `stable_baselines` branch (same model, MDP, agent), just with a different framework.
* Development continues in the `dev` branch.
* The current version on `master` and `dev` do not support `stable_baselines` anymore.

## Things to Evaluate

* Impact of num UEs (fixed or varying within an episode)
* Distance between BS (density)
* UE movement
* Fairness parameter of multi agent
* Squentialization of multi agent
* Resource sharing models
* Scalability: Num BS and UE
* Generalization
Tensorboard is available at http://localhost:6006

Loading

0 comments on commit 1d9d30b

Please sign in to comment.