This repository contains the code for our implementation of a solution for the 2022 CityLearn challenge. |
---|
- Introduction
- Adequate Tree-Search Solution
- Alternative Rule-Based solution
- Tunable parameters
- Results analysis
- Prerequisites
- Files in the repository
- Credits
The imposed problem is a multi-agent scenario, where the agents are the buildings in the smart grid.
There is a battery in each building, which is used to store energy, residents' load, and a solar panel to produce
energy. Each building has its own energy consumption and production, and the goal is to minimize the utility,
which is a specified measure of the net energy consumption of the buildings in the grid, parts of which are global to
the whole district, while others are local to each building.
The action space for each agent (building) is the amount of energy to be charged/discharged to the battery, which is
continues in
The crux of the problem is that:
- Each building's actions affect its future net consumption, so the net consumption of each building has to be predicted.
- The utility involves global parts, so the optimal action for one building depends on the actions of the other buildings.
- The natural periodicity of the net consumption is 24 hours, which for vanilla planning using tree-search
algorithms, even with moderate branching factors, is intractable (e.g., for
$5$ , one gets$5^{24}=6\cdot 10^{16}$ nodes).
The net consumption of the
The non-shiftable load
We can therefore factorize the problem into two sub-environments, one for the agent and one for the "world", where the actions only affect the agent's environment, and the world's environment dictates the time-evolution.
The actions affect the net electricity consumption via the equation above, and accordingly affect the utility function.
We implement and use:
- Uniform-Cost Search algorithm (a type of tree-search), which is a modified Dijkstra's algorithm, to find the next optimal action for each building.
- Various predictors to predict the net consumption of each building for the next time frame (
$1$ or$24$ h) - Battery model to translate action to change in battery's state.
- Depth-selective search, where the search is performed only on specified levels of the tree, and the rest is bridged by steps among which the action is uniformly divided.
- Instantaneous approximation of the total utility function that is formally calculated only after complete episode (year). Moreover, some of the utility terms are global (affected by the actions of all the buildings in the district). However, we use decentralized controllers, where each building takes actions to optimize the local reward function. Therefore, a local estimation of the global utility is established and used to guide the search for each agent.
- Residual corrections between the sum of local trajectories and the optimal global behavior are taken care of by the last agent. This is done by a simple heuristic, which optimizes a global utility with the sum of net consumptions (modified with the planned actions).
Let's break down the name of the repository: MAD CATS (Multi-Agent Distributed Control with Adequate Tree-Search)
CityLearn
is the name of the challenge, referring to the fact that there is a collective (city) learning goal.Multi-Agent
is the type of problem that basically refers to the same thing.Distributed Control
refers to the fact that the actions are performed in a distributed manner, i.e., each agent has its own policy and search tree, which is independent of others.Adequate Tree-Search
refers to the uniform-cost tree-search algorithm we use to solve the problem, with its extra spices that contribute efficiency, e.g., depth-selective search and non-uniform action space discretization, and to the fact that it is based on a battery model we employed, and use mixed decision-makers for the decentralized controllers.
To use search algorithms such as Uniform-Cost-Search (UCS), we need to have a model of the world as a Markov Decision Process (MDP) that we can use for offline planning. The given CityLearn environment, as previously mentioned, can be factored into two parts:
- The Environment model, which consists of the weather parameters (e.g. temperature, solar irradiance, etc.), the grid parameters (e.g. electricity price, carbon intensity, etc.) and the buildings' electricity measured data (e.g. non-shiftable load and solar generation).
- The Battery model, which consists of the battery's State of Charge (SoC), Capacity, Nominal-Power, and so on.
Therefore, we formulate the model of the battery as an MDP, and together with a predictor that predicts the future behavior of the grid and each building's electricity consumption and production, we use UCS to find the best plan of battery actions from each state for each building.
The battery MDP
- The state
$S$ is composed of the battery state (SoC derivative, Soc, and capacity) at time step$t$ ,$s_t = (SoC_{t} - SoC_{t-1}, SoC_{t}, Capacity_{t})$ . - The action
$A$ is defined as some discretization of the continuous action space$[-1, 1]$ , e.g.,$[-1.0, -0.5, -0.1, 0, 0.1, 0.5, 1.0]$ . - The transition
$T$ is given by the physical model of the battery, reverse-engineered from the CityLearn environment. - The reward
$R$ is a local reward function (cost function in our case) which we handcrafted to be globally consistent with the original CityLearn's utility.
The goal here is to find a trajectory of battery charge/discharge actions with minimal cost, given a set of environment predictions, execute the single first or the first few actions from that trajectory, and then re-plan.
Before delving into further details, it is important to focus on the different timescales which motivated our solution.
-
Each action is taken for a time frame of
$1$ hour, and the environment is updated every hour, so this is the immediate timescale to predict and act upon. -
The basic operational timescale is the 24-hour cycle, which is the natural periodicity of the net consumption of the buildings, as the generation peak occurs at about noon, and the consumption peak occurs during the evening. This timescale is also the timescale of the electricity price and the carbon intensity, which constitute the global utility terms.
The gap between these two timescales is the problem's main challenge and is why we are to use tree-search algorithms to find the optimal action for each building.
-
The long-term timescale is the whole year, which is the timescale of the data we have to train our decision-makers on.
-
Intermediate timescales, like the one-month used in the load factor utility term or the long drifting time of the weather, are not considered in our solution, as they are not relevant to the immediate action of the agent.
We reverse-engineered the battery model from the CityLearn
environment, and used it as the MDP's (Markov decision
process) transition function for the planning as mentioned above.
The key parameters of the model are the battery's capacity, charging/ discharging efficiency, and nominal power. For a complete mathematical analysis of the battery model used in the CityLearn's environment, see battery model analysis.
The utility is a function of the net consumption (
To motivate the construction of the instantaneous utility function, we first observe it as it's defined in the CityLearn 2022 challenge environment.
The utility is a weighted sum of four terms:
Where
Next, let us break down each one of the utility terms.
Here
Note that this part of the utility can be directly decomposed into the sum of instantaneous utilities at each time step (and be rewritten as a dot product). Still, a global knowledge of the district's net consumption is required to execute the ReLU.
To approximate this global trend, we use a leaky ReLU, where the slope of the negative part is a parameter, which
we set to
where
This approximation only applies to the local utility estimation.
Here
Similarly to the price term, this part of the utility can be directly decomposed into the sum of instantaneous utilities, but a global knowledge of the district's net consumption is required to execute the absolute value.
To approximate this, we use a factored ReLU, with a scaling factor
Once again, this approximation only applies to the local utility estimation.
This is a somewhat cumbersome term, but let's break it down intuitively.
In the numerator, we have a sum of the net consumption over all (730) time-steps of the month
To approximate this term, we take a heuristic approach, using the median and max consumption of the no-op trajectory over each month as a proxy for penalizing high (candidate peak) consumptions.
where
We use a decentralized control setting for all agents, where each building has its own set of decision-makers. The last agent is knowledgeable about the rest of the agents (net consumptions modified by the planned actions), and can use this information to make better decisions in the context of the group.
Schematic illustration of the decentralized controller architecture in use.
When using tree search algorithms, the search worst-case time complexity reads
Usually, UCSs are considered with a clearly defined goal state. This is not the case here as, although using UCS, we do not look for a goal state. Instead, we optimize the search to reach the environment's correlation depth (natural periodicity), which is about 24 hours (24 search steps).
The problem is that the search gets intractable even with a moderate branching factor. For example, with
Therefore, to reach search depth of
This way, we can expand much deeper trees where we trade off width with depth and reach the interesting correlations in the environment while easing the burden of the exponential branching barrier.
Depth Selective Search (DSS) algorithm illustration. We search only in a few selective parts of the tree (pink), while on the other parts (yellow), we use rollout with constant predefined actions.
As mentioned before, the net consumption is the measure which is used to evaluate the performance of the agents, and is
controlled by them via the battery consumption.
Crucially, the net consumption according to which the action has to be optimized is not the one that is observed by the
agent, but the one of the following time step.
Furthermore, for the tree search we need to predict the net consumption for the next
To this end, we implemented several alternative predictors:
- IDX predictor: This is the simplest predictor, which predicts the net consumption according to the known time index and building number.
-
Dot product: Generates prediction by finding the maximal dot-product overlap of the past
$24$ h consumption and training data. It is a simple and fast method, but it does not generalize very well. -
Multi-Layer Perceptron: This is a Multi-Layer Perceptron (MLP) network trained to predict the net
consumption from the last
$24$ h history. Its architecture is Time-delayed Neural Net (TDNN), which means that it takes the whole history as input, and the output is the prediction for the next 24 time steps at once, without any time roll. It features three hidden layers with$128$ ,$16384$ , and$64$ neurons, and a ReLU activation function. The network is trained with the Adam optimizer, and the loss function is the mean squared error (MSE), with exponential weighting for improving the accuracy at the initial times out of the 24 hours output. Given the small dataset, its prediction is not very accurate, but it is a great improvement over the dot product.
Alternatively, when not using search, we use a set of rules that defines the next move for each building independently (locally), based on the next hour prediction. The rules were defined to "flatten" the net consumption curve (closing the temporal gap / phase-shift between peak production and peak demand) and, by this, to minimize the utility:
- If the next hour's production is higher than the consumption, the battery is charged by the extra amount.
- If the next hour's consumption exceeds production, the battery is discharged by the missing amount.
The action is also cropped by the available energy (/room) in the battery, and by its nominal power, which limits the maximal charge/discharge per hour.
On top of that, the rules treat the cases where the battery is fully charged or fully discharged. We also penalize the battery charge in hours when the carbon intensity is below its median, as in such times the utility for using the grid power is relatively lower.
The rules are defined in two cases, for a single building and for a group of buildings. The essence is the same, just that in the latter case, the input is the net consumption of the group. We also skip the carbon intensity condition for the global case, as flattening the district's curve was found to have a more significant effect on the utility in this case.
An important hyperparameter is thus the number of buildings that use the group rules. Namely, shall only the last building (which is fully knowledgeable) make its decisions based on the district's total net consumption, or shall more agents (from the end) take altruistic actions?
Rule-based solution for a single building
Controller parameters:
-
random_order
: Whether to choose the order of the buildings randomly each time-step or not. -
prediction_method
: The method to use for predicting the net consumption of each building. Choose from:-
IDX
: Use the time and building indices for perfect prediction over the training set. -
CSV
: Load the predictions from a CSV file. -
DOT
: Generate prediction by finding the maximal dot-product overlap of the past$24$ h consumption and training data. -
MLP
: Predict with Multi-Layer Perceptron, using$24$ h history of net consumption and global variables.
-
-
agent_type
: The type of decision-maker to use for all agents except the last one ($N-1$ agents). Choose from:-
RB-local
: Use Rule-Based agents. -
PLAN-local
: Use the Uniform-Cost Search algorithm.
-
-
last_agent_type
: The type of decision-maker to use for the last agent. Choose from:-
RB-local
: Use Rule-Based agent -- egoistic decision-maker using only individual net consumption. -
RB-global
: Use Rule-Based agent, but with global net consumption -- altruistic decision-maker using only collective district's net consumption. -
PLAN-local
: Use the Uniform-Cost Search algorithm. -
PLAN-global
: Use the Uniform-Cost Search algorithm, but with global net consumption. [not implemented]
-
Planner parameters:
search_depths
: list of depths to search to. The gaps are filled with constant action of the last searched depth.max_serach_time
: time to terminate search if not finished, and keep to the best trajectory found by this time.d_action
: the action-space is discrete, so this is the step-size of the action-space.acion_space_list
: list of actions to search over. IfNone
, then the action-space is discretized tod_action
steps.utility_weighting
: re-weighting of the different terms in the local utility.
Here we review some results to highlight the differences between the different methods.
We compare the performance of the different decision-makers, and first observe how they affect the net-consumption trajectory of an individual building and then the whole district.
We pick an arbitrary time frame (6500-6800) and plot the net consumption of the first building in the following figure (black line), this is the total consumption for the baseline case, where the agents are not allowed to use their batteries. Namely, performing a series of no-op actions. The blue line is the net consumption of the same building when it uses its battery, according to the rule-based solution, and the orange line is the net consumption when the agent uses the search algorithm.
A comparison of the decision-makers. The net consumption of a single building is shown, and the actions are taken by the different decision-makers.
We can see that the rule-based solution is able to flatten the net consumption curve at the first dip, where it charges the battery for few consecutive hours, until it is full and it reduces to the baseline consumption. The search algorithm, on the other hand, spreads the charge over a longer time frame within the dip. This is because the search algorithm is able to see the future net consumption, and thus it can charge the battery more efficiently. This is also the reason why the search algorithm is able to better compensate the net consumption curve at the first peak.
Next, we compare the net consumption of all the buildings in the district, and we see that this feature is even more prominent (see, for example, the red-marked dip and following peak).
A comparison of the variation in decision-makers behaviour across different buildings. The net consumption of a single building is shown, and the actions are taken by the different decision-makers for each building.
We also observe that the general trend of the net consumption is similar for all the buildings, and the differences are mainly in the high-frequency fluctuations. We therefore focus on the total net consumption of the district, and plot it in the following figure.
A comparison of the control configurations. The net consumption of the whole district (sum of all building's) is shown, and the actions are taken by the different decision-makers in each control configuration: No-op, local Rule-Based,local RB with global RB, planners with (or w/o) last global RB.
Here we compare the net consumption of the whole district, with additional control configurations. The black line is again the baseline, where the agents are not allowed to use their batteries, and the colored lines are the different control configurations. The total utilities of the different configurations are provided in the table below.
- Blue is the local Rule-Based solution, where each agent plays egoistically according to the Rule-Based policy. This one is similar to the previous figure, but now we get the averaging effect of the district.
- Orange is the local Rule-Based solution with the last agent using the global Rule-Based policy, thus incorporating the residual net consumption of the rest of the district (altruism). We can see that this configuration is able to flatten the net consumption curve more precisely, and thus achieve a lower utility.
- Green is the same as the orange one, but here we introduce randomization in the order of the buildings, such that the last agent is not always the same building, to distribute the altruistic behaviour across the district. This configuration results in a slightly higher utility than the previous one, because the coefficients of the rule-based solution were optimized for the specific last building.
- The red line is the local planner, which as in the single building case, is able to exploit the future net consumption and flatten the net consumption curve more efficiently.
- The purple line is the local planner with the last agent using the global rule-based. This hybrid approach combines the best of both worlds: the planner is able to consider farther future, and flatten the net consumption curve more efficiently for a single building, and the last agent is able to incorporate the residual net consumption of the rest of the district using the rule-based solution.
Below we provide the utility of the different configurations as a function of the time throughout the whole year.
A comparison of the control configurations utilities: No-op, local Rule-Based,local RB with global RB, planners with (or w/o) last global RB.
The colors are the same as in the previous figure, and the black line at
Utilities for these examples:
Decision-maker | Utility (total) | Price cost | Emission cost | Grid cost |
---|---|---|---|---|
No-op | 1.000 | 1.000 | 1.000 | 1.000 |
Local RB | 0.7845 | 0.721 | 0.7764 | 0.8561 |
Local RB + global RB | 0.7822 | 0.6858 | 0.8174 | 0.8436 |
Local RB + global RB + random | 0.7939 | 0.685 | 0.8248 | 0.8718 |
Local planner | 0.855 | 0.7708 | 0.8958 | 0.8985 |
Local planner + global RB | 0.846 | 0.7405 | 0.9391 | 0.8584 |
In what follows, we provide a more detailed analysis of the different decision-makers.
Here we focus on the special features of our UCS planner.
In the figure below, we compare the district's net consumption for the local planner using different depth values for the search algorithm. We avoid using the global rule-based decision-maker for the last agent, to preserve the raw performance of the planner.
A comparison of several search depths for the local planners. The net consumption of the whole district (sum of all building's) is shown, and the actions are taken by the planners using
[0,1]
,[0,1,2,3]
, and[0,1,2,8]
search depths (without last global RB).
The [0,1]
depth planner (blue) cannot exploit information from the future, and thus it is not able to flatten the net
consumption curve as efficiently as the other two.
The [0,1,2,3]
depth planner (orange) is able to flatten the net consumption curve more efficiently, but it is not able
to perform as good as the [0,1,2,8]
depth planner (green), which is able to utilize information from the far future.
In particular, note the length of the "gaps" between the planner and the no-op trajectories. The [0,1,2,8]
variant manages
to open wide ( [0,1,2,3]
depth planner is restricted to shorter ones, and even further for the [0,1]
variant.
Using the same configuration as in the previous figure, we leverage the local utility approximation to weight the different
utilities in the optimization problem. The weights are corresponding to the price, emission, and grid costs, respectively,
where the grid cost is the sum of the ramping and load factors. The input weights we consider are therefore multiplied
by [1/3, 1/3, 1/6, 1/6]
and normalized to sum to one, such that [1, 1, 1, 1]
corresponds to the original challenge
weights.
A comparison of different utility weightings for the local planners. The original (environment's) utility is shown, and the actions are taken by the planners using
[1, 1, 1, 1]
,[1, 0, 0, 0]
,[0, 1, 0, 0]
, and[0, 0, 1, 1]
weightings (without last global RB).
Unsurprisingly, each utility weighting pulls the decision-maker towards its own utility.
The time-averaged utility of the different weightings are:
Utility weighting | Utility (total) | Price cost | Emission cost | Grid cost |
---|---|---|---|---|
[1, 1, 1, 1] |
0.8757 | 0.8102 | 0.9108 | 0.9061 |
[1, 0, 0,0] |
0.9039 | 0.7112 | 0.9155 | 1.085 |
[0, 1, 0, 0] |
0.962 | 0.8296 | 0.906 | 1.1503 |
[0, 0, 1, 1] |
0.9763 | 0.9221 | 1.0185 | 0.9882 |
We consider the effect of shuffling the order of the agents, where the last agent that uses rule-based decision-maker is randomly selected at each time step to distribute the altruistic behaviour across the district.
A comparison of the effect of shuffling the order of the agents. The net consumption of a single building is shown (top), and the actions are taken by the planners using
[0,1,4]
search depth (with last global RB). Below, the net consumption of the whole district (sum of all building's) is presented.
We observe that the shuffling of the order of the agents each time step introduces a negative effect on the performance, as the future is "cropped" for the agents whenever they are selected to use as the rule-based altruistic decision-maker.
The total (time-averaged) utility of these two experiments are:
Random order | Utility (total) |
---|---|
True |
0.8759 |
False |
0.8696 |
Here we use the special rule-based decision-maker, and focus on the effect of the global decision-maker. To this end, we avoid agents randomization, and present the egoistic and altruistic behaviours of the last agent with and without the global decision-maker (top), and their effect on the district's net consumption (bottom).
The global decision-maker seems to not take the optimal actions for itself (upper panel, blue curve compared to the orange one), but it is able to flatten the net consumption curve of the whole district (bottom panel, orange curve compared to the blue one).
The rule-based solution at this stage is superior to the planner, as it is able to flatten the net consumption curve more precisely.
However, the special abilities of the planner to incorporate information from the far future, are anticipated to be
able to improve the performance of the planner.
Moreover, we found that synergizing imperfect planners with last rule-based agent is a beneficial approach, as it is able to
exploit both far future (local) information and the accurate global information required for "polishing" the net consumption curve.
Note that, as any planner-based, this solution is only as good as its function predictor, and thus it is expected to be
able to perform better with an improved one.
Furthermore, the adequate selection of the search depth enabled the planner to exploit information from the far future.
Still, we limited it to
You can create the environment using the requirements.txt file or the environment.yml file, both should work fine.
Library | Version |
---|---|
python |
3.9.13 |
matplotlib |
3.5.2 |
tqdm |
4.64.1 |
torch |
1.12.1 |
plus the CityLearn package itself, with its dependencies. Note to get the 1.3.6 version, from: https://github.com/intelligent-environments-lab/CityLearn
├── README.md
├── agents
- folder for the agents' modules
│ ├── battery_model_rb_agent.py
- rule-based agent with a battery model
│ ├── brute_force_day_agent.py
- planner agent combined with a rule-based agent
│ └── controller.py
- the multi-agent control class
├── battery
- folder for the battery model modules
│ |── BATTERY.md
- explanation of the battery model
│ ├── battery_model.py
- battery model class
│ ├── battery_model_new.py
- battery model class (second version)
│ ├── env_battery_model.py
- battery model class as it is written in the CityLearn environment
│ └── tree_search.py
- the tree search algorithm (UCS) implementation
├── data
- the CityLearn phase 1 data and the data needed for MLP training and prediction
├── environment.yml
├── figures
- figures for README files
├── main.py
- main script for locally evaluating the model on the training data
├── net_consumption_mlp
- folder for the MLP implementation
│ ├── data_preprocessing.py
- CityLearn data preprocessor for MLP training
│ ├── eval_exp_mlp.py
- MLP evaluator
│ ├── main_mlp.py
- main module for MLP training
│ ├── mlp_model.py
- the MLP class
│ ├── mlp_predictor.py
- a CityLearn MLP-based predictor
├── notebooks
- folder for notebooks collection
├── predictors
- folder for predictor modules
│ ├── check_predictors.py
│ ├── consumption_predictors.py
│ └── predictors_wrapper.py
- a wrapper for the different predictors being used
├── requirements.txt
├── rewards
- folder for reward and utility modules
│ ├── agent_inst_u.py
- a local instantaneous utility function approximation module
│ ├── agent_inst_u_with_last.py
- a local and global instantaneous utility function approximation module
│ ├── get_reward.py
- not in use in our implementation but kept for CityLearn compatibility
│ └── user_reward.py
- not in use in our implementation but kept for CityLearn compatibility
└── utils.py
- utility functions for the main script
- CityLearn challenge. https://www.aicrowd.com/challenges /neurips-2022-citylearn-challenge
This solution and its repository were developed by:
-
Gal Ness - gness67{at}gmail.com
-
Roy Elkabetz - elkabetzroy{at}gmail.com