Note: This is Alpha code for evaluation purposes.
Intel® Platform Resource Manager (Intel® PRM) is a suite of software packages to help you to co-locate best-efforts jobs with latency-critical jobs on a node and in a cluster. The suite contains the following:
- Agent (eris agent) to monitor and control platform resources (CPU Cycle, Last Level Cache, Memory Bandwidth, etc.) on each node.
- Analysis tool (analyze tool) to build a model for platform resource contention detection.
- Python 3.6.x
- Python lib: numpy, pandas, scipy, scikit-learn, docker, prometheus-client
- Golang compiler
- gcc
- git
- Docker
Assuming all requirements are installed and configured properly, follow the steps below to set up a working environment.
-
Install the
intel-cmt-cat
tool with the commands:git clone https://github.com/intel/intel-cmt-cat cd intel-cmt-cat make sudo make install PREFIX=/usr
-
Build the Intel® Platform Resource Manager with the commands:
git clone https://github.com/intel/platform-resource-manager cd platform-resource-manager ./setup.sh cd eris
-
Prepare the workload configuration file.
To use the Intel® PRM tool, you must provide a workload configuration
json
file in advance. Each row in the file describes name, id, type (Best-Effort, Latency-Critical), request CPU count of one task (Container).
The following is an example file demonstrating the file format.
{
"cassandra_workload": {
"cpus": 10,
"type": "latency_critical"
},
"django_workload": {
"cpus": 8,
"type": "latency_critical"
},
"memcache_workload_1": {
"cpus": 2,
"type": "latency_critical"
},
"memcache_workload_2": {
"cpus": 2,
"type": "latency_critical"
},
"memcache_workload_3": {
"cpus": 2,
"type": "latency_critical"
},
"stress-ng": {
"cpus": 2,
"type": "best_efforts"
},
"tensorflow_training": {
"cpus": 1,
"type": "best_efforts"
}
}
This section lists command line arguments for the eris agent and the analyze tool.
usage: eris.py [-h] [-v] [-g] [-d] [-c] [-r] [-i] [-e] [-n] [-p]
[-u UTIL_INTERVAL] [-m METRIC_INTERVAL] [-l LLC_CYCLES]
[-q QUOTA_CYCLES] [-k MARGIN_RATIO] [-t THRESH_FILE]
workload_conf_file
eris agent monitor container CPU utilization and platform metrics, detect
potential resource contention and regulate task resource usages
positional arguments:
workload_conf_file workload configuration file describes each task name,
type, id, request cpu count
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-g, --collect-metrics
collect platform performance metrics (CPI, MPKI,
etc..)
-d, --detect detect resource contention between containers
-c, --control regulate best-efforts task resource usages
-r, --record record container CPU utilizaton and platform metrics
in csv file
-i, --key-cid use container id in workload configuration file as key
id
-e, --enable-hold keep container resource usage in current level while
the usage is close but not exceed throttle threshold
-n, --disable-cat disable CAT control while in resource regulation
-x, --exclusive-cat use exclusive CAT control while in resource regulation
-p, --enable_prometheus
allow eris send metrics to prometheus
-u UTIL_INTERVAL, --util-interval UTIL_INTERVAL
CPU utilization monitor interval (1, 10)
-m METRIC_INTERVAL, --metric-interval METRIC_INTERVAL (2, 60)
platform metrics monitor interval
-l LLC_CYCLES, --llc-cycles LLC_CYCLES
cycle number in LLC controller
-q QUOTA_CYCLES, --quota-cycles QUOTA_CYCLES
cycle number in CPU CFS quota controller
-k MARGIN_RATIO, --margin-ratio MARGIN_RATIO
margin ratio related to one logical processor used in
CPU cycle regulation
-t THRESH_FILE, --thresh-file THRESH_FILE
threshold model file build from analyze.py tool
usage: analyze.py [-h] [-v] [-t THRESH]
[-f {quartile,normal,gmm-strict,gmm-normal}]
[-m METRIC_FILE]
workload_conf_file
This tool analyzes CPU utilization and platform metrics collected from eris
agent and build data model for contention detect and resource regulation.
positional arguments:
workload_conf_file workload configuration file describes each task name,
type, id, request cpu count
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-t THRESH, --thresh THRESH
threshold used in outlier detection
-a {gmm-standard, gmm-origin}, --fense-method {gmm-standard, gmm-origin}
fense method in outiler detection
-f {gmm-strict,gmm-normal}, --fense-type {gmm-strict,gmm-normal}
fense type used in outlier detection
-m METRIC_FILE, --metric-file METRIC_FILE
metrics file collected from eris agent
-u UTIL_FILE, --util-file UTIL_FILE
Utilization file collected from eris agent
-o, --offline do offline analysis based on given metrics file
-i, --key-cid use container id in workload configuration file as key
id
-
Run latency critical tasks and stress workloads on one node. The CPU utilization will be recorded in
util.csv
and platform metrics will be recorded inmetrics.csv
.sudo python eris.py --collect-metrics --record workload.json
-
Analyze data collected from the eris agent and build the data model for resource contention detection and regulation. This step generates a model file
threshold.json
.sudo python analyze.py workload.json
-
Add best-efforts task to node, restart monitor, and detect potential resource contention.
sudo python eris.py --collect-metrics --record --detect workload.json
Optionally, you can enable resource regulation on best-efforts tasks with the following command:
sudo python eris.py --collect-metrics --record --detect --control workload.json
Intel® PRM is an open source project licensed under the Apache v2 License.
Coding style
Intel® PRM follows the standard formatting recommendations and language idioms set out in C, Go, and Python.
Pull requests
We accept github pull requests.
Issue tracking
If you have a problem, please let us know. If you find a bug not already documented, please file a new issue in github so we can work toward resolution.