Prometheus Cheatsheets

Basics
Curated Examples
Example Queries
Scrape Configs
Recording Rules
External Sources

Basics

Counter: A counter metric always increases
Gauge: A gauge metric can increase or decrease
Histogram: A histogram metric can increase or descrease
Source and Statistics 101

Query Functions:

rate - The rate function calculates at what rate the counter increases per second over a given time window. src
irate - Calculates at what rate the counter increases per second over a defined time window. The difference being that irate only looks at the last two data points. This makes irate well suited for graphing volatile and/or fast-moving counters. src
increase - The increase function calculates the counter increase over a given time frame. src
resets - The function gives you the number of counter resets over a given time window. src

Curated Examples

Example queries per exporter / service:

Node Metrics

Questions and Answers

How can I get the amount of requests over a given time (dashboard time):

sum by (uri) (increase(http_requests_total[$__range]))

How many pod restarts per minute?

rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace="default"}[15m]) * 60 * 15

View the pod restarts over time:

sum(kube_pod_container_status_restarts_total{container="my-service"}) by (pod)

Example Queries

Show me all the metric names for the job=app:

group ({job="app"}) by (__name__)

How many nodes are up?

up

Combining values from 2 different vectors (Hostname with a Metric):

up * on(instance) group_left(nodename) (node_uname_info)

Exclude labels:

sum without(job) (up * on(instance)  group_left(nodename)  (node_uname_info))

Count targets per job:

count by (job) (up)

Amount of Memory Available:

node_memory_MemAvailable_bytes

Amount of Memory Available in MB:

node_memory_MemAvailable_bytes/1024/1024

Amount of Memory Available in MB 10 minutes ago:

node_memory_MemAvailable_bytes/1024/1024 offset 10m

Average Memory Available for Last 5 Minutes:

avg_over_time(node_memory_MemAvailable_bytes[5m])/1024/1024

Memory Usage in Percent:

100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))

CPU Utilization:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle", instance="my-instance"}[5m])) * 100 )

CPU Utilization offset with 24hours ago:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle", instance="my-instance"}[5m] offset 24h)) * 100 )

CPU Utilization per Core:

( (1 - rate(node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}[$__interval])) / ignoring(cpu) group_left count without (cpu)( node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}) )

CPU Utilization by Node:

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m]) * 100) * on(instance) group_left(nodename) (node_uname_info))

Memory Available by Node:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

Or if you rely on labels from other metrics:

(node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$nodename"})

Load Average in percentage:

avg(node_load1{instance=~"$name", job=~"$job"}) /  count(count(node_cpu_seconds_total{instance=~"$name", job=~"$job"}) by (cpu)) * 100

Load Average per Instance:

sum(node_load5{}) by (instance) / count(node_cpu_seconds_total{mode="user"}) by (instance) * 100

Load Average (average per instance_id: lets say the metric has 2 identical label values but are different):

avg by (instance_id, instance) (node_load1{job=~"node-exporter", aws_environment="dev", instance="debug-dev"})
# {instance="debug-dev",instance_id="i-aaaaaaaaaaaaaaaaa"}
# {instance="debug-dev",instance_id="i-bbbbbbbbbbbbbbbbb"}

Disk Available by Node:

node_filesystem_free_bytes{mountpoint="/"} * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Outbound:

sum(rate(node_disk_read_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Inbound:

sum(rate(node_disk_written_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Network IO per Node:

sum(rate(node_network_receive_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
sum(rate(node_network_transmit_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Process Restarts:

changes(process_start_time_seconds{job=~".+"}[15m])

Container Cycling:

(time() - container_start_time_seconds{job=~".+"}) < 60

Histogram:

histogram_quantile(1.00, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (handler, le)) * 1e3

Metrics 24 hours ago (nice when you compare today with yesterday):

# query a
total_number_of_errors{instance="my-instance", region="eu-west-1"}
# query b
total_number_of_errors{instance="my-instance", region="eu-west-1"} offset 24h

# related:
# https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/

Number of Nodes (Up):

count(up{job="cadvisor_my-swarm"})

Running Containers per Node:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id)

Running Containers per Node, include corresponding hostnames:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id) * ON (container_label_com_docker_swarm_node_id) GROUP_LEFT(node_name) node_meta

HAProxy Response Codes:

haproxy_server_http_responses_total{backend=~"$backend", server=~"$server", code=~"$code", alias=~"$alias"} > 0

Metrics with the most resources:

topk(10, count by (__name__)({__name__=~".+"}))

the same, but per job:

topk(10, count by (__name__, job)({__name__=~".+"}))

or jobs have the most time series:

topk(10, count by (job)({__name__=~".+"}))

Top 5 per value:

sort_desc(topk(5, aws_service_costs))

Table - Top 5 (enable instant as well):

sort(topk(5, aws_service_costs))

Most metrics per job, sorted:

sort_desc (sum by (job) (count by (__name__, job)({job=~".+"})))

Group per Day (Table) - wip

aws_service_costs{service=~"$service"} + ignoring(year, month, day) group_right
  count_values without() ("year", year(timestamp(
    count_values without() ("month", month(timestamp(
      count_values without() ("day", day_of_month(timestamp(
        aws_service_costs{service=~"$service"}
      )))
    )))
  ))) * 0

Group Metrics per node hostname:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

..
{cloud_provider="amazon",instance="x.x.x.x:9100",job="node_n1",my_hostname="n1.x.x",nodename="n1.x.x"}

Subtract two gauge metrics (exclude the label that dont match):

polkadot_block_height{instance="polkadot", chain=~"$chain", status="sync_target"} - ignoring(status) polkadot_block_height{instance="polkadot", chain=~"$chain", status="finalized"}

Conditional joins when labels exisits:

(
    # source: https://stackoverflow.com/a/72218915
    # For all sensors that have a name (label "label"), join them with `node_hwmon_sensor_label` to get that name.
    (node_hwmon_temp_celsius * ignoring(label) group_left(label) node_hwmon_sensor_label)
  or
    # For all sensors that do NOT a name (label "label") in `node_hwmon_sensor_label`, assign them `label="unknown-sensor-name"`.
    # `label_replace()` only adds the new label, it does not remove the old one.
    (label_replace((node_hwmon_temp_celsius unless ignoring(label) node_hwmon_sensor_label), "label", "unknown-sensor-name", "", ".*"))
)

Container CPU Average for 5m:

(sum by(instance, container_label_com_amazonaws_ecs_container_name, container_label_com_amazonaws_ecs_cluster) (rate(container_cpu_usage_seconds_total[5m])) * 100)

Container Memory Usage: Total:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})

Container Memory, per Task, Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id)

Container Memory per Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_node_id)

Memory Usage per Stack:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_stack_namespace)

Remove metrics from results that does not contain a specific label:

container_cpu_usage_seconds_total{container_label_com_amazonaws_ecs_cluster!=""}

Remove labels from a metric:

sum without (age, country) (people_metrics)

View top 10 biggest metrics by name:

topk(10, count by (__name__)({__name__=~".+"}))

View top 10 biggest metrics by name, job:

topk(10, count by (__name__, job)({__name__=~".+"}))

View all metrics for a specific job:

{__name__=~".+", job="node-exporter"}

View all metrics for more than one job using vector selectors

{__name__=~".+", job=~"traefik|cadvisor|prometheus"}

Website uptime with blackbox-exporter:

# https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for

avg_over_time(probe_success{job="node"}[15m]) * 100

Remove / Replace:

https://medium.com/@texasdave2/replace-and-remove-a-label-in-a-prometheus-query-9500faa302f0

Client Request Counts:

irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Client Response Time:

irate(http_client_requests_seconds_sum{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m]) / 
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Requests per Second:

sum(increase(http_server_requests_seconds_count{service="my-service", env="dev"}[1m])) by (uri)

is the same as:

sum(rate(http_server_requests_seconds_count{service="my-service", env="dev"}[1m]) * 60 ) by (uri)

See this SO thread for more details

p95 Request Latencies with histogram_quantile (the latency experienced by the slowest 5% of requests in seconds):

histogram_quantile(0.95, sum by (le, store) (rate(myapp_latency_seconds_bucket{application="product-service", category=~".+"}[5m])))

Resource Requests and Limits:

# for cpu: average rate of cpu usage over 15minutes
rate(container_cpu_usage_seconds_total{job="kubelet",container="my-application"}[15m])

# for mem: shows in mb
container_memory_usage_bytes{job="kubelet",container="my-application"}  / (1024 * 1024)

Scrape Config

relabel configs:

# full example: https://gist.github.com/ruanbekker/72216bea59fc56af189f5a7b2e3a8002
scrape_configs:
  - job_name: 'multipass-nodes'
    static_configs:
    - targets: ['ip-192-168-64-29.multipass:9100']
      labels:
        env: test
    - targets: ['ip-192-168-64-30.multipass:9100']
      labels:
        env: test
    # https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/#internal-labels
    relabel_configs:
    - source_labels: [__address__]
      separator: ':'
      regex: '(.*):(.*)'
      replacement: '${1}'
      target_label: instance

static_configs:

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
         - targets: ['localhost:9090']
      labels:
        region: 'eu-west-1'

dns_sd_configs:

scrape_configs:
  - job_name: 'mysql-exporter'
    scrape_interval: 5s
    dns_sd_configs:
    - names:
      - 'tasks.mysql-exporter'
      type: 'A'
      port: 9104
    relabel_configs:
    - source_labels: [__address__]
      regex: '.*'
      target_label: instance
      replacement: 'mysqld-exporter'

Useful links:

https://gist.github.com/ruanbekker/72216bea59fc56af189f5a7b2e3a8002
https://gist.github.com/trastle/1aa205354577ef0b329d4b8cc84c674a
prometheus/docs#341
https://medium.com/quiq-blog/prometheus-relabeling-tricks-6ae62c56cbda
https://blog.freshtracks.io/prometheus-relabel-rules-and-the-action-parameter-39c71959354a
https://www.robustperception.io/relabel_configs-vs-metric_relabel_configs
https://training.robustperception.io/courses/prometheus-configuration/lectures/3170347

Grafana with Prometheus

If you have output like this on grafana:

{instance="10.0.2.66:9100",job="node",nodename="rpi-02"}

and you only want to show the hostnames, you can apply the following in "Legend" input:

{{nodename}}

If your output want exported_instance in:

sum(exporter_memory_usage{exported_instance="myapp"})

You would need to do:

sum by (exported_instance) (exporter_memory_usage{exported_instance="my_app"})

Then on Legend:

{{exported_instance}}

Variables

Hostname:

Then in Grafana you can use:

sum(rate(node_disk_read_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$node"})

Node Exporter Address

type: query query: label_values(node_network_up, instance)

MySQL Exporter Address

type: query query: label_values(mysql_up, instance)

Static Values:

type: custom name: dc label: dc values seperated by comma: eu-west-1a,eu-west-1b,eu-west-1c

Docker Swarm Stack Names

name: stack label: stack query: label_values(container_last_seen,container_label_com_docker_stack_namespace)

Docker Swarm Service Names

name: service_name label: service_name query: label_values(container_last_seen,container_label_com_docker_swarm_service_name)

Docker Swarm Manager NodeId:

label_values(container_last_seen{container_label_com_docker_swarm_service_name=~"proxy_traefik", container_label_com_docker_swarm_node_id=~".*"}, container_label_com_docker_swarm_node_id)

Docker Swarm Stacks Running on Managers

label_values(container_last_seen{container_label_com_docker_swarm_node_id=~"$manager_node_id"},container_label_com_docker_stack_namespace)

Recording Rules

@deploy.live's Recording Rules Post

Application Instrumentation

Python Flask

@ramdesh flask-prometheus-grafana-example

External Sources

Prometheus
PromQL for Beginners
Prometheus 101
Section.io: Prometheus Querying
InnoQ: Prometheus Counters
Biggest Metrics
Top Metrics
Ordina-Jworks
Infinity Works
Prometheus Relabeling Tricks
@Valyala: PromQL Tutorial for Beginners
@Jitendra: PromQL Cheat Sheet
InfinityWorks: Prometheus Example Queries
Timber: PromQL for Humans
SectionIO: Prometheus Querying
RobustPerception
- RobustPerception: Understanding Machine CPU Usage
- RobustPerception: Common Query Patterns
- RobustPerception: Website Uptime
- RobustPerception: Prometheus Histogram
- RobustPerception: Prometheus Counter
- RobustPerception: Prometheus Guage
- RobustPerception: Prometheus Summary
DevConnected: The Definitive Guide to Prometheus
@showmax Prometheus Introduction
@rancher Cluster Monitoring
Prometheus CPU Stats
@aws Prometheus Rewrite Rules for k8s
ec2_sd_configs
- Prometheus AWS Cross Account ec2_sd_config
- Prometheus AWS ec2_sd_config role
kubernetes_sd_configs
- fabianlee kubernetes configs
@metricfire.com: Understanding the Rate Function Dashboarding:
Alerting on Missing Labels and Metrics
@devconnected Disk IO Dashboarding
@deploy.live recording rules
CPU and Memory Requests
Prometheus Counter Metrics
last9.io PromQL Cheatsheet

Setups:

Simulating AWS Tags in Local Prometheus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Prometheus Cheatsheets

Basics

Curated Examples

Questions and Answers

Example Queries

Scrape Config

Grafana with Prometheus

Variables

Recording Rules

Application Instrumentation

Python Flask

External Sources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Prometheus Cheatsheets

Basics

Curated Examples

Questions and Answers

Example Queries

Scrape Config

Grafana with Prometheus

Variables

Recording Rules

Application Instrumentation

Python Flask

External Sources