Skip to content

Latest commit

 

History

History
660 lines (470 loc) · 20.3 KB

README.md

File metadata and controls

660 lines (470 loc) · 20.3 KB

Prometheus Cheatsheets

Basics

  • Counter: A counter metric always increases
  • Gauge: A gauge metric can increase or decrease
  • Histogram: A histogram metric can increase or descrease
  • Source and Statistics 101

Query Functions:

  • rate - The rate function calculates at what rate the counter increases per second over a given time window. src
  • irate - Calculates at what rate the counter increases per second over a defined time window. The difference being that irate only looks at the last two data points. This makes irate well suited for graphing volatile and/or fast-moving counters. src
  • increase - The increase function calculates the counter increase over a given time frame. src
  • resets - The function gives you the number of counter resets over a given time window. src

Curated Examples

Example queries per exporter / service:

Questions and Answers

How can I get the amount of requests over a given time (dashboard time):

sum by (uri) (increase(http_requests_total[$__range]))

How many pod restarts per minute?

rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace="default"}[15m]) * 60 * 15

View the pod restarts over time:

sum(kube_pod_container_status_restarts_total{container="my-service"}) by (pod)

Example Queries

Show me all the metric names for the job=app:

group ({job="app"}) by (__name__)

How many nodes are up?

up

Combining values from 2 different vectors (Hostname with a Metric):

up * on(instance) group_left(nodename) (node_uname_info)

Exclude labels:

sum without(job) (up * on(instance)  group_left(nodename)  (node_uname_info))

Count targets per job:

count by (job) (up)

Amount of Memory Available:

node_memory_MemAvailable_bytes

Amount of Memory Available in MB:

node_memory_MemAvailable_bytes/1024/1024

Amount of Memory Available in MB 10 minutes ago:

node_memory_MemAvailable_bytes/1024/1024 offset 10m

Average Memory Available for Last 5 Minutes:

avg_over_time(node_memory_MemAvailable_bytes[5m])/1024/1024

Memory Usage in Percent:

100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))

CPU Utilization:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle", instance="my-instance"}[5m])) * 100 ) 

CPU Utilization offset with 24hours ago:

100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle", instance="my-instance"}[5m] offset 24h)) * 100 )

CPU Utilization per Core:

( (1 - rate(node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}[$__interval])) / ignoring(cpu) group_left count without (cpu)( node_cpu_seconds_total{job="node-exporter", mode="idle", instance="$instance"}) ) 

CPU Utilization by Node:

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m]) * 100) * on(instance) group_left(nodename) (node_uname_info))

Memory Available by Node:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

Or if you rely on labels from other metrics:

(node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$nodename"})

Load Average in percentage:

avg(node_load1{instance=~"$name", job=~"$job"}) /  count(count(node_cpu_seconds_total{instance=~"$name", job=~"$job"}) by (cpu)) * 100

Load Average per Instance:

sum(node_load5{}) by (instance) / count(node_cpu_seconds_total{mode="user"}) by (instance) * 100

Load Average (average per instance_id: lets say the metric has 2 identical label values but are different):

avg by (instance_id, instance) (node_load1{job=~"node-exporter", aws_environment="dev", instance="debug-dev"})
# {instance="debug-dev",instance_id="i-aaaaaaaaaaaaaaaaa"}
# {instance="debug-dev",instance_id="i-bbbbbbbbbbbbbbbbb"}

Disk Available by Node:

node_filesystem_free_bytes{mountpoint="/"} * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Outbound:

sum(rate(node_disk_read_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Disk IO per Node: Inbound:

sum(rate(node_disk_written_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Network IO per Node:

sum(rate(node_network_receive_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
sum(rate(node_network_transmit_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)

Process Restarts:

changes(process_start_time_seconds{job=~".+"}[15m])

Container Cycling:

(time() - container_start_time_seconds{job=~".+"}) < 60

Histogram:

histogram_quantile(1.00, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (handler, le)) * 1e3

Metrics 24 hours ago (nice when you compare today with yesterday):

# query a
total_number_of_errors{instance="my-instance", region="eu-west-1"}
# query b
total_number_of_errors{instance="my-instance", region="eu-west-1"} offset 24h

# related:
# https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/

Number of Nodes (Up):

count(up{job="cadvisor_my-swarm"})

Running Containers per Node:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id)

Running Containers per Node, include corresponding hostnames:

count(container_last_seen) BY (container_label_com_docker_swarm_node_id) * ON (container_label_com_docker_swarm_node_id) GROUP_LEFT(node_name) node_meta 

HAProxy Response Codes:

haproxy_server_http_responses_total{backend=~"$backend", server=~"$server", code=~"$code", alias=~"$alias"} > 0

Metrics with the most resources:

topk(10, count by (__name__)({__name__=~".+"}))

the same, but per job:

topk(10, count by (__name__, job)({__name__=~".+"}))

or jobs have the most time series:

topk(10, count by (job)({__name__=~".+"}))

Top 5 per value:

sort_desc(topk(5, aws_service_costs))

Table - Top 5 (enable instant as well):

sort(topk(5, aws_service_costs))

Most metrics per job, sorted:

sort_desc (sum by (job) (count by (__name__, job)({job=~".+"})))

Group per Day (Table) - wip

aws_service_costs{service=~"$service"} + ignoring(year, month, day) group_right
  count_values without() ("year", year(timestamp(
    count_values without() ("month", month(timestamp(
      count_values without() ("day", day_of_month(timestamp(
        aws_service_costs{service=~"$service"}
      )))
    )))
  ))) * 0

Group Metrics per node hostname:

node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

..
{cloud_provider="amazon",instance="x.x.x.x:9100",job="node_n1",my_hostname="n1.x.x",nodename="n1.x.x"}

Subtract two gauge metrics (exclude the label that dont match):

polkadot_block_height{instance="polkadot", chain=~"$chain", status="sync_target"} - ignoring(status) polkadot_block_height{instance="polkadot", chain=~"$chain", status="finalized"}

Conditional joins when labels exisits:

(
    # source: https://stackoverflow.com/a/72218915
    # For all sensors that have a name (label "label"), join them with `node_hwmon_sensor_label` to get that name.
    (node_hwmon_temp_celsius * ignoring(label) group_left(label) node_hwmon_sensor_label)
  or
    # For all sensors that do NOT a name (label "label") in `node_hwmon_sensor_label`, assign them `label="unknown-sensor-name"`.
    # `label_replace()` only adds the new label, it does not remove the old one.
    (label_replace((node_hwmon_temp_celsius unless ignoring(label) node_hwmon_sensor_label), "label", "unknown-sensor-name", "", ".*"))
)

Container CPU Average for 5m:

(sum by(instance, container_label_com_amazonaws_ecs_container_name, container_label_com_amazonaws_ecs_cluster) (rate(container_cpu_usage_seconds_total[5m])) * 100) 

Container Memory Usage: Total:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})

Container Memory, per Task, Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id)

Container Memory per Node:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_node_id)

Memory Usage per Stack:

sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_stack_namespace)

Remove metrics from results that does not contain a specific label:

container_cpu_usage_seconds_total{container_label_com_amazonaws_ecs_cluster!=""}

Remove labels from a metric:

sum without (age, country) (people_metrics)

View top 10 biggest metrics by name:

topk(10, count by (__name__)({__name__=~".+"}))

View top 10 biggest metrics by name, job:

topk(10, count by (__name__, job)({__name__=~".+"}))

View all metrics for a specific job:

{__name__=~".+", job="node-exporter"}

View all metrics for more than one job using vector selectors

{__name__=~".+", job=~"traefik|cadvisor|prometheus"}

Website uptime with blackbox-exporter:

# https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for

avg_over_time(probe_success{job="node"}[15m]) * 100

Remove / Replace:

Client Request Counts:

irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Client Response Time:

irate(http_client_requests_seconds_sum{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m]) / 
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])

Requests per Second:

sum(increase(http_server_requests_seconds_count{service="my-service", env="dev"}[1m])) by (uri)

is the same as:

sum(rate(http_server_requests_seconds_count{service="my-service", env="dev"}[1m]) * 60 ) by (uri)

See this SO thread for more details

p95 Request Latencies with histogram_quantile (the latency experienced by the slowest 5% of requests in seconds):

histogram_quantile(0.95, sum by (le, store) (rate(myapp_latency_seconds_bucket{application="product-service", category=~".+"}[5m])))

Resource Requests and Limits:

# for cpu: average rate of cpu usage over 15minutes
rate(container_cpu_usage_seconds_total{job="kubelet",container="my-application"}[15m])

# for mem: shows in mb
container_memory_usage_bytes{job="kubelet",container="my-application"}  / (1024 * 1024)

Scrape Config

relabel configs:

# full example: https://gist.github.com/ruanbekker/72216bea59fc56af189f5a7b2e3a8002
scrape_configs:
  - job_name: 'multipass-nodes'
    static_configs:
    - targets: ['ip-192-168-64-29.multipass:9100']
      labels:
        env: test
    - targets: ['ip-192-168-64-30.multipass:9100']
      labels:
        env: test
    # https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/#internal-labels
    relabel_configs:
    - source_labels: [__address__]
      separator: ':'
      regex: '(.*):(.*)'
      replacement: '${1}'
      target_label: instance

static_configs:

scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
         - targets: ['localhost:9090']
      labels:
        region: 'eu-west-1'

dns_sd_configs:

scrape_configs:
  - job_name: 'mysql-exporter'
    scrape_interval: 5s
    dns_sd_configs:
    - names:
      - 'tasks.mysql-exporter'
      type: 'A'
      port: 9104
    relabel_configs:
    - source_labels: [__address__]
      regex: '.*'
      target_label: instance
      replacement: 'mysqld-exporter'

Useful links:

Grafana with Prometheus

If you have output like this on grafana:

{instance="10.0.2.66:9100",job="node",nodename="rpi-02"}

and you only want to show the hostnames, you can apply the following in "Legend" input:

{{nodename}}

If your output want exported_instance in:

sum(exporter_memory_usage{exported_instance="myapp"})

You would need to do:

sum by (exported_instance) (exporter_memory_usage{exported_instance="my_app"})

Then on Legend:

{{exported_instance}}

Variables

  • Hostname:

name: node label: node node: label_values(node_uname_info, nodename)

Then in Grafana you can use:

sum(rate(node_disk_read_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info{nodename=~"$node"})
  • Node Exporter Address

type: query query: label_values(node_network_up, instance)

  • MySQL Exporter Address

type: query query: label_values(mysql_up, instance)

  • Static Values:

type: custom name: dc label: dc values seperated by comma: eu-west-1a,eu-west-1b,eu-west-1c

  • Docker Swarm Stack Names

name: stack label: stack query: label_values(container_last_seen,container_label_com_docker_stack_namespace)

  • Docker Swarm Service Names

name: service_name label: service_name query: label_values(container_last_seen,container_label_com_docker_swarm_service_name)

  • Docker Swarm Manager NodeId:

name: manager_node_id label: manager_node_id query:

label_values(container_last_seen{container_label_com_docker_swarm_service_name=~"proxy_traefik", container_label_com_docker_swarm_node_id=~".*"}, container_label_com_docker_swarm_node_id)
  • Docker Swarm Stacks Running on Managers

name: stack_on_manager label: stack_on_manager query:

label_values(container_last_seen{container_label_com_docker_swarm_node_id=~"$manager_node_id"},container_label_com_docker_stack_namespace)

Recording Rules

Application Instrumentation

Python Flask

External Sources

Setups: