[enhancement] Disregard non-running pods #128

rakvay · 2024-10-30T13:50:50Z

Describe the enhancement you'd like

I have been using the Kubernetes dashboards provided in this repository, and I appreciate the work that has gone into creating them. However, I’ve noticed that some of the metrics currently include non-running pods, which can lead to inaccurate resource usage and performance insights.
Specifically, I would like to request the following changes to ensure that PromQL queries related to resource requests and limits (specifically kube_pod_container_resource_requests and kube_pod_container_resource_limits) explicitly filter out non-running pods. For example, metrics like kube_pod_status_phase should include a check for phase="Running".

Current expressions:

sum(kube_pod_container_resource_requests{namespace=~"$namespace", resource="cpu", cluster="$cluster"})

sum(kube_pod_container_resource_limits{namespace=~"$namespace", resource="memory", cluster="$cluster"})

Proposed modifications:

sum(kube_pod_container_resource_requests{namespace=~"$namespace", resource="cpu"} * on(namespace, pod) group_left() (sum(kube_pod_status_phase{phase="Running", cluster="$cluster"}) by (pod, namespace) == 1))

sum(kube_pod_container_resource_limits{namespace=~"$namespace", resource="memory"} * on(namespace, pod) group_left() (sum(kube_pod_status_phase{phase="Running", cluster="$cluster"}) by (pod, namespace) == 1))

Similar modifications should be applied to all relevant metrics to accurately reflect the state of running pods.

Additional context

No response

The text was updated successfully, but these errors were encountered:

martin-ilavsky · 2024-11-21T08:24:10Z

Hi we have imilar issue for memory/cpu metrics. When pod is restarted, metrics for it persists for a little while, thus creating peaks in cpu/memory metrics since they are summing 2 metrics together. We have added id to expression to separate them.

Current expression:

"sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=~\"$pod\", image!=\"\", container!=\"\", cluster=\"$cluster\"}[$__rate_interval])) by (container)"

Changed expression:
"sum(rate(container_cpu_usage_seconds_total{namespace=\"$namespace\", pod=~\"$pod\", image!=\"\", container!=\"\", cluster=\"$cluster\"}[$__rate_interval])) by (container,id)"

Similarly to cpu, network and other metrics.

rakvay added the enhancement New feature or request label Oct 30, 2024

rakvay assigned dotdc Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] Disregard non-running pods #128

[enhancement] Disregard non-running pods #128

rakvay commented Oct 30, 2024

martin-ilavsky commented Nov 21, 2024

[enhancement] Disregard non-running pods #128

[enhancement] Disregard non-running pods #128

Comments

rakvay commented Oct 30, 2024

Describe the enhancement you'd like

Additional context

martin-ilavsky commented Nov 21, 2024