Invalid metric values for completed tasks task.cpu_usage_percent/task.mem_usage_percent #43460

eanikindfi · 2024-10-28T09:20:16Z

eanikindfi
Oct 28, 2024

Apache Airflow version

2.10.2

What happened?

We have Airflow (2.10.2) in k8s cluster deployed by an official helm-chart. This helm-release contains statsd component, and Airflow send its metrics to statsd.
We use celery-executor, so our tasks are running inside worker-pods.
Also we have Victoriametrics release in this cluster. We scrape metrics from statsd with VMScrapeConfig.

Out of all the metrics provided by Airflow we need 2 essential ones:

task.cpu_usage_percent.<dag_id>.<task_id>
task.mem_usage_percent.<dag_id>.<task_id>

We can see lifecycles of out tasks/dags in airflow-webinterface, so we know when each task started and ended.
But if we get metric values from statsd at the moment when our task already ended (hours after the completion), we still get the cpu_usage and mem_usage for this task.

What you think should happen instead?

According to documentation task.cpu_usage_percent.<dag_id>.<task_id> and task.mem_usage_percent.<dag_id>.<task_id> are gauges that show:

Percentage of CPU/memory used by a task

So we assume, that if task ended at 02:15 PM, then at 02:16 PM or even later Airflow shouldn't send either task.cpu_usage_percent.* or task.mem_usage_percent.* for this task to statsd, right?

The essential meaning of these 2 metrics is to show how much resources in use for each task/dag at the moment, That way we can visualize dynamic of the resource-usage of Airflow or create alerting-solutions. Correct me if I'm wrong.

How to reproduce

Configuration

statsd:

statsd:
  extraMappings:
  - match: airflow.task.cpu_usage.*.*
    name: "airflow_task_cpu_usage"
    help: "Percentage of CPU used by a task"
    labels:
      dag_id: "$1"
      task_id: "$2"
  - match: airflow.task.mem_usage.*.*
    name: "airflow_task_mem_usage"
    help: "Percentage of memory used by a task"
    labels:
      dag_id: "$1"
      task_id: "$2"

VMScrapeConfig:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMScrapeConfig
metadata:
  name: airflow-service-scrape
  namespace: monitoring
spec:
  staticConfigs:
    - targets: [airflow-statsd.airflow.svc.cluster.local:9102]
  metricsPath: /metrics
  scrapeInterval: 30s
  scrapeTimeout: 15s

Operating System

Kubernetes v1.31.0-eks

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Even though I have provided VMScrapeConfig configuration, there is no need to use it in test-cases for reproduction of the issue. Because we check statsd endpoint (on port 9102) also to see the raw metrics delivered from Airflow - it still has the same issue.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2024-10-28T09:20:18Z

boring-cyborg[bot]
bot Oct 28, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

potiuk · 2024-10-28T11:08:59Z

potiuk
Oct 28, 2024
Collaborator

I think, it's the nature of statsd and we cannot do much about it (but I am not 100% sure) - and we are moving away from statsd in favour of open-telemetry https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#setup-opentelemetry.

While we are not removing statsd, it's very likely we are not going to invest any time into improving or even fixing statsd problems.

Mybe you could switch to using open-telemetry and see if you have similar problem there - this is anyhow "future" of airflow telemetry - and it has way better features than statsd (including support for open-telemetry traces in 2.10.*) - so the best course of action would be to try it out @eanikindfi - otherwise you might hope that some will look at this one, but it's unlikely statsd will get a lot of love from the side of maintainers.

cc: @howardyoo @ferruzzi

0 replies

ferruzzi · 2024-10-28T20:29:56Z

ferruzzi
Oct 28, 2024
Collaborator

It's an interesting one. As it stands, the metric emits the current loads every so often, but there is no code in the task cleanup stage which triggers that to update down (is downdate a word?) to 0%. It would be an interesting puzzle. At what point should that get zeroed out? If it is done during the cleanup stage and something happens, then it's misleading because it hasn't actually finished yet.

Perhaps some kind of check for "get all the tasks which have completed since the last time these metrics were updated and zero out their resource metrics"?

I wonder if this can be handled on the user/dashboarding end, something like "if the percentages haven't been updated after a certain amount of time then display them as zero"?

0 replies

potiuk · 2024-10-29T07:19:49Z

potiuk
Oct 29, 2024
Collaborator

Yeah. It looks like a "recipient-only" thing when you can compare status of the task with metrics.

Let me convert it to a discussion, because it's unlikely to be solved differently.

0 replies

eanikindfi · 2024-10-30T03:05:25Z

eanikindfi
Oct 30, 2024
Author

Thank you guys for quick answers!

So right now we have 2 ways for solving it basically:

Go to open-telemetry solution rather than statsd;
Fulfill our monitorings with additional rules to compare live metrics from Airflow with an actual state of the task/dag.

Questions:

@potiuk, Do you have a successful test-cases of this problem with open-telemetry? We have small knowledge of this component, so it would be awesome if you share your experience before we go there blindly. How Airflow >> open-telemetry stack works with mentioned metrics?;
Is it possible to "mark" status of the statsd solution as not preferable in the documentation of Airflow? Based on this discussion.
@potiuk @ferruzzi, If we try to compare resource-usage and states — any suggestions what metric we can use to check the actual state of the task?

0 replies

eanikindfi · 2024-12-04T09:14:27Z

eanikindfi
Dec 4, 2024
Author

Friendly bump

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid metric values for completed tasks task.cpu_usage_percent/task.mem_usage_percent #43460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Invalid metric values for completed tasks task.cpu_usage_percent/task.mem_usage_percent #43460

eanikindfi Oct 28, 2024

Apache Airflow version

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Are you willing to submit PR?

Code of Conduct

Replies: 6 comments

boring-cyborg[bot] bot Oct 28, 2024

potiuk Oct 28, 2024 Collaborator

ferruzzi Oct 28, 2024 Collaborator

potiuk Oct 29, 2024 Collaborator

eanikindfi Oct 30, 2024 Author

eanikindfi Dec 4, 2024 Author

eanikindfi
Oct 28, 2024

boring-cyborg[bot]
bot Oct 28, 2024

potiuk
Oct 28, 2024
Collaborator

ferruzzi
Oct 28, 2024
Collaborator

potiuk
Oct 29, 2024
Collaborator

eanikindfi
Oct 30, 2024
Author

eanikindfi
Dec 4, 2024
Author