Invalid metric values for completed tasks task.cpu_usage_percent/task.mem_usage_percent #43460
Replies: 6 comments
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
I think, it's the nature of statsd and we cannot do much about it (but I am not 100% sure) - and we are moving away from statsd in favour of open-telemetry https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html#setup-opentelemetry. While we are not removing statsd, it's very likely we are not going to invest any time into improving or even fixing statsd problems. Mybe you could switch to using open-telemetry and see if you have similar problem there - this is anyhow "future" of airflow telemetry - and it has way better features than statsd (including support for open-telemetry traces in 2.10.*) - so the best course of action would be to try it out @eanikindfi - otherwise you might hope that some will look at this one, but it's unlikely statsd will get a lot of love from the side of maintainers. cc: @howardyoo @ferruzzi |
Beta Was this translation helpful? Give feedback.
-
It's an interesting one. As it stands, the metric emits the current loads every so often, but there is no code in the task cleanup stage which triggers that to update down (is downdate a word?) to 0%. It would be an interesting puzzle. At what point should that get zeroed out? If it is done during the cleanup stage and something happens, then it's misleading because it hasn't actually finished yet. Perhaps some kind of check for "get all the tasks which have completed since the last time these metrics were updated and zero out their resource metrics"? I wonder if this can be handled on the user/dashboarding end, something like "if the percentages haven't been updated after a certain amount of time then display them as zero"? |
Beta Was this translation helpful? Give feedback.
-
Yeah. It looks like a "recipient-only" thing when you can compare status of the task with metrics. Let me convert it to a discussion, because it's unlikely to be solved differently. |
Beta Was this translation helpful? Give feedback.
-
Thank you guys for quick answers! So right now we have 2 ways for solving it basically:
Questions:
|
Beta Was this translation helpful? Give feedback.
-
Friendly bump |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.10.2
What happened?
We have Airflow
(2.10.2)
in k8s cluster deployed by an official helm-chart. This helm-release containsstatsd
component, and Airflow send its metrics tostatsd
.We use
celery-executor
, so our tasks are running inside worker-pods.Also we have Victoriametrics release in this cluster. We scrape metrics from
statsd
withVMScrapeConfig
.Out of all the metrics provided by Airflow we need 2 essential ones:
task.cpu_usage_percent.<dag_id>.<task_id>
task.mem_usage_percent.<dag_id>.<task_id>
We can see lifecycles of out tasks/dags in airflow-webinterface, so we know when each task started and ended.
But if we get metric values from statsd at the moment when our task already ended (hours after the completion), we still get the
cpu_usage
andmem_usage
for this task.What you think should happen instead?
According to documentation
task.cpu_usage_percent.<dag_id>.<task_id>
andtask.mem_usage_percent.<dag_id>.<task_id>
are gauges that show:So we assume, that if task ended at 02:15 PM, then at 02:16 PM or even later Airflow shouldn't send either
task.cpu_usage_percent.*
ortask.mem_usage_percent.*
for this task tostatsd
, right?The essential meaning of these 2 metrics is to show how much resources in use for each task/dag at the moment, That way we can visualize dynamic of the resource-usage of Airflow or create alerting-solutions. Correct me if I'm wrong.
How to reproduce
Configuration
statsd:
VMScrapeConfig:
Operating System
Kubernetes v1.31.0-eks
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Even though I have provided
VMScrapeConfig
configuration, there is no need to use it in test-cases for reproduction of the issue. Because we checkstatsd
endpoint (on port 9102) also to see the raw metrics delivered from Airflow - it still has the same issue.Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions