Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dag and task metrics should be initialized to zero at startup #68

Open
prabhuakshai92 opened this issue Oct 1, 2019 · 3 comments
Open

Comments

@prabhuakshai92
Copy link

prabhuakshai92 commented Oct 1, 2019

Airflow metrics don't get reset after a restart, however, the metrics did not get initialized. This lead to some unexpected PromQL responses when querying with missing data.

For example, a task state 'failed' is set to '1' at the first failure of the task but before the failure no data existed for the task with state 'failed'. A PromQL query that checks if the task at least executed once over a time period using the 'increase' function, based on either 'success' or 'failed' state count increase over that time period, responded as if neither state changed over the period of time because the 'increase' function extrapolates the value that is available over the time period if there is no data.

Prometheus documentation discusses about this issue:

A potential fix for this issue is to initialize all dag and their task metrics to zero at startup.

@prabhuakshai92 prabhuakshai92 changed the title Dag and task metrics should be initialized to zero Dag and task metrics should be initialized to zero when exported to prometheus Oct 1, 2019
@prabhuakshai92 prabhuakshai92 changed the title Dag and task metrics should be initialized to zero when exported to prometheus Dag and task metrics should be initialized to zero at startup Oct 1, 2019
@WakeupTsai
Copy link

A workaround here:

sum(increase(airflow_task_status{status="failed"}[10m])) without (pod,instance) > 0 or max without(pod, instance) (airflow_task_status{status="failed"} != 0 unless airflow_task_status{status="failed"} offset 10m)

reference: prometheus/prometheus#1673

@jasonstitt
Copy link

A caveat with the workaround is that the exporter provides a total count of past failures, so when you first start the exporter (or if there's a sufficiently long interruption in metrics), when the exporter comes up everything that failed in the past will show new failures. So, zero initialization would be superior.

@shalberd
Copy link

shalberd commented Sep 5, 2024

Agreed, this is no good long-term solution. This issue is still there in Airflow 2.8.x ff. as well.

Does anyone have a hint where the metrics and their values are produced here in statsd? Setting the counter to zero makes total sense.

I am having a similar issue with the first failure of any dag
i.e. metric
airflow_dagrun_failed_count{} being a counter as well with variable dag_id key / field values.

@WakeupTsai does have a good solution, but it is a workaround in the face of prometheus' architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants