Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tag workflow_failed counter metric with exception type #563

Open
tsurdilo opened this issue Nov 20, 2024 · 11 comments
Open

Tag workflow_failed counter metric with exception type #563

tsurdilo opened this issue Nov 20, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@tsurdilo
Copy link

tsurdilo commented Nov 20, 2024

tag workflow_failed counter metric with the application failure type that caused workflow to fail

this would allow users to filter this metric and omit failure types for which their business logic is expecting to fail the exec, allowing them to still be able to alert on workflow failed counts for unexpected failures

@tsurdilo tsurdilo added the enhancement New feature or request label Nov 20, 2024
@cretz
Copy link
Member

cretz commented Nov 20, 2024

I am not sure we can blindly do this by default. Many people send metrics off to third parties that charge per label/cardinality IIUC and to potentially increase this significantly for a user on SDK upgrade may not be the best thing. We may be able to make it an opt-in option. Also not every failure is an application failure, so would need to know what to set the label to on those.

What we usually tell users that want more than what our metrics offer by default is to make their own metrics. Granted on some SDKs this may not properly catch all types (such as non-determinism in core-based SDKs when configured to make it a failure instead of task failure), but in general an interceptor should work for this case. The benefit of this approach is when a user wants the next label for workflow failure based on something, and then the next, and so on, this can easily be done without altering SDKs.

@tsurdilo
Copy link
Author

not sure i fully understand as temporal_activity_execution_failed does include "exception" field that includes the exception type thrown on activity failure (in java sdk at least).
not sure why we could not do the same thing for workflow failures?

@cretz
Copy link
Member

cretz commented Nov 21, 2024

I think that is only in Java and it's not documented at https://docs.temporal.io/references/sdk-metrics#activity_execution_failed. It's not that we can't add this, it's that we need to be cautious about adding by default and increasing label counts and cardinality on users that simple do an SDK upgrade. In the meantime, users can, for the most part, emit their own metrics with whatever error information they want (including the type). Also there's a difference between "error/exception data type" and "application failure error type" that would have to be discussed too.

@tsurdilo
Copy link
Author

got it. ok its just that this use case comes up a lot. basically users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures. but then that skews their completion metrics too. and workarounds of custom metrics and/or custom search attributes are not really good workarounds at all (at least havent been in past when we suggested to users)

@Sushisource
Copy link
Member

Sushisource commented Nov 22, 2024

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

@longquanzheng
Copy link

longquanzheng commented Nov 22, 2024 via email

@cretz
Copy link
Member

cretz commented Nov 22, 2024

add some tags

Note this is about a single, predefined tag on a single metric that we probably cannot even enable by default, not general purpose addition of tags to SDK metrics. Many metrics are emitted throughout the SDK and even with the metrics abstraction, context is not provided at metrics record time. If possible, any custom metrics needs may need to be done with custom metrics. Granted there are parts of the internal code we don't have user callbacks/interceptors within.

@tsurdilo
Copy link
Author

tsurdilo commented Nov 22, 2024

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

as much as id like to agree with this..well i do, but often its more difficult than that.

its currently not possible to filter executions (visibility api nor metrics) for completed workflows by their result value. this then forces users to use custom search attributes and upsert to some business "failure" value, which on cloud costs an action too. then they no longer can alert on this via metrics either, and have to resort to visibility api, so all in all, to me this is more of a philosophical question, but was told many times by users that they actually want to fail execution in case of business failures. idk whats best but as things are currently, having exception type in workflow failed metric would be useful imho.

the custom metrics approach seems ok but we have sdks that iirc dont allow custom tags as of yet, so still a bit hard to do overall

@gandhisamay
Copy link

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

@Sushisource @tsurdilo if there are scenarios where retrying doesn't make any sense, simply because the activity responded with a bad request, in that case what is the ideal way to stop the workflow execution? Currently what we do is, we set the property non_retryable=true and throw an application failure. Is there any better way of tackling these, so that we can generate correct metrics for that as well?

@drewhoskins-temporal
Copy link

Would it help to be able to specify an exception's severity (info, warn, error, critical or the like), tag it in metrics, and then filter that tag out? That would address the metrics cardinality issue.

@cretz
Copy link
Member

cretz commented Nov 25, 2024

the custom metrics approach seems ok but we have sdks that iirc dont allow custom tags as of yet, so still a bit hard to do overall

Is this only TypeScript at temporalio/sdk-typescript#1229? We need to solve this, I will look into prioritizing. There are a lot of customizations people want to do to metrics.

tag it in metrics, and then filter that tag out?

We have no real concept of filtering tags out today (but of course users can do whatever they want in the abstractions/collectors). The concern is automatically adding new tags by default on those with an expectation of their tag list not surprisingly growing. We can do anything opt-in of course. Yes it can seem unfortunate that we can't add tags at will by default on SDK and server-side metrics. We can add metrics though, and we can allow opt in, and we do allow custom metrics. Or we can decide as a team if adding tags by default on SDK upgrade is worth the cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants