Tag workflow_failed counter metric with exception type #563

tsurdilo · 2024-11-20T22:36:24Z

tag workflow_failed counter metric with the application failure type that caused workflow to fail

this would allow users to filter this metric and omit failure types for which their business logic is expecting to fail the exec, allowing them to still be able to alert on workflow failed counts for unexpected failures

cretz · 2024-11-20T22:55:54Z

I am not sure we can blindly do this by default. Many people send metrics off to third parties that charge per label/cardinality IIUC and to potentially increase this significantly for a user on SDK upgrade may not be the best thing. We may be able to make it an opt-in option. Also not every failure is an application failure, so would need to know what to set the label to on those.

What we usually tell users that want more than what our metrics offer by default is to make their own metrics. Granted on some SDKs this may not properly catch all types (such as non-determinism in core-based SDKs when configured to make it a failure instead of task failure), but in general an interceptor should work for this case. The benefit of this approach is when a user wants the next label for workflow failure based on something, and then the next, and so on, this can easily be done without altering SDKs.

tsurdilo · 2024-11-21T18:50:01Z

not sure i fully understand as temporal_activity_execution_failed does include "exception" field that includes the exception type thrown on activity failure (in java sdk at least).
not sure why we could not do the same thing for workflow failures?

cretz · 2024-11-21T22:48:21Z

I think that is only in Java and it's not documented at https://docs.temporal.io/references/sdk-metrics#activity_execution_failed. It's not that we can't add this, it's that we need to be cautious about adding by default and increasing label counts and cardinality on users that simple do an SDK upgrade. In the meantime, users can, for the most part, emit their own metrics with whatever error information they want (including the type). Also there's a difference between "error/exception data type" and "application failure error type" that would have to be discussed too.

tsurdilo · 2024-11-21T23:16:33Z

got it. ok its just that this use case comes up a lot. basically users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures. but then that skews their completion metrics too. and workarounds of custom metrics and/or custom search attributes are not really good workarounds at all (at least havent been in past when we suggested to users)

Sushisource · 2024-11-22T00:57:53Z

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

longquanzheng · 2024-11-22T01:11:20Z

I actually really wanted to add some tags to the metrics emitted by SDK (and optional). It saved a lot of time if we had this Thanks, Quanzheng

…

On Thu, Nov 21, 2024 at 4:58 PM Spencer Judge ***@***.***> wrote: users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures That says it all right there, then. Workflow failures are generally something that is supposed to be *because* of unexpected reasons. If they're failing them *intentionally* because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow. — Reply to this email directly, view it on GitHub <#563 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCQPM32KICCKFPK4H4G4TL2BZ6SPAVCNFSM6AAAAABSFUEOBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJSGY2TKNRUGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

cretz · 2024-11-22T14:53:55Z

add some tags

Note this is about a single, predefined tag on a single metric that we probably cannot even enable by default, not general purpose addition of tags to SDK metrics. Many metrics are emitted throughout the SDK and even with the metrics abstraction, context is not provided at metrics record time. If possible, any custom metrics needs may need to be done with custom metrics. Granted there are parts of the internal code we don't have user callbacks/interceptors within.

tsurdilo · 2024-11-22T23:07:22Z

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

as much as id like to agree with this..well i do, but often its more difficult than that.

its currently not possible to filter executions (visibility api nor metrics) for completed workflows by their result value. this then forces users to use custom search attributes and upsert to some business "failure" value, which on cloud costs an action too. then they no longer can alert on this via metrics either, and have to resort to visibility api, so all in all, to me this is more of a philosophical question, but was told many times by users that they actually want to fail execution in case of business failures. idk whats best but as things are currently, having exception type in workflow failed metric would be useful imho.

the custom metrics approach seems ok but we have sdks that iirc dont allow custom tags as of yet, so still a bit hard to do overall

gandhisamay · 2024-11-23T12:43:30Z

users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures

That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow.

@Sushisource @tsurdilo if there are scenarios where retrying doesn't make any sense, simply because the activity responded with a bad request, in that case what is the ideal way to stop the workflow execution? Currently what we do is, we set the property non_retryable=true and throw an application failure. Is there any better way of tackling these, so that we can generate correct metrics for that as well?

drewhoskins-temporal · 2024-11-23T16:18:43Z

Would it help to be able to specify an exception's severity (info, warn, error, critical or the like), tag it in metrics, and then filter that tag out? That would address the metrics cardinality issue.

cretz · 2024-11-25T14:07:24Z

the custom metrics approach seems ok but we have sdks that iirc dont allow custom tags as of yet, so still a bit hard to do overall

Is this only TypeScript at temporalio/sdk-typescript#1229? We need to solve this, I will look into prioritizing. There are a lot of customizations people want to do to metrics.

tag it in metrics, and then filter that tag out?

We have no real concept of filtering tags out today (but of course users can do whatever they want in the abstractions/collectors). The concern is automatically adding new tags by default on those with an expectation of their tag list not surprisingly growing. We can do anything opt-in of course. Yes it can seem unfortunate that we can't add tags at will by default on SDK and server-side metrics. We can add metrics though, and we can allow opt in, and we do allow custom metrics. Or we can decide as a team if adding tags by default on SDK upgrade is worth the cost.

tsurdilo added the enhancement New feature or request label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tag workflow_failed counter metric with exception type #563

Tag workflow_failed counter metric with exception type #563

tsurdilo commented Nov 20, 2024 •

edited

Loading

cretz commented Nov 20, 2024

tsurdilo commented Nov 21, 2024

cretz commented Nov 21, 2024 •

edited

Loading

tsurdilo commented Nov 21, 2024

Sushisource commented Nov 22, 2024 •

edited

Loading

longquanzheng commented Nov 22, 2024 via email

cretz commented Nov 22, 2024

tsurdilo commented Nov 22, 2024 •

edited

Loading

gandhisamay commented Nov 23, 2024

drewhoskins-temporal commented Nov 23, 2024

cretz commented Nov 25, 2024 •

edited

Loading

Tag workflow_failed counter metric with exception type #563

Tag workflow_failed counter metric with exception type #563

Comments

tsurdilo commented Nov 20, 2024 • edited Loading

cretz commented Nov 20, 2024

tsurdilo commented Nov 21, 2024

cretz commented Nov 21, 2024 • edited Loading

tsurdilo commented Nov 21, 2024

Sushisource commented Nov 22, 2024 • edited Loading

longquanzheng commented Nov 22, 2024 via email

cretz commented Nov 22, 2024

tsurdilo commented Nov 22, 2024 • edited Loading

gandhisamay commented Nov 23, 2024

drewhoskins-temporal commented Nov 23, 2024

cretz commented Nov 25, 2024 • edited Loading

tsurdilo commented Nov 20, 2024 •

edited

Loading

cretz commented Nov 21, 2024 •

edited

Loading

Sushisource commented Nov 22, 2024 •

edited

Loading

tsurdilo commented Nov 22, 2024 •

edited

Loading

cretz commented Nov 25, 2024 •

edited

Loading