-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag workflow_failed counter metric with exception type #563
Comments
I am not sure we can blindly do this by default. Many people send metrics off to third parties that charge per label/cardinality IIUC and to potentially increase this significantly for a user on SDK upgrade may not be the best thing. We may be able to make it an opt-in option. Also not every failure is an application failure, so would need to know what to set the label to on those. What we usually tell users that want more than what our metrics offer by default is to make their own metrics. Granted on some SDKs this may not properly catch all types (such as non-determinism in core-based SDKs when configured to make it a failure instead of task failure), but in general an interceptor should work for this case. The benefit of this approach is when a user wants the next label for workflow failure based on something, and then the next, and so on, this can easily be done without altering SDKs. |
not sure i fully understand as temporal_activity_execution_failed does include "exception" field that includes the exception type thrown on activity failure (in java sdk at least). |
I think that is only in Java and it's not documented at https://docs.temporal.io/references/sdk-metrics#activity_execution_failed. It's not that we can't add this, it's that we need to be cautious about adding by default and increasing label counts and cardinality on users that simple do an SDK upgrade. In the meantime, users can, for the most part, emit their own metrics with whatever error information they want (including the type). Also there's a difference between "error/exception data type" and "application failure error type" that would have to be discussed too. |
got it. ok its just that this use case comes up a lot. basically users cannot fail their executions as part of their business logic, because later on they can no longer alert on unexpected workflow failures. but then that skews their completion metrics too. and workarounds of custom metrics and/or custom search attributes are not really good workarounds at all (at least havent been in past when we suggested to users) |
That says it all right there, then. Workflow failures are generally something that is supposed to be because of unexpected reasons. If they're failing them intentionally because of business reasons, well, then that really shouldn't be a failure, but a different return value from the workflow. |
I actually really wanted to add some tags to the metrics emitted by SDK
(and optional). It saved a lot of time if we had this
Thanks,
Quanzheng
…On Thu, Nov 21, 2024 at 4:58 PM Spencer Judge ***@***.***> wrote:
users cannot fail their executions as part of their business logic,
because later on they can no longer alert on unexpected workflow failures
That says it all right there, then. Workflow failures are generally
something that is supposed to be *because* of unexpected reasons. If
they're failing them *intentionally* because of business reasons, well,
then that really shouldn't be a failure, but a different return value from
the workflow.
—
Reply to this email directly, view it on GitHub
<#563 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCQPM32KICCKFPK4H4G4TL2BZ6SPAVCNFSM6AAAAABSFUEOBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJSGY2TKNRUGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Note this is about a single, predefined tag on a single metric that we probably cannot even enable by default, not general purpose addition of tags to SDK metrics. Many metrics are emitted throughout the SDK and even with the metrics abstraction, context is not provided at metrics record time. If possible, any custom metrics needs may need to be done with custom metrics. Granted there are parts of the internal code we don't have user callbacks/interceptors within. |
as much as id like to agree with this..well i do, but often its more difficult than that. its currently not possible to filter executions (visibility api nor metrics) for completed workflows by their result value. this then forces users to use custom search attributes and upsert to some business "failure" value, which on cloud costs an action too. then they no longer can alert on this via metrics either, and have to resort to visibility api, so all in all, to me this is more of a philosophical question, but was told many times by users that they actually want to fail execution in case of business failures. idk whats best but as things are currently, having exception type in workflow failed metric would be useful imho. the custom metrics approach seems ok but we have sdks that iirc dont allow custom tags as of yet, so still a bit hard to do overall |
@Sushisource @tsurdilo if there are scenarios where retrying doesn't make any sense, simply because the activity responded with a bad request, in that case what is the ideal way to stop the workflow execution? Currently what we do is, we set the property |
Would it help to be able to specify an exception's severity (info, warn, error, critical or the like), tag it in metrics, and then filter that tag out? That would address the metrics cardinality issue. |
Is this only TypeScript at temporalio/sdk-typescript#1229? We need to solve this, I will look into prioritizing. There are a lot of customizations people want to do to metrics.
We have no real concept of filtering tags out today (but of course users can do whatever they want in the abstractions/collectors). The concern is automatically adding new tags by default on those with an expectation of their tag list not surprisingly growing. We can do anything opt-in of course. Yes it can seem unfortunate that we can't add tags at will by default on SDK and server-side metrics. We can add metrics though, and we can allow opt in, and we do allow custom metrics. Or we can decide as a team if adding tags by default on SDK upgrade is worth the cost. |
tag workflow_failed counter metric with the application failure type that caused workflow to fail
this would allow users to filter this metric and omit failure types for which their business logic is expecting to fail the exec, allowing them to still be able to alert on workflow failed counts for unexpected failures
The text was updated successfully, but these errors were encountered: