Semantics of error states ambiguous #179

uniqueg · 2022-06-03T18:04:18Z

Description of the problem

Running different WES implementations, we ran into the issue that the definition of state EXECUTOR_ERROR is somewhat ambiguous.

The relevant states are currently defined as:

    + COMPLETE: The task has completed running. Executors have exited without error
    and output files have been successfully uploaded.
    + EXECUTOR_ERROR: The task encountered an error in one of the Executor processes. Generally,
  this means that an Executor exited with a non-zero exit code.
    + SYSTEM_ERROR: The task was stopped due to a system error, but not from an Executor,
  for example an upload failed due to network issues, the worker's ran out of disk space, etc.

Now, the defintion of COMPLETE perhaps makes it fairly clear that it should only be assigned if a workflow completed successfully with no issues whatsoever (return_code is 0). However, what is perhaps not so clear is when an EXECUTOR_ERROR should be assigned. Would it include only failures of jobs or system errors"of the executor itself as well. The EXECUTOR_ERROR definition rather suggests the latter, the SYSTEM_ERROR definition rather the former.

The question becomes perhaps clearer in a WES/TES setting. Should WES return EXECUTOR_ERROR even if a TES job returns a SYSTEM_ERROR (note that the definitions of EXECUTOR_ERROR and SYSTEM_ERROR are the same for WES and TES)? It's arguably an error "in the executor" either way, and so by the definition of SYSTEM_ERROR in WES is not eligible to be assigned a WES SYSTEM_ERROR.

Personally, I feel that SYSTEM_ERRORs in executors should be assigned SYSTEM_ERRORs in WES as well, because of the implications this has for users: For an EXECUTOR_ERROR there is likely something wrong on the user's end, and the user should carefully check the workflow and run request before bothering with trying again. Whereas for a SYSTEM_ERROR the user might want to try again and file a bug report if the error persists.

Possible solution

Make it more clear what is and what isn't to be considered an EXECUTOR_ERROR and a SYSTEM_ERROR on the level of WES, e.g., by including what the states signify to the client/user.

The text was updated successfully, but these errors were encountered:

patmagee · 2022-07-11T19:18:40Z

@uniqueg thanks for raising this. Ambiguity here has also lead me to likely misusing these error states in different cases as well. I have typically held to the broad definition the SYSTEM_ERROR is something related to network issues, quota (if running on a cloud), or infrastructure related problems (ie system goes down, db goes down etc), while EXECUTOR_ERROR is pretty much everything else.

To me EXECUTOR_ERROR has always been a bit strange, the name seems to hint at an application error as opposed to an error encountered during the running of a workflow. Thinking about this from a user's perspective, I think they broadly only care about whether their workflow completed successfully or not. A simpler team may have been failed.. but this tangent is a bit beside the point.

IMO the difference between a SYSTEM_ERROR happening in the WES server, or in distributed TES backend should not be exposed to the user. To the user, they are belong to a single API and so should be treated in a uniform way when conveying state. To that end I think it makese sense to return SYSTEM_ERROR when the TES backend (or any other beckend component) malfunctions

uniqueg · 2022-07-12T07:39:51Z

Thanks @patmagee. I generally agree with your interpretation, though I thought that there is perhaps some useful distinction for the user:

In the case of an EXECUTOR_ERROR, it will hardly make sense for the user to try to re-run the same request (likely something is wrong with the workflow/task configuration, causing the workflow engine or a tool to fail).
In the case of a SYSTEM_ERROR, they might well try to re-run, because it could be due to a temporary problem of the WES/TES or similar.

However, certainly this is at best a rule of thumb. Surely there will be cases in any implementation, where re-running a SYSTEM_ERROR would not have any chance of succeeding when replaying the request, as well as cases where an EXECUTOR_ERROR might still not be the user's "fault".

patmagee added the Type: Documentation label Jul 21, 2022

uniqueg mentioned this issue Oct 31, 2022

Field for run/task error messages (beyond stderr) #195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantics of error states ambiguous #179

Semantics of error states ambiguous #179

uniqueg commented Jun 3, 2022

patmagee commented Jul 11, 2022

uniqueg commented Jul 12, 2022

Semantics of error states ambiguous #179

Semantics of error states ambiguous #179

Comments

uniqueg commented Jun 3, 2022

Description of the problem

Possible solution

patmagee commented Jul 11, 2022

uniqueg commented Jul 12, 2022