diff --git a/design/20230313_action_failure_reason_communication.md b/design/20230313_action_failure_reason_communication.md new file mode 100644 index 0000000..6c76ba5 --- /dev/null +++ b/design/20230313_action_failure_reason_communication.md @@ -0,0 +1,32 @@ +# Action Failure Reason Communication + +## Context + +Tink Worker is the client launched by an Operating System Installation Environment (OSIE) that communicates with Tink Server to retrieve actions to run on the node. Actions are OCI images that Tink Worker can launch using a container runtime such as Docker. In the event an action fails, Tinkerbell does not provide mechanisms for the user defined action to communicate why it failed. This makes debugging `Workflow`s difficult as users leveraging `kubectl` cannot observe a specific failure reason and must resort to running commands directly on the node. + +As part of the Tink CRD Refactor proposal we introduced `Reason` and `Message` fields to indiciate why an action entered a failure state. The proposal does not detail how these fields are populated but envisages at minimum the `Reason` being used to communicate timeout failures. + +* A `Reason` is a machine readable CamelCase word or phrase that succinctly describes the failure reason. +* A `Message` is a human readable string that elaborates on the failure reason to provide specifics. + +This proposal lays out a contract for action containers to communicate why it exited with a non-zero exit code. + +## Goals/Non-goals + +**Goals** + +- Define a contract for action images to communicate failure information that is exposed via associated custom resource definitions and is therefore inspectable with `kubectl`. + +## Proposal + +Actions communicate a `Reason` by writing it to `/tinkerbell/failure-reason`. The reason must follow the same formatting expectations as defined in the CRD refactor proposal: most importantly, it must not contain spaces or new lines. If the reason does not follow the convention we will report `InvalidActionReason` in-place of the reported reason. + +Actions communicate a failure message by writing to `/tinkerbell/failure-message`. It must not contain new lines. + +When an action exits with a non-zero exit code, Tink Worker will arrange to read the reason and message provided by the action image and transmit them, with the action result, to Tink Server. Tink Server will update the action state, reason and message. Providing the reason and message on the action status will ensure the controller populates the `Succeeded` condition as detailed in the Tink CRD Refactor proposal. + +![Reason propagation](https://raw.githubusercontent.com/tinkerbell/roadmap/7e4e769305edf5c5679a406ebf0564eb754fe57a/design/images/tink_worker_failure_reasons/reason_propagation.png) + +The reason and message files will be mounted with `0666` permissions granting read write access to everyone. This ensures images launched with a different UID will still be able to write a reason and message. + +The implementation behind the reason and message files will be transparent to the action maintainer. For example, the file may be backed by unix domain sockets or a host text file. \ No newline at end of file diff --git a/design/images/action_failure_reason_communication/reason_propagation.png b/design/images/action_failure_reason_communication/reason_propagation.png new file mode 100644 index 0000000..06cabb2 Binary files /dev/null and b/design/images/action_failure_reason_communication/reason_propagation.png differ diff --git a/design/images/action_failure_reason_communication/sequence.puml b/design/images/action_failure_reason_communication/sequence.puml new file mode 100644 index 0000000..6e66566 --- /dev/null +++ b/design/images/action_failure_reason_communication/sequence.puml @@ -0,0 +1,24 @@ +@startuml reason_propagation + +participant Action as action +participant "Tink Worker" as worker +participant "Tink Server" as server +entity "Workflow" as workflow + + +worker -> server ++ : Request actions +activate worker +server -> workflow : Get actions +server <-- workflow +worker <-- server -- : Actions +loop for every action +worker -> action ++ : Execute +action -> action : Write to /dev/reason\nWrite to /dev/message +worker <- action -- : Exit non-zero +worker -> action : Extract reason and message +worker <-- action -- +worker -> server -- : Report status, reason\nand message +server -> workflow : Update action +end + +@enduml \ No newline at end of file