stackabletech · sbernauer · Nov 7, 2023 · Oct 10, 2023 · Oct 10, 2023 · Oct 10, 2023
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/deploy/helm/trino-operator/crds/crds.yaml b/deploy/helm/trino-operator/crds/crds.yaml
@@ -85,10 +85,6 @@ spec:
                           description: matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed.
                           type: object
                       type: object
-                    gracefulShutdownTimeout:
-                      default: 1h
-                      description: Time period the trino workers have to gracefully shut down, e.g. `1h`, `30m` or `2d`. Consult the trino-operator documentation for details.
-                      type: string
                     listenerClass:
                       default: cluster-internal
                       description: |-
@@ -628,6 +624,10 @@ spec:
                                   type: array
                               type: object
                           type: object
+                        gracefulShutdownTimeout:
+                          description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
+                          nullable: true
+                          type: string
                         logging:
                           default:
                             enableVectorAgent: null
@@ -4118,6 +4118,10 @@ spec:
                                         type: array
                                     type: object
                                 type: object
+                              gracefulShutdownTimeout:
+                                description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
+                                nullable: true
+                                type: string
                               logging:
                                 default:
                                   enableVectorAgent: null
@@ -7668,6 +7672,10 @@ spec:
                                   type: array
                               type: object
                           type: object
+                        gracefulShutdownTimeout:
+                          description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
+                          nullable: true
+                          type: string
                         logging:
                           default:
                             enableVectorAgent: null
@@ -11158,6 +11166,10 @@ spec:
                                         type: array
                                     type: object
                                 type: object
+                              gracefulShutdownTimeout:
+                                description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
+                                nullable: true
+                                type: string
                               logging:
                                 default:
                                   enableVectorAgent: null

diff --git a/docs/modules/trino/pages/usage-guide/operations/graceful-shutdown.adoc b/docs/modules/trino/pages/usage-guide/operations/graceful-shutdown.adoc
@@ -1,32 +1,95 @@
 = Graceful shutdown
 
-== How it works
-Trino supports https://trino.io/docs/current/admin/graceful-shutdown.html[graceful shutdown] of the workers.
+You can configure the graceful shutdown as described in xref:concepts:operations/graceful_shutdown.adoc[].
+
+== Coordinators
+
+As a default, coordinators have `15 minutes` to terminate gracefully.
+
+The coordinator process will receive a `SIGTERM` signal when Kubernetes wants to terminate the Pod.
+After the graceful shutdown timeout runs out, and the process still didn't exit, Kubernetes will issue a `SIGKILL` signal.
+
+When a coordinator gets restarted, all currently running queries will fail and cannot be recovered after the restart process is finished.
+As of Trino version `428` this can not be prevented (e.g. by using multiple coordinators).
+
+== Workers
+
+As a default, Coordinators have `60 minutes` to terminate gracefully.
+
+Trino supports https://trino.io/docs/current/admin/graceful-shutdown.html[gracefully shutting down] workers.
 This operator always adds a https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/[`PreStop` hook] to gracefully shut them down.
 No additional configuration is needed, this guide is intended for users that need to tweak this mechanism.
 
-The default graceful shutdown period is 1 hour, but it can be tuned using `spec.clusterConfig.gracefulShutdownTimeout` which uses string values like `1h` (1 hour), `30m` (30 minutes) or `2d` (2 days).
+The default graceful shutdown period is `1` hour, but it can be configured as follows:
+
+[source,yaml]
+----
+apiVersion: trino.stackable.tech/v1alpha1
+kind: TrinoCluster
+metadata:
+  name: trino
+spec:
+  # ...
+  workers:
+    config:
+      gracefulShutdownTimeout: 1h
+    roleGroups:
+      default:
+        replicas: 1
+----
+
+== Implementation
 
 Once a worker Pod is asked to terminate, the `PreStop` hook is executed and the following timeline occurs:
 
 1. The worker goes into `SHUTTING_DOWN` state.
-2. The worker sleeps for 60 seconds to ensure that the coordinator has noticed the shutdown and stops scheduling new tasks on the worker.
+2. The worker sleeps for `30` seconds to ensure that the coordinator has noticed the shutdown and stops scheduling new tasks on the worker.
 3. The worker now waits till all tasks running on it complete. This will take as long as the longest running query takes.
-4. The worker sleeps for 60 seconds to ensure that the coordinator has
+4. The worker sleeps for `30` seconds to ensure that the coordinator has
 noticed that all tasks are complete
 5. The `PreStop` hook will never return, but the JVM will be shut down by the graceful shutdown mechanism.
-6. When the graceful shutdown is not quick enough (e.g. a query runs longer than the graceful shutdown period), after `<graceful shutdown period> + 60s of step 2 + 60s of step 4 + 30s safety overhead` the Pod gets killed, regardless if it has shut down gracefully or not. This is achieved by setting `terminationGracePeriodSeconds` on the worker Pods.
+6. If the graceful shutdown doesn't complete quick enough (e.g. a query runs longer than the graceful shutdown period), after `<graceful shutdown period> + 30s of step 2 + 30s of step 4 + 10s safety overhead` the Pod gets killed, regardless if it has shut down gracefully or not. This is achieved by setting `terminationGracePeriodSeconds` on the worker Pods. Currently running queries on the worker will fail and cannot be recovered.
+
+[WARNING]
+====
+As of SDP version `23.7`, the secret-operator issues TLS certificates with a lifetime of 24h.
+It also adds an annotation to the Pod, indicating it requires a restart 30 minutes before the certificate expires (23.5h hours in this case).
+Currently, this results in all Pod using HTTPS (both coordinator and workers in a typical setup) to be restarted every 23.5 hours.
 
-WARNING: As of 23.7, the secret-operator issues TLS certificates with a lifetime of 24h. It also adds an annotation to the Pod, so that it is restarted 30 minutes before the certificate expires (23.5h hours in this case). Bot can not be configured. This results in all Pod using https (both coordinator and workers in a typical setup) restarting every 23.5 hours. This problem will be addressed in a future release by e.g. making the certification lifetime configurable.
+The TLS certificate lifetime can be configured using `podOverrides` by setting `secrets.stackable.tech/backend.autotls.cert.lifetime` on every secret-operator volume.
+One sample configuration could look like:
+
+[source,yaml]
+----
+spec:
+  workers:
+    podOverrides:
+      spec:
+        volumes:
+          - name: server-tls-mount
+            ephemeral:
+              volumeClaimTemplate:
+                metadata:
+                  annotations:
+                    secrets.stackable.tech/backend.autotls.cert.lifetime: 14d
+          - name: internal-tls-mount
+            ephemeral:
+              volumeClaimTemplate:
+                metadata:
+                  annotations:
+                    secrets.stackable.tech/backend.autotls.cert.lifetime: 14d
+----
+====
 
 == Implications
-All queries that take less than the graceful shutdown period are guaranteed to not be disturbed by regular termination of Pods.
-They can obviously still fail when e.g. a Kubernetes node dies completely or the Pod does not get the time it takes (e.g. 1h by default) to properly gracefully shut down.
 
-Because of this reason the operator automatically restricts the execution time of queries to the configured graceful shutdown period using the Trino configuration `query.max-execution-time=3600s`.
+All queries that take less than the minimal graceful shutdown period of all roleGroups (`1` hour as a default) are guaranteed to not be disturbed by regular termination of Pods.
+They can obviously still fail when, for example, a Kubernetes node dies or gets rebooted before it is fully drained.
+
+Because of this, the operator automatically restricts the execution time of queries to the minimal graceful shutdown period of all roleGroups using the Trino configuration `query.max-execution-time=3600s`.
 This causes all queries that take longer than 1 hour to fail with the error message `Query failed: Query exceeded the maximum execution time limit of 3600s.00s`.
 
-In case you need to execute queries that take longer than the configured graceful shutdown period, you need to increase the `query.max-execution-time=3600s` as follows:
+In case you need to execute queries that take longer than the configured graceful shutdown period, you need to increase the `query.max-execution-time` property as follows:
 
 [source,yaml]
 ----
@@ -37,20 +100,15 @@
         query.max-execution-time: 24h
 ----
 
-Please keep in mind, that queries taking longer than the graceful shutdown period are now subject to failure when a Trino worker dies.
-This can be circumvented by using https://trino.io/docs/current/admin/fault-tolerant-execution.html[Fault-tolerant execution], which support for might be added in the future.
-Until then, you have to use configOverrides to enable it.
+Please keep in mind, that queries taking longer than the graceful shutdown period are now subject to failure when a Trino worker gets shut down.
+Running into this issue can be circumvented by using https://trino.io/docs/current/admin/fault-tolerant-execution.html[Fault-tolerant execution], which is not supported natively yet.
+Until native support is added, you will have to use `configOverrides` to enable it.
 
-== Kubernetes cluster requirements
-Pods need to have the ability to take as long as they need to gracefully shut down without getting killed.
+== Authorization requirements
 
-Imagine the situation that you set the graceful shutdown period to 24 hours (using `spec.clusterConfig.gracefulShutdownTimeout: 24h`).
-in case of e.g. an on-prem Kubernetes cluster the Kubernetes infrastructure team wants to drain the Kubernetes node, so that they can do regular maintenance, such as rebooting the node. They will have some upper limit on how long they will wait for Pods on the Node to terminate, until they will reboot the Kubernetes node regardless.
+WARNING: When you are not using OPA for authorization, the user `admin` is not allowed to gracefully shut down workers.
+If you need graceful shutdown you need to use OPA or need to make sure `admin` is allowed to gracefully shut down workers (e.g. having you own authorizer or patching Trino).
 
-When setting up a production cluster, you need to check with your Kubernetes administrator (or cloud provider) what time period your Pods have to terminate gracefully.
-It is not sufficient to have a look at the `spec.terminationGracePeriodSeconds` and come to the conclusion that the Pods have e.g. 24 hours to gracefully shut down, as e.g. an administrator reboots the Kubernetes node before the time period is reached.
-
-== OPA requirements
 In case you use OPA to authorize Trino requests, you need to make sure the user `admin` is authorized to trigger a graceful shutdown of the workers.
 You can achieve this e.g. by adding the following rule, which grants `admin` the permissions to do anything - including graceful shutdown.
 
@@ -61,4 +119,6 @@
 }
 ----
 
-NOTE: We plan to add CustomResources, so that you can define your Trino ACLs via Kubernetes objects. In this case the trino-operator will generate the rego-rules for you, and will add the needed rules for graceful shutdown for you.
+In case the user `admin` does not have the permission to gracefully shut down a worker, the error message `curl: (22) The requested URL returned error: 403 Forbidden` will be shown in the worker log and the worker will shut down immediately.
+
+NOTE: We plan to add CustomResources, so that you can define your Trino ACLs via Kubernetes objects. In this case the trino-operator will generate the rego-rules for you, and will add the needed rules for graceful shutdown for you. Until then, you need to grant the permission yourself.