Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: Move gracefulShutdownTimeout to roleGroup config #486

Merged
merged 25 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 59 additions & 59 deletions Cargo.lock

Large diffs are not rendered by default.

20 changes: 16 additions & 4 deletions deploy/helm/trino-operator/crds/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -85,10 +85,6 @@ spec:
description: matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value". The requirements are ANDed.
type: object
type: object
gracefulShutdownTimeout:
default: 1h
description: Time period the trino workers have to gracefully shut down, e.g. `1h`, `30m` or `2d`. Consult the trino-operator documentation for details.
type: string
listenerClass:
default: cluster-internal
description: |-
Expand Down Expand Up @@ -628,6 +624,10 @@ spec:
type: array
type: object
type: object
gracefulShutdownTimeout:
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
nullable: true
type: string
logging:
default:
enableVectorAgent: null
Expand Down Expand Up @@ -4118,6 +4118,10 @@ spec:
type: array
type: object
type: object
gracefulShutdownTimeout:
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
nullable: true
type: string
logging:
default:
enableVectorAgent: null
Expand Down Expand Up @@ -7668,6 +7672,10 @@ spec:
type: array
type: object
type: object
gracefulShutdownTimeout:
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
nullable: true
type: string
logging:
default:
enableVectorAgent: null
Expand Down Expand Up @@ -11158,6 +11166,10 @@ spec:
type: array
type: object
type: object
gracefulShutdownTimeout:
description: Time period Pods have to gracefully shut down, e.g. `30m`, `1h` or `2d`. Consult the operator documentation for details.
nullable: true
type: string
logging:
default:
enableVectorAgent: null
Expand Down
106 changes: 83 additions & 23 deletions docs/modules/trino/pages/usage-guide/operations/graceful-shutdown.adoc
Original file line number Diff line number Diff line change
@@ -1,32 +1,95 @@
= Graceful shutdown

== How it works
Trino supports https://trino.io/docs/current/admin/graceful-shutdown.html[graceful shutdown] of the workers.
You can configure the graceful shutdown as described in xref:concepts:operations/graceful_shutdown.adoc[].

== Coordinators

As a default, coordinators have `15 minutes` to terminate gracefully.

The coordinator process will receive a `SIGTERM` signal when Kubernetes wants to terminate the Pod.
After the graceful shutdown timeout runs out, and the process still didn't exit, Kubernetes will issue a `SIGKILL` signal.

When a coordinator gets restarted, all currently running queries will fail and cannot be recovered after the restart process is finished.
As of Trino version `428` this can not be prevented (e.g. by using multiple coordinators).

== Workers

As a default, Coordinators have `60 minutes` to terminate gracefully.

Trino supports https://trino.io/docs/current/admin/graceful-shutdown.html[gracefully shutting down] workers.
This operator always adds a https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/[`PreStop` hook] to gracefully shut them down.
No additional configuration is needed, this guide is intended for users that need to tweak this mechanism.

The default graceful shutdown period is 1 hour, but it can be tuned using `spec.clusterConfig.gracefulShutdownTimeout` which uses string values like `1h` (1 hour), `30m` (30 minutes) or `2d` (2 days).
The default graceful shutdown period is `1` hour, but it can be configured as follows:

[source,yaml]
----
apiVersion: trino.stackable.tech/v1alpha1
kind: TrinoCluster
metadata:
name: trino
spec:
# ...
workers:
config:
gracefulShutdownTimeout: 1h
roleGroups:
default:
replicas: 1
----

== Implementation

Once a worker Pod is asked to terminate, the `PreStop` hook is executed and the following timeline occurs:

1. The worker goes into `SHUTTING_DOWN` state.
2. The worker sleeps for 60 seconds to ensure that the coordinator has noticed the shutdown and stops scheduling new tasks on the worker.
2. The worker sleeps for `30` seconds to ensure that the coordinator has noticed the shutdown and stops scheduling new tasks on the worker.
3. The worker now waits till all tasks running on it complete. This will take as long as the longest running query takes.
4. The worker sleeps for 60 seconds to ensure that the coordinator has
4. The worker sleeps for `30` seconds to ensure that the coordinator has
noticed that all tasks are complete
5. The `PreStop` hook will never return, but the JVM will be shut down by the graceful shutdown mechanism.
6. When the graceful shutdown is not quick enough (e.g. a query runs longer than the graceful shutdown period), after `<graceful shutdown period> + 60s of step 2 + 60s of step 4 + 30s safety overhead` the Pod gets killed, regardless if it has shut down gracefully or not. This is achieved by setting `terminationGracePeriodSeconds` on the worker Pods.
6. If the graceful shutdown doesn't complete quick enough (e.g. a query runs longer than the graceful shutdown period), after `<graceful shutdown period> + 30s of step 2 + 30s of step 4 + 10s safety overhead` the Pod gets killed, regardless if it has shut down gracefully or not. This is achieved by setting `terminationGracePeriodSeconds` on the worker Pods. Currently running queries on the worker will fail and cannot be recovered.

Check notice on line 51 in docs/modules/trino/pages/usage-guide/operations/graceful-shutdown.adoc

View workflow job for this annotation

GitHub Actions / LanguageTool

[LanguageTool] docs/modules/trino/pages/usage-guide/operations/graceful-shutdown.adoc#L51

A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA[1]) Suggestions: `Currently,` URL: https://languagetool.org/insights/post/linking-words/ Rule: https://community.languagetool.org/rule/show/SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA?lang=en-US&subId=1 Category: PUNCTUATION
Raw output
docs/modules/trino/pages/usage-guide/operations/graceful-shutdown.adoc:51:360: A comma may be missing after the conjunctive/linking adverb ‘Currently’. (SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA[1])
 Suggestions: `Currently,`
 URL: https://languagetool.org/insights/post/linking-words/ 
 Rule: https://community.languagetool.org/rule/show/SENT_START_CONJUNCTIVE_LINKING_ADVERB_COMMA?lang=en-US&subId=1
 Category: PUNCTUATION

[WARNING]
====
As of SDP version `23.7`, the secret-operator issues TLS certificates with a lifetime of 24h.
It also adds an annotation to the Pod, indicating it requires a restart 30 minutes before the certificate expires (23.5h hours in this case).
Currently, this results in all Pod using HTTPS (both coordinator and workers in a typical setup) to be restarted every 23.5 hours.

WARNING: As of 23.7, the secret-operator issues TLS certificates with a lifetime of 24h. It also adds an annotation to the Pod, so that it is restarted 30 minutes before the certificate expires (23.5h hours in this case). Bot can not be configured. This results in all Pod using https (both coordinator and workers in a typical setup) restarting every 23.5 hours. This problem will be addressed in a future release by e.g. making the certification lifetime configurable.
The TLS certificate lifetime can be configured using `podOverrides` by setting `secrets.stackable.tech/backend.autotls.cert.lifetime` on every secret-operator volume.
One sample configuration could look like:

[source,yaml]
----
spec:
workers:
podOverrides:
spec:
volumes:
- name: server-tls-mount
ephemeral:
volumeClaimTemplate:
metadata:
annotations:
secrets.stackable.tech/backend.autotls.cert.lifetime: 14d
- name: internal-tls-mount
ephemeral:
volumeClaimTemplate:
metadata:
annotations:
secrets.stackable.tech/backend.autotls.cert.lifetime: 14d
----
====

== Implications
All queries that take less than the graceful shutdown period are guaranteed to not be disturbed by regular termination of Pods.
They can obviously still fail when e.g. a Kubernetes node dies completely or the Pod does not get the time it takes (e.g. 1h by default) to properly gracefully shut down.

Because of this reason the operator automatically restricts the execution time of queries to the configured graceful shutdown period using the Trino configuration `query.max-execution-time=3600s`.
All queries that take less than the minimal graceful shutdown period of all roleGroups (`1` hour as a default) are guaranteed to not be disturbed by regular termination of Pods.
They can obviously still fail when, for example, a Kubernetes node dies or gets rebooted before it is fully drained.

Because of this, the operator automatically restricts the execution time of queries to the minimal graceful shutdown period of all roleGroups using the Trino configuration `query.max-execution-time=3600s`.
This causes all queries that take longer than 1 hour to fail with the error message `Query failed: Query exceeded the maximum execution time limit of 3600s.00s`.

In case you need to execute queries that take longer than the configured graceful shutdown period, you need to increase the `query.max-execution-time=3600s` as follows:
In case you need to execute queries that take longer than the configured graceful shutdown period, you need to increase the `query.max-execution-time` property as follows:

[source,yaml]
----
Expand All @@ -37,20 +100,15 @@
query.max-execution-time: 24h
----

Please keep in mind, that queries taking longer than the graceful shutdown period are now subject to failure when a Trino worker dies.
This can be circumvented by using https://trino.io/docs/current/admin/fault-tolerant-execution.html[Fault-tolerant execution], which support for might be added in the future.
Until then, you have to use configOverrides to enable it.
Please keep in mind, that queries taking longer than the graceful shutdown period are now subject to failure when a Trino worker gets shut down.
Running into this issue can be circumvented by using https://trino.io/docs/current/admin/fault-tolerant-execution.html[Fault-tolerant execution], which is not supported natively yet.
Until native support is added, you will have to use `configOverrides` to enable it.

== Kubernetes cluster requirements
Pods need to have the ability to take as long as they need to gracefully shut down without getting killed.
== Authorization requirements

Imagine the situation that you set the graceful shutdown period to 24 hours (using `spec.clusterConfig.gracefulShutdownTimeout: 24h`).
in case of e.g. an on-prem Kubernetes cluster the Kubernetes infrastructure team wants to drain the Kubernetes node, so that they can do regular maintenance, such as rebooting the node. They will have some upper limit on how long they will wait for Pods on the Node to terminate, until they will reboot the Kubernetes node regardless.
WARNING: When you are not using OPA for authorization, the user `admin` is not allowed to gracefully shut down workers.
If you need graceful shutdown you need to use OPA or need to make sure `admin` is allowed to gracefully shut down workers (e.g. having you own authorizer or patching Trino).

When setting up a production cluster, you need to check with your Kubernetes administrator (or cloud provider) what time period your Pods have to terminate gracefully.
It is not sufficient to have a look at the `spec.terminationGracePeriodSeconds` and come to the conclusion that the Pods have e.g. 24 hours to gracefully shut down, as e.g. an administrator reboots the Kubernetes node before the time period is reached.

== OPA requirements
In case you use OPA to authorize Trino requests, you need to make sure the user `admin` is authorized to trigger a graceful shutdown of the workers.
You can achieve this e.g. by adding the following rule, which grants `admin` the permissions to do anything - including graceful shutdown.

Expand All @@ -61,4 +119,6 @@
}
----

NOTE: We plan to add CustomResources, so that you can define your Trino ACLs via Kubernetes objects. In this case the trino-operator will generate the rego-rules for you, and will add the needed rules for graceful shutdown for you.
In case the user `admin` does not have the permission to gracefully shut down a worker, the error message `curl: (22) The requested URL returned error: 403 Forbidden` will be shown in the worker log and the worker will shut down immediately.

NOTE: We plan to add CustomResources, so that you can define your Trino ACLs via Kubernetes objects. In this case the trino-operator will generate the rego-rules for you, and will add the needed rules for graceful shutdown for you. Until then, you need to grant the permission yourself.
Loading