Document Custom Stacks + Observability autocancellation

Document the new custom stacks and monitor-based auto-cancellation features we've exposed. This seems straightfoward enough for now.
pluralsh · Jul 29, 2024 · 950a087 · 950a087
1 parent 468c3fb
commit 950a087
Show file tree

Hide file tree

Showing 3 changed files with 186 additions and 0 deletions.
diff --git a/pages/stacks/auto-cancellation.md b/pages/stacks/auto-cancellation.md
@@ -0,0 +1,91 @@
+---
+title: Auto Cancellation
+description: Automatically cancel complex Terraform applies when alarms fire
+---
+
+## Overview
+
+One common issue when managing changes to Kubernetes infrastructure is due to the very long running nature of the operations, cluster upgrades can take hours for large node counts, there is plenty that can go wrong and immense wasted man-hours babying your infrastructure automation ensuring that does not happen.
+
+Plural helps solve this by polling the monitors you likely have set up to ensure infrastructure health in tools like Datadog or NewRelic, and automatically cancelling your IaC when they fire.  Due to our close management of the commands themselves, we'll gracefully shut them down, ensuring things like annoying state locks are cleaned up and no resources are left dangling.  We're basically trying to automate one of the most boring but labor-intensive parts of your DevOps workflow.
+
+Setting this up is really simple, you'll need to create an `ObservabilityProvider` resource and then set a list of `observableMonitors` on your stack.
+
+## Create an ObservabilityProvider
+
+To do this in one swoop for datadog, create resources like:
+
+```yaml
+apiVersion: v1
+kind: Secret
+metadata:
+  name: datadog
+  namespace: stacks
+stringData:
+  apiKey: YOUR_API_KEY
+  appKey: YOUR_APP_KEY
+---
+apiVersion: deployments.plural.sh/v1alpha1
+kind: ObservabilityProvider
+metadata:
+  name: datadog
+spec:
+  type: DATADOG
+  name: datadog
+  credentials:
+    datadog:
+      name: datadog
+      namespace: stacks
+```
+
+{% callout severity="info" %}
+You can also create this in the UI to avoid the secret creation, and reference it by name without the `credentials` block
+{% /callout %}
+
+## Add `observableMetrics` to your Stack
+
+An example setup for this is here:
+
+```yaml
+apiVersion: deployments.plural.sh/v1alpha1
+kind: InfrastructureStack
+metadata:
+  name: cancellable-stack
+spec:
+  name: cancellable-stack
+  detach: false
+  type: TERRAFORM
+  approval: true
+  manageState: true
+  observableMetrics:
+  - identifier: My Datadog Monitor
+    observabilityProviderRef:
+      kind: ObservabilityProvider
+      name: datadog
+      namespace: stacks
+  repositoryRef:
+    name: infra
+    namespace: infra
+  clusterRef:
+    name: mgmt
+    namespace: infra
+  git:
+    ref: main
+    folder: stacks/cancellable-stack
+```
+
+What qualifies for `identifier` in each observable metric varies on provider, in Datadog, it's simply a monitor name, in NewRelic, it's an entity. 
+
+{% callout severity="info" %}
+This puts a lot of ownership on the monitors you are configuring.  That requires a measure of craftmanship and insight into what the cluster is doing that requires a devops engineer.  You can also split the logic into multiple monitors for completeness, and the system will poll all of them, cancelling if any fire.    
+{% /callout %}
+
+## Remediation post-cancellation
+
+How you take action once your stack is cancelled is ultimately going to depend on the failure mode causing the incident.  Here are some examples:
+
+1. If it was ultimately a red herring due to workloads restarting loudly on the cluster, simply let it settle, then restart the stack run in the UI.
+2. If there's some underlying flaw in the setup of the change, either k8s version incompatibility, bad node AMI, etc, make the change in your Git repository, push it to the tracked branch, and let the stack run resume with the corrected code.
+3. If it's a flaw of a downstream service, correct it there, then restart the stack run in the UI.
+
+By and large, you should have full freedom to respond, and the various touchpoints in the Plural product will make the process as self-serviceable as possible.
diff --git a/pages/stacks/custom-stacks.md b/pages/stacks/custom-stacks.md
@@ -0,0 +1,87 @@
+---
+title: Custom Stacks
+description: Define your own command workflows to be executed via Stacks
+---
+
+## Overview
+
+Plural allows for you to define your own command workflows in place of the standard patterns for tools, like the `terraform plan` -> `terraform apply` chain for terraform, or `ansible-playbook` command for ansible.  This can serve a number of useful purposes:
+
+1. Supporting a GitOps workflow for cli-based kubernetes provisioners like `k3s` or GKE anthos' `gkectl`.
+2. Supporting in-house provisioner scripts you'd want a more scalable, GitOps approach to configuration for, alongside the elegant UI the Plural Console can offer.
+3. Automating bulk scripting based on any declarative config, each forcing manual node refreshes
+
+It works off a `StackDefinition` resource, and requires extending one of our base docker images.
+
+## Extend a Plural `harness` container image
+
+The first step to defining your own custom stack is building your own base image.  The standard path here is to simply extend ours, copying the `harness` binary into an executable path.  This [PR](https://github.com/pluralsh/deployment-operator/pull/248) provides a simple example of how that can be done, with the new image simply consisting of a debian base with the AWS cli installed.
+
+There are a few potential things to notice (all solved in the PR):
+
+1. For security reasons, we always execute stacks with the 65535 uid.  This is to prevent run-as-root vulnerabilities, but also means you might need to manually create that user and its home directory in your image if you're installing utilities that might need them.
+2. The images you can use are in either the `ghcr.io/pluralsh/stackrun-harness-base` repository or the `ghcr.io/pluralsh/harness` repository.  The latter has finished images with `terraform`, `ansible` and other executables installed.
+3. You should make sure to include the WORKDIR and ENTRYPOINT as in the existing images, eg:
+
+```
+WORKDIR /plural
+
+ENTRYPOINT ["harness", "--working-dir=/plural"]
+```
+
+## Creating a StackDefinition
+
+Stack definition CRDs are actually pretty self-explanatory, they just specify the commands you'll want the stack to run and any base configuration.  Here's an example:
+
+```yaml
+apiVersion: deployments.plural.sh/v1alpha1
+kind: StackDefinition
+metadata:
+  name: my-custom-stack
+spec:
+  description: "example of a basic custom stack"
+  configuration:
+    image: ghcr.io/pluralsh/harness # replace with your new base image
+    tag: 0.4.42-terraform-1.8 # replace with your new tag
+  steps:
+  - cmd: /bin/sh
+    args:
+    - ./stack.sh
+    stage: PLAN
+  - cmd: echo
+    args:
+    - APPLYING
+    stage: APPLY
+```
+
+The `stage` field maps to the standard terraform workflow, with the main point of importance being the `APPLY` stage cannot be executed until the stack has been approved, if it has enabled `approval` on its spec.
+
+The `configuration` block is a way to specify default image setup for stacks using this definition.
+
+## Instantiating a Custom Stack
+
+Finally creating an instance of your custom stack is very quick, simply create an `InfrastructureStack` resource pointing to the `StackDefinition`:
+
+```yaml
+apiVersion: deployments.plural.sh/v1alpha1
+kind: InfrastructureStack
+metadata:
+  name: custom
+spec:
+  name: custom
+  detach: false
+  type: CUSTOM # must be this type
+  approval: true
+  stackDefinitionRef:
+    name: my-custom-stack # points to CR above
+    namespace: stacks
+  repositoryRef:
+    name: infra
+    namespace: infra
+  clusterRef:
+    name: mgmt
+    namespace: infra
+  git:
+    ref: main
+    folder: stacks/custom
+```
diff --git a/src/NavData.tsx b/src/NavData.tsx
@@ -184,6 +184,14 @@ const rootNavData: NavMenu = deepFreeze([
             title: 'Executing IaC Locally',
             href: '/stacks/local-execution',
           },
+          {
+            title: 'Custom Stacks',
+            href: '/stacks/custom-stacks',
+          },
+          {
+            title: 'Auto-Cancellation',
+            href: '/stacks/auto-cancellation',
+          },
         ],
       },
       {