Skip to content

Commit

Permalink
Document Custom Stacks + Observability autocancellation
Browse files Browse the repository at this point in the history
Document the new custom stacks and monitor-based auto-cancellation features we've exposed.  This seems straightfoward enough for now.
  • Loading branch information
michaeljguarino committed Jul 29, 2024
1 parent 468c3fb commit 950a087
Show file tree
Hide file tree
Showing 3 changed files with 186 additions and 0 deletions.
91 changes: 91 additions & 0 deletions pages/stacks/auto-cancellation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Auto Cancellation
description: Automatically cancel complex Terraform applies when alarms fire
---

## Overview

One common issue when managing changes to Kubernetes infrastructure is due to the very long running nature of the operations, cluster upgrades can take hours for large node counts, there is plenty that can go wrong and immense wasted man-hours babying your infrastructure automation ensuring that does not happen.

Plural helps solve this by polling the monitors you likely have set up to ensure infrastructure health in tools like Datadog or NewRelic, and automatically cancelling your IaC when they fire. Due to our close management of the commands themselves, we'll gracefully shut them down, ensuring things like annoying state locks are cleaned up and no resources are left dangling. We're basically trying to automate one of the most boring but labor-intensive parts of your DevOps workflow.

Setting this up is really simple, you'll need to create an `ObservabilityProvider` resource and then set a list of `observableMonitors` on your stack.

## Create an ObservabilityProvider

To do this in one swoop for datadog, create resources like:

```yaml
apiVersion: v1
kind: Secret
metadata:
name: datadog
namespace: stacks
stringData:
apiKey: YOUR_API_KEY
appKey: YOUR_APP_KEY
---
apiVersion: deployments.plural.sh/v1alpha1
kind: ObservabilityProvider
metadata:
name: datadog
spec:
type: DATADOG
name: datadog
credentials:
datadog:
name: datadog
namespace: stacks
```
{% callout severity="info" %}
You can also create this in the UI to avoid the secret creation, and reference it by name without the `credentials` block
{% /callout %}

## Add `observableMetrics` to your Stack

An example setup for this is here:

```yaml
apiVersion: deployments.plural.sh/v1alpha1
kind: InfrastructureStack
metadata:
name: cancellable-stack
spec:
name: cancellable-stack
detach: false
type: TERRAFORM
approval: true
manageState: true
observableMetrics:
- identifier: My Datadog Monitor
observabilityProviderRef:
kind: ObservabilityProvider
name: datadog
namespace: stacks
repositoryRef:
name: infra
namespace: infra
clusterRef:
name: mgmt
namespace: infra
git:
ref: main
folder: stacks/cancellable-stack
```

What qualifies for `identifier` in each observable metric varies on provider, in Datadog, it's simply a monitor name, in NewRelic, it's an entity.

{% callout severity="info" %}
This puts a lot of ownership on the monitors you are configuring. That requires a measure of craftmanship and insight into what the cluster is doing that requires a devops engineer. You can also split the logic into multiple monitors for completeness, and the system will poll all of them, cancelling if any fire.
{% /callout %}

## Remediation post-cancellation

How you take action once your stack is cancelled is ultimately going to depend on the failure mode causing the incident. Here are some examples:

1. If it was ultimately a red herring due to workloads restarting loudly on the cluster, simply let it settle, then restart the stack run in the UI.
2. If there's some underlying flaw in the setup of the change, either k8s version incompatibility, bad node AMI, etc, make the change in your Git repository, push it to the tracked branch, and let the stack run resume with the corrected code.
3. If it's a flaw of a downstream service, correct it there, then restart the stack run in the UI.

By and large, you should have full freedom to respond, and the various touchpoints in the Plural product will make the process as self-serviceable as possible.
87 changes: 87 additions & 0 deletions pages/stacks/custom-stacks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Custom Stacks
description: Define your own command workflows to be executed via Stacks
---

## Overview

Plural allows for you to define your own command workflows in place of the standard patterns for tools, like the `terraform plan` -> `terraform apply` chain for terraform, or `ansible-playbook` command for ansible. This can serve a number of useful purposes:

1. Supporting a GitOps workflow for cli-based kubernetes provisioners like `k3s` or GKE anthos' `gkectl`.
2. Supporting in-house provisioner scripts you'd want a more scalable, GitOps approach to configuration for, alongside the elegant UI the Plural Console can offer.
3. Automating bulk scripting based on any declarative config, each forcing manual node refreshes

It works off a `StackDefinition` resource, and requires extending one of our base docker images.

## Extend a Plural `harness` container image

The first step to defining your own custom stack is building your own base image. The standard path here is to simply extend ours, copying the `harness` binary into an executable path. This [PR](https://github.com/pluralsh/deployment-operator/pull/248) provides a simple example of how that can be done, with the new image simply consisting of a debian base with the AWS cli installed.

There are a few potential things to notice (all solved in the PR):

1. For security reasons, we always execute stacks with the 65535 uid. This is to prevent run-as-root vulnerabilities, but also means you might need to manually create that user and its home directory in your image if you're installing utilities that might need them.
2. The images you can use are in either the `ghcr.io/pluralsh/stackrun-harness-base` repository or the `ghcr.io/pluralsh/harness` repository. The latter has finished images with `terraform`, `ansible` and other executables installed.
3. You should make sure to include the WORKDIR and ENTRYPOINT as in the existing images, eg:

```
WORKDIR /plural
ENTRYPOINT ["harness", "--working-dir=/plural"]
```

## Creating a StackDefinition

Stack definition CRDs are actually pretty self-explanatory, they just specify the commands you'll want the stack to run and any base configuration. Here's an example:

```yaml
apiVersion: deployments.plural.sh/v1alpha1
kind: StackDefinition
metadata:
name: my-custom-stack
spec:
description: "example of a basic custom stack"
configuration:
image: ghcr.io/pluralsh/harness # replace with your new base image
tag: 0.4.42-terraform-1.8 # replace with your new tag
steps:
- cmd: /bin/sh
args:
- ./stack.sh
stage: PLAN
- cmd: echo
args:
- APPLYING
stage: APPLY
```
The `stage` field maps to the standard terraform workflow, with the main point of importance being the `APPLY` stage cannot be executed until the stack has been approved, if it has enabled `approval` on its spec.

The `configuration` block is a way to specify default image setup for stacks using this definition.

## Instantiating a Custom Stack

Finally creating an instance of your custom stack is very quick, simply create an `InfrastructureStack` resource pointing to the `StackDefinition`:

```yaml
apiVersion: deployments.plural.sh/v1alpha1
kind: InfrastructureStack
metadata:
name: custom
spec:
name: custom
detach: false
type: CUSTOM # must be this type
approval: true
stackDefinitionRef:
name: my-custom-stack # points to CR above
namespace: stacks
repositoryRef:
name: infra
namespace: infra
clusterRef:
name: mgmt
namespace: infra
git:
ref: main
folder: stacks/custom
```
8 changes: 8 additions & 0 deletions src/NavData.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,14 @@ const rootNavData: NavMenu = deepFreeze([
title: 'Executing IaC Locally',
href: '/stacks/local-execution',
},
{
title: 'Custom Stacks',
href: '/stacks/custom-stacks',
},
{
title: 'Auto-Cancellation',
href: '/stacks/auto-cancellation',
},
],
},
{
Expand Down

0 comments on commit 950a087

Please sign in to comment.