Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: k8s storage #77

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 167 additions & 0 deletions 074-k8s-storage/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
# Terms
- **cache object** a BLOB and relevant metadata that Concourse needs to persist. These could be Resource Caches, Task Caches or Build Caches.
- **worker** Concourse executes steps on a **worker** and implements some **worker** interface. Concourse is agnostic of the runtime representation of the worker (eg. K8s pod, node or cluster).

# Summary

After spiking on a few solutions for storage on Kubernetes our recommendation is to use an image registry to store **cache objects** for steps.
Copy link
Member

@cirocosta cirocosta Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be interesting to check out the approach that kaniko took - it essentially caches layers in a container registry if you don't specify a local path (so, e.g., if you want to run a Job that runs kaniko but you don't want to care about providing it a volume - which would imply concurrency=1)

$ k logs -f build-v1.18.6-4tn7w  | grep cache
INFO[0008] Checking for cached layer cirocosta/envtest-cache:8642ec4096d286832017c667e68548ac74436793bd49b1cf59d6a238044f6b0a...
INFO[0008] No cached layer found for cmd RUN set -x &&          apt update -y && apt install -y rsync &&                git clone https://github.com/kubernetes/kubernetes &&                 cd ./kubernetes &&                      git checkout $KUBE_TAG &&            go mod download &&                       make generated_files
INFO[0194] Checking for cached layer cirocosta/envtest-cache:21931b04c009b9e67db0c6c8a1f907ee699dd32ad3fef2aa14efaf6b1a5dc995...
INFO[0313] Found cached layer, extracting to filesystem
INFO[0348] Checking for cached layer cirocosta/envtest-cache:b862d7fa29205a55ec028679dc2bb416b707a1a29d24feb9714075e4be5a5c74...
INFO[0435] Found cached layer, extracting to filesystem
INFO[0447] Checking for cached layer cirocosta/envtest-cache:e6c3ee315be35a8483933a79682375de049bdef0560721ce9962d728993803d0...
INFO[0464] Found cached layer, extracting to filesystem
INFO[0470] Checking for cached layer cirocosta/envtest-cache:e78396635e5a8c3a4b750b11796109b2a6a954b628c6e24a3dfbc1f579ce7024...
INFO[0471] No cached layer found for cmd RUN go install -v ./cmd/envtest
INFO[0508] Pushing layer cirocosta/envtest-cache:e78396635e5a8c3a4b750b11796109b2a6a954b628c6e24a3dfbc1f579ce7024 to cache now

(https://github.com/GoogleContainerTools/kaniko/blob/master/pkg/cache/cache.go)

i'm not familiar with the details of the implementation, but it might serve as some inspiration 😁


buildkit seems to allow something similar: https://github.com/moby/buildkit/blob/master/README.md#export-cache

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cirocosta
We can see this style of local caching being beneficial for pulling the contents on an input for a task. Is that what you meant ? We weren't sure how this might be useful for pushing 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I didn't have any specifics in mind, it was more about showcasing the pattern of using a registry as a place where you just throw layers at a registry regardless of what you're building and then try to reuse those layers (pulling if they exist) when running another build


# Motivation

As we started thinking about the Kubernetes runtime we realized that we need to think about what our storage solution would be before proceeding with any other part of the implementation. Storage has a huge effect on how Concourse interacts with the runtime (Kubernetes). Storage also had a lot of unknowns, we didn't know what the storage landscape on Kubernetes looked like and what options were available to us. Storage also has a huge impact on the performance of the cluster, in regards to storage and initialization of steps.

## Requirements
An ideal storage solution can do the following :

- image fetching from the CRI k8s is using
- transfer **cache objects** between steps (whatever represents a step, most likely a pod)
- cache for resources and tasks
- stream **cache objects** across worker runtimes (k8s worker sends artifact to garden worker)

## Criteria
- security
- performance, aka initialization time (time spent running workloads on a single k8s worker, as well as across workers)
- resource usage to run this storage solution

# Proposal

**TL;DR**: We recommend going with the image registry option because it satisfies all the requirements and gives us a bunch of options to improve performance when compared to the blobstore option. It also provides a very flexible solution that works across multiple runtime workers. [See Details](#image-registry-to-store-artifacts)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trade-study - push/pull from registry using

  1. init & side car containers
  2. CSI with independent volume/input (eg. image populator)
  3. Custom image per combination of step for pulling and TODO for pushing


Furthermore, the CSI is a useful interface for building the storage component against. [See Details](#csi)

# Storage Options considered
## Baggageclaim Daemonset
### Description
A privileged baggageclaim pod would manage all the **cache object** for step pods. The pod can be provided sufficient privilege to create overlay mounts using `BiDirectional` value for `mountPropagation`. The `volumeMount` object allows specifying a volume `subPath`.

This approach didn't work using GCE PDs or vSphere Volumes ([Issue](https://github.com/kubernetes/kubernetes/issues/95049)). It does work using `hostPath` option, however, that would require a large root volume and wouldn't be able to leverage IaaS based persistent disks.

The pod would run on all nodes that Concourse would execute steps on.

### Pros
+ Leverage baggageclaim
+ volume streaming between nodes would work using the current Concourse architecture
+ resource and task caches would also work using the current Concourse architecture
+ would be able to stream **cache objects** across worker runtimes as it would be mediated via the web
+ Concourse would have complete control over volume lifecycle
+ would have negligible overhead for steps scheduled on the same node as no input/output stream would be required

### Cons
- Not being able to use IaaS based persisent disks doesn't offer a viable solution. K8s nodes would need to have large root volumes.
Copy link

@ari-becker ari-becker Oct 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually an issue? Under Persistent Volumes you note as a con:

IaaS based limits on volume limits per node prevents this from being a scalable solution

Large root volumes is closer to how Concourse Workers work today (i.e. closer to the implicit foundational design assumptions in the Concourse domain that this RFC attempts to deal with), and if there are scalability issues with Persistent Volumes, then doesn't this represent a simpler way (because it builds on prior baggageclaim work) to get started with building the Kubernetes runtime?

Copy link

@xtreme-sameer-vohra xtreme-sameer-vohra Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ari-becker,
Yes, this is true. It would be useful to get perspectives such as yours for us to understand what constraints or lack thereof folks are dealing with.

We are assuming the cons;

  • resizing of nodes with larger root volumes
  • running privileged baggageclaim pods
  • being ok with baggagelciam's current lack of security (this can be enhanced)

creates larger challenges for operators.

It sounds like, you'd be okay with this setup ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xtreme-sameer-vohra currently, we already run Concourse workers on a Kubernetes cluster, one where we've architected our workers such that the workers are restricted to their own underlying nodes (we manually scale both the underlying nodes and the worker StatefulSet in tandem) and the worker pods are privileged. So all of your cons are already present in our deployment and therefore implicitly acceptable.

Maybe you need to clarify what the goals are supposed to be of developing a Kubernetes runtime for Concourse? is it to phase out the "worker" in favor of ATC directly scheduling Kubernetes Jobs that represent Concourse jobs or steps? Because then running privileged baggageclaim pods are fine, basically then the worker isn't really phased out, it's just simplified down to only the baggageclaim code, and presumably we'd be able to configure ATC to create Jobs with node restrictions / affinity / taints / tolerations to get the Jobs to run on a dedicated worker instance group. Is it the ability to run on common / shared instances? Because then running privileged baggageclaim pods is probably unacceptable. Should untrusted Concourse workloads run on common / shared instances in the first place, even if they're not allowed to be privileged? Ehhh, debatable, and a lot of security experts will tell you no, particularly if the jobs are not configured with PodSecurityPolicies and the like, which in a CI / development environment rarely makes sense.

The main value that we'd get out of a Kubernetes-native runtime would be the ability to set up Cluster Autoscaler so that Kubernetes will add additional nodes (and reduce nodes) as needed according to scheduling workloads, as currently we need to scale workers by hand. Cluster Autoscaler works well with workloads that are scheduled to dedicated instance groups, adding and subtracting nodes from the dedicated instance group that will permit Jobs that are restricted to those instance groups.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, we are working on the K8s Runtime RFC that should have preceded this one 😸
The primary objective being able to leverage the scaling & operational capabilities of K8s by deploying steps as K8s pods without changing the pipeline schema. The latter would allow pipelines to be portable across runtimes. Along the same lines we're also assessing;

  • the least amount of required permissions on K8s for fulfilling the same requirements as a current worker (container & storage management)
  • the # of dependencies (besides postgres) for a Concourse deployment on K8s

As you mentioned, the container management would be delegated to k8s. We are spiking on using IaaS PVs + propagating mounts using CSI, hoping to ultimately leverage baggageclaim. We are early in the exploration, but this path might also provide the ability of using volume location for the purposes of scheduling 💭

The initial draft of the proposal has a worker being mapped to K8s cluster + namespace. "Instance Groups" seems interesting. Do you see it fitting in naturally with the cluster+namespace mapping or would you propose something different ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xtreme-sameer-vohra One of the things that worries me about using an external image registry is the prospect of cross-availability-zone (on AWS, $0.01/GB) and/or VPC egress traffic costs (on AWS, much higher). Currently, all of our workers are in the same availability zone (for CI workers, we're less concerned with the prospect of an AZ-wide outage and more concerned with the traffic costs) and so our traffic costs are much lower.

Effectively using a external image registry while keeping costs under control would mean a) deploying a separate image registry per availability zone alongside the workers, where workers are correctly configured to use the image registry that is in their availability zone, b) the image registry itself needs to use either a local disk / IaaS PV (so as not to incur traffic costs when storing layers in s3, for example), so really all that an image registry does is consolidate storage usage in one place. This eventually leads to cost savings compared to using a large local disk / IaaS PV per worker, with the tradeoff of additional operational complexity.

When talking about instance groups, I'm using the Kops nomenclature. An instance group maps to an AWS AutoscalingGroup, where the nodes are of a certain instance type (i.e. CI is generally best-served with burstable nodes, which on AWS are t3 nodes, whereas t3 nodes may not make sense for other workloads). In order to schedule workers on a specific ASG that we set up for them, we use a combination of node affinity to direct Kubernetes to schedule the CI workloads (currently, the worker pods) to those nodes, and taints and tolerations to not allow any other workloads to schedule there. This is orthogonal to Kubernetes namespacing, which is a good idea anyway to improve logical isolation between individual CI workloads.

- Wouldn't have support for hosting images by default. However, `baggageclaim` could be extended to add the APIs
- `baggageclaim` itself doesn't have any authentication/authorization or transport security (https) mechanisms built into it

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ - `baggageclaim` needs to be run in a privileged container. It would be useful to understand the perspective of operators for this constraint ? Is this tenable ?

Copy link
Member

@cirocosta cirocosta Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as i know, TKG clusters created via TMC expects pods to adhere to a podsecurity policy that's quite restrictive:

buuut, I'd imagine someone installing a CSI plugin (like anything network related) would expect to grant higher privileges anyway (being an "infra component"), but not to any container (i.e., the containers that run the workloads, like tasks, etc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this does make the assumption that installing Concourse requires more than just pod* capabilities.
However, installing a CSI driver would be an Operator action similar to installing concourse today.

@taylorsilva and I were also discussing that this approach wouldn't preclude us from using the IaaS PVs directly for simpler installations where user doesn't have privileges to install a CSI driver. This would have performance implications in the overhead to create/destroy volumes and limit of volumes / node, but would be a very simple solution 🤔

## Image Registry to store artifacts
### Description
Each **cache object** is represented as a image layer for a repository in an image registry. [SPIKE using registry to store artifacts](https://github.com/concourse/concourse/issues/3740). Concourse would require a managed image registry as a dependency. For each step, Concourse would generate a image config and manifest with all the relevant inputs modeled as image layers.

### Pros
- Would have support for building an image in a step and using it as the image for a subsequent step. This would require the image registry to be accessible by the CRI subsystem on a node
- Image registries are are critical to operating on K8s and as such there are plenty of options for leveraging managed IaaS based solutions such as GCR, ECR, ACR to on prem solutions like Harbor. Therefore, it would be a safe assumption that a Concourse on K8s user would already have a registry available for use.
- Could explore further de-coupling by exploring [csi-driver-image-populator](https://github.com/kubernetes-csi/csi-driver-image-populator) when using registries for storing artifacts. Listed as a sample driver in the CSI docs and README says it is not production ready. Last commit was Oct 2019. There is also another utility - [imgpack](https://github.com/k14s/imgpkg) which allows arbitrary data store in images as layers.
- Leverage performance enhancements to registries such as [pull through cache](https://docs.docker.com/registry/recipes/mirror/)
- Use a standardized and documented [OCI image-spec protocol](https://github.com/opencontainers/image-spec)
- LRU based local caching of image layers by the K8s CRI
- Established ways of securely pushing/pulling blobs from an image registry
- As this would be a centralized storage solution
- it doesn't impact what a K8s based Concourse worker looks like
- Simplified GC
- Would support streaming across worker runtimes

### Cons
- Some registries such as GCR don't expose an API to delete layers directly
- **cache object** would have to have a fixed static path in the image file system to be able to reuse the same layer. This would require some additional handling on Concourse to support [input-mapping](https://concourse-ci.org/jobs.html#schema.step.task-step.input_mapping) and [output-mapping](https://concourse-ci.org/jobs.html#schema.step.task-step.output_mapping)
- Adds extra development overhead to generate new image config & manifests to leverage **cache object** layers
- Adds extra initialization overhead. Concourse wouldn't have control over the local caches on K8s nodes, so volumes would always have to be pushed to the centralized registry and pulled at least once when executing a step
- Potentially adds substantial load on registry, as Concourse would be creating a new file system layer for every **cache object**
- There isn't a well documented approach to setup an in-cluster secure registry. The setup requires exposing an in-cluster registry externally with traffic routed via an LB. [Prior spike](https://github.com/concourse/concourse/issues/3796)

## S3 Compatible Blobstore
## Description
Each **cache object** is stored in a blobstore. Concourse would require a mananaged blobstore as a dependency. For each step, Concourse would pull down the relevant blobs for inputs and push blobs for outputs.

### Pros
- Scale well (GCR uses GCS as the underlying storage)
- Could explore further de-coupling by exploring CSI driver
- Established ways of securely fetching/pushing blobs from an a blobstore
- As this would be a centralized storage solution
- it doesn't impact what a K8s based Concourse worker looks like
- Simplified GC
- Would support streaming across worker runtimes

### Cons
- Wouldn't have support for hosting images by default.
- Adds another dependency for Concourse (depending on where Concourse is deployed there might be managed solutions available)
- Lack of standardized APIs
- Adds extra initialization overhead. Concourse wouldn't have a local cache, so volumes would always have to be pushed & pulled for steps
- Concourse would potentially be heavy user of the blobstore

## Persistent Volumes
Each **cache object** would be stored in its own persistent volume. Persistent volume snapshots would be used to reference **cache object** versions.

### Pros
- Would leverage native k8s offering
- Maps well to Concourse's use of **cache objects** and offloads the heavy lifting to K8s
- Potentially wouldn't require volumes to be streamed at all

### Cons
- Wouldn't have support for hosting images by default.
- IaaS based limits on [volume limits per node](https://kubernetes.io/docs/concepts/storage/storage-limits/#dynamic-volume-limits) prevents this from being a scalable solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super related, but found this idea interesting: argoproj/argo-workflows#4130

- CSI Snapshotting feature is optional and not every driver supports it ([Drivers & features they support](https://kubernetes-csi.github.io/docs/drivers.html#production-drivers))
- As this would NOT be a centralized storage solution, it wouldn't support workers across multiple runtimes or even K8s clusters

## K8s POC (Baggagelciam peer-to-peer)
Each step would have a sidecar container to populate input **cache objects** and host outputs **cache objects** via an HTTP API.`beltloader` is used to populate inputs. `baggageclaim` is used to host outputs. `baggageclaim` was also modified to allow **cache objects** to be accessed via the registry APIs (support images).

### Pros
- No external dependencies are required
- Supports worker-to-worker streaming bypassing Concourse web

### Cons
- the `step` pod's lifecycle is tied to the **cache object** lifecycle (pods have to be kept around until the **cache object** they host is required). This would increase the CPU & memory usage of a cluster.
- there isn't a simple mechanism to allow the k8s container runtime to securely access the `baggageclaim` endpoints to fetch images
- As this would NOT be a centralized storage solution, it would require exposing the `baggageclaim` endpoints via `services` to be accessed externally
- `baggageclaim` itself doesn't have any authentication/authorization or transport security (https) mechanisms built into it

# Other considerations
## CSI
The [Container Storage Interface](https://github.com/container-storage-interface/spec/blob/master/spec.md) provides a generic interface for providing storage to containers.

CSI was developed as a standard for exposing arbitrary block and file storage storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes. With the adoption of the Container Storage Interface, the Kubernetes volume layer becomes truly extensible. Using CSI, third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code. This gives Kubernetes users more options for storage and makes the system more secure and reliable. [Source](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/#why-csi)

The CSI spec can be used to wrap every solution listed above. It provides an API through which the chosen solution would be consumed.

### Pros
- Can be deployed/managed using k8s resources ([hostPath CSI Driver example](https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/docs/deploy-1.17-and-later.md))
- Allows the storage mechanims to be swapped more easily
- can be an extension point for Concourse community
- De-couples Concourse from its usage of storage
- the driver could be patched/upgraded indepdently of Concourse
- The CSI Spec is quite flexible and has a minimum set of required methods (the other set of features are opt-in)
- CSI supports multiple deployment topologies (master, master+node, node)
- Provides a scheduling extension point for volume aware scheduling

### Cons
- extra overhead for development, packaging and deployment
- the CSI version may be tied to a K8s version

## Fuse
This might simplify our usage of external storage solutions such as blobstores. There isn't a supported solution in K8s at the moment. However, this would be something worth considering if that were to change. [Click here to view the current issue requesting K8s development](https://github.com/kubernetes/kubernetes/issues/7890).

# Open Questions

- Do we implement our own version of the csi-image-populator?
- Should we implement this as a CSI driver?


# Answered Questions


# Related Links
- [Storage Spike](https://github.com/concourse/concourse/issues/6036)
- [Review k8s worker POC](https://github.com/concourse/concourse/issues/5986)


# New Implications

Will drive the rest of the Kubernetes runtime work.