-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: k8s storage #77
base: master
Are you sure you want to change the base?
RFC: k8s storage #77
Changes from 5 commits
fa56f8e
b33fb26
6bade72
a288d45
df4fc3f
05f90fd
51ac1a3
ac4b2fd
4d08864
27da6c6
8816255
88b5edd
349287e
ddea43c
c21e215
fc2fd1f
421d0ae
00a363a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,167 @@ | ||||||
# Terms | ||||||
- **cache object** a BLOB and relevant metadata that Concourse needs to persist. These could be Resource Caches, Task Caches or Build Caches. | ||||||
- **worker** Concourse executes steps on a **worker** and implements some **worker** interface. Concourse is agnostic of the runtime representation of the worker (eg. K8s pod, node or cluster). | ||||||
|
||||||
# Summary | ||||||
|
||||||
After spiking on a few solutions for storage on Kubernetes our recommendation is to use an image registry to store **cache objects** for steps. | ||||||
|
||||||
# Motivation | ||||||
|
||||||
As we started thinking about the Kubernetes runtime we realized that we need to think about what our storage solution would be before proceeding with any other part of the implementation. Storage has a huge effect on how Concourse interacts with the runtime (Kubernetes). Storage also had a lot of unknowns, we didn't know what the storage landscape on Kubernetes looked like and what options were available to us. Storage also has a huge impact on the performance of the cluster, in regards to storage and initialization of steps. | ||||||
|
||||||
## Requirements | ||||||
An ideal storage solution can do the following : | ||||||
|
||||||
- image fetching from the CRI k8s is using | ||||||
- transfer **cache objects** between steps (whatever represents a step, most likely a pod) | ||||||
- cache for resources and tasks | ||||||
- stream **cache objects** across worker runtimes (k8s worker sends artifact to garden worker) | ||||||
|
||||||
## Criteria | ||||||
- security | ||||||
- performance, aka initialization time (time spent running workloads on a single k8s worker, as well as across workers) | ||||||
- resource usage to run this storage solution | ||||||
|
||||||
# Proposal | ||||||
|
||||||
**TL;DR**: We recommend going with the image registry option because it satisfies all the requirements and gives us a bunch of options to improve performance when compared to the blobstore option. It also provides a very flexible solution that works across multiple runtime workers. [See Details](#image-registry-to-store-artifacts) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Trade-study - push/pull from registry using
|
||||||
|
||||||
Furthermore, the CSI is a useful interface for building the storage component against. [See Details](#csi) | ||||||
|
||||||
# Storage Options considered | ||||||
## Baggageclaim Daemonset | ||||||
### Description | ||||||
A privileged baggageclaim pod would manage all the **cache object** for step pods. The pod can be provided sufficient privilege to create overlay mounts using `BiDirectional` value for `mountPropagation`. The `volumeMount` object allows specifying a volume `subPath`. | ||||||
|
||||||
This approach didn't work using GCE PDs or vSphere Volumes ([Issue](https://github.com/kubernetes/kubernetes/issues/95049)). It does work using `hostPath` option, however, that would require a large root volume and wouldn't be able to leverage IaaS based persistent disks. | ||||||
|
||||||
The pod would run on all nodes that Concourse would execute steps on. | ||||||
|
||||||
### Pros | ||||||
+ Leverage baggageclaim | ||||||
+ volume streaming between nodes would work using the current Concourse architecture | ||||||
+ resource and task caches would also work using the current Concourse architecture | ||||||
+ would be able to stream **cache objects** across worker runtimes as it would be mediated via the web | ||||||
+ Concourse would have complete control over volume lifecycle | ||||||
+ would have negligible overhead for steps scheduled on the same node as no input/output stream would be required | ||||||
|
||||||
### Cons | ||||||
- Not being able to use IaaS based persisent disks doesn't offer a viable solution. K8s nodes would need to have large root volumes. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this actually an issue? Under
Large root volumes is closer to how Concourse Workers work today (i.e. closer to the implicit foundational design assumptions in the Concourse domain that this RFC attempts to deal with), and if there are scalability issues with Persistent Volumes, then doesn't this represent a simpler way (because it builds on prior There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @ari-becker, We are assuming the cons;
creates larger challenges for operators. It sounds like, you'd be okay with this setup ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xtreme-sameer-vohra currently, we already run Concourse workers on a Kubernetes cluster, one where we've architected our workers such that the workers are restricted to their own underlying nodes (we manually scale both the underlying nodes and the worker StatefulSet in tandem) and the worker pods are privileged. So all of your cons are already present in our deployment and therefore implicitly acceptable. Maybe you need to clarify what the goals are supposed to be of developing a Kubernetes runtime for Concourse? is it to phase out the "worker" in favor of ATC directly scheduling Kubernetes Jobs that represent Concourse jobs or steps? Because then running privileged baggageclaim pods are fine, basically then the worker isn't really phased out, it's just simplified down to only the baggageclaim code, and presumably we'd be able to configure ATC to create Jobs with node restrictions / affinity / taints / tolerations to get the Jobs to run on a dedicated worker instance group. Is it the ability to run on common / shared instances? Because then running privileged baggageclaim pods is probably unacceptable. Should untrusted Concourse workloads run on common / shared instances in the first place, even if they're not allowed to be privileged? Ehhh, debatable, and a lot of security experts will tell you no, particularly if the jobs are not configured with PodSecurityPolicies and the like, which in a CI / development environment rarely makes sense. The main value that we'd get out of a Kubernetes-native runtime would be the ability to set up Cluster Autoscaler so that Kubernetes will add additional nodes (and reduce nodes) as needed according to scheduling workloads, as currently we need to scale workers by hand. Cluster Autoscaler works well with workloads that are scheduled to dedicated instance groups, adding and subtracting nodes from the dedicated instance group that will permit Jobs that are restricted to those instance groups. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yes, we are working on the K8s Runtime RFC that should have preceded this one 😸
As you mentioned, the container management would be delegated to k8s. We are spiking on using IaaS PVs + propagating mounts using CSI, hoping to ultimately leverage The initial draft of the proposal has a worker being mapped to K8s cluster + namespace. "Instance Groups" seems interesting. Do you see it fitting in naturally with the cluster+namespace mapping or would you propose something different ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @xtreme-sameer-vohra One of the things that worries me about using an external image registry is the prospect of cross-availability-zone (on AWS, $0.01/GB) and/or VPC egress traffic costs (on AWS, much higher). Currently, all of our workers are in the same availability zone (for CI workers, we're less concerned with the prospect of an AZ-wide outage and more concerned with the traffic costs) and so our traffic costs are much lower. Effectively using a external image registry while keeping costs under control would mean a) deploying a separate image registry per availability zone alongside the workers, where workers are correctly configured to use the image registry that is in their availability zone, b) the image registry itself needs to use either a local disk / IaaS PV (so as not to incur traffic costs when storing layers in s3, for example), so really all that an image registry does is consolidate storage usage in one place. This eventually leads to cost savings compared to using a large local disk / IaaS PV per worker, with the tradeoff of additional operational complexity. When talking about instance groups, I'm using the Kops nomenclature. An instance group maps to an AWS AutoscalingGroup, where the nodes are of a certain instance type (i.e. CI is generally best-served with burstable nodes, which on AWS are t3 nodes, whereas t3 nodes may not make sense for other workloads). In order to schedule workers on a specific ASG that we set up for them, we use a combination of node affinity to direct Kubernetes to schedule the CI workloads (currently, the worker pods) to those nodes, and taints and tolerations to not allow any other workloads to schedule there. This is orthogonal to Kubernetes namespacing, which is a good idea anyway to improve logical isolation between individual CI workloads. |
||||||
- Wouldn't have support for hosting images by default. However, `baggageclaim` could be extended to add the APIs | ||||||
- `baggageclaim` itself doesn't have any authentication/authorization or transport security (https) mechanisms built into it | ||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as far as i know, TKG clusters created via TMC expects pods to adhere to a podsecurity policy that's quite restrictive:
buuut, I'd imagine someone installing a CSI plugin (like anything network related) would expect to grant higher privileges anyway (being an "infra component"), but not to any container (i.e., the containers that run the workloads, like tasks, etc) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, this does make the assumption that installing Concourse requires more than just pod* capabilities. @taylorsilva and I were also discussing that this approach wouldn't preclude us from using the IaaS PVs directly for simpler installations where user doesn't have privileges to install a |
||||||
## Image Registry to store artifacts | ||||||
### Description | ||||||
Each **cache object** is represented as a image layer for a repository in an image registry. [SPIKE using registry to store artifacts](https://github.com/concourse/concourse/issues/3740). Concourse would require a managed image registry as a dependency. For each step, Concourse would generate a image config and manifest with all the relevant inputs modeled as image layers. | ||||||
|
||||||
### Pros | ||||||
- Would have support for building an image in a step and using it as the image for a subsequent step. This would require the image registry to be accessible by the CRI subsystem on a node | ||||||
- Image registries are are critical to operating on K8s and as such there are plenty of options for leveraging managed IaaS based solutions such as GCR, ECR, ACR to on prem solutions like Harbor. Therefore, it would be a safe assumption that a Concourse on K8s user would already have a registry available for use. | ||||||
- Could explore further de-coupling by exploring [csi-driver-image-populator](https://github.com/kubernetes-csi/csi-driver-image-populator) when using registries for storing artifacts. Listed as a sample driver in the CSI docs and README says it is not production ready. Last commit was Oct 2019. There is also another utility - [imgpack](https://github.com/k14s/imgpkg) which allows arbitrary data store in images as layers. | ||||||
- Leverage performance enhancements to registries such as [pull through cache](https://docs.docker.com/registry/recipes/mirror/) | ||||||
- Use a standardized and documented [OCI image-spec protocol](https://github.com/opencontainers/image-spec) | ||||||
- LRU based local caching of image layers by the K8s CRI | ||||||
- Established ways of securely pushing/pulling blobs from an image registry | ||||||
- As this would be a centralized storage solution | ||||||
- it doesn't impact what a K8s based Concourse worker looks like | ||||||
- Simplified GC | ||||||
- Would support streaming across worker runtimes | ||||||
|
||||||
### Cons | ||||||
- Some registries such as GCR don't expose an API to delete layers directly | ||||||
- **cache object** would have to have a fixed static path in the image file system to be able to reuse the same layer. This would require some additional handling on Concourse to support [input-mapping](https://concourse-ci.org/jobs.html#schema.step.task-step.input_mapping) and [output-mapping](https://concourse-ci.org/jobs.html#schema.step.task-step.output_mapping) | ||||||
- Adds extra development overhead to generate new image config & manifests to leverage **cache object** layers | ||||||
- Adds extra initialization overhead. Concourse wouldn't have control over the local caches on K8s nodes, so volumes would always have to be pushed to the centralized registry and pulled at least once when executing a step | ||||||
- Potentially adds substantial load on registry, as Concourse would be creating a new file system layer for every **cache object** | ||||||
- There isn't a well documented approach to setup an in-cluster secure registry. The setup requires exposing an in-cluster registry externally with traffic routed via an LB. [Prior spike](https://github.com/concourse/concourse/issues/3796) | ||||||
|
||||||
## S3 Compatible Blobstore | ||||||
## Description | ||||||
Each **cache object** is stored in a blobstore. Concourse would require a mananaged blobstore as a dependency. For each step, Concourse would pull down the relevant blobs for inputs and push blobs for outputs. | ||||||
|
||||||
### Pros | ||||||
- Scale well (GCR uses GCS as the underlying storage) | ||||||
- Could explore further de-coupling by exploring CSI driver | ||||||
- Established ways of securely fetching/pushing blobs from an a blobstore | ||||||
- As this would be a centralized storage solution | ||||||
- it doesn't impact what a K8s based Concourse worker looks like | ||||||
- Simplified GC | ||||||
- Would support streaming across worker runtimes | ||||||
|
||||||
### Cons | ||||||
- Wouldn't have support for hosting images by default. | ||||||
- Adds another dependency for Concourse (depending on where Concourse is deployed there might be managed solutions available) | ||||||
- Lack of standardized APIs | ||||||
- Adds extra initialization overhead. Concourse wouldn't have a local cache, so volumes would always have to be pushed & pulled for steps | ||||||
- Concourse would potentially be heavy user of the blobstore | ||||||
|
||||||
## Persistent Volumes | ||||||
Each **cache object** would be stored in its own persistent volume. Persistent volume snapshots would be used to reference **cache object** versions. | ||||||
|
||||||
### Pros | ||||||
- Would leverage native k8s offering | ||||||
- Maps well to Concourse's use of **cache objects** and offloads the heavy lifting to K8s | ||||||
- Potentially wouldn't require volumes to be streamed at all | ||||||
|
||||||
### Cons | ||||||
- Wouldn't have support for hosting images by default. | ||||||
- IaaS based limits on [volume limits per node](https://kubernetes.io/docs/concepts/storage/storage-limits/#dynamic-volume-limits) prevents this from being a scalable solution | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not super related, but found this idea interesting: argoproj/argo-workflows#4130 |
||||||
- CSI Snapshotting feature is optional and not every driver supports it ([Drivers & features they support](https://kubernetes-csi.github.io/docs/drivers.html#production-drivers)) | ||||||
- As this would NOT be a centralized storage solution, it wouldn't support workers across multiple runtimes or even K8s clusters | ||||||
|
||||||
## K8s POC (Baggagelciam peer-to-peer) | ||||||
Each step would have a sidecar container to populate input **cache objects** and host outputs **cache objects** via an HTTP API.`beltloader` is used to populate inputs. `baggageclaim` is used to host outputs. `baggageclaim` was also modified to allow **cache objects** to be accessed via the registry APIs (support images). | ||||||
|
||||||
### Pros | ||||||
- No external dependencies are required | ||||||
- Supports worker-to-worker streaming bypassing Concourse web | ||||||
|
||||||
### Cons | ||||||
- the `step` pod's lifecycle is tied to the **cache object** lifecycle (pods have to be kept around until the **cache object** they host is required). This would increase the CPU & memory usage of a cluster. | ||||||
- there isn't a simple mechanism to allow the k8s container runtime to securely access the `baggageclaim` endpoints to fetch images | ||||||
- As this would NOT be a centralized storage solution, it would require exposing the `baggageclaim` endpoints via `services` to be accessed externally | ||||||
- `baggageclaim` itself doesn't have any authentication/authorization or transport security (https) mechanisms built into it | ||||||
|
||||||
# Other considerations | ||||||
## CSI | ||||||
The [Container Storage Interface](https://github.com/container-storage-interface/spec/blob/master/spec.md) provides a generic interface for providing storage to containers. | ||||||
|
||||||
CSI was developed as a standard for exposing arbitrary block and file storage storage systems to containerized workloads on Container Orchestration Systems (COs) like Kubernetes. With the adoption of the Container Storage Interface, the Kubernetes volume layer becomes truly extensible. Using CSI, third-party storage providers can write and deploy plugins exposing new storage systems in Kubernetes without ever having to touch the core Kubernetes code. This gives Kubernetes users more options for storage and makes the system more secure and reliable. [Source](https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/#why-csi) | ||||||
|
||||||
The CSI spec can be used to wrap every solution listed above. It provides an API through which the chosen solution would be consumed. | ||||||
|
||||||
### Pros | ||||||
- Can be deployed/managed using k8s resources ([hostPath CSI Driver example](https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/docs/deploy-1.17-and-later.md)) | ||||||
- Allows the storage mechanims to be swapped more easily | ||||||
- can be an extension point for Concourse community | ||||||
- De-couples Concourse from its usage of storage | ||||||
- the driver could be patched/upgraded indepdently of Concourse | ||||||
- The CSI Spec is quite flexible and has a minimum set of required methods (the other set of features are opt-in) | ||||||
- CSI supports multiple deployment topologies (master, master+node, node) | ||||||
- Provides a scheduling extension point for volume aware scheduling | ||||||
|
||||||
### Cons | ||||||
- extra overhead for development, packaging and deployment | ||||||
- the CSI version may be tied to a K8s version | ||||||
|
||||||
## Fuse | ||||||
This might simplify our usage of external storage solutions such as blobstores. There isn't a supported solution in K8s at the moment. However, this would be something worth considering if that were to change. [Click here to view the current issue requesting K8s development](https://github.com/kubernetes/kubernetes/issues/7890). | ||||||
|
||||||
# Open Questions | ||||||
|
||||||
- Do we implement our own version of the csi-image-populator? | ||||||
- Should we implement this as a CSI driver? | ||||||
|
||||||
|
||||||
# Answered Questions | ||||||
|
||||||
|
||||||
# Related Links | ||||||
- [Storage Spike](https://github.com/concourse/concourse/issues/6036) | ||||||
- [Review k8s worker POC](https://github.com/concourse/concourse/issues/5986) | ||||||
|
||||||
|
||||||
# New Implications | ||||||
|
||||||
Will drive the rest of the Kubernetes runtime work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be interesting to check out the approach that kaniko took - it essentially caches layers in a container registry if you don't specify a local path (so, e.g., if you want to run a
Job
that runskaniko
but you don't want to care about providing it a volume - which would implyconcurrency=1
)(https://github.com/GoogleContainerTools/kaniko/blob/master/pkg/cache/cache.go)
i'm not familiar with the details of the implementation, but it might serve as some inspiration 😁
buildkit seems to allow something similar: https://github.com/moby/buildkit/blob/master/README.md#export-cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @cirocosta
We can see this style of local caching being beneficial for pulling the contents on an input for a task. Is that what you meant ? We weren't sure how this might be useful for pushing 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I didn't have any specifics in mind, it was more about showcasing the pattern of using a registry as a place where you just throw layers at a registry regardless of what you're building and then try to reuse those layers (pulling if they exist) when running another build