Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk usage not correct reported preventing GC trigger #5459

Open
sthroner opened this issue Oct 28, 2024 · 3 comments
Open

Disk usage not correct reported preventing GC trigger #5459

sthroner opened this issue Oct 28, 2024 · 3 comments
Labels
area/rootless rootless mode

Comments

@sthroner
Copy link

sthroner commented Oct 28, 2024

Hello,

we are running Buildkit rootless in a Kubernetes installation and have defined a GC policy with keepBytes:

[[worker.oci.gcpolicy]]
    all = true
    keepBytes = "250GB"  # 50GB less than the PVC size for /home/user/.local/share/buildkit

But the Rule is not always triggered when we hit the limit. We tried to pin down the issue already and here are all the details we already found out.

GC Triggered based on Disk Usage

Most of the time, the GC is working fine and removes the cached data above the set limit, but from the time a buildkit instance, is running out of storage and responds with the following error:

error: failed to solve: ResourceExhausted: failed to prepare k4ovv028ht6dewfcgpus32fn7 as q40z7n0str2xd0ec1u7mjz1r7: copying of parent failed: failed to copy files: write /home/user/.local/share/buildkit/runc-native/snapshots/snapshots/new-19411978/usr/lib/x86_64-linux-gnu/libperl.so.5.36.0: copy_file_range: no space left on device time to time we run into the issue that the GC is not triggered and the buildkit instance is running out of storage

After some tests it looked like the buildctl disk usage command (buildctl du) did not report the correct amount for the actual disk usage (du). Since the buildctl reported disk usage was lower then the keepByte value set in the GC policy the GC was not triggered.

Disk Usage Reported by Buildkit based on type

Record Type,Size
source.local,27.36 MiB
regular,110.78 GiB
Total,110.80 GiB

Disk Usage System

291.7G	/home/user/.local/share/buildkit/runc-native
291.7G	/home/user/.local/share/buildkit/

When running the GC manually via buildctl prune the GC does cleanup all the space. Therefore the GC collector is working fine but it looks more like an issue with the measurement of the disk usage.

Wrong Permission within Cache Folder

What we also noticed during the analysis was that the permissions for some folders within the cache were not set as we would expect them to be.

running du does not work due to permission

du: can't open '/home/user/.local/share/buildkit/runc-native/snapshots/snapshots/1762/var/cache/apt/archives/partial': Permission denied

permission for the folder

~/.local/share/buildkit/runc-native/snapshots/snapshots/1762/var/cache/apt/archives $ ls -la
total 12
drwxr-xr-x    3 user     user          4096 Oct  8 13:21 .
drwxr-xr-x    3 user     user          4096 Oct  8 13:21 ..
-rw-r-----    1 user     user             0 Aug 13 00:43 lock
drwx------    2 100041   user          4096 Aug 13 00:43 partial

We also see other folder with similar permissions as the var/cache/apt/archives/partial, so this looks like not only something related to apt package manager.

###Setup
We currently use the version 0.16 of the rootless container (https://hub.docker.com/layers/moby/buildkit/v0.16.0-rootless) in a K8s setup.

StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: buildkit-temp
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: buildkit-temp
  serviceName: buildkit-temp
  template:
    metadata:
      labels:
        app.kubernetes.io/name: buildkit-temp
    spec:
      containers:
        - args:
            - --config
            - /var/config/buildkit.toml
          image: moby/buildkit:rootless
          name: buildkit
          ports:
            - containerPort: 1234
              protocol: TCP
          securityContext:
            allowPrivilegeEscalation: true
            capabilities:
              add:
              - CHOWN
              - DAC_OVERRIDE
              - FOWNER
              - FSETID
              - SETGID
              - SETUID
              - SETFCAP
              drop:
              - ALL
            privileged: false
            runAsGroup: 1000
            runAsNonRoot: true
            runAsUser: 1000
            seccompProfile:
              type: Unconfined
          volumeMounts:
            - mountPath: /home/user/.local/share/buildkit
              name: buildkit
            - mountPath: /var/config
              name: config
      securityContext:
        fsGroup: 1000
      ## Include the Service Account in the deployment
      serviceAccount: gitlab-runner-master-buildkit
      serviceAccountName: gitlab-runner-master-buildkit
      volumes:
        - name: config
          configMap:
            name: buildkit-temp
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
    - apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: buildkit
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 300Gi
        volumeMode: Filesystem

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: buildkit-temp
data:
  buildkit.toml: |
    debug = true

    [grpc]
      address = [
        "tcp://0.0.0.0:1234", # tcp is for buildctl connections
        "unix:///run/user/1000/buildkit/buildkitd.sock", # non-root socket when running as non-root
      ]

    [worker.containerd]
      enabled = false

    [worker.oci]
      enabled = true
      # Enable automatic garbage collection, runs every minute
      gc = true
      # Allow running in main pid namespace when privileged: false
      noProcessSandbox = true

      [[worker.oci.gcpolicy]]
        all = true
        keepBytes = "250GB"  # 50GB less than the PVC size for /home/user/.local/share/buildkit
@devthejo
Copy link

devthejo commented Nov 12, 2024

Seem to have a similar problem here, this is my full config (in case I'm missing something):

debug = true
root = "/var/lib/buildkit"

commands

[history]
  maxAge = 172800
  maxEntries = 50

[worker]

[worker.containerd]
  enabled = false

[worker.oci]
  enabled = true
  
  rootless = true
  
  max-parallelism = 4

  gc = true
  
  snapshotter = auto

  platforms = ["linux/amd64"]
  
  [[worker.oci.gcpolicy]]
    filters = ["type==exec.cachemount"]
    keepBytes = 90%
    keepDuration = 30d

  [[worker.oci.gcpolicy]]
    all = true
    keepBytes = 90%
    keepDuration = 30d

But my volume mounted on /home/user/.local/share/buildkit and used only by buildkit is full at 96%, causing a no space left on disk error when trying to run a build task

EDIT:
Another observation: after restarted the pod (and resized volume), seem that the cleanup was performed

@jedevc
Copy link
Member

jedevc commented Nov 12, 2024

@devthejo what version of buildkit do you see this on? The original issue seems to be on v0.16, with rootless mode, is that the same setup you have?

@jedevc jedevc added the area/rootless rootless mode label Nov 12, 2024
@devthejo
Copy link

devthejo commented Nov 13, 2024

@devthejo what version of buildkit do you see this on? The original issue seems to be on v0.16, with rootless mode, is that the same setup you have?

@jedevc It was v0.13.0, I upgraded now to v0.17.1 and I'm waiting to see if it's reproducible on the new version (I was in the need to fix the bug quickly and didn't have enough time to investigate further). Not sure it's the same issue, but it looked like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rootless rootless mode
Projects
None yet
Development

No branches or pull requests

3 participants