Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller stops accepting jobs from the cluster queue #302

Open
aressem opened this issue Apr 8, 2024 · 9 comments · Fixed by #442
Open

Controller stops accepting jobs from the cluster queue #302

aressem opened this issue Apr 8, 2024 · 9 comments · Fixed by #442

Comments

@aressem
Copy link

aressem commented Apr 8, 2024

We have the agent-stack-k8s up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):

2024-04-08T11:38:23.100Z	DEBUG	limiter	scheduler/limiter.go:77	max-in-flight reached	{"in-flight": 25}

We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named buildkite-${UUID} in the k8s cluster. Executing kubectl -n buildkite rollout restart deployment agent-stack-k8s makes the controller happy again and it starts jobs from the queue.

I suspect that there is something that should decrement the in-flight number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.

@DrJosh9000
Copy link
Contributor

Hi @aressem, did you discover anything with your tests where the number is set to 0?

@aressem
Copy link
Author

aressem commented Apr 23, 2024

@DrJosh9000 , the pipeline works as expected with in-flight set to 0. I don't know what that number might be now, but I suspect it is steadily increasing :)

@artem-zinnatullin
Copy link
Contributor

Same issue when testing with max-in-flight: 1 on v0.11.0, at some point controller stops taking new jobs even though there are no jobs/pods running in the namespace besides the controller iteself.

2024-05-21T21:31:57.923Z	DEBUG	limiter	scheduler/limiter.go:79	max-in-flight reached	{"in-flight": 1}

@calvinbui
Copy link

i saw the same issue, num-in-flight does not decrease so the available-tokens eventually reaches 0 and no new jobs are run.

@DrJosh9000
Copy link
Contributor

num-in-flight and available-tokens are now somewhat decoupled, so it would be useful to compare available-tokens against the number of job pods actually pending or running in the k8s cluster.

🤔 Maybe the controller should periodically survey the cluster, and adjust tokens accordingly.

@artem-zinnatullin
Copy link
Contributor

We just had CI outage partially caused by this behavior, here is the gist:

  1. We've gradually (%) switched CI jobs from https://github.com/EmbarkStudios/k8s-buildkite-plugin to https://github.com/buildkite/agent-stack-k8s
  2. Due to how this Controller works it adds its own (agent, etc) containers to the Job definition thus raising the resources request of the actual container definition which is fine and we override but EmbarkStudio plugin runs agents completely separately and it just never added on top of a K8S job definition even if overall overhead is the same (again it's fine, but it is an important detail)
  3. We have cron jobs for benchmarking which try to acquire entire CI K8S Node so that other jobs can't affect its performance, we did it by matching resources.requests.memory very close to Node limit (should've probably used taints)
  4. Due to overhead of additional containers added to the K8S Job by https://github.com/buildkite/agent-stack-k8s we now overshot on these few benchmark jobs and they got stuck in K8S because they couldn't fit any node memory
  5. For about a week these benchmark jobs accumulated in the Buildkite queue
  6. At some point https://github.com/buildkite/agent-stack-k8s v0.18.0 stopped taking new Buildkite jobs at 93 jobs in our case even though we have max-in-flight: 250
  7. I've tried deploying max-in-flight: 0 to remove the limit but controller still wasn't taking any new jobs even though people were pushing more PRs and those builds would've fit the nodes, kubetcl get jobs was only displaying these 93 jobs that could never run in the cluster.
  8. Controller didn't pick up any new jobs until I cancelled all stale pending benchmark jobs in Buildkite UI, then other jobs started to be processed

The logs indicated that there were available tokens but yet it got stuck at lower number.

2024-11-12T17:59:37.861Z	DEBUG	limiter	scheduler/limiter.go:87	
Create: job is already in-flight	{"uuid": "01931db6-67ea-403c-8687-e01ab64e8e94", 
"num-in-flight": 93, "available-tokens": 162}

We will be adding alarms for stale Buildkite jobs in the queue, but something still seems wrong with the controller because it should've still scheduled other K8S Jobs into the cluster.

@evict
Copy link

evict commented Nov 21, 2024

I am also running in to this issue, I have to restart my kubernetes deployment basically every day. 😅

@DrJosh9000
Copy link
Contributor

I'm still looking into this one.

I have a new theory: k8s jobs can be successfully created, but fail without ever starting a pod. This state isn't handled properly: the job remains present until the TTL, so can't be recreated under the same name. That remains the oldest job available, so before #427 the controller repeatedly tries and fails to recreate it. With #427 other jobs get a shot at being created instead, but this isn't much help if the jobs are failing because the cluster is very busy.

@DrJosh9000
Copy link
Contributor

v0.20.0 shipped with a few improvements in this area. I'm curious to see what has helped. And ideally if possible, when trying v0.20.0 or later, gather some Prometheus metrics to get a better sense of what else is going wrong:

# values.yaml
config:
  prometheus-port: 9216  # or some other port of your choosing

Here's an example PodMonitor when using the Prometheus operator:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: agent-stack-k8s
  labels:
    app: agent-stack-k8s
spec:
  jobLabel: app
  namespaceSelector:
    matchNames:
      - buildkite
  selector:
    matchLabels:
      app: agent-stack-k8s
  podMetricsEndpoints:
    - port: metrics    # defined in the Helm chart when prometheus-port is set
      interval: 1s     # feel free to tune

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants