Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back-off restarting failed container katib-controller #2440

Open
qazserfv123 opened this issue Oct 10, 2024 · 1 comment
Open

Back-off restarting failed container katib-controller #2440

qazserfv123 opened this issue Oct 10, 2024 · 1 comment

Comments

@qazserfv123
Copy link

What happened?

After installed install the latest changes of Katib control plane

Run kubectl get pod -n kubeflow and the result is

root@k8master:~# kubectl get pod -n kubeflow
NAME                                READY   STATUS             RESTARTS         AGE
katib-controller-86fbb67df-5mgpx    0/1     CrashLoopBackOff   52 (4m39s ago)   5h49m
katib-db-manager-7c8745f44b-4tzm5   0/1     CrashLoopBackOff   56 (54s ago)     5h49m
katib-mysql-77b9495867-fqb5l        0/1     Pending            0                5h49m
katib-ui-5d9c77cfc4-4bfzl           1/1     Running            0                5h49m

and run kubectl describe pod katib-controller-86fbb67df-5mgpx -n kubeflow , the result is

Name:             katib-controller-86fbb67df-5mgpx
Namespace:        kubeflow
Priority:         0
Service Account:  katib-controller
Node:             k8node02/192.168.100.12
Start Time:       Thu, 10 Oct 2024 02:20:03 +0000
Labels:           katib.kubeflow.org/component=controller
                  katib.kubeflow.org/metrics-collector-injection=disabled
                  pod-template-hash=86fbb67df
Annotations:      prometheus.io/port: 8080
                  prometheus.io/scrape: true
                  sidecar.istio.io/inject: false
Status:           Running
IP:               10.244.0.3
IPs:
  IP:           10.244.0.3
Controlled By:  ReplicaSet/katib-controller-86fbb67df
Containers:
  katib-controller:
    Container ID:  docker://ec8cfc87a2c33a75ae61fd2d7ac906ccf52800fb49159e6e6253f129c0fd86bf
    Image:         docker.io/kubeflowkatib/katib-controller:latest
    Image ID:      docker-pullable://kubeflowkatib/katib-controller@sha256:103962f0810467fc5f6edcb46b8343387a289dd113dce38933ab15d3b0713261
    Ports:         8443/TCP, 8080/TCP, 18080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      ./katib-controller
    Args:
      --katib-config=/katib-config.yaml
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 10 Oct 2024 08:10:54 +0000
      Finished:     Thu, 10 Oct 2024 08:11:24 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 10 Oct 2024 08:04:52 +0000
      Finished:     Thu, 10 Oct 2024 08:05:22 +0000
    Ready:          False
    Restart Count:  53
    Liveness:       http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      KATIB_CORE_NAMESPACE:  kubeflow (v1:metadata.namespace)
    Mounts:
      /katib-config.yaml from katib-config (ro,path="katib-config.yaml")
      /tmp/cert from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s4x2k (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  katib-webhook-cert
    Optional:    false
  katib-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      katib-config
    Optional:  false
  kube-api-access-s4x2k:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Normal   Pulled     36m (x39 over 4h20m)     kubelet  (combined from similar events): Successfully pulled image "docker.io/kubeflowkatib/katib-controller:latest" in 20.234160626s (20.234172377s including waiting)
  Warning  Unhealthy  6m18s (x261 over 4h49m)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  BackOff    85s (x1164 over 4h48m)   kubelet  Back-off restarting failed container katib-controller in pod katib-controller-86fbb67df-5mgpx_kubeflow(c1cd3096-6bcc-4db2-969b-8f0ac265ae05)

Thanks!

What did you expect to happen?

Run kubectl get pod -n kubeflow and the result is

root@k8master:~# kubectl get pod -n kubeflow
NAME                                READY   STATUS             RESTARTS         AGE
katib-controller-86fbb67df-5mgpx    1/1     Running            52 (4m39s ago)   5h49m
katib-db-manager-7c8745f44b-4tzm5   1/1     Running            56 (54s ago)     5h49m
katib-mysql-77b9495867-fqb5l       1/1     Running            0                5h49m
katib-ui-5d9c77cfc4-4bfzl           1/1     Running            0                5h49m

Environment

Kubernetes version:

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.0", GitCommit:"1b4df30b3cdfeaba6024e81e559a6cd09a089d65", GitTreeState:"clean", BuildDate:"2023-04-11T17:10:18Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.16", GitCommit:"cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a", GitTreeState:"clean", BuildDate:"2024-07-17T01:44:26Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}

Katib controller version:
``
docker.io/kubeflowkatib/katib-controller:latest


Katib Python SDK version:

Name: kubeflow-katib
Version: 0.17.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /root/miniconda3/lib/python3.10/site-packages
Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3
Required-by:



### Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍
@andreyvelich
Copy link
Member

Sorry for the late reply @qazserfv123!
As I can see your Katib MySQL pod is pending, can you describe it ?

kubectl describe pod katib-mysql-77b9495867-fqb5l -n kubeflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants