Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prometheus-kube-stack] missing cpu statistics with chart deployed by argocd #5070

Open
adippl opened this issue Dec 17, 2024 · 1 comment
Open
Labels
bug Something isn't working

Comments

@adippl
Copy link

adippl commented Dec 17, 2024

Describe the bug a clear and concise description of what the bug is.

My kube-prometheus-stack grafana stoped displaying pod cpu statistics after I've upgraded my cluster to 1.30

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate metric is missing from prometheus

there are no prometheus related alerts in alertmanager

What's your helm version?

argocd v2.12.4

What's your kubectl version?

Client Version: v1.30.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.6

Which chart?

prometheus-kube-stack

What's the chart version?

67.2.0

What happened?

Chart deployed by argocd doesn't work correctly/

What you expected to happen?

I would want kube-prometheus-chart to work on argocd

How to reproduce it?

Install kube-prometheus-chart with argocd.

Enter the changed values of values.yaml?

fullnameOverride: "kps"
prometheus:
  networkPolicy:
    enabled: false

    flavor: kubernetes
  ingress:
    enabled: true
    ingressClassName: admin-ingress
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    labels: {}
    hosts:
      - prometheus.k8s3.domain.example
    path: /
    tls:
    - hosts:
        - prometheus.k8s3.domain.example
      secretName: prometheus.k8s3.domain.example-tls
    pathType: Prefix
  prometheusSpec:
    priorityClassName: "high-priority"
    externalLabels:
      cluster: k8s3
    retention: 14d
    replicas: 1
    podAntiAffinity: "hard"
    scrapeTimeout: 30s
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 200Gi
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    probeSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector:
      matchLabels:
        prometheus: main
    podMonitorSelector: {}
    podMonitorNamespaceSelector:
      matchLabels:
        prometheus: main
    ruleSelector: {}
    ruleNamespaceSelector:
      matchLabels:
        prometheus: main
    resources:
     requests:
       cpu: 250m
       memory: 1536Mi
     limits:
       cpu: 2000m
       memory: 2048Mi
    priorityClassName: "high-priority"

grafana:
  adminPassword: xxxxxx
  sidecar:
    dashboards:
      searchNamespace: ALL
  serviceMonitor:
    scrapeTimeout: 10s
  ingress:
    enabled: true
    ingressClassName: admin-ingress
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    labels: {}
    hosts:
      - grafana.k8s3.domain.example
    path: /
    tls:
    - hosts:
        - grafana.k8s3.domain.example
      secretName: grafana.k8s3.domain.example-tls
  resources:
    requests:
      cpu: 150m
      memory: 384Mi
    limits:
      cpu: 500m
      memory: 512Mi
  prometheusSpec:
    priorityClassName: "high-priority"

prometheusOperator:
  resources:
    requests:
      cpu: 1m
      memory: 64Mi
    limits:
      cpu: 500m
      memory: 200Mi
  priorityClassName: "high-priority"
  networkPolicy:
    enabled: false

alertmanager:
  alertmanagerSpec:
    priorityClassName: "high-priority"
    resources:
      requests:
        cpu: 10m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 200Mi
  enabled: true
  config:
    global:
      resolve_timeout: 5m
    inhibit_rules:
      - source_matchers:
          - 'severity = critical'
        target_matchers:
          - 'severity =~ warning|info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'severity = warning'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'alertname = InfoInhibitor'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
    route:
      group_by: ['namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'email'
      routes:
      - receiver: "null"
        matchers:
        - alertname =~ "Watchdog|InfoInhibitor"
      - receiver: "null"
        matchers:
        - alertname =~ "CephNodeNetworkPacketDrops"
        - device =~ "vnet.*|cilium_wg0"
    receivers:
    - name: "null"
    - name: 'email'
      email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: xxxx.xxxxxxx.xxx:587
        auth_username: 'xxxxxxx'
        auth_password: 'xxxxxxxxxxxxxxxxxxxxxxxxx'
        require_tls: true
    templates:
    - '/etc/alertmanager/config/*.tmpl'

defaultRules:
  rules:
    kubeProxy: false

kubeProxy:
  enabled: false

Enter the command that you execute and failing/misfunctioning.

prometheus query node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

Anything else we need to know?

No response

@adippl adippl added the bug Something isn't working label Dec 17, 2024
@adippl
Copy link
Author

adippl commented Dec 17, 2024

I think I've enconutered argocd instance changing issue mentioned here: #1769 (comment)

I've tried to change release: by changing argocd app name but it doesn't help.

kube-state-metrics servicemonitor look like fine and it's selectors match kube-state-metrics labels

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2024-12-17T12:29:21Z"
  generation: 1
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: kps-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.14.0
    argocd.argoproj.io/instance: prometheus
    helm.sh/chart: kube-state-metrics-5.27.0
    release: kps-prometheus
  name: kps-prometheus-kube-state-metrics
  namespace: monitoring
  resourceVersion: "21744314"
  uid: 1d6eeb69-6e35-465a-848f-b07cf447a576
spec:
  endpoints:
  - honorLabels: true
    port: http
  jobLabel: app.kubernetes.io/name
  selector:
    matchLabels:
      app.kubernetes.io/instance: kps-prometheus
      app.kubernetes.io/name: kube-state-metrics

While kube-state-metrics pod has these metadata:

Name:             kps-prometheus-kube-state-metrics-78bcb4676d-rrv59
Namespace:        monitoring
Priority:         0
Service Account:  kps-prometheus-kube-state-metrics
Node:             gh-k8s3-worker-1/10.0.9.84
Start Time:       Tue, 17 Dec 2024 13:29:19 +0100
Labels:           app.kubernetes.io/component=metrics
                  app.kubernetes.io/instance=kps-prometheus
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=kube-state-metrics
                  app.kubernetes.io/part-of=kube-state-metrics
                  app.kubernetes.io/version=2.14.0
                  helm.sh/chart=kube-state-metrics-5.27.0
                  pod-template-hash=78bcb4676d
                  release=kps-prometheus

Pod is getting scraped.
20241217_15h04m02s_grim

Is there somehting else I'm missing here?

I would want to solve this issue without modifying default argo-cd behavior.

@adippl adippl changed the title [prometheus-kube-stack] missing cpu statistics after kubernetes upgrade to 1.30 [prometheus-kube-stack] missing cpu statistics with chart deployed by argocd Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant