[prometheus-kube-stack] missing cpu statistics with chart deployed by argocd #5070

adippl · 2024-12-17T11:36:08Z

Describe the bug a clear and concise description of what the bug is.

My kube-prometheus-stack grafana stoped displaying pod cpu statistics after I've upgraded my cluster to 1.30

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate metric is missing from prometheus

there are no prometheus related alerts in alertmanager

What's your helm version?

argocd v2.12.4

What's your kubectl version?

Client Version: v1.30.6 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.6

Which chart?

prometheus-kube-stack

What's the chart version?

67.2.0

What happened?

Chart deployed by argocd doesn't work correctly/

What you expected to happen?

I would want kube-prometheus-chart to work on argocd

How to reproduce it?

Install kube-prometheus-chart with argocd.

Enter the changed values of values.yaml?

fullnameOverride: "kps"
prometheus:
  networkPolicy:
    enabled: false

    flavor: kubernetes
  ingress:
    enabled: true
    ingressClassName: admin-ingress
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    labels: {}
    hosts:
      - prometheus.k8s3.domain.example
    path: /
    tls:
    - hosts:
        - prometheus.k8s3.domain.example
      secretName: prometheus.k8s3.domain.example-tls
    pathType: Prefix
  prometheusSpec:
    priorityClassName: "high-priority"
    externalLabels:
      cluster: k8s3
    retention: 14d
    replicas: 1
    podAntiAffinity: "hard"
    scrapeTimeout: 30s
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 200Gi
    ruleSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
    podMonitorSelectorNilUsesHelmValues: false
    probeSelectorNilUsesHelmValues: false
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector:
      matchLabels:
        prometheus: main
    podMonitorSelector: {}
    podMonitorNamespaceSelector:
      matchLabels:
        prometheus: main
    ruleSelector: {}
    ruleNamespaceSelector:
      matchLabels:
        prometheus: main
    resources:
     requests:
       cpu: 250m
       memory: 1536Mi
     limits:
       cpu: 2000m
       memory: 2048Mi
    priorityClassName: "high-priority"

grafana:
  adminPassword: xxxxxx
  sidecar:
    dashboards:
      searchNamespace: ALL
  serviceMonitor:
    scrapeTimeout: 10s
  ingress:
    enabled: true
    ingressClassName: admin-ingress
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    labels: {}
    hosts:
      - grafana.k8s3.domain.example
    path: /
    tls:
    - hosts:
        - grafana.k8s3.domain.example
      secretName: grafana.k8s3.domain.example-tls
  resources:
    requests:
      cpu: 150m
      memory: 384Mi
    limits:
      cpu: 500m
      memory: 512Mi
  prometheusSpec:
    priorityClassName: "high-priority"

prometheusOperator:
  resources:
    requests:
      cpu: 1m
      memory: 64Mi
    limits:
      cpu: 500m
      memory: 200Mi
  priorityClassName: "high-priority"
  networkPolicy:
    enabled: false

alertmanager:
  alertmanagerSpec:
    priorityClassName: "high-priority"
    resources:
      requests:
        cpu: 10m
        memory: 100Mi
      limits:
        cpu: 200m
        memory: 200Mi
  enabled: true
  config:
    global:
      resolve_timeout: 5m
    inhibit_rules:
      - source_matchers:
          - 'severity = critical'
        target_matchers:
          - 'severity =~ warning|info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'severity = warning'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
          - 'alertname'
      - source_matchers:
          - 'alertname = InfoInhibitor'
        target_matchers:
          - 'severity = info'
        equal:
          - 'namespace'
    route:
      group_by: ['namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'email'
      routes:
      - receiver: "null"
        matchers:
        - alertname =~ "Watchdog|InfoInhibitor"
      - receiver: "null"
        matchers:
        - alertname =~ "CephNodeNetworkPacketDrops"
        - device =~ "vnet.*|cilium_wg0"
    receivers:
    - name: "null"
    - name: 'email'
      email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: xxxx.xxxxxxx.xxx:587
        auth_username: 'xxxxxxx'
        auth_password: 'xxxxxxxxxxxxxxxxxxxxxxxxx'
        require_tls: true
    templates:
    - '/etc/alertmanager/config/*.tmpl'

defaultRules:
  rules:
    kubeProxy: false

kubeProxy:
  enabled: false

Enter the command that you execute and failing/misfunctioning.

prometheus query node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

adippl · 2024-12-17T14:09:04Z

I think I've enconutered argocd instance changing issue mentioned here: #1769 (comment)

I've tried to change release: by changing argocd app name but it doesn't help.

kube-state-metrics servicemonitor look like fine and it's selectors match kube-state-metrics labels

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2024-12-17T12:29:21Z"
  generation: 1
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: kps-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.14.0
    argocd.argoproj.io/instance: prometheus
    helm.sh/chart: kube-state-metrics-5.27.0
    release: kps-prometheus
  name: kps-prometheus-kube-state-metrics
  namespace: monitoring
  resourceVersion: "21744314"
  uid: 1d6eeb69-6e35-465a-848f-b07cf447a576
spec:
  endpoints:
  - honorLabels: true
    port: http
  jobLabel: app.kubernetes.io/name
  selector:
    matchLabels:
      app.kubernetes.io/instance: kps-prometheus
      app.kubernetes.io/name: kube-state-metrics

While kube-state-metrics pod has these metadata:

Name:             kps-prometheus-kube-state-metrics-78bcb4676d-rrv59
Namespace:        monitoring
Priority:         0
Service Account:  kps-prometheus-kube-state-metrics
Node:             gh-k8s3-worker-1/10.0.9.84
Start Time:       Tue, 17 Dec 2024 13:29:19 +0100
Labels:           app.kubernetes.io/component=metrics
                  app.kubernetes.io/instance=kps-prometheus
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=kube-state-metrics
                  app.kubernetes.io/part-of=kube-state-metrics
                  app.kubernetes.io/version=2.14.0
                  helm.sh/chart=kube-state-metrics-5.27.0
                  pod-template-hash=78bcb4676d
                  release=kps-prometheus

Pod is getting scraped.

Is there somehting else I'm missing here?

I would want to solve this issue without modifying default argo-cd behavior.

adippl added the bug Something isn't working label Dec 17, 2024

adippl changed the title ~~[prometheus-kube-stack] missing cpu statistics after kubernetes upgrade to 1.30~~ [prometheus-kube-stack] missing cpu statistics with chart deployed by argocd Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prometheus-kube-stack] missing cpu statistics with chart deployed by argocd #5070

[prometheus-kube-stack] missing cpu statistics with chart deployed by argocd #5070

adippl commented Dec 17, 2024 •

edited

Loading

adippl commented Dec 17, 2024

[prometheus-kube-stack] missing cpu statistics with chart deployed by argocd #5070

[prometheus-kube-stack] missing cpu statistics with chart deployed by argocd #5070

Comments

adippl commented Dec 17, 2024 • edited Loading

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

adippl commented Dec 17, 2024

adippl commented Dec 17, 2024 •

edited

Loading