-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(helm): update rook-ceph group to v1.16.0 (minor) #746
base: main
Are you sure you want to change the base?
Conversation
Path: @@ -73,11 +73,25 @@
# imagePullSecrets:
# - name: my-registry-secret
---
+# Source: rook-ceph-cluster/templates/rbac.yaml
+# Service account for other components
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: rook-ceph-default
+ namespace: default # namespace:cluster
+ labels:
+ operator: rook
+ storage-backend: ceph
+# imagePullSecrets:
+# - name: my-registry-secret
+---
# Source: rook-ceph-cluster/templates/configmap.yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: rook-config-override
+ namespace: default # namespace:cluster
data:
config: |2
[global]
@@ -96,16 +110,17 @@
pool: ceph-blockpool
clusterID: default
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
- csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
+ csi.storage.k8s.io/controller-expand-secret-namespace: 'default'
csi.storage.k8s.io/fstype: ext4
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
- csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
+ csi.storage.k8s.io/node-stage-secret-namespace: 'default'
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
- csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
+ csi.storage.k8s.io/provisioner-secret-namespace: 'default'
imageFeatures: layering
imageFormat: "2"
reclaimPolicy: Delete
allowVolumeExpansion: true
+volumeBindingMode: Immediate
---
# Source: rook-ceph-cluster/templates/cephfilesystem.yaml
apiVersion: storage.k8s.io/v1
@@ -120,14 +135,15 @@
pool: ceph-filesystem-data0
clusterID: default
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
- csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
+ csi.storage.k8s.io/controller-expand-secret-namespace: 'default'
csi.storage.k8s.io/fstype: ext4
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
- csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
+ csi.storage.k8s.io/node-stage-secret-namespace: 'default'
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
- csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
+ csi.storage.k8s.io/provisioner-secret-namespace: 'default'
reclaimPolicy: Delete
allowVolumeExpansion: true
+volumeBindingMode: Immediate
---
# Source: rook-ceph-cluster/templates/cephobjectstore.yaml
apiVersion: storage.k8s.io/v1
@@ -136,6 +152,7 @@
name: ceph-bucket
provisioner: default.ceph.rook.io/bucket
reclaimPolicy: Delete
+volumeBindingMode: Immediate
parameters:
objectStoreName: ceph-objectstore
objectStoreNamespace: default
@@ -179,10 +196,10 @@
namespace: default # namespace:cluster
rules:
# this is needed for rook's "key-management" CLI to fetch the vault token from the secret when
- # validating the connection details
+ # validating the connection details and for key rotation operations.
- apiGroups: [""]
resources: ["secrets"]
- verbs: ["get"]
+ verbs: ["get", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
@@ -191,23 +208,6 @@
verbs: ["get", "list", "create", "update", "delete"]
---
# Source: rook-ceph-cluster/templates/rbac.yaml
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-rules:
- # Placeholder role so the rgw service account will
- # be generated in the csv. Remove this role and role binding
- # when fixing https://github.com/rook/rook/issues/10141.
- - apiGroups:
- - ""
- resources:
- - configmaps
- verbs:
- - get
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
# Aspects of ceph-mgr that operate within the cluster's namespace
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
@@ -242,9 +242,31 @@
- apiGroups:
- ceph.rook.io
resources:
- - "*"
+ - cephclients
+ - cephclusters
+ - cephblockpools
+ - cephfilesystems
+ - cephnfses
+ - cephobjectstores
+ - cephobjectstoreusers
+ - cephobjectrealms
+ - cephobjectzonegroups
+ - cephobjectzones
+ - cephbuckettopics
+ - cephbucketnotifications
+ - cephrbdmirrors
+ - cephfilesystemmirrors
+ - cephfilesystemsubvolumegroups
+ - cephblockpoolradosnamespaces
+ - cephcosidrivers
verbs:
- - "*"
+ - get
+ - list
+ - watch
+ - create
+ - update
+ - delete
+ - patch
- apiGroups:
- apps
resources:
@@ -339,102 +361,6 @@
- update
---
# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-default-psp
- namespace: default # namespace:cluster
- labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: default
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-rgw-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-mgr-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-mgr
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-cmd-reporter-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-cmd-reporter
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-purge-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-purge-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
# Allow the operator to create resources in this cluster's namespace
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
@@ -467,22 +393,6 @@
namespace: default # namespace:cluster
---
# Source: rook-ceph-cluster/templates/rbac.yaml
-# Allow the rgw pods in this namespace to work with configmaps
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: Role
- name: rook-ceph-rgw
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
# Allow the ceph mgr to access resources scoped to the CephCluster namespace necessary for mgr modules
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
@@ -582,6 +492,7 @@
kind: Ingress
metadata:
name: default-dashboard
+ namespace: default # namespace:cluster
spec:
rules:
- host: rook.${SECRET_DOMAIN}
@@ -599,11 +510,14 @@
- hosts:
- rook.${SECRET_DOMAIN}
---
+
+---
# Source: rook-ceph-cluster/templates/cephblockpool.yaml
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: ceph-blockpool
+ namespace: default # namespace:cluster
spec:
failureDomain: host
replicated:
@@ -614,12 +528,13 @@
kind: CephCluster
metadata:
name: default
+ namespace: default # namespace:cluster
spec:
monitoring:
enabled: true
cephVersion:
allowUnsupported: false
- image: quay.io/ceph/ceph:v16.2.10
+ image: quay.io/ceph/ceph:v19.2.0
cleanupPolicy:
allowUninstallWithVolumes: false
confirmation: ""
@@ -636,8 +551,6 @@
urlPrefix: /
dataDirHostPath: /var/lib/rook
disruptionManagement:
- machineDisruptionBudgetNamespace: openshift-machine-api
- manageMachineDisruptionBudgets: false
managePodBudgets: true
osdMaintenanceTimeout: 30
pgHealthCheckTimeout: 0
@@ -659,16 +572,24 @@
disabled: false
osd:
disabled: false
+ logCollector:
+ enabled: true
+ maxLogSize: 500M
+ periodicity: daily
mgr:
allowMultiplePerNode: false
count: 2
- modules:
- - enabled: true
- name: pg_autoscaler
+ modules: null
mon:
allowMultiplePerNode: false
count: 3
network:
+ connections:
+ compression:
+ enabled: false
+ encryption:
+ enabled: false
+ requireMsgr2: false
provider: host
priorityClassNames:
mgr: system-cluster-critical
@@ -678,49 +599,48 @@
resources:
cleanup:
limits:
- cpu: 500m
memory: 1Gi
requests:
cpu: 500m
memory: 100Mi
crashcollector:
limits:
- cpu: 500m
memory: 60Mi
requests:
cpu: 100m
memory: 60Mi
+ exporter:
+ limits:
+ memory: 128Mi
+ requests:
+ cpu: 50m
+ memory: 50Mi
logcollector:
limits:
- cpu: 500m
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi
mgr:
limits:
- cpu: 1000m
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
mgr-sidecar:
limits:
- cpu: 500m
memory: 100Mi
requests:
cpu: 100m
memory: 40Mi
mon:
limits:
- cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
osd:
limits:
- cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
@@ -747,6 +667,7 @@
name: k8s-worker03
useAllDevices: false
useAllNodes: false
+ upgradeOSDRequiresHealthyPGs: false
waitTimeoutForHealthyOSDInMinutes: 10
---
# Source: rook-ceph-cluster/templates/cephfilesystem.yaml
@@ -754,6 +675,7 @@
kind: CephFilesystem
metadata:
name: ceph-filesystem
+ namespace: default # namespace:cluster
spec:
dataPools:
- failureDomain: host
@@ -769,37 +691,55 @@
priorityClassName: system-cluster-critical
resources:
limits:
- cpu: 2000m
memory: 4Gi
requests:
cpu: 1000m
memory: 4Gi
---
+# Source: rook-ceph-cluster/templates/cephfilesystem.yaml
+apiVersion: ceph.rook.io/v1
+kind: CephFilesystemSubVolumeGroup
+metadata:
+ name: ceph-filesystem-csi # lets keep the svg crd name same as `filesystem name + csi` for the default csi svg
+ namespace: default # namespace:cluster
+spec:
+ # The name of the subvolume group. If not set, the default is the name of the subvolumeGroup CR.
+ name: csi
+ # filesystemName is the metadata name of the CephFilesystem CR where the subvolume group will be created
+ filesystemName: ceph-filesystem
+ # reference https://docs.ceph.com/en/latest/cephfs/fs-volumes/#pinning-subvolumes-and-subvolume-groups
+ # only one out of (export, distributed, random) can be set at a time
+ # by default pinning is set with value: distributed=1
+ # for disabling default values set (distributed=0)
+ pinning:
+ distributed: 1 # distributed=<0, 1> (disabled=0)
+ # export: # export=<0-256> (disabled=-1)
+ # random: # random=[0.0, 1.0](disabled=0.0)
+---
# Source: rook-ceph-cluster/templates/cephobjectstore.yaml
apiVersion: ceph.rook.io/v1
kind: CephObjectStore
metadata:
name: ceph-objectstore
+ namespace: default # namespace:cluster
spec:
dataPool:
erasureCoded:
codingChunks: 1
dataChunks: 2
failureDomain: host
+ parameters:
+ bulk: "true"
gateway:
instances: 1
port: 80
priorityClassName: system-cluster-critical
resources:
limits:
- cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
- healthCheck:
- bucket:
- interval: 60s
metadataPool:
failureDomain: host
replicated:
@@ -817,810 +757,881 @@
namespace: default
spec:
# Import the raw prometheus rules since they have descriptions that should not be processed with the helm templates
- # copied from https://github.com/ceph/ceph/blob/master/monitoring/ceph-mixin/prometheus_alerts.yml
+ # Copied from https://github.com/ceph/ceph/blob/master/monitoring/ceph-mixin/prometheus_alerts.yml
+ # Attention: This is not a 1:1 copy of ceph-mixin alerts. This file contains several Rook-related adjustments.
+ # List of main adjustments:
+ # - Alerts related to cephadm are excluded
+ # - The PrometheusJobMissing alert is adjusted for the rook-ceph-mgr job, and the PrometheusJobExporterMissing alert is added
groups:
- - name: cluster health
- rules:
- - alert: CephHealthError
- expr: ceph_health_status == 2
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.2.1
- annotations:
- summary: Cluster is in the ERROR state
- description: >
- The cluster state has been HEALTH_ERROR for more than 5 minutes. Please check "ceph health detail" for more information.
-
- - alert: CephHealthWarning
- expr: ceph_health_status == 1
- for: 15m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- summary: Cluster is in the WARNING state
- description: >
- The cluster state has been HEALTH_WARN for more than 15 minutes. Please check "ceph health detail" for more information.
-
- - name: mon
+ - name: "cluster health"
rules:
- - alert: CephMonDownQuorumAtRisk
- expr: ((ceph_health_detail{name="MON_DOWN"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (count(ceph_mon_quorum_status == 1) == bool (floor(count(ceph_mon_metadata) / 2) + 1))) == 1
- for: 30s
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.3.1
+ - alert: "CephHealthError"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down
- summary: Monitor quorum is at risk
- description: |
- {{ $min := query "floor(count(ceph_mon_metadata) / 2) +1" | first | value }}Quorum requires a majority of monitors (x {{ $min }}) to be active
- Without quorum the cluster will become inoperable, affecting all services and connected clients.
-
- The following monitors are down:
- {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)" }}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonDown
- expr: (count(ceph_mon_quorum_status == 0) <= (count(ceph_mon_metadata) - floor(count(ceph_mon_metadata) / 2) + 1))
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down
- summary: One or more monitors down
- description: |
- {{ $down := query "count(ceph_mon_quorum_status == 0)" | first | value }}{{ $s := "" }}{{ if gt $down 1.0 }}{{ $s = "s" }}{{ end }}There are {{ $down }} monitor{{ $s }} down.
- Quorum is still intact, but the loss of an additional monitor will make your cluster inoperable.
-
- The following monitors are down:
- {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)" }}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonDiskspaceCritical
- expr: ceph_health_detail{name="MON_DISK_CRIT"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.3.2
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-crit
- summary: Filesystem space on at least one monitor is critically low
- description: |
- The free space available to a monitor's store is critically low.
- You should increase the space available to the monitor(s). The default directory
- is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and under
- /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook.
- Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files.
- Also check any other directories under /var/lib/rook and other directories on the
- same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are;
- {{- range query "ceph_mon_metadata"}}
- - {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonDiskspaceLow
- expr: ceph_health_detail{name="MON_DISK_LOW"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-low
- summary: Disk space on at least one monitor is approaching full
- description: |
- The space available to a monitor's store is approaching full (>70% is the default).
- You should increase the space available to the monitor(s). The default directory
- is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and under
- /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook.
- Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files.
- Also check any other directories under /var/lib/rook and other directories on the
- same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are;
- {{- range query "ceph_mon_metadata"}}
- - {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonClockSkew
- expr: ceph_health_detail{name="MON_CLOCK_SKEW"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-clock-skew
- summary: Clock skew detected among monitors
- description: |
- Ceph monitors rely on closely synchronized time to maintain
- quorum and cluster consistency. This event indicates that time on at least
- one mon has drifted too far from the lead mon.
-
- Review cluster status with ceph -s. This will show which monitors
- are affected. Check the time sync status on each monitor host with
- "ceph time-sync-status" and the state and peers of your ntpd or chrony daemon.
- - name: osd
+ description: "The cluster state has been HEALTH_ERROR for more than 5 minutes. Please check 'ceph health detail' for more information."
+ summary: "Ceph is in the ERROR state"
+ expr: "ceph_health_status == 2"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.2.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephHealthWarning"
+ annotations:
+ description: "The cluster state has been HEALTH_WARN for more than 15 minutes. Please check 'ceph health detail' for more information."
+ summary: "Ceph is in the WARNING state"
+ expr: "ceph_health_status == 1"
+ for: "15m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "mon"
rules:
- - alert: CephOSDDownHigh
- expr: count(ceph_osd_up == 0) / count(ceph_osd_up) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 >= 10
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.1
+ - alert: "CephMonDownQuorumAtRisk"
annotations:
- summary: More than 10% of OSDs are down
- description: |
- {{ $value | humanize }}% or {{ with query "count(ceph_osd_up == 0)" }}{{ . | first | value }}{{ end }} of {{ with query "count(ceph_osd_up)" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%).
-
- The following OSDs are down:
- {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephOSDHostDown
- expr: ceph_health_detail{name="OSD_HOST_DOWN"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.8
- annotations:
- summary: An OSD host is offline
- description: |
- The following OSDs are down:
- {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
- - {{ .Labels.hostname }} : {{ .Labels.ceph_daemon }}
- {{- end }}
- - alert: CephOSDDown
- expr: ceph_health_detail{name="OSD_DOWN"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.2
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-down
- summary: An OSD has been marked down
- description: |
- {{ $num := query "count(ceph_osd_up == 0)" | first | value }}{{ $s := "" }}{{ if gt $num 1.0 }}{{ $s = "s" }}{{ end }}{{ $num }} OSD{{ $s }} down for over 5mins.
-
- The following OSD{{ $s }} {{ if eq $s "" }}is{{ else }}are{{ end }} down:
- {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0"}}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephOSDNearFull
- expr: ceph_health_detail{name="OSD_NEARFULL"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.3
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-nearfull
- summary: OSD(s) running low on free space (NEARFULL)
- description: |
- One or more OSDs have reached the NEARFULL threshold
-
- Use 'ceph health detail' and 'ceph osd df' to identify the problem.
- To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data.
- - alert: CephOSDFull
- expr: ceph_health_detail{name="OSD_FULL"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.6
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-full
- summary: OSD full, writes blocked
- description: |
- An OSD has reached the FULL threshold. Writes to pools that share the
- affected OSD will be blocked.
-
- Use 'ceph health detail' and 'ceph osd df' to identify the problem.
- To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data.
- - alert: CephOSDBackfillFull
- expr: ceph_health_detail{name="OSD_BACKFILLFULL"} > 0
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-backfillfull
- summary: OSD(s) too full for backfill operations
- description: "An OSD has reached the BACKFILL FULL threshold. This will prevent rebalance operations\nfrom completing. \nUse 'ceph health detail' and 'ceph osd df' to identify the problem.\n\nTo resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data.\n"
- - alert: CephOSDTooManyRepairs
- expr: ceph_health_detail{name="OSD_TOO_MANY_REPAIRS"} == 1
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-too-many-repairs
- summary: OSD reports a high number of read errors
- description: |
- Reads from an OSD have used a secondary PG to return data to the client, indicating
- a potential failing disk.
- - alert: CephOSDTimeoutsPublicNetwork
- expr: ceph_health_detail{name="OSD_SLOW_PING_TIME_FRONT"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- summary: Network issues delaying OSD heartbeats (public network)
- description: |
- OSD heartbeats on the cluster's 'public' network (frontend) are running slow. Investigate the network
- for latency or loss issues. Use 'ceph health detail' to show the affected OSDs.
- - alert: CephOSDTimeoutsClusterNetwork
- expr: ceph_health_detail{name="OSD_SLOW_PING_TIME_BACK"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- summary: Network issues delaying OSD heartbeats (cluster network)
- description: |
- OSD heartbeats on the cluster's 'cluster' network (backend) are running slow. Investigate the network
- for latency or loss issues. Use 'ceph health detail' to show the affected OSDs.
- - alert: CephOSDInternalDiskSizeMismatch
- expr: ceph_health_detail{name="BLUESTORE_DISK_SIZE_MISMATCH"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
+ description: "{{ $min := query \"floor(count(ceph_mon_metadata) / 2) + 1\" | first | value }}Quorum requires a majority of monitors (x {{ $min }}) to be active. Without quorum the cluster will become inoperable, affecting all services and connected clients. The following monitors are down: {{- range query \"(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down"
+ summary: "Monitor quorum is at risk"
+ expr: |
+ (
+ (ceph_health_detail{name="MON_DOWN"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (
+ count(ceph_mon_quorum_status == 1) == bool (floor(count(ceph_mon_metadata) / 2) + 1)
+ )
+ ) == 1
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.3.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephMonDown"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-disk-size-mismatch
- summary: OSD size inconsistency error
description: |
- One or more OSDs have an internal inconsistency between metadata and the size of the device.
- This could lead to the OSD(s) crashing in future. You should redeploy the affected OSDs.
- - alert: CephDeviceFailurePredicted
- expr: ceph_health_detail{name="DEVICE_HEALTH"} == 1
- for: 1m
+ {{ $down := query "count(ceph_mon_quorum_status == 0)" | first | value }}{{ $s := "" }}{{ if gt $down 1.0 }}{{ $s = "s" }}{{ end }}You have {{ $down }} monitor{{ $s }} down. Quorum is still intact, but the loss of an additional monitor will make your cluster inoperable. The following monitors are down: {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down"
+ summary: "One or more monitors down"
+ expr: |
+ count(ceph_mon_quorum_status == 0) <= (count(ceph_mon_metadata) - floor(count(ceph_mon_metadata) / 2) + 1)
+ for: "30s"
labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#id2
- summary: Device(s) predicted to fail soon
- description: |
- The device health module has determined that one or more devices will fail
- soon. To review device status use 'ceph device ls'. To show a specific
- device use 'ceph device info <dev id>'.
-
- Mark the OSD out so that data may migrate to other OSDs. Once
- the OSD has drained, destroy the OSD, replace the device, and redeploy the OSD.
- - alert: CephDeviceFailurePredictionTooHigh
- expr: ceph_health_detail{name="DEVICE_HEALTH_TOOMANY"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.7
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephMonDiskspaceCritical"
+ annotations:
+ description: "The free space available to a monitor's store is critically low. You should increase the space available to the monitor(s). The default directory is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook. Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files. Also check any other directories under /var/lib/rook and other directories on the same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are; {{- range query \"ceph_mon_metadata\"}} - {{ .Labels.hostname }} {{- end }}"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-crit"
+ summary: "Filesystem space on at least one monitor is critically low"
+ expr: "ceph_health_detail{name=\"MON_DISK_CRIT\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.3.2"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephMonDiskspaceLow"
+ annotations:
+ description: "The space available to a monitor's store is approaching full (>70% is the default). You should increase the space available to the monitor(s). The default directory is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook. Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files. Also check any other directories under /var/lib/rook and other directories on the same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are; {{- range query \"ceph_mon_metadata\"}} - {{ .Labels.hostname }} {{- end }}"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-low"
+ summary: "Drive space on at least one monitor is approaching full"
+ expr: "ceph_health_detail{name=\"MON_DISK_LOW\"} == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephMonClockSkew"
+ annotations:
+ description: "Ceph monitors rely on closely synchronized time to maintain quorum and cluster consistency. This event indicates that the time on at least one mon has drifted too far from the lead mon. Review cluster status with ceph -s. This will show which monitors are affected. Check the time sync status on each monitor host with 'ceph time-sync-status' and the state and peers of your ntpd or chrony daemon."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-clock-skew"
+ summary: "Clock skew detected among monitors"
+ expr: "ceph_health_detail{name=\"MON_CLOCK_SKEW\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "osd"
+ rules:
+ - alert: "CephOSDDownHigh"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-toomany
- summary: Too many devices are predicted to fail, unable to resolve
- description: |
- The device health module has determined that devices predicted to
- fail can not be remediated automatically, since too many OSDs would be removed from the
- cluster to ensure performance and availabililty. Prevent data
- integrity issues by adding new OSDs so that data may be relocated.
- - alert: CephDeviceFailureRelocationIncomplete
- expr: ceph_health_detail{name="DEVICE_HEALTH_IN_USE"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
+ description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
+ summary: "More than 10% of OSDs are down"
+ expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 >= 10"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephOSDHostDown"
+ annotations:
+ description: "The following OSDs are down: {{- range query \"(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.hostname }} : {{ .Labels.ceph_daemon }} {{- end }}"
+ summary: "An OSD host is offline"
+ expr: "ceph_health_detail{name=\"OSD_HOST_DOWN\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.8"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDDown"
+ annotations:
+ description: |
+ {{ $num := query "count(ceph_osd_up == 0)" | first | value }}{{ $s := "" }}{{ if gt $num 1.0 }}{{ $s = "s" }}{{ end }}{{ $num }} OSD{{ $s }} down for over 5mins. The following OSD{{ $s }} {{ if eq $s "" }}is{{ else }}are{{ end }} down: {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0"}} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-down"
+ summary: "An OSD has been marked down"
+ expr: "ceph_health_detail{name=\"OSD_DOWN\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDNearFull"
+ annotations:
+ description: "One or more OSDs have reached the NEARFULL threshold. Use 'ceph health detail' and 'ceph osd df' to identify the problem. To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-nearfull"
+ summary: "OSD(s) running low on free space (NEARFULL)"
+ expr: "ceph_health_detail{name=\"OSD_NEARFULL\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.3"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDFull"
+ annotations:
+ description: "An OSD has reached the FULL threshold. Writes to pools that share the affected OSD will be blocked. Use 'ceph health detail' and 'ceph osd df' to identify the problem. To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-full"
+ summary: "OSD full, writes blocked"
+ expr: "ceph_health_detail{name=\"OSD_FULL\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.6"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephOSDBackfillFull"
+ annotations:
+ description: "An OSD has reached the BACKFILL FULL threshold. This will prevent rebalance operations from completing. Use 'ceph health detail' and 'ceph osd df' to identify the problem. To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-backfillfull"
+ summary: "OSD(s) too full for backfill operations"
+ expr: "ceph_health_detail{name=\"OSD_BACKFILLFULL\"} > 0"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDTooManyRepairs"
+ annotations:
+ description: "Reads from an OSD have used a secondary PG to return data to the client, indicating a potential failing drive."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-too-many-repairs"
+ summary: "OSD reports a high number of read errors"
+ expr: "ceph_health_detail{name=\"OSD_TOO_MANY_REPAIRS\"} == 1"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDTimeoutsPublicNetwork"
+ annotations:
+ description: "OSD heartbeats on the cluster's 'public' network (frontend) are running slow. Investigate the network for latency or loss issues. Use 'ceph health detail' to show the affected OSDs."
+ summary: "Network issues delaying OSD heartbeats (public network)"
+ expr: "ceph_health_detail{name=\"OSD_SLOW_PING_TIME_FRONT\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDTimeoutsClusterNetwork"
+ annotations:
+ description: "OSD heartbeats on the cluster's 'cluster' network (backend) are slow. Investigate the network for latency issues on this subnet. Use 'ceph health detail' to show the affected OSDs."
+ summary: "Network issues delaying OSD heartbeats (cluster network)"
+ expr: "ceph_health_detail{name=\"OSD_SLOW_PING_TIME_BACK\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDInternalDiskSizeMismatch"
+ annotations:
+ description: "One or more OSDs have an internal inconsistency between metadata and the size of the device. This could lead to the OSD(s) crashing in future. You should redeploy the affected OSDs."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-disk-size-mismatch"
+ summary: "OSD size inconsistency error"
+ expr: "ceph_health_detail{name=\"BLUESTORE_DISK_SIZE_MISMATCH\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephDeviceFailurePredicted"
+ annotations:
+ description: "The device health module has determined that one or more devices will fail soon. To review device status use 'ceph device ls'. To show a specific device use 'ceph device info <dev id>'. Mark the OSD out so that data may migrate to other OSDs. Once the OSD has drained, destroy the OSD, replace the device, and redeploy the OSD."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#id2"
+ summary: "Device(s) predicted to fail soon"
+ expr: "ceph_health_detail{name=\"DEVICE_HEALTH\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephDeviceFailurePredictionTooHigh"
+ annotations:
+ description: "The device health module has determined that devices predicted to fail can not be remediated automatically, since too many OSDs would be removed from the cluster to ensure performance and availability. Prevent data integrity issues by adding new OSDs so that data may be relocated."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-toomany"
+ summary: "Too many devices are predicted to fail, unable to resolve"
+ expr: "ceph_health_detail{name=\"DEVICE_HEALTH_TOOMANY\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.7"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephDeviceFailureRelocationIncomplete"
+ annotations:
+ description: "The device health module has determined that one or more devices will fail soon, but the normal process of relocating the data on the device to other OSDs in the cluster is blocked. \nEnsure that the cluster has available free space. It may be necessary to add capacity to the cluster to allow data from the failing device to successfully migrate, or to enable the balancer."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-in-use"
+ summary: "Device failure is predicted, but unable to relocate data"
+ expr: "ceph_health_detail{name=\"DEVICE_HEALTH_IN_USE\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDFlapping"
+ annotations:
+ description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} was marked down and back up {{ $value | humanize }} times once a minute for 5 minutes. This may indicate a network issue (latency, packet loss, MTU mismatch) on the cluster network, or the public network if no cluster network is deployed. Check the network stats on the listed host(s)."
+ documentation: "https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd#flapping-osds"
+ summary: "Network issues are causing OSDs to flap (mark each other down)"
+ expr: "(rate(ceph_osd_up[5m]) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 60 > 1"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.4"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDReadErrors"
+ annotations:
+ description: "An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors"
+ summary: "Device read errors detected"
+ expr: "ceph_health_detail{name=\"BLUESTORE_SPURIOUS_READ_ERRORS\"} == 1"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGImbalance"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-in-use
- summary: Device failure is predicted, but unable to relocate data
- description: |
- The device health module has determined that one or more devices will fail
- soon, but the normal process of relocating the data on the device to other
- OSDs in the cluster is blocked.
-
- Ensure that the cluster has available free space. It may be necessary to add
- capacity to the cluster to allow the data from the failing device to
- successfully migrate, or to enable the balancer.
- - alert: CephOSDFlapping
- expr: |
- (
- rate(ceph_osd_up[5m])
- LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata
- ) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 60 > 1
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.4
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd#flapping-osds
- summary: Network issues are causing OSDs to flap (mark each other down)
- description: >
- OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} was marked down and back up {{ $value | humanize }} times once a minute for 5 minutes. This may indicate a network issue (latency, packet loss, MTU mismatch) on the cluster network, or the public network if no cluster network is deployed. Check network stats on the listed host(s).
-
- - alert: CephOSDReadErrors
- expr: ceph_health_detail{name="BLUESTORE_SPURIOUS_READ_ERRORS"} == 1
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors
- summary: Device read errors detected
- description: >
- An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel.
-
- # alert on high deviation from average PG count
- - alert: CephPGImbalance
+ description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count."
+ summary: "PGs are not balanced across OSDs"
expr: |
abs(
- (
- (ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)
- ) / on (job) group_left avg(ceph_osd_numpg > 0) by (job)
+ ((ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)) /
+ on (job) group_left avg(ceph_osd_numpg > 0) by (job)
) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
- for: 5m
+ for: "5m"
labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.5
- annotations:
- summary: PGs are not balanced across OSDs
- description: >
- OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count.
-
- # alert on high commit latency...but how high is too high
- - name: mds
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.5"
+ severity: "warning"
+ type: "ceph_default"
+ - name: "mds"
rules:
- - alert: CephFilesystemDamaged
- expr: ceph_health_detail{name="MDS_DAMAGE"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.1
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages
- summary: CephFS filesystem is damaged.
- description: >
- Filesystem metadata has been corrupted. Data may be inaccessible. Analyze metrics from the MDS daemon admin socket, or escalate to support.
-
- - alert: CephFilesystemOffline
- expr: ceph_health_detail{name="MDS_ALL_DOWN"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.3
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-all-down
- summary: CephFS filesystem is offline
- description: >
- All MDS ranks are unavailable. The MDS daemons managing metadata are down, rendering the filesystem offline.
-
- - alert: CephFilesystemDegraded
- expr: ceph_health_detail{name="FS_DEGRADED"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.4
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-degraded
- summary: CephFS filesystem is degraded
- description: >
- One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, at worst the filesystem is completely unusable.
-
- - alert: CephFilesystemMDSRanksLow
- expr: ceph_health_detail{name="MDS_UP_LESS_THAN_MAX"} > 0
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-up-less-than-max
- summary: MDS daemon count is lower than configured
- description: >
- The filesystem's "max_mds" setting defines the number of MDS ranks in the filesystem. The current number of active MDS daemons is less than this value.
-
- - alert: CephFilesystemInsufficientStandby
- expr: ceph_health_detail{name="MDS_INSUFFICIENT_STANDBY"} > 0
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-insufficient-standby
- summary: Ceph filesystem standby daemons too few
- description: >
- The minimum number of standby daemons required by standby_count_wanted is less than the current number of standby daemons. Adjust the standby count or increase the number of MDS daemons.
-
- - alert: CephFilesystemFailureNoStandby
- expr: ceph_health_detail{name="FS_WITH_FAILED_MDS"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.5
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-with-failed-mds
- summary: MDS daemon failed, no further standby available
- description: >
- An MDS daemon has failed, leaving only one active rank and no available standby. Investigate the cause of the failure or add a standby MDS.
-
- - alert: CephFilesystemReadOnly
- expr: ceph_health_detail{name="MDS_HEALTH_READ_ONLY"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.2
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages
- summary: CephFS filesystem in read only mode due to write error(s)
- description: >
- The filesystem has switched to READ ONLY due to an unexpected error when writing to the metadata pool.
-
- Analyze the output from the MDS daemon admin socket, or escalate to support.
-
- - name: mgr
+ - alert: "CephFilesystemDamaged"
+ annotations:
+ description: "Filesystem metadata has been corrupted. Data may be inaccessible. Analyze metrics from the MDS daemon admin socket, or escalate to support."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages"
+ summary: "CephFS filesystem is damaged."
+ expr: "ceph_health_detail{name=\"MDS_DAMAGE\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemOffline"
+ annotations:
+ description: "All MDS ranks are unavailable. The MDS daemons managing metadata are down, rendering the filesystem offline."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-all-down"
+ summary: "CephFS filesystem is offline"
+ expr: "ceph_health_detail{name=\"MDS_ALL_DOWN\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemDegraded"
+ annotations:
+ description: "One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, at worst the filesystem is completely unusable."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-degraded"
+ summary: "CephFS filesystem is degraded"
+ expr: "ceph_health_detail{name=\"FS_DEGRADED\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemMDSRanksLow"
+ annotations:
+ description: "The filesystem's 'max_mds' setting defines the number of MDS ranks in the filesystem. The current number of active MDS daemons is less than this value."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-up-less-than-max"
+ summary: "Ceph MDS daemon count is lower than configured"
+ expr: "ceph_health_detail{name=\"MDS_UP_LESS_THAN_MAX\"} > 0"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephFilesystemInsufficientStandby"
+ annotations:
+ description: "The minimum number of standby daemons required by standby_count_wanted is less than the current number of standby daemons. Adjust the standby count or increase the number of MDS daemons."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-insufficient-standby"
+ summary: "Ceph filesystem standby daemons too few"
+ expr: "ceph_health_detail{name=\"MDS_INSUFFICIENT_STANDBY\"} > 0"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephFilesystemFailureNoStandby"
+ annotations:
+ description: "An MDS daemon has failed, leaving only one active rank and no available standby. Investigate the cause of the failure or add a standby MDS."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-with-failed-mds"
+ summary: "MDS daemon failed, no further standby available"
+ expr: "ceph_health_detail{name=\"FS_WITH_FAILED_MDS\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.5"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemReadOnly"
+ annotations:
+ description: "The filesystem has switched to READ ONLY due to an unexpected error when writing to the metadata pool. Either analyze the output from the MDS daemon admin socket, or escalate to support."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages"
+ summary: "CephFS filesystem in read only mode due to write error(s)"
+ expr: "ceph_health_detail{name=\"MDS_HEALTH_READ_ONLY\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.2"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "mgr"
rules:
- - alert: CephMgrModuleCrash
- expr: ceph_health_detail{name="RECENT_MGR_MODULE_CRASH"} == 1
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.6.1
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#recent-mgr-module-crash
- summary: A manager module has recently crashed
- description: >
- One or more mgr modules have crashed and have yet to be acknowledged by an administrator. A crashed module may impact functionality within the cluster. Use the 'ceph crash' command to determine which module has failed, and archive it to acknowledge the failure.
-
- - alert: CephMgrPrometheusModuleInactive
- expr: up{job="ceph"} == 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.6.2
- annotations:
- summary: The mgr/prometheus module is not available
- description: >
- The mgr/prometheus module at {{ $labels.instance }} is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down.
-
- Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine the mgr/prometheus module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.
-
- - name: pgs
+ - alert: "CephMgrModuleCrash"
+ annotations:
+ description: "One or more mgr modules have crashed and have yet to be acknowledged by an administrator. A crashed module may impact functionality within the cluster. Use the 'ceph crash' command to determine which module has failed, and archive it to acknowledge the failure."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#recent-mgr-module-crash"
+ summary: "A manager module has recently crashed"
+ expr: "ceph_health_detail{name=\"RECENT_MGR_MODULE_CRASH\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.6.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephMgrPrometheusModuleInactive"
+ annotations:
+ description: "The mgr/prometheus module at {{ $labels.instance }} is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'."
+ summary: "The mgr/prometheus module is not available"
+ expr: "up{job=\"ceph\"} == 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.6.2"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "pgs"
rules:
- - alert: CephPGsInactive
- expr: ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_active) > 0
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.1
- annotations:
- summary: One or more placement groups are inactive
- description: >
- {{ $value }} PGs have been inactive for more than 5 minutes in pool {{ $labels.name }}. Inactive placement groups are not able to serve read/write requests.
-
- - alert: CephPGsUnclean
- expr: ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_clean) > 0
- for: 15m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.2
- annotations:
- summary: One or more placement groups are marked unclean
- description: >
- {{ $value }} PGs have been unclean for more than 15 minutes in pool {{ $labels.name }}. Unclean PGs have not recovered from a previous failure.
-
- - alert: CephPGsDamaged
- expr: ceph_health_detail{name=~"PG_DAMAGED|OSD_SCRUB_ERRORS"} == 1
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.4
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-damaged
- summary: Placement group damaged; manual intervention needed
- description: >
- Scrubs have flagged at least one PG as damaged or inconsistent.
-
- Check to see which PG is affected, and attempt a manual repair if necessary. To list problematic placement groups, use 'ceph health detail' or 'rados list-inconsistent-pg <pool>'. To repair PGs use the 'ceph pg repair <pg_num>' command.
-
- - alert: CephPGRecoveryAtRisk
- expr: ceph_health_detail{name="PG_RECOVERY_FULL"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.5
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-recovery-full
- summary: OSDs are too full for recovery
- description: >
- Data redundancy is at risk since one or more OSDs are at or above the 'full' threshold. Add capacity to the cluster, restore down/out OSDs, or delete unwanted data.
-
- - alert: CephPGUnavailableBlockingIO
- # PG_AVAILABILITY, but an OSD is not in a DOWN state
- expr: ((ceph_health_detail{name="PG_AVAILABILITY"} == 1) - scalar(ceph_health_detail{name="OSD_DOWN"})) == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.3
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-availability
- summary: PG is unavailable, blocking I/O
- description: >
- Data availability is reduced, impacting the cluster's ability to service I/O. One or more placement groups (PGs) are in a state that blocks I/O.
-
- - alert: CephPGBackfillAtRisk
- expr: ceph_health_detail{name="PG_BACKFILL_FULL"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.6
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-backfill-full
- summary: Backfill operations are blocked due to lack of free space
- description: >
- Data redundancy may be at risk due to lack of free space within the cluster. One or more OSDs have breached their 'backfillfull' threshold. Add more capacity, or delete unwanted data.
-
- - alert: CephPGNotScrubbed
- expr: ceph_health_detail{name="PG_NOT_SCRUBBED"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
+ - alert: "CephPGsInactive"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-scrubbed
- summary: Placement group(s) have not been scrubbed
- description: |
- One or more PGs have not been scrubbed recently. Scrubs check metadata integrity,
- protecting against bit-rot. They check that metadata
- is consistent across data replicas. When PGs miss their scrub interval, it may
- indicate that the scrub window is too small, or PGs were not in a 'clean' state during the
- scrub window.
-
- You can manually initiate a scrub with: ceph pg scrub <pgid>
- - alert: CephPGsHighPerOSD
- expr: ceph_health_detail{name="TOO_MANY_PGS"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
+ description: "{{ $value }} PGs have been inactive for more than 5 minutes in pool {{ $labels.name }}. Inactive placement groups are not able to serve read/write requests."
+ summary: "One or more placement groups are inactive"
+ expr: "ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_active) > 0"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGsUnclean"
+ annotations:
+ description: "{{ $value }} PGs have been unclean for more than 15 minutes in pool {{ $labels.name }}. Unclean PGs have not recovered from a previous failure."
+ summary: "One or more placement groups are marked unclean"
+ expr: "ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_clean) > 0"
+ for: "15m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGsDamaged"
+ annotations:
+ description: "During data consistency checks (scrub), at least one PG has been flagged as being damaged or inconsistent. Check to see which PG is affected, and attempt a manual repair if necessary. To list problematic placement groups, use 'rados list-inconsistent-pg <pool>'. To repair PGs use the 'ceph pg repair <pg_num>' command."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-damaged"
+ summary: "Placement group damaged, manual intervention needed"
+ expr: "ceph_health_detail{name=~\"PG_DAMAGED|OSD_SCRUB_ERRORS\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGRecoveryAtRisk"
+ annotations:
+ description: "Data redundancy is at risk since one or more OSDs are at or above the 'full' threshold. Add more capacity to the cluster, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-recovery-full"
+ summary: "OSDs are too full for recovery"
+ expr: "ceph_health_detail{name=\"PG_RECOVERY_FULL\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.5"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGUnavailableBlockingIO"
+ annotations:
+ description: "Data availability is reduced, impacting the cluster's ability to service I/O. One or more placement groups (PGs) are in a state that blocks I/O."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-availability"
+ summary: "PG is unavailable, blocking I/O"
+ expr: "((ceph_health_detail{name=\"PG_AVAILABILITY\"} == 1) - scalar(ceph_health_detail{name=\"OSD_DOWN\"})) == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGBackfillAtRisk"
+ annotations:
+ description: "Data redundancy may be at risk due to lack of free space within the cluster. One or more OSDs have reached the 'backfillfull' threshold. Add more capacity, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-backfill-full"
+ summary: "Backfill operations are blocked due to lack of free space"
+ expr: "ceph_health_detail{name=\"PG_BACKFILL_FULL\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.6"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGNotScrubbed"
+ annotations:
+ description: "One or more PGs have not been scrubbed recently. Scrubs check metadata integrity, protecting against bit-rot. They check that metadata is consistent across data replicas. When PGs miss their scrub interval, it may indicate that the scrub window is too small, or PGs were not in a 'clean' state during the scrub window. You can manually initiate a scrub with: ceph pg scrub <pgid>"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-scrubbed"
+ summary: "Placement group(s) have not been scrubbed"
+ expr: "ceph_health_detail{name=\"PG_NOT_SCRUBBED\"} == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGsHighPerOSD"
+ annotations:
+ description: "The number of placement groups per OSD is too high (exceeds the mon_max_pg_per_osd setting).\n Check that the pg_autoscaler has not been disabled for any pools with 'ceph osd pool autoscale-status', and that the profile selected is appropriate. You may also adjust the target_size_ratio of a pool to guide the autoscaler based on the expected relative size of the pool ('ceph osd pool set cephfs.cephfs.meta target_size_ratio .1') or set the pg_autoscaler mode to 'warn' and adjust pg_num appropriately for one or more pools."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks/#too-many-pgs"
+ summary: "Placement groups per OSD is too high"
+ expr: "ceph_health_detail{name=\"TOO_MANY_PGS\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGNotDeepScrubbed"
+ annotations:
+ description: "One or more PGs have not been deep scrubbed recently. Deep scrubs protect against bit-rot. They compare data replicas to ensure consistency. When PGs miss their deep scrub interval, it may indicate that the window is too small or PGs were not in a 'clean' state during the deep-scrub window."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-deep-scrubbed"
+ summary: "Placement group(s) have not been deep scrubbed"
+ expr: "ceph_health_detail{name=\"PG_NOT_DEEP_SCRUBBED\"} == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "nodes"
+ rules:
+ - alert: "CephNodeRootFilesystemFull"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks/#too-many-pgs
- summary: Placement groups per OSD is too high
- description: |
- The number of placement groups per OSD is too high (exceeds the mon_max_pg_per_osd setting).
-
- Check that the pg_autoscaler has not been disabled for any pools with 'ceph osd pool autoscale-status',
- and that the profile selected is appropriate. You may also adjust the target_size_ratio of a pool to guide
- the autoscaler based on the expected relative size of the pool
- ('ceph osd pool set cephfs.cephfs.meta target_size_ratio .1') or set the pg_autoscaler
- mode to "warn" and adjust pg_num appropriately for one or more pools.
- - alert: CephPGNotDeepScrubbed
- expr: ceph_health_detail{name="PG_NOT_DEEP_SCRUBBED"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
+ description: "Root volume is dangerously full: {{ $value | humanize }}% free."
+ summary: "Root filesystem is dangerously full"
+ expr: "node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 < 5"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephNodeNetworkPacketDrops"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-deep-scrubbed
- summary: Placement group(s) have not been deep scrubbed
- description: |
- One or more PGs have not been deep scrubbed recently. Deep scrubs
- protect against bit-rot. They compare data
- replicas to ensure consistency. When PGs miss their deep scrub interval, it may indicate
- that the window is too small or PGs were not in a 'clean' state during the deep-scrub
- window.
-
- You can manually initiate a deep scrub with: ceph pg deep-scrub <pgid>
- - name: nodes
- rules:
- - alert: CephNodeRootFilesystemFull
- expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 < 5
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.1
- annotations:
- summary: Root filesystem is dangerously full
- description: >
- Root volume is dangerously full: {{ $value | humanize }}% free.
-
- # alert on packet errors and drop rate
- - alert: CephNodeNetworkPacketDrops
+ description: "Node {{ $labels.instance }} experiences packet drop > 0.5% or > 10 packets/s on interface {{ $labels.device }}."
+ summary: "One or more NICs reports packet drops"
expr: |
(
- increase(node_network_receive_drop_total{device!="lo"}[1m]) +
- increase(node_network_transmit_drop_total{device!="lo"}[1m])
+ rate(node_network_receive_drop_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_drop_total{device!="lo"}[1m])
) / (
- increase(node_network_receive_packets_total{device!="lo"}[1m]) +
- increase(node_network_transmit_packets_total{device!="lo"}[1m])
- ) >= 0.0001 or (
- increase(node_network_receive_drop_total{device!="lo"}[1m]) +
- increase(node_network_transmit_drop_total{device!="lo"}[1m])
+ rate(node_network_receive_packets_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_packets_total{device!="lo"}[1m])
+ ) >= 0.0050000000000000001 and (
+ rate(node_network_receive_drop_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_drop_total{device!="lo"}[1m])
) >= 10
labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.2
- annotations:
- summary: One or more NICs reports packet drops
- description: >
- Node {{ $labels.instance }} experiences packet drop > 0.01% or > 10 packets/s on interface {{ $labels.device }}.
-
- - alert: CephNodeNetworkPacketErrors
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeNetworkPacketErrors"
+ annotations:
+ description: "Node {{ $labels.instance }} experiences packet errors > 0.01% or > 10 packets/s on interface {{ $labels.device }}."
+ summary: "One or more NICs reports packet errors"
expr: |
(
- increase(node_network_receive_errs_total{device!="lo"}[1m]) +
- increase(node_network_transmit_errs_total{device!="lo"}[1m])
+ rate(node_network_receive_errs_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_errs_total{device!="lo"}[1m])
) / (
- increase(node_network_receive_packets_total{device!="lo"}[1m]) +
- increase(node_network_transmit_packets_total{device!="lo"}[1m])
+ rate(node_network_receive_packets_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_packets_total{device!="lo"}[1m])
) >= 0.0001 or (
- increase(node_network_receive_errs_total{device!="lo"}[1m]) +
- increase(node_network_transmit_errs_total{device!="lo"}[1m])
+ rate(node_network_receive_errs_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_errs_total{device!="lo"}[1m])
) >= 10
labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.3
- annotations:
- summary: One or more NICs reports packet errors
- description: >
- Node {{ $labels.instance }} experiences packet errors > 0.01% or > 10 packets/s on interface {{ $labels.device }}.
-
- # Restrict to device names beginning with '/' to skip false alarms from
- # tmpfs, overlay type filesystems
- - alert: CephNodeDiskspaceWarning
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.3"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeNetworkBondDegraded"
+ annotations:
+ description: "Bond {{ $labels.master }} is degraded on Node {{ $labels.instance }}."
+ summary: "Degraded Bond on Node {{ $labels.instance }}"
expr: |
- predict_linear(node_filesystem_free_bytes{device=~"/.*"}[2d], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) *
- on(instance) group_left(nodename) node_uname_info < 0
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.4
- annotations:
- summary: Host filesystem free space is low
- description: >
- Mountpoint {{ $labels.mountpoint }} on {{ $labels.nodename }} will be full in less than 5 days based on the 48 hour trailing fill rate.
-
- - alert: CephNodeInconsistentMTU
- expr: node_network_mtu_bytes{device!="lo"} LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!="lo"} > 0) != on() group_left() (quantile(0.5, node_network_mtu_bytes{device!="lo"}))
+ node_bonding_slaves - node_bonding_active != 0
labels:
- severity: warning
- type: ceph_default
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeDiskspaceWarning"
+ annotations:
+ description: "Mountpoint {{ $labels.mountpoint }} on {{ $labels.nodename }} will be full in less than 5 days based on the 48 hour trailing fill rate."
+ summary: "Host filesystem free space is getting low"
+ expr: "predict_linear(node_filesystem_free_bytes{device=~\"/.*\"}[2d], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) *on(instance) group_left(nodename) node_uname_info < 0"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.4"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeInconsistentMTU"
+ annotations:
+ description: "Node {{ $labels.instance }} has a different MTU size ({{ $value }}) than the median of devices named {{ $labels.device }}."
+ summary: "MTU settings across Ceph hosts are inconsistent"
+ expr: "node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0) == scalar( max by (device) (node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) != quantile by (device) (.5, node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) )or node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0) == scalar( min by (device) (node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) != quantile by (device) (.5, node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) )"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "pools"
+ rules:
+ - alert: "CephPoolGrowthWarning"
annotations:
- summary: MTU settings across hosts are inconsistent
- description: >
- Node {{ $labels.instance }} has a different MTU size ({{ $value }}) than the median value on device {{ $labels.device }}.
-
- - name: pools
+ description: "Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours."
+ summary: "Pool growth rate may soon exceed capacity"
+ expr: "(predict_linear(ceph_pool_percent_used[2d], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id, instance, pod) group_right() ceph_pool_metadata) >= 95"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.9.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPoolBackfillFull"
+ annotations:
+ description: "A pool is approaching the near full threshold, which will prevent recovery/backfill operations from completing. Consider adding more capacity."
+ summary: "Free space in a pool is too low for recovery/backfill"
+ expr: "ceph_health_detail{name=\"POOL_BACKFILLFULL\"} > 0"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPoolFull"
+ annotations:
+ description: "A pool has reached its MAX quota, or OSDs supporting the pool have reached the FULL threshold. Until this is resolved, writes to the pool will be blocked. Pool Breakdown (top 5) {{- range query \"topk(5, sort_desc(ceph_pool_percent_used LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id) group_right ceph_pool_metadata))\" }} - {{ .Labels.name }} at {{ .Value }}% {{- end }} Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>)"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pool-full"
+ summary: "Pool is full - writes are blocked"
+ expr: "ceph_health_detail{name=\"POOL_FULL\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.9.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPoolNearFull"
+ annotations:
+ description: "A pool has exceeded the warning (percent full) threshold, or OSDs supporting the pool have reached the NEARFULL threshold. Writes may continue, but you are at risk of the pool going read-only if more capacity isn't made available. Determine the affected pool with 'ceph df detail', looking at QUOTA BYTES and STORED. Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>). Also ensure that the balancer is active."
+ summary: "One or more Ceph pools are nearly full"
+ expr: "ceph_health_detail{name=\"POOL_NEAR_FULL\"} > 0"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "healthchecks"
rules:
- - alert: CephPoolGrowthWarning
- expr: |
- (predict_linear((max(ceph_pool_percent_used) without (pod, instance))[2d:1h], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id)
- group_right ceph_pool_metadata) >= 95
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.9.2
- annotations:
- summary: Pool growth rate may soon exceed capacity
- description: >
- Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours.
-
- - alert: CephPoolBackfillFull
- expr: ceph_health_detail{name="POOL_BACKFILLFULL"} > 0
- labels:
- severity: warning
- type: ceph_default
+ - alert: "CephSlowOps"
annotations:
- summary: Free space in a pool is too low for recovery/backfill
- description: >
- A pool is approaching the near full threshold, which will prevent recovery/backfill from completing. Consider adding more capacity.
-
- - alert: CephPoolFull
- expr: ceph_health_detail{name="POOL_FULL"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.9.1
+ description: "{{ $value }} OSD requests are taking too long to process (osd_op_complaint_time exceeded)"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops"
+ summary: "OSD operations are slow to complete"
+ expr: "ceph_healthcheck_slow_ops > 0"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephDaemonSlowOps"
+ annotations:
+ description: "{{ $labels.ceph_daemon }} operations are taking too long to process (complaint time exceeded)"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops"
+ summary: "{{ $labels.ceph_daemon }} operations are slow to complete"
+ expr: "ceph_daemon_health_metrics{type=\"SLOW_OPS\"} > 0"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "hardware"
+ rules:
+ - alert: "HardwareStorageError"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pool-full
- summary: Pool is full - writes are blocked
- description: |
- A pool has reached its MAX quota, or OSDs supporting the pool
- have reached the FULL threshold. Until this is resolved, writes to
- the pool will be blocked.
- Pool Breakdown (top 5)
- {{- range query "topk(5, sort_desc(ceph_pool_percent_used LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id) group_right ceph_pool_metadata))" }}
- - {{ .Labels.name }} at {{ .Value }}%
- {{- end }}
- Increase the pool's quota, or add capacity to the cluster
- then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>)
- - alert: CephPoolNearFull
- expr: ceph_health_detail{name="POOL_NEAR_FULL"} > 0
- for: 5m
- labels:
- severity: warning
- type: ceph_default
+ description: "Some storage devices are in error. Check `ceph health detail`."
+ summary: "Storage devices error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_STORAGE\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareMemoryError"
+ annotations:
+ description: "DIMM error(s) detected. Check `ceph health detail`."
+ summary: "DIMM error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_MEMORY\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.2"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareProcessorError"
+ annotations:
+ description: "Processor error(s) detected. Check `ceph health detail`."
+ summary: "Processor error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_PROCESSOR\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareNetworkError"
+ annotations:
+ description: "Network error(s) detected. Check `ceph health detail`."
+ summary: "Network error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_NETWORK\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwarePowerError"
+ annotations:
+ description: "Power supply error(s) detected. Check `ceph health detail`."
+ summary: "Power supply error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_POWER\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.5"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareFanError"
+ annotations:
+ description: "Fan error(s) detected. Check `ceph health detail`."
+ summary: "Fan error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_FANS\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.6"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "PrometheusServer"
+ rules:
+ - alert: "PrometheusJobMissing"
annotations:
- summary: One or more Ceph pools are nearly full
- description: |
- A pool has exceeded the warning (percent full) threshold, or OSDs
- supporting the pool have reached the NEARFULL threshold. Writes may
- continue, but you are at risk of the pool going read-only if more capacity
- isn't made available.
-
- Determine the affected pool with 'ceph df detail', looking
- at QUOTA BYTES and STORED. Increase the pool's quota, or add
- capacity to the cluster then increase the pool's quota
- (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>).
- Also ensure that the balancer is active.
- - name: healthchecks
+ description: "The prometheus job that scrapes from Ceph MGR is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance."
+ summary: "The scrape job for Ceph MGR is missing from Prometheus"
+ expr: "absent(up{job=\"rook-ceph-mgr\"})"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.12.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "PrometheusJobExporterMissing"
+ annotations:
+ description: "The prometheus job that scrapes from Ceph Exporter is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance."
+ summary: "The scrape job for Ceph Exporter is missing from Prometheus"
+ expr: "sum(absent(up{job=\"rook-ceph-exporter\"})) and sum(ceph_osd_metadata{ceph_version=~\"^ceph version (1[89]|[2-9][0-9]).*\"}) > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.12.1"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "rados"
rules:
- - alert: CephSlowOps
- expr: ceph_healthcheck_slow_ops > 0
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops
- summary: OSD operations are slow to complete
- description: >
- {{ $value }} OSD requests are taking too long to process (osd_op_complaint_time exceeded)
-
- # Object related events
- - name: rados
+ - alert: "CephObjectMissing"
+ annotations:
+ description: "The latest version of a RADOS object can not be found, even though all OSDs are up. I/O requests for this object from clients will block (hang). Resolving this issue may require the object to be rolled back to a prior version manually, and manually verified."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#object-unfound"
+ summary: "Object(s) marked UNFOUND"
+ expr: "(ceph_health_detail{name=\"OBJECT_UNFOUND\"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (count(ceph_osd_up == 1) == bool count(ceph_osd_metadata)) == 1"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.1"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "generic"
rules:
- - alert: CephObjectMissing
- expr: (ceph_health_detail{name="OBJECT_UNFOUND"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (count(ceph_osd_up == 1) == bool count(ceph_osd_metadata)) == 1
- for: 30s
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.10.1
+ - alert: "CephDaemonCrash"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#object-unfound
- summary: Object(s) marked UNFOUND
- description: |
- The latest version of a RADOS object can not be found, even though all OSDs are up. I/O
- requests for this object from clients will block (hang). Resolving this issue may
- require the object to be rolled back to a prior version manually, and manually verified.
- # Generic
- - name: generic
+ description: "One or more daemons have crashed recently, and need to be acknowledged. This notification ensures that software crashes do not go unseen. To acknowledge a crash, use the 'ceph crash archive <id>' command."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash"
+ summary: "One or more Ceph daemons have crashed, and are pending acknowledgement"
+ expr: "ceph_health_detail{name=\"RECENT_CRASH\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.1.2"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "rbdmirror"
rules:
- - alert: CephDaemonCrash
- expr: ceph_health_detail{name="RECENT_CRASH"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.1.2
+ - alert: "CephRBDMirrorImagesPerDaemonHigh"
annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash
- summary: One or more Ceph daemons have crashed, and are pending acknowledgement
- description: |
- One or more daemons have crashed recently, and need to be acknowledged. This notification
- ensures that software crashes do not go unseen. To acknowledge a crash, use the
- 'ceph crash archive <id>' command.
+ description: "Number of image replications per daemon is not supposed to go beyond threshold 100"
+ summary: "Number of image replications are now above 100"
+ expr: "sum by (ceph_daemon, namespace) (ceph_rbd_mirror_snapshot_image_snapshots) > 100"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.2"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephRBDMirrorImagesNotInSync"
+ annotations:
+ description: "Both local and remote RBD mirror images should be in sync."
+ summary: "Some of the RBD mirror images are not in sync with the remote counter parts."
+ expr: "sum by (ceph_daemon, image, namespace, pool) (topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_local_timestamp) - topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_remote_timestamp)) != 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephRBDMirrorImagesNotInSyncVeryHigh"
+ annotations:
+ description: "More than 10% of the images have synchronization problems"
+ summary: "Number of unsynchronized images are very high."
+ expr: "count by (ceph_daemon) ((topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_local_timestamp) - topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_remote_timestamp)) != 0) > (sum by (ceph_daemon) (ceph_rbd_mirror_snapshot_snapshots)*.1)"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephRBDMirrorImageTransferBandwidthHigh"
+ annotations:
+ description: "Detected a heavy increase in bandwidth for rbd replications (over 80%) in the last 30 min. This might not be a problem, but it is good to review the number of images being replicated simultaneously"
+ summary: "The replication network usage has been increased over 80% in the last 30 minutes. Review the number of images being replicated. This alert will be cleaned automatically after 30 minutes"
+ expr: "rate(ceph_rbd_mirror_journal_replay_bytes[30m]) > 0.80"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.5"
+ severity: "warning"
+ type: "ceph_default"
+ - name: "nvmeof"
+ rules:
+ - alert: "NVMeoFSubsystemNamespaceLimit"
+ annotations:
+ description: "Subsystems have a max namespace limit defined at creation time. This alert means that no more namespaces can be added to {{ $labels.nqn }}"
+ summary: "{{ $labels.nqn }} subsystem has reached its maximum number of namespaces "
+ expr: "(count by(nqn) (ceph_nvmeof_subsystem_namespace_metadata)) >= ceph_nvmeof_subsystem_namespace_limit"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFTooManyGateways"
+ annotations:
+ description: "You may create many gateways, but 4 is the tested limit"
+ summary: "Max supported gateways exceeded "
+ expr: "count(ceph_nvmeof_gateway_info) > 4.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFMaxGatewayGroupSize"
+ annotations:
+ description: "You may create many gateways in a gateway group, but 2 is the tested limit"
+ summary: "Max gateways within a gateway group ({{ $labels.group }}) exceeded "
+ expr: "count by(group) (ceph_nvmeof_gateway_info) > 2.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFSingleGatewayGroup"
+ annotations:
+ description: "Although a single member gateway group is valid, it should only be used for test purposes"
+ summary: "The gateway group {{ $labels.group }} consists of a single gateway - HA is not possible "
+ expr: "count by(group) (ceph_nvmeof_gateway_info) == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighGatewayCPU"
+ annotations:
+ description: "Typically, high CPU may indicate degraded performance. Consider increasing the number of reactor cores"
+ summary: "CPU used by {{ $labels.instance }} NVMe-oF Gateway is high "
+ expr: "label_replace(avg by(instance) (rate(ceph_nvmeof_reactor_seconds_total{mode=\"busy\"}[1m])),\"instance\",\"$1\",\"instance\",\"(.*):.*\") > 80.00"
+ for: "10m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFGatewayOpenSecurity"
+ annotations:
+ description: "It is good practice to ensure subsystems use host security to reduce the risk of unexpected data loss"
+ summary: "Subsystem {{ $labels.nqn }} has been defined without host level security "
+ expr: "ceph_nvmeof_subsystem_metadata{allow_any_host=\"yes\"}"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFTooManySubsystems"
+ annotations:
+ description: "Although you may continue to create subsystems in {{ $labels.gateway_host }}, the configuration may not be supported"
+ summary: "The number of subsystems defined to the gateway exceeds supported values "
+ expr: "count by(gateway_host) (label_replace(ceph_nvmeof_subsystem_metadata,\"gateway_host\",\"$1\",\"instance\",\"(.*):.*\")) > 16.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFVersionMismatch"
+ annotations:
+ description: "This may indicate an issue with deployment. Check cephadm logs"
+ summary: "The cluster has different NVMe-oF gateway releases active "
+ expr: "count(count by(version) (ceph_nvmeof_gateway_info)) > 1"
+ for: "1h"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighClientCount"
+ annotations:
+ description: "The supported limit for clients connecting to a subsystem is 32"
+ summary: "The number of clients connected to {{ $labels.nqn }} is too high "
+ expr: "ceph_nvmeof_subsystem_host_count > 32.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighHostCPU"
+ annotations:
+ description: "High CPU on a gateway host can lead to CPU contention and performance degradation"
+ summary: "The CPU is high ({{ $value }}%) on NVMeoF Gateway host ({{ $labels.host }}) "
+ expr: "100-((100*(avg by(host) (label_replace(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]),\"host\",\"$1\",\"instance\",\"(.*):.*\")) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(host) group_right label_replace(ceph_nvmeof_gateway_info,\"host\",\"$1\",\"instance\",\"(.*):.*\")))) >= 80.00"
+ for: "10m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFInterfaceDown"
+ annotations:
+ description: "A NIC used by one or more subsystems is in a down state"
+ summary: "Network interface {{ $labels.device }} is down "
+ expr: "ceph_nvmeof_subsystem_listener_iface_info{operstate=\"down\"}"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.14.1"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFInterfaceDuplex"
+ annotations:
+ description: "Until this is resolved, performance from the gateway will be degraded"
+ summary: "Network interface {{ $labels.device }} is not running in full duplex mode "
+ expr: "ceph_nvmeof_subsystem_listener_iface_info{duplex!=\"full\"}"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighReadLatency"
+ annotations:
+ description: "High latencies may indicate a constraint within the cluster e.g. CPU, network. Please investigate"
+ summary: "The average read latency over the last 5 mins has reached 10 ms or more on {{ $labels.gateway }}"
+ expr: "label_replace((avg by(instance) ((rate(ceph_nvmeof_bdev_read_seconds_total[1m]) / rate(ceph_nvmeof_bdev_reads_completed_total[1m])))),\"gateway\",\"$1\",\"instance\",\"(.*):.*\") > 0.01"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighWriteLatency"
+ annotations:
+ description: "High latencies may indicate a constraint within the cluster e.g. CPU, network. Please investigate"
+ summary: "The average write latency over the last 5 mins has reached 20 ms or more on {{ $labels.gateway }}"
+ expr: "label_replace((avg by(instance) ((rate(ceph_nvmeof_bdev_write_seconds_total[5m]) / rate(ceph_nvmeof_bdev_writes_completed_total[5m])))),\"gateway\",\"$1\",\"instance\",\"(.*):.*\") > 0.02"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
---
---
-apiVersion: snapshot.storage.k8s.io/v1beta1
+apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-rbdplugin-snapclass |
Path: @@ -1,86 +1,4 @@
---
-# Source: rook-ceph/templates/psp.yaml
-# We expect most Kubernetes teams to follow the Kubernetes docs and have these PSPs.
-# LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos privileged (for kube-system namespace)
-# LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos restricted (for all logged in users)
-#
-# PSPs are applied based on the first match alphabetically. `rook-ceph-operator` comes after
-# `restricted` alphabetically, so we name this `00-rook-privileged`, so it stays somewhere
-# close to the top and so `rook-system` gets the intended PSP. This may need to be renamed in
-# environments with other `00`-prefixed PSPs.
-#
-# More on PSP ordering: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#policy-order
-apiVersion: policy/v1beta1
-kind: PodSecurityPolicy
-metadata:
- name: 00-rook-privileged
- annotations:
- seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'runtime/default'
- seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
-spec:
- privileged: true
- allowedCapabilities:
- # required by CSI
- - SYS_ADMIN
- - MKNOD
- fsGroup:
- rule: RunAsAny
- # runAsUser, supplementalGroups - Rook needs to run some pods as root
- # Ceph pods could be run as the Ceph user, but that user isn't always known ahead of time
- runAsUser:
- rule: RunAsAny
- supplementalGroups:
- rule: RunAsAny
- # seLinux - seLinux context is unknown ahead of time; set if this is well-known
- seLinux:
- rule: RunAsAny
- volumes:
- # recommended minimum set
- - configMap
- - downwardAPI
- - emptyDir
- - persistentVolumeClaim
- - secret
- - projected
- # required for Rook
- - hostPath
- # allowedHostPaths can be set to Rook's known host volume mount points when they are fully-known
- # allowedHostPaths:
- # - pathPrefix: "/run/udev" # for OSD prep
- # readOnly: false
- # - pathPrefix: "/dev" # for OSD prep
- # readOnly: false
- # - pathPrefix: "/var/lib/rook" # or whatever the dataDirHostPath value is set to
- # readOnly: false
- # Ceph requires host IPC for setting up encrypted devices
- hostIPC: true
- # Ceph OSDs need to share the same PID namespace
- hostPID: true
- # hostNetwork can be set to 'false' if host networking isn't used
- hostNetwork: true
- hostPorts:
- # Ceph messenger protocol v1
- - min: 6789
- max: 6790 # <- support old default port
- # Ceph messenger protocol v2
- - min: 3300
- max: 3300
- # Ceph RADOS ports for OSDs, MDSes
- - min: 6800
- max: 7300
- # # Ceph dashboard port HTTP (not recommended)
- # - min: 7000
- # max: 7000
- # Ceph dashboard port HTTPS
- - min: 8443
- max: 8443
- # Ceph mgr Prometheus Metrics
- - min: 9283
- max: 9283
- # port for CSIAddons
- - min: 9070
- max: 9070
----
# Source: rook-ceph/templates/cluster-rbac.yaml
# Service account for Ceph OSDs
apiVersion: v1
@@ -155,6 +73,19 @@
# imagePullSecrets:
# - name: my-registry-secret
---
+# Source: rook-ceph/templates/cluster-rbac.yaml
+# Service account for other components
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: rook-ceph-default
+ namespace: default # namespace:cluster
+ labels:
+ operator: rook
+ storage-backend: ceph
+# imagePullSecrets:
+# - name: my-registry-secret
+---
# Source: rook-ceph/templates/serviceaccount.yaml
# Service account for the Rook-Ceph operator
apiVersion: v1
@@ -211,6 +142,20 @@
# imagePullSecrets:
# - name: my-registry-secret
---
+# Source: rook-ceph/templates/serviceaccount.yaml
+# Service account for Ceph COSI driver
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: objectstorage-provisioner
+ namespace: default # namespace:operator
+ labels:
+ app.kubernetes.io/part-of: container-object-storage-interface
+ app.kubernetes.io/component: driver-ceph
+ app.kubernetes.io/name: cosi-driver-ceph
+# imagePullSecrets:
+# - name: my-registry-secret
+---
# Source: rook-ceph/templates/configmap.yaml
# Operator settings that can be updated without an operator restart
# Operator settings that require an operator restart are found in the operator env vars
@@ -218,36 +163,53 @@
apiVersion: v1
metadata:
name: rook-ceph-operator-config
+ namespace: default # namespace:operator
data:
ROOK_LOG_LEVEL: "INFO"
ROOK_CEPH_COMMANDS_TIMEOUT_SECONDS: "15"
ROOK_OBC_WATCH_OPERATOR_NAMESPACE: "true"
+ ROOK_CEPH_ALLOW_LOOP_DEVICES: "false"
+ ROOK_ENABLE_DISCOVERY_DAEMON: "false"
ROOK_CSI_ENABLE_RBD: "true"
ROOK_CSI_ENABLE_CEPHFS: "true"
+ ROOK_CSI_DISABLE_DRIVER: "false"
CSI_ENABLE_CEPHFS_SNAPSHOTTER: "true"
+ CSI_ENABLE_NFS_SNAPSHOTTER: "true"
CSI_ENABLE_RBD_SNAPSHOTTER: "true"
CSI_PLUGIN_ENABLE_SELINUX_HOST_MOUNT: "false"
CSI_ENABLE_ENCRYPTION: "false"
CSI_ENABLE_OMAP_GENERATOR: "false"
CSI_ENABLE_HOST_NETWORK: "true"
+ CSI_ENABLE_METADATA: "false"
+ CSI_ENABLE_VOLUME_GROUP_SNAPSHOT: "true"
CSI_PLUGIN_PRIORITY_CLASSNAME: "system-node-critical"
CSI_PROVISIONER_PRIORITY_CLASSNAME: "system-cluster-critical"
- CSI_RBD_FSGROUPPOLICY: "ReadWriteOnceWithFSType"
- CSI_CEPHFS_FSGROUPPOLICY: "ReadWriteOnceWithFSType"
- CSI_NFS_FSGROUPPOLICY: "ReadWriteOnceWithFSType"
- ROOK_CSI_ENABLE_GRPC_METRICS: "false"
- CSI_ENABLE_VOLUME_REPLICATION: "false"
+ CSI_RBD_FSGROUPPOLICY: "File"
+ CSI_CEPHFS_FSGROUPPOLICY: "File"
+ CSI_NFS_FSGROUPPOLICY: "File"
+ ROOK_CSI_CEPH_IMAGE: "quay.io/cephcsi/cephcsi:v3.13.0"
+ ROOK_CSI_REGISTRAR_IMAGE: "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1"
+ ROOK_CSI_PROVISIONER_IMAGE: "registry.k8s.io/sig-storage/csi-provisioner:v5.0.1"
+ ROOK_CSI_SNAPSHOTTER_IMAGE: "registry.k8s.io/sig-storage/csi-snapshotter:v8.0.1"
+ ROOK_CSI_ATTACHER_IMAGE: "registry.k8s.io/sig-storage/csi-attacher:v4.6.1"
+ ROOK_CSI_RESIZER_IMAGE: "registry.k8s.io/sig-storage/csi-resizer:v1.11.1"
+ ROOK_CSI_IMAGE_PULL_POLICY: "IfNotPresent"
CSI_ENABLE_CSIADDONS: "false"
+ ROOK_CSIADDONS_IMAGE: "quay.io/csiaddons/k8s-sidecar:v0.11.0"
+ CSI_ENABLE_TOPOLOGY: "false"
ROOK_CSI_ENABLE_NFS: "false"
CSI_FORCE_CEPHFS_KERNEL_CLIENT: "true"
CSI_GRPC_TIMEOUT_SECONDS: "150"
CSI_PROVISIONER_REPLICAS: "2"
- CSI_RBD_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : csi-omap-generator\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_RBD_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_CEPHFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_CEPHFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_NFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n"
- CSI_NFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n"
+ CSI_RBD_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n limits:\n memory: 1Gi\n- name : csi-omap-generator\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_RBD_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_CEPHFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_CEPHFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_NFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : csi-attacher\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n"
+ CSI_NFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n"
+ CSI_CEPHFS_ATTACH_REQUIRED: "true"
+ CSI_RBD_ATTACH_REQUIRED: "true"
+ CSI_NFS_ATTACH_REQUIRED: "true"
---
# Source: rook-ceph/templates/clusterrole.yaml
kind: ClusterRole
@@ -271,9 +233,24 @@
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
- - apiGroups: ["admissionregistration.k8s.io"]
- resources: ["validatingwebhookconfigurations"]
- verbs: ["create", "get", "delete", "update"]
+ - apiGroups: ["csiaddons.openshift.io"]
+ resources: ["networkfences"]
+ verbs: ["create", "get", "update", "delete", "watch", "list", "deletecollection"]
+ - apiGroups: ["apiextensions.k8s.io"]
+ resources: ["customresourcedefinitions"]
+ verbs: ["get"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["cephconnections"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["clientprofiles"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["operatorconfigs"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["drivers"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
---
# Source: rook-ceph/templates/clusterrole.yaml
# The cluster role for managing all the cluster-specific resources in a namespace
@@ -332,9 +309,8 @@
# Node access is needed for determining nodes where mons should run
- nodes
- nodes/proxy
- - services
# Rook watches secrets which it uses to configure access to external resources.
- # e.g., external Ceph cluster; TLS certificates for the admission controller or object store
+ # e.g., external Ceph cluster or object store
- secrets
# Rook watches for changes to the rook-operator-config configmap
- configmaps
@@ -352,6 +328,7 @@
- persistentvolumeclaims
# Rook creates endpoints for mgr and object store access
- endpoints
+ - services
verbs:
- get
- list
@@ -380,6 +357,7 @@
- create
- update
- delete
+ - deletecollection
# The Rook operator must be able to watch all ceph.rook.io resources to reconcile them.
- apiGroups: ["ceph.rook.io"]
resources:
@@ -399,6 +377,7 @@
- cephfilesystemmirrors
- cephfilesystemsubvolumegroups
- cephblockpoolradosnamespaces
+ - cephcosidrivers
verbs:
- get
- list
@@ -467,6 +446,14 @@
- delete
- deletecollection
- apiGroups:
+ - apps
+ resources:
+ # This is to add osd deployment owner ref on key rotation
+ # cron jobs.
+ - deployments/finalizers
+ verbs:
+ - update
+ - apiGroups:
- healthchecking.openshift.io
resources:
- machinedisruptionbudgets
@@ -651,19 +638,19 @@
rules:
- apiGroups: [""]
resources: ["nodes"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
- resources: ["namespaces"]
- verbs: ["get", "list"]
+ verbs: ["get"]
- apiGroups: [""]
- resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "update"]
- - apiGroups: ["storage.k8s.io"]
- resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update"]
+ resources: ["secrets"]
+ verbs: ["get"]
- apiGroups: [""]
resources: ["configmaps"]
- verbs: ["get", "list"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts/token"]
+ verbs: ["create"]
---
# Source: rook-ceph/templates/clusterrole.yaml
kind: ClusterRole
@@ -675,11 +662,20 @@
resources: ["secrets"]
verbs: ["get", "list"]
- apiGroups: [""]
+ resources: ["configmaps"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["nodes"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: ["storage.k8s.io"]
+ resources: ["csinodes"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: [""]
resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
+ verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
- verbs: ["get", "list", "watch", "update"]
+ verbs: ["get", "list", "watch", "patch", "update"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
@@ -688,31 +684,40 @@
verbs: ["list", "watch", "create", "update", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update", "patch"]
+ verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments/status"]
verbs: ["patch"]
- apiGroups: [""]
- resources: ["nodes"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
resources: ["persistentvolumeclaims/status"]
- verbs: ["update", "patch"]
+ verbs: ["patch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots"]
- verbs: ["get", "list", "watch", "update", "patch"]
- - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshotcontents"]
- verbs: ["create", "get", "list", "watch", "update", "delete", "patch"]
+ verbs: ["get", "list", "watch", "update", "patch", "create"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
+ resources: ["volumesnapshotcontents"]
+ verbs: ["get", "list", "watch", "patch", "update", "create"]
+ - apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotcontents/status"]
verbs: ["update", "patch"]
- - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshots/status"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotclasses"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents"]
+ verbs: ["get", "list", "watch", "update", "patch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents/status"]
verbs: ["update", "patch"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts/token"]
+ verbs: ["create"]
---
# Source: rook-ceph/templates/clusterrole.yaml
kind: ClusterRole
@@ -730,26 +735,23 @@
resources: ["secrets"]
verbs: ["get", "list"]
- apiGroups: [""]
- resources: ["nodes"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
- resources: ["namespaces"]
- verbs: ["get", "list"]
- - apiGroups: [""]
resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "update"]
+ verbs: ["get", "list"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update"]
+ verbs: ["get", "list"]
- apiGroups: [""]
resources: ["configmaps"]
- verbs: ["get", "list"]
+ verbs: ["get"]
- apiGroups: [""]
resources: ["serviceaccounts"]
verbs: ["get"]
- apiGroups: [""]
resources: ["serviceaccounts/token"]
verbs: ["create"]
+ - apiGroups: [""]
+ resources: ["nodes"]
+ verbs: ["get"]
---
# Source: rook-ceph/templates/clusterrole.yaml
kind: ClusterRole
@@ -762,13 +764,19 @@
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
+ verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
- apiGroups: [""]
resources: ["persistentvolumeclaims"]
verbs: ["get", "list", "watch", "update"]
- apiGroups: ["storage.k8s.io"]
+ resources: ["storageclasses"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: [""]
+ resources: ["events"]
+ verbs: ["list", "watch", "create", "update", "patch"]
+ - apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update", "patch"]
+ verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["storage.k8s.io"]
resources: ["volumeattachments/status"]
verbs: ["patch"]
@@ -776,71 +784,64 @@
resources: ["nodes"]
verbs: ["get", "list", "watch"]
- apiGroups: ["storage.k8s.io"]
- resources: ["storageclasses"]
+ resources: ["csinodes"]
verbs: ["get", "list", "watch"]
- apiGroups: [""]
- resources: ["events"]
- verbs: ["list", "watch", "create", "update", "patch"]
+ resources: ["persistentvolumeclaims/status"]
+ verbs: ["patch"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshots"]
- verbs: ["get", "list", "watch", "update", "patch"]
- - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshotcontents"]
- verbs: ["create", "get", "list", "watch", "update", "delete", "patch"]
+ verbs: ["get", "list", "watch", "update", "patch", "create"]
- apiGroups: ["snapshot.storage.k8s.io"]
resources: ["volumesnapshotclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshotcontents/status"]
- verbs: ["update", "patch"]
+ resources: ["volumesnapshotcontents"]
+ verbs: ["get", "list", "watch", "patch", "update", "create"]
- apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshots/status"]
+ resources: ["volumesnapshotcontents/status"]
verbs: ["update", "patch"]
- - apiGroups: [""]
- resources: ["persistentvolumeclaims/status"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotclasses"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents"]
+ verbs: ["get", "list", "watch", "update", "patch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents/status"]
verbs: ["update", "patch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplications", "volumereplicationclasses"]
- verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplications/finalizers"]
- verbs: ["update"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplications/status"]
- verbs: ["get", "patch", "update"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplicationclasses/status"]
- verbs: ["get"]
- apiGroups: [""]
resources: ["serviceaccounts"]
verbs: ["get"]
- apiGroups: [""]
resources: ["serviceaccounts/token"]
verbs: ["create"]
+ - apiGroups: [""]
+ resources: ["nodes"]
+ verbs: ["get", "list", "watch"]
---
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
+# Source: rook-ceph/templates/clusterrole.yaml
kind: ClusterRole
+apiVersion: rbac.authorization.k8s.io/v1
metadata:
- name: 'psp:rook'
+ name: objectstorage-provisioner-role
labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-rules:
- - apiGroups:
- - policy
- resources:
- - podsecuritypolicies
- resourceNames:
- - 00-rook-privileged
- verbs:
- - use
+ app.kubernetes.io/part-of: container-object-storage-interface
+ app.kubernetes.io/component: driver-ceph
+ app.kubernetes.io/name: cosi-driver-ceph
+rules:
+ - apiGroups: ["objectstorage.k8s.io"]
+ resources: ["buckets", "bucketaccesses", "bucketclaims", "bucketaccessclasses", "buckets/status", "bucketaccesses/status", "bucketclaims/status", "bucketaccessclasses/status"]
+ verbs: ["get", "list", "watch", "update", "create", "delete"]
+ - apiGroups: ["coordination.k8s.io"]
+ resources: ["leases"]
+ verbs: ["get", "watch", "list", "delete", "update", "create"]
+ - apiGroups: [""]
+ resources: ["secrets", "events"]
+ verbs: ["get", "delete", "update", "create"]
---
# Source: rook-ceph/templates/cluster-rbac.yaml
# Allow the ceph mgr to access cluster-wide resources necessary for the mgr modules
@@ -946,28 +947,30 @@
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
- name: cephfs-csi-nodeplugin
+ name: cephfs-csi-provisioner-role
subjects:
- kind: ServiceAccount
- name: rook-csi-cephfs-plugin-sa
+ name: rook-csi-cephfs-provisioner-sa
namespace: default # namespace:operator
roleRef:
kind: ClusterRole
- name: cephfs-csi-nodeplugin
+ name: cephfs-external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
# Source: rook-ceph/templates/clusterrolebinding.yaml
+# This is required by operator-sdk to map the cluster/clusterrolebindings with SA
+# otherwise operator-sdk will create a individual file for these.
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
- name: cephfs-csi-provisioner-role
+ name: cephfs-csi-nodeplugin-role
subjects:
- kind: ServiceAccount
- name: rook-csi-cephfs-provisioner-sa
+ name: rook-csi-cephfs-plugin-sa
namespace: default # namespace:operator
roleRef:
kind: ClusterRole
- name: cephfs-external-provisioner-runner
+ name: cephfs-csi-nodeplugin
apiGroup: rbac.authorization.k8s.io
---
# Source: rook-ceph/templates/clusterrolebinding.yaml
@@ -984,81 +987,24 @@
name: rbd-external-provisioner-runner
apiGroup: rbac.authorization.k8s.io
---
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
- name: rook-ceph-system-psp
- labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-system
- namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
+# Source: rook-ceph/templates/clusterrolebinding.yaml
+# RBAC for ceph cosi driver service account
kind: ClusterRoleBinding
-metadata:
- name: rook-csi-cephfs-provisioner-sa-psp
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-csi-cephfs-provisioner-sa
- namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
metadata:
- name: rook-csi-cephfs-plugin-sa-psp
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
+ name: objectstorage-provisioner-role-binding
+ labels:
+ app.kubernetes.io/part-of: container-object-storage-interface
+ app.kubernetes.io/component: driver-ceph
+ app.kubernetes.io/name: cosi-driver-ceph
subjects:
- kind: ServiceAccount
- name: rook-csi-cephfs-plugin-sa
+ name: objectstorage-provisioner
namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
- name: rook-csi-rbd-plugin-sa-psp
roleRef:
- apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-csi-rbd-plugin-sa
- namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
- name: rook-csi-rbd-provisioner-sa-psp
-roleRef:
+ name: objectstorage-provisioner-role
apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-csi-rbd-provisioner-sa
- namespace: default # namespace:operator
---
# Source: rook-ceph/templates/cluster-rbac.yaml
kind: Role
@@ -1068,10 +1014,10 @@
namespace: default # namespace:cluster
rules:
# this is needed for rook's "key-management" CLI to fetch the vault token from the secret when
- # validating the connection details
+ # validating the connection details and for key rotation operations.
- apiGroups: [""]
resources: ["secrets"]
- verbs: ["get"]
+ verbs: ["get", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
@@ -1080,23 +1026,6 @@
verbs: ["get", "list", "create", "update", "delete"]
---
# Source: rook-ceph/templates/cluster-rbac.yaml
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-rules:
- # Placeholder role so the rgw service account will
- # be generated in the csv. Remove this role and role binding
- # when fixing https://github.com/rook/rook/issues/10141.
- - apiGroups:
- - ""
- resources:
- - configmaps
- verbs:
- - get
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
# Aspects of ceph-mgr that operate within the cluster's namespace
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
@@ -1131,9 +1060,31 @@
- apiGroups:
- ceph.rook.io
resources:
- - "*"
+ - cephclients
+ - cephclusters
+ - cephblockpools
+ - cephfilesystems
+ - cephnfses
+ - cephobjectstores
+ - cephobjectstoreusers
+ - cephobjectrealms
+ - cephobjectzonegroups
+ - cephobjectzones
+ - cephbuckettopics
+ - cephbucketnotifications
+ - cephrbdmirrors
+ - cephfilesystemmirrors
+ - cephfilesystemsubvolumegroups
+ - cephblockpoolradosnamespaces
+ - cephcosidrivers
verbs:
- - "*"
+ - get
+ - list
+ - watch
+ - create
+ - update
+ - delete
+ - patch
- apiGroups:
- apps
resources:
@@ -1269,6 +1220,7 @@
- create
- update
- delete
+ - deletecollection
- apiGroups:
- batch
resources:
@@ -1284,6 +1236,13 @@
- get
- create
- delete
+ - apiGroups:
+ - multicluster.x-k8s.io
+ resources:
+ - serviceexports
+ verbs:
+ - get
+ - create
---
# Source: rook-ceph/templates/role.yaml
kind: Role
@@ -1292,12 +1251,6 @@
name: cephfs-external-provisioner-cfg
namespace: default # namespace:operator
rules:
- - apiGroups: [""]
- resources: ["endpoints"]
- verbs: ["get", "watch", "list", "delete", "update", "create"]
- - apiGroups: [""]
- resources: ["configmaps"]
- verbs: ["get", "list", "create", "delete"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
@@ -1309,113 +1262,11 @@
name: rbd-external-provisioner-cfg
namespace: default # namespace:operator
rules:
- - apiGroups: [""]
- resources: ["endpoints"]
- verbs: ["get", "watch", "list", "delete", "update", "create"]
- - apiGroups: [""]
- resources: ["configmaps"]
- verbs: ["get", "list", "watch", "create", "delete", "update"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "watch", "list", "delete", "update", "create"]
---
# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-default-psp
- namespace: default # namespace:cluster
- labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: default
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-rgw-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-mgr-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-mgr
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-cmd-reporter-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-cmd-reporter
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-purge-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-purge-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
# Allow the operator to create resources in this cluster's namespace
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
@@ -1448,22 +1299,6 @@
namespace: default # namespace:cluster
---
# Source: rook-ceph/templates/cluster-rbac.yaml
-# Allow the rgw pods in this namespace to work with configmaps
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: Role
- name: rook-ceph-rgw
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
# Allow the ceph mgr to access resources scoped to the CephCluster namespace necessary for mgr modules
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
@@ -1615,6 +1450,7 @@
kind: Deployment
metadata:
name: rook-ceph-operator
+ namespace: default # namespace:operator
labels:
operator: rook
storage-backend: ceph
@@ -1633,39 +1469,37 @@
labels:
app: rook-ceph-operator
spec:
+ tolerations:
+ - effect: NoExecute
+ key: node.kubernetes.io/unreachable
+ operator: Exists
+ tolerationSeconds: 5
containers:
- name: rook-ceph-operator
- image: "rook/ceph:v1.9.12"
+ image: "docker.io/rook/ceph:v1.16.0"
imagePullPolicy: IfNotPresent
args: ["ceph", "operator"]
securityContext:
+ capabilities:
+ drop:
+ - ALL
+ runAsGroup: 2016
runAsNonRoot: true
runAsUser: 2016
- runAsGroup: 2016
volumeMounts:
- mountPath: /var/lib/rook
name: rook-config
- mountPath: /etc/ceph
name: default-config-dir
- - mountPath: /etc/webhook
- name: webhook-cert
- ports:
- - containerPort: 9443
- name: https-webhook
- protocol: TCP
env:
- name: ROOK_CURRENT_NAMESPACE_ONLY
value: "false"
- name: ROOK_HOSTPATH_REQUIRES_PRIVILEGED
value: "false"
- - name: ROOK_ENABLE_SELINUX_RELABELING
- value: "true"
- name: ROOK_DISABLE_DEVICE_HOTPLUG
value: "false"
- - name: ROOK_ENABLE_DISCOVERY_DAEMON
- value: "false"
- - name: ROOK_DISABLE_ADMISSION_CONTROLLER
- value: "false"
+ - name: ROOK_DISCOVER_DEVICES_INTERVAL
+ value: "60m"
- name: NODE_NAME
valueFrom:
fieldRef:
@@ -1680,7 +1514,6 @@
fieldPath: metadata.namespace
resources:
limits:
- cpu: 500m
memory: 256Mi
requests:
cpu: 10m
@@ -1691,5 +1524,7 @@
emptyDir: {}
- name: default-config-dir
emptyDir: {}
- - name: webhook-cert
- emptyDir: {}
+# Source: rook-ceph/templates/securityContextConstraints.yaml
+# scc for the Rook and Ceph daemons
+# for creating cluster in openshift
+--- |
MegaLinter status: ❌ ERROR
See errors details in artifact MegaLinter reports on CI Job page |
4c696c0
to
ecc5e00
Compare
ecc5e00
to
cb07759
Compare
cb07759
to
40bd676
Compare
40bd676
to
2b51770
Compare
2b51770
to
c1d7a2d
Compare
c1d7a2d
to
9d0fb81
Compare
9d0fb81
to
efdb31b
Compare
efdb31b
to
3e46cca
Compare
3e46cca
to
221e89d
Compare
221e89d
to
9422233
Compare
b8fc3e0
to
0af6374
Compare
0af6374
to
c097aad
Compare
c097aad
to
ccdacd2
Compare
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
ccdacd2
to
6568064
Compare
6568064
to
00cf479
Compare
00cf479
to
04cc8d9
Compare
04cc8d9
to
89e3bb5
Compare
89e3bb5
to
61cf10e
Compare
61cf10e
to
b5a4b14
Compare
b5a4b14
to
1a5b1d0
Compare
1a5b1d0
to
d7eba24
Compare
d7eba24
to
cafe080
Compare
cafe080
to
3a20c80
Compare
| datasource | package | from | to | | ---------- | ----------------- | ------- | ------- | | helm | rook-ceph | v1.9.12 | v1.16.0 | | helm | rook-ceph | v1.9.12 | v1.16.0 | | helm | rook-ceph | v1.9.12 | v1.16.0 | | helm | rook-ceph-cluster | v1.9.12 | v1.16.0 | | docker | rook/ceph | v1.9.13 | v1.16.0 | | docker | rook/ceph | v1.9.13 | v1.16.0 |
3a20c80
to
10ea576
Compare
This PR contains the following updates:
v1.9.12
->v1.16.0
v1.9.12
->v1.16.0
v1.9.13
->v1.16.0
⚠ Dependency Lookup Warnings ⚠
Warnings were logged while processing this repo. Please check the Dependency Dashboard for more information.
Release Notes
rook/rook
v1.16.0
Compare Source
Upgrade Guide
To upgrade from previous versions of Rook, see the Rook upgrade guide.
Breaking Changes
Features
statusCheck
is enabled on the parent CephBlockPool.additionalConfig.bucketPolicy
field (see #15138).opsLogSidecar
in the gateway settings.v1.15.7
Compare Source
Improvements
Rook v1.15.7 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.15.6
Compare Source
Improvements
Rook v1.15.6 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.15.5
Compare Source
Improvements
Rook v1.15.5 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
/run/udev
in the init container for ceph-volume activate (#14901, @guits)v1.15.4
Compare Source
Improvements
Rook v1.15.4 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.15.3
Compare Source
Improvements
Rook v1.15.3 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.15.2
Compare Source
Improvements
Rook v1.15.2 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.15.1
Compare Source
Improvements
Rook v1.15.1 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
mon.zones
spec (#14636, @BenoitKnecht)v1.15.0
Compare Source
Upgrade Guide
To upgrade from previous versions of Rook, see the Rook upgrade guide.
Breaking Changes
csi-*plugin-holder-*
in the Rook operator namespace, see the detailed documentation to disable them. This deprecation process will be required before upgrading to the future Rook v1.16.spec.hosting
configurations are set. Use the newspec.hosting.advertiseEndpoint
config to define required behavior as documented.Features
allowDeviceClassUpdate: true
is set in the CephCluster CR.allowOsdCrushWeightUpdate: true
is set in the CephCluster CR.docker.io/rook/ceph
) in operator manifests and helm charts.Experimental Features
operator.yaml
.v1.14.12
Compare Source
Improvements
Rook v1.14.12 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.11
Compare Source
Improvements
Rook v1.14.11 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.10
Compare Source
Improvements
Rook v1.14.10 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.9
Compare Source
Improvements
Rook v1.14.9 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.8
Compare Source
Improvements
Rook v1.14.8 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.7
Compare Source
What's Changed
monitoring: fix CephPoolGrowthWarning expression (#14346, @matofeder)
monitoring: Set honor labels on the service monitor (#14339, @travisn)
Full Changelog: rook/rook@v1.14.6...v1.14.7
v1.14.6
Compare Source
What's Changed
v1.14.5
Compare Source
Improvements
Rook v1.14.5 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.4
Compare Source
Improvements
Rook v1.14.4 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.3
Compare Source
Improvements
Rook v1.14.3 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.2
Compare Source
Improvements
Rook v1.14.2 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.1
Compare Source
Improvements
Rook v1.14.1 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.14.0
Compare Source
Upgrade Guide
To upgrade from previous versions of Rook, see the Rook upgrade guide.
Breaking Changes
repository
andtag
settings are specified separately in the helm chart values.yaml for the CSI images. Helm users previously specifying the CSI images with theimage
setting will need to update their values.yaml with the separaterepository
andtag
settings.csi-*plugin-holder-*
in the Rook operator namespace, see the holder pod deprecation documentation to disable them. Migration of affected clusters is optional for v1.14, but will be required in a future release.CSI_ENABLE_READ_AFFINITY
was removed. v1.13 clusters that have modified this value to be"true"
must set the option as desired in each CephCluster as documented here before upgrading to v1.14.Features
default
service account now use a newrook-ceph-default
service account.application
can be applied to a CephBlockPool CR.rook-ceph
namespace).kubectl
output for Rook CRDs.v1.13.10
Compare Source
Improvements
Rook v1.13.10 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.13.9
Compare Source
Improvements
Rook v1.13.9 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.13.8
Compare Source
Improvements
Rook v1.13.8 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.13.7
Compare Source
Improvements
Rook v1.13.7 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
monitoring
section of CephCluster to ceph-exporter (#13902, @rkachach)v1.13.6
Compare Source
Improvements
Rook v1.13.6 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
master
tag in the values.yaml with the release tag (#13897, @travisn)v1.13.5
Compare Source
Improvements
Rook v1.13.5 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.13.4
Compare Source
Improvements
Rook v1.13.4 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.13.3
Compare Source
Improvements
Rook v1.13.3 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.13.2
Compare Source
Improvements
Rook v1.13.2 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
encryptedDevice
is not yet supported for host-based clusters (#13452, @cupnes)v1.13.1
Compare Source
Improvements
Rook v1.13.1 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
spec.csi
section in the CephCluster documentation (#13375, @Rakshith-R)v1.13.0
Compare Source
Upgrade Guide
To upgrade from previous versions of Rook, see the Rook upgrade guide.
Breaking Changes
Features
cephConfig
to the CephCluster CR to allow setting Ceph config options in the Ceph MON config store via the CRD. These settings supersede the ceph.conf override settings.ceph.rook.io/do-not-reconcile
for all Ceph daemons. This is helpful when using the debug command in the kubectl rook-ceph plugin.v1.12.11
Compare Source
Improvements
Rook v1.12.11 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.12.10
Compare Source
Improvements
Rook v1.12.10 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.12.9
Compare Source
Improvements
Rook v1.12.9 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
v1.12.8
Compare Source
Improvements
Rook v1.12.8 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.
all
placement for net addr detect job (#13206, @BlaineEXE)Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about these updates again.
This PR has been generated by Renovate Bot.