feat(helm): update rook-ceph group to v1.16.0 (minor) #746

chii-bot · 2022-08-31T22:21:31Z

This PR contains the following updates:

Package	Update	Change
rook-ceph	minor	`v1.9.12` -> `v1.16.0`
rook-ceph-cluster	minor	`v1.9.12` -> `v1.16.0`
rook/ceph	minor	`v1.9.13` -> `v1.16.0`

⚠ Dependency Lookup Warnings ⚠

Warnings were logged while processing this repo. Please check the Dependency Dashboard for more information.

Release Notes

rook/rook

`v1.16.0`

Compare Source

Upgrade Guide

To upgrade from previous versions of Rook, see the Rook upgrade guide.

Breaking Changes

Removed support for Ceph Quincy (v17) since it has reached end of life. Reef (v18) and Squid (v19) are the currently supported Ceph versions.
Rook has removed CSI network "holder" pods. If there are pods named csi-plugin-holder- in the Rook operator namespace, see the detailed documentation to disable them before upgrading to v1.16.
The minimum K8s version is increased to v1.27.

Features

Ceph-CSI driver v3.13, including support for volume group snapshots, CephFS support for omap in rados namespaces, and other csi improvements.
Enable mirroring for CephBlockPoolRadosNamespaces
Enable periodic monitoring for CephBlockPoolRadosNamespaces mirroring if the statusCheck is enabled on the parent CephBlockPool.
Allow migration of PVC based OSDs to enable or disable encryption.
Support multiple instances of object stores to enable scenarios such as RGW instances with only admin-ops enabled.
ObjectBucketClaim management of s3 bucket policy via the additionalConfig.bucketPolicy field (see #15138).
Object stores enable arbitrary command line parameters or ceph configuration settings.
Enable RGW admin ops logs by enabling the opsLogSidecar in the gateway settings.
Added support for K8s version v1.32.

`v1.15.7`

Compare Source

Improvements

Rook v1.15.7 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

object: Update s5cmd to resolve vulnerabilities (#15178, @TomHellier)
object: COSI user to be created explicitly instead of automated by the operator (#15144, @BlaineEXE)
file: Add support for named MDS metadata pool names without the filesystem prefix (#15056, @NotTheEvilOne)
csi: update to the v3.12.3 Ceph-CSI release (#15058, @Madhu-1)
rbdmirror: Add a timeout for the RBD import cmd that may hang (#15051, @parth-gr)
osd: Fix device class label on the OSD deployment (#15066, @parth-gr)
core: Fix Annotations.Merge to prevent side effects (#15080, @OdedViner)
rgw: Fix shared pools for zone (#15038, @arttor)

`v1.15.6`

Compare Source

Improvements

Rook v1.15.6 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

osd: Log warning when duplicate node topology values are detected (#15016, @solidDoWant)
core: Configure remaining pods with the revision history limit (#14976, @obnoxxx)
helm: Set service account for toolbox pod (#15019, @amrut-asm)
osd: Import keyring file on activate to ceph auth if not imported yet (#14826, @prazumovsky)
mon: Allow failover of the arbiter mon (#14981, @GrantFleming)

`v1.15.5`

Compare Source

Improvements

Rook v1.15.5 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

rgw: Add support for pool placements (#14588 #14715 #14884 #14951, @arttor)
osd: Mount /run/udev in the init container for ceph-volume activate (#14901, @guits)
osd: Allow scheduling OSDs on unschedulable nodes (#14949, @travisn)
core: Allow setting resources on the detect version job (#14941, @travisn)
mds: Wait for mds standby upgrade for the same filesystem instead of any filesystem (#14952, @travisn)
csi: Remove version check for k8s and cephcsi (#14942, @travisn)
kms: Key rotation support for vault kms (#14818, @iPraveenParihar)
object: Also use system certs for validating RGW cert (#14835, @BlaineEXE)
core: Cleanup blockpool during uninstall if corresponding annotation is set (#14895, @Madhu-1)
object: set OBC user quota(s) in one SetUserQuota() call (#14827, @jhoblitt)

`v1.15.4`

Compare Source

Improvements

Rook v1.15.4 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Define empty securityContext for pods to fix CIS 5.7.3 (#14823, @prazumovsky)
core: Fix deletion of the osd-replace-config configmap during OSD migration (#14862, @sp98)
core: Allow removal of exporter pods from a node no longer having ceph daemons (#14854, @travisn)
docs: Add documentation for RBD VolumeGroupSnapshot (#14845, @black-dragon74)
csi: Disable fencing in Rook due to unreliable IPs being fenced (#14831, @Madhu-1)
multus: Do not force delete in validation cleanup (#14820, @BlaineEXE)
mon: Do not remove extra mon in middle of failover (#14805, @travisn)
mds: Fix liveness probe timeout when ceph timeout is reached (#14798, @BlaineEXE)

`v1.15.3`

Compare Source

Improvements

Rook v1.15.3 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

rgw: Allow CephObjectZone and CephObjectStore creation based on pre-existing pools (#14801 #14772, @jhoblitt)
helm: Add enforce host network setting (#14791, @travisn)
core: Allow configuration of the revision history limit (#14775, @obnoxxx)
core: Preserve pool application name change (#14755, @sp98)
csi: Update privileges in CSI logrotate sidecar container (#14782, @parth-gr)
docs: Declare cephconfig settings stable in the CephCluster CR (#14752, @travisn)
build: Allow building with golang 1.23 (#14748, @obnoxxx)
csi: Fix the ROOK_CSI_DISABLE_DRIVER flag in the CSI driver reconcile (#14746, @parth-gr)
external: Update MDS caps for the healthchecker/cephfs users (#14722, @subhamkrai)
docs: Update external docs with a better structure (#14718, @parth-gr)

`v1.15.2`

Compare Source

Improvements

Rook v1.15.2 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Enable annotations on crash collector (#14731, @travisn)
exporter: Configure prio-limit for ceph exporter pod (#14717, @arttor)
docs: Add grafana dashboards files to docs (#14679, @galexrt)
pool: Allow negative step num in crush rule (#14709, @travisn)
csi: Stop deleting csi-operator resources when not enabled (#14693, @subhamkrai)
core: Check for duplicate ceph fs pool names (#14653, @sp98)
csi: Update to CephCSI patch release v3.12.2 (#14694, @Madhu-1)
osd: Discover metadata and wal devices for raw device cleanup (#14645, @Papawy)
network: Allow enforcing host network on all pods (#14585, @obnoxxx)
mon: Remove extra mon from quorum before taking down pod (#14667, @travisn)

`v1.15.1`

Compare Source

Improvements

Rook v1.15.1 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

csi: Update csi-addons to v0.9.1 (#14671, @Madhu-1)
helm: Reorder volumes in rook-ceph-csi scc for argocd diff to show no changes (#14642, @raynay-r)
rgw: Allow users to add custom volume mounts (#14616, @BlaineEXE)
core: Spread Ceph mons across zones when using mon.zones spec (#14636, @BenoitKnecht)
external: Remove the false bool values from config file (#14627, @parth-gr)
core: Host cleanup jobs to read flags correctly (#14631, @sp98)
multus: Fix default service account handling (#14629, @BlaineEXE)
csi: Use specific CSI operator version tag instead of latest image (#14618, @subhamkrai)

`v1.15.0`

Compare Source

Upgrade Guide

To upgrade from previous versions of Rook, see the Rook upgrade guide.

Breaking Changes

Minimum version of Kubernetes supported is increased to K8s v1.26.
During CephBlockPool updates, Rook will now return an error if an invalid device class is specified. Pools with invalid device classes may start failing until the correct device class is specified. For more details, see #14057.
Rook has deprecated CSI network "holder" pods. If there are pods named csi-*plugin-holder-* in the Rook operator namespace, see the detailed documentation to disable them. This deprecation process will be required before upgrading to the future Rook v1.16.
Ceph COSI driver images have been updated. This impacts existing COSI Buckets, BucketClaims, and BucketAccesses. Update existing clusters following the guide here.
CephObjectStore, CephObjectStoreUser, and OBC endpoint behavior has changed when CephObjectStore spec.hosting configurations are set. Use the new spec.hosting.advertiseEndpoint config to define required behavior as documented.

Features

Added support for Ceph Squid (v19), in addition to Reef (v18) and Quincy (v17). Quincy support will be removed in Rook v1.16.
Ceph-CSI driver v3.12, including new options for RBD, log rotation, and updated sidecar images.
Allow updating the device class of OSDs, if allowDeviceClassUpdate: true is set in the CephCluster CR.
Allow updating the weight of an OSD, if allowOsdCrushWeightUpdate: true is set in the CephCluster CR.
Use fully-qualified image names (docker.io/rook/ceph) in operator manifests and helm charts.

Experimental Features

CephObjectStore support for keystone authentication for S3 and Swift. See the Object store documentation to configure.
CSI operator: CSI settings are moving to CRs managed by a new operator. Once enabled, Rook will convert the settings previously defined in the operator configmap or env vars into the new CRs managed by the CSI operator. There are two steps to enable:
- Create csi-operator.yaml
- Set ROOK_USE_CSI_OPERATOR: true in operator.yaml.

`v1.14.12`

Compare Source

Improvements

Rook v1.14.12 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

object: Also use system certs for validating RGW cert (#14835, @BlaineEXE)
osd: mount /run/udev in the init container for ceph-volume activate (#14901, @guits)
core: Define empty securityContext for pods to fix CIS 5.7.3 (#14823, @prazumovsky)
csi: Disable fencing in Rook (#14831, @Madhu-1)
mds: Fix liveness probe timeout (#14798, @BlaineEXE)

`v1.14.11`

Compare Source

Improvements

Rook v1.14.11 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Enable annotations on crash collector (#14731, @travisn)
helm: Reorder volumes in rook-ceph-csi scc for argocd diff to show no changes (#14642, @raynay-r)
core: Fix Ceph monitor placement when zones are specifically defined in a non-stretch cluster (#14636, @BenoitKnecht)
core: Fix host cleanup jobs to read flags correctly (#14631, @sp98)
multus: Default service account handling for the multus tool (#14629, @BlaineEXE)

`v1.14.10`

Compare Source

Improvements

Rook v1.14.10 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Configuration option added for metrics bindAddress (#14598, @jrcichra)
core: Annotations and labels configurable on detect version jobs (#14576, @travisn)
docs: Troubleshooting topic for containerd LimitNOFILE issue (#14500, @nicofnt)

`v1.14.9`

Compare Source

Improvements

Rook v1.14.9 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

manifest: Update the ceph recommended version to v18.2.4 (#14491, @travisn)
mgr: Properly detect if dashboard cert already exists to avoid unnecessary dashboard module restarts (#14484, @travisn)
mgr: Lookup cluster crd on active mgr watch (#14482, @arttor)
csi: Make kube apiserver qps configurable (#14420, @YiteGu)
multus: Reset validation tool debounce time to 30 (#14451, @BlaineEXE)
multus: Add host checking to validation tool (#14230, @BlaineEXE)
pool: Skip updating crush rules for stretch clusters (#14447, @travisn)

`v1.14.8`

Compare Source

Improvements

Rook v1.14.8 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

osd: Fix activate failure when block device moves (#14374, @BlaineEXE)
csi: Update csi-addons repo link for correctly versioned downloads (#14408, @Madhu-1)
build: Update go-retryablehttp from 0.7.6 to 0.7.7 (#14391, @subhamkrai)
osd: Use old passphrase to kill the LUKS slot during key rotation (#14367, @black-dragon74)
csi: Skip creating networkFence when csi is disabled (#14294, @subhamkrai)

`v1.14.7`

Compare Source

What's Changed

monitoring: fix CephPoolGrowthWarning expression (#14346, @matofeder)
monitoring: Set honor labels on the service monitor (#14339, @travisn)

Full Changelog: rook/rook@v1.14.6...v1.14.7

`v1.14.6`

Compare Source

What's Changed

build: add result of codegen (#14287, @obnoxxx)
build: remove iproute build dependency on centos repo (#14299, @BlaineEXE)
mon: Allow overriding the mon endpoint with annotation (#13500, @travisn)
multus: add and test ipv6 support for validation tool (#14302, @BlaineEXE)
monitoring: fix exporter service monitor selector (#14313, @matofeder)
monitoring: update to the latest ceph prometheus rules (#14312, @matofeder)
doc: add recommendation for nfs in external cluster (#13876, @parth-gr)
pool: get the exact deviceClass name instead of crushroot+deviceClass (#14325, @ideepika)
helm: allow custom labels and annotations for storage classes (#14323, @catdog2)
core: Update go modules for snyk security check (#14331, @travisn)

`v1.14.5`

Compare Source

Improvements

Rook v1.14.5 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

mon: Fix the bind address when IPv6 and msgr2 are enabled (#14248, @BlaineEXE)
osd: Configure cluster full settings related to OSDs filling up (#14281, @travisn)
core: Remove unnecessary owner refs in resource cleanup jobs (#14234, @sp98)
mgr: Set balancer mode for the balancer mgr module in the CephCluster CR (#14232, @sp98)
osd: Reduce safe-to-destroy retry timeout to 15s (#14257, @bdowling)
docs: Document how to define a StorageClass to consume a RADOS namespace (#14173, @obnoxxx)
core: Fix missing env in subvolume group cleanup job (#14236, @sp98)

`v1.14.4`

Compare Source

Improvements

Rook v1.14.4 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Remove obsolete Ceph Pacific checks (#14210, @satoru-takeuchi)
osd: Add cephcluster status for deprecated OSDs that should be replaced (#14187, @travisn)
mgr: Fix UpdateActiveMgrLabel to retry label update on failure (#14160, @rkachach)
ci: Update ubuntu image from 20.04 to 22.04 (#14166, @subhamkrai)

`v1.14.3`

Compare Source

Improvements

Rook v1.14.3 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

csi: Fix missing namespace in internal csi cluster config map (#14154, @BlaineEXE)
osd: Limit storageClassDeviceSet names to 40 chars (#14134, @subhamkrai)
mon: Disable the msgr v1 port listening inside the mon pod if msgr2 is required (#14147, @travisn)
external: Restructure external cluster examples manifests (#13932, @smoshiur1237)
mon: Allow mon scale-down when mons are portable (#14106, @subhamkrai)
osd: Legacy LVM-based OSDs on PVCs crash on resize init container (#14100, @travisn)
csi: Update csi sidecars image version (#14129, @iPraveenParihar)
csi: Create csi configmap if csi controller is disabled (#14125, @parth-gr)
operator: Support custom dashboard service labels and annotations (#14115, @sfackler)
external: Add support for rados namespace for rbd EC pools (#13769, @parth-gr)
ci: Use markdownlint to enforce mkdocs compatibility (#14114, @BlaineEXE)

`v1.14.2`

Compare Source

Improvements

Rook v1.14.2 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

ci: Add K8s 1.30 support (#14093, @subhamkrai)
helm: Use correct metadata and data EC block pool (#14088, @travisn)
csi: Only create CSI config configmap in CSI reconciler (#14089, @BlaineEXE)

`v1.14.1`

Compare Source

Improvements

Rook v1.14.1 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

crds: More verbose kubectl info for CephBlockPoolRadosNamespace and CephFilesystemSubVolumeGroup (#14049, @NymanRobin)
subvolumegroup: Add support for quota and datapool (#14036, @Madhu-1)
osd: Add option to require healthy PGs during OSD upgrade (#14040, @mmaoyu)
core: Cleanup RADOS namespace with forced deletion annotation (#14052, @sp98)
core: Cleanup Subvolumegroups with forced deletion annotation (#14026, @sp98)
osd: Prevent osd reconcile when device set names duplicated (#14002, @travisn)
doc: Host networking required for CSI driver (#14023, @BlaineEXE)
operator: Ensure cluster owner info is set in LoadClusterInfo (#14079, @BlaineEXE)

`v1.14.0`

Compare Source

Upgrade Guide

To upgrade from previous versions of Rook, see the Rook upgrade guide.

Breaking Changes

The minimum supported version of Kubernetes is v1.25. Upgrade to Kubernetes v1.25 or higher before upgrading Rook.
The image repository and tag settings are specified separately in the helm chart values.yaml for the CSI images. Helm users previously specifying the CSI images with the image setting will need to update their values.yaml with the separate repository and tag settings.
Rook is beginning the process of deprecating CSI network "holder" pods. If there are pods named csi-*plugin-holder-* in the Rook operator namespace, see the holder pod deprecation documentation to disable them. Migration of affected clusters is optional for v1.14, but will be required in a future release.
The Rook operator config CSI_ENABLE_READ_AFFINITY was removed. v1.13 clusters that have modified this value to be "true" must set the option as desired in each CephCluster as documented here before upgrading to v1.14.

Features

Kubernetes versions v1.25 through v1.29 are supported. K8s v1.30 will be supported as soon as released.
Ceph daemon pods using the default service account now use a new rook-ceph-default service account.
A custom Ceph application can be applied to a CephBlockPool CR.
Object stores can be created with shared metadata and data pools. Isolation between object stores is enabled via RADOS namespaces. This configuration is recommended to limit the number of pools when multiple object stores are created.
Support for VolumeSnapshotGroup is available for the RBD and CephFS CSI drivers.
Support for virtual style hosting for s3 buckets is added in the CephObjectStore, by adding hosting.dnsNames to the object store.
A static prefix can be specified for the CSI drivers and OBC provisioner (the default prefix is the rook-ceph namespace).
Azure Key Vault KMS support is added for storing OSD encryption keys.
Additional status columns added to the kubectl output for Rook CRDs.

`v1.13.10`

Compare Source

Improvements

Rook v1.13.10 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

osd: Fix activate failure when block device moves (#14374, @BlaineEXE)
csi: Update csi-addons repo link for correctly versioned download (#14408, @Madhu-1)

`v1.13.9`

Compare Source

Improvements

Rook v1.13.9 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

mgr: Fix UpdateActiveMgrLabel to retry label update on failure (#14160, @rkachach)
core: Remove obsolete Ceph Pacific checks (#14210, @satoru-takeuchi)
osd: Add cephcluster status for deprecated OSDs that should be replaced (#14187, @travisn)
osd: Remove support for resize of legacy LVM-based OSDs on PVCs due to crash in resize container (#14100, @travisn)
osd: Prevent osd reconcile when device set names duplicated (#14002, @travisn)

`v1.13.8`

Compare Source

Improvements

Rook v1.13.8 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

external: Fix v2 port check in external script (#13982, @parth-gr)
security: Update go dependency go-jose to pass Snyk security scan (#13960, @subhamkrai)
osd: Start encrypted OSDs with metadata device using shared key (#13830, @cupnes)
helm: Use toYaml for discovery nodeAffinity (#13931, @hhk7734)

`v1.13.7`

Compare Source

Improvements

Rook v1.13.7 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Set default ceph version to v18.2.2 (#13913, @travisn)
monitoring: Increase default metrics scraping interval from 5s to 10s (#13923, @rkachach)
exporter: Apply labels from monitoring section of CephCluster to ceph-exporter (#13902, @rkachach)

`v1.13.6`

Compare Source

Improvements

Rook v1.13.6 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

helm: Replace the master tag in the values.yaml with the release tag (#13897, @travisn)
manifest: Reduce CRD size by removing some descriptions (#13793, @rkachach)
csi: Update CSIDriverOption params during saving cluster config (#13836, @Rakshith-R)
external: Remove requirement for v1 port and allow exclusive v2 mon port configuration (#13856, @parth-gr)
csi: Update sidecars to latest release (#13846, @Madhu-1)
operator: Use Linux container CPU quota (#13816, @uhthomas)
helm: Fix links to obsolete ceph master documentation (#13877, @galexrt)

`v1.13.5`

Compare Source

Improvements

Rook v1.13.5 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

pool: Skip crush rule update when not needed (#13772, @travisn)
osd: Support OSD creation with a metadata partition (#13314, @microyahoo)
csi: Update Ceph-CSI image to 3.10.2 (#13736, @Madhu-1)
mon: Set mon PDB max unavailable as 2 when there are 5 or more mons. (#13794, @sp98)
external: fix syntax error import-external-cluster.sh (#13780, @timolow)
core: Continue processing PVs for network fencing when no node IPs found (#13768, @Madhu-1)
mgr: Remove unnecessary privileged security context from mgr sidecar container (#13741, @rkachach)
network: Disallow legacy hostNetwork provider when a non-default provider is specified (#13693, @obnoxxx)
csi: Disable CephFS network fencing (#13806, @subhamkrai)

`v1.13.4`

Compare Source

Improvements

Rook v1.13.4 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

helm: Remove cpu limits from all pods (#13722, @travisn)
core: Set blocking PDB even if no unhealthy PGs appear (#13511, @ushitora-anqou)
mgr: Update the dashboard password when the secret changes (#13644, @rkachach)
core: Skip reconcile if override configmap is unchanged (#13652, @travisn)
core: remove invalid ownerRef from networkFence (#13728, @subhamkrai)
osd: Correctly count the devices when metadataDevice is set (#13673, @satoru-takeuchi)
csi: Update network fence CR name (#13615, @riya-singhal31)
object: Add check specific to name and namespace for ceph cosi driver (#13623, @thotz)
exporter: Don't delete exporter service on daemon deletion (#13653, @travisn)
csi: Fix NetNamespaceFilePath generation with namespace instead of name (#13663, @iPraveenParihar)
csi: Option to set a static csi driver name (#13622, @Madhu-1)
object: Fix the default multisite zonegroup creation (#13655, @parth-gr)
docs: Declare the max supported K8s version (#13646, @parth-gr)
ci: Reformat the python script (#13645, @parth-gr)
object: Watch for updates to the cosidriver CRD (#13621, @thotz)
mgr: Improvements to dashboard configuration handling (#13604, @rkachach)

`v1.13.3`

Compare Source

Improvements

Rook v1.13.3 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

operator: Increase resource limits to 1.5 CPU (#13619, @travisn)
helm: Remove duplicated toolbox keyring (#13609, @eb4x)
exporter: Skip reconcile on exporter deletion (#13597, @travisn)
manifest: Remove obsolete pg_autoscaler from mgr modules examples (#13588, @travisn)
csi: Make leader election flags configurable (#13573, @Madhu-1)
csi: Update csi provisioner to 3.6.3 (#13579, @Madhu-1)
csi: Update feature gates cmdline args (#13258, @iPraveenParihar)

`v1.13.2`

Compare Source

Improvements

Rook v1.13.2 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

helm: Update cluster chart and all examples to ceph v18.2.1 (#13499, @travisn)
mds: Increase max limit of mds active daemons (#13561, @travisn)
external: Support the cluster-name legacy flag in the external script (#13540, @parth-gr)
core: Fix error handling on setting watcher (#13479, @satoru-takeuchi)
osd: Create ceph conf and keyring files before osd migration (#13524, @sp98)
doc: Resizing encryptedDevice is not yet supported for host-based clusters (#13452, @cupnes)
manifest: Shorten CRD descriptions to 100 chars (#13517, @travisn)
multus: Use nginx-unprivileged image from quay for multus tool (#13506, @BlaineEXE)

`v1.13.1`

Compare Source

Improvements

Rook v1.13.1 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

build: Update base and example manifests to ceph v18.2.1 (#13428, @BlaineEXE)
csi: Update default Ceph-CSI version to v3.10.1 (#13442, @riya-singhal31)
csi: Update the CSI-Addons sidecar to v0.8.0 (#13411, @nixpanic)
csi: Implement network fencing for CephFS (#13348, @riya-singhal31)
helm: Allow configuring monitoring interval (#13408, @charlie-haley)
mon: Allow changing hostNetwork settings (#12369, @sp98)
csi: Remove obsolete gRPC metrics service (#13439, @iPraveenParihar)
helm: Fix duplicate tolerations (#13418, @jfcoz)
ci: Run K8s v1.29 in the CI (#13400, @subhamkrai)
docs: Add spec.csi section in the CephCluster documentation (#13375, @Rakshith-R)

`v1.13.0`

Compare Source

Upgrade Guide

To upgrade from previous versions of Rook, see the Rook upgrade guide.

Breaking Changes

Removed support for Ceph Pacific (v16). Ceph Quincy (v17) and Ceph Reef (v18) are the only currently supported versions.
The minimum supported Kubernetes version is v1.23
The minimum supported Ceph-CSI driver is 3.9
The admission controller is removed. If the admission controller is enabled (it is disabled by default), it is recommended to be disabled before the upgrade. See the upgrade guide for more details.

Features

Added experimental cephConfig to the CephCluster CR to allow setting Ceph config options in the Ceph MON config store via the CRD. These settings supersede the ceph.conf override settings.
CephCSI v3.10 is now the default CSI driver version.
- Per-cluster CSI settings for read affinity moved from the operator configmap settings to the CephCluster CR
The default CephFS SubvolumeGroup has pinning enabled by default to distribute load across MDS ranks in predictable and stable ways.
The Ceph exporter daemon is updated to use a Ceph keyring with reduced privileges instead of the admin keyring.
If the host network setting changes in the CephCluster CR, the mons will now automatically failover to enable the new configuration.
Allow for additional advanced maintenance and troubleshooting of Ceph daemons, by respecting the label ceph.rook.io/do-not-reconcile for all Ceph daemons. This is helpful when using the debug command in the kubectl rook-ceph plugin.

`v1.12.11`

Compare Source

Improvements

Rook v1.12.11 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

exporter: Skip reconcile on exporter deletion (#13597, @travisn)
helm: Allow configuring monitoring interval (#13408, @charlie-haley)
core: Golang linter issues with variables in loops and update linter version (#13324, @travisn)
multus: Use nginx-unprivileged image from quay (#13506, @BlaineEXE)

`v1.12.10`

Compare Source

Improvements

Rook v1.12.10 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

helm: Fix the namespace for the object store ingress (#13312, @jouve)
external: Allow run as a user flag for a non-default external user (#13383, @parth-gr)
mon: Proper detection of mon failover when the host path changes (#13360, @sp98)

`v1.12.9`

Compare Source

Improvements

Rook v1.12.9 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

core: Report node metrics using ceph telemetry (#12850, @parth-gr)
helm: Add namespace to all resource templates (#13288, @travisn)
core: Add pgHealthyRegex to DisruptionManagementSpec (#13225, @ushitora-anqou)
mgr: Adding CEPH_ARGS to the mgr pod so radosgw-admin can use it (#13256, @rkachach)
exporter: Change deployment strategy to Recreate (#13265, @weirdwiz)
helm: Use csiaddonsport parameter (#13259, @satoru-takeuchi)
mgr: Get servicemonitor exporter's interval from MonitoringSpec (#13248, @rkachach)
rgw: Handle mgr-proxied rgw cli commands in multus scenarios (#13237, @zer0def)
mgr: Honor the continueUpgradeAfterChecksEvenIfNotHealthy flag for mgr daemon (#13222, @obnoxxx)

`v1.12.8`

Compare Source

Improvements

Rook v1.12.8 is a patch release limited in scope and focusing on feature additions and bug fixes to the Ceph operator.

multus: Enable all placement for net addr detect job (#13206, @BlaineEXE)
nfs: Add livness-probe to nfs-ganesha container (#12845, @synarete)
pool: Allow updating deviceClass on existing pool (#13069, @subhamkrai)
osd: Revert encrypted OSDs on partitions since encryption was not working properly (#13169, @satoru-takeuchi)
multus: Use rook image for ip range detection (#13129, @BlaineEXE)
mgr: Set interval of serviceMonitor to the value from MonitoringSpec ([#13179](https://togi

Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about these updates again.

If you want to rebase/retry this PR, click this checkbox.

This PR has been generated by Renovate Bot.

chii-bot · 2022-08-31T22:22:16Z

Path: cluster/core/rook-ceph/cluster/helm-release.yaml
Version: v1.9.12 -> v1.16.0

@@ -73,11 +73,25 @@
 # imagePullSecrets:
 # - name: my-registry-secret
 ---
+# Source: rook-ceph-cluster/templates/rbac.yaml
+# Service account for other components
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: rook-ceph-default
+ namespace: default # namespace:cluster
+ labels:
+ operator: rook
+ storage-backend: ceph
+# imagePullSecrets:
+# - name: my-registry-secret
+---
 # Source: rook-ceph-cluster/templates/configmap.yaml
 kind: ConfigMap
 apiVersion: v1
 metadata:
 name: rook-config-override
+ namespace: default # namespace:cluster
 data:
 config: |2
 [global]
@@ -96,16 +110,17 @@
 pool: ceph-blockpool
 clusterID: default
 csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
- csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
+ csi.storage.k8s.io/controller-expand-secret-namespace: 'default'
 csi.storage.k8s.io/fstype: ext4
 csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
- csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
+ csi.storage.k8s.io/node-stage-secret-namespace: 'default'
 csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
- csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
+ csi.storage.k8s.io/provisioner-secret-namespace: 'default'
 imageFeatures: layering
 imageFormat: "2"
 reclaimPolicy: Delete
 allowVolumeExpansion: true
+volumeBindingMode: Immediate
 ---
 # Source: rook-ceph-cluster/templates/cephfilesystem.yaml
 apiVersion: storage.k8s.io/v1
@@ -120,14 +135,15 @@
 pool: ceph-filesystem-data0
 clusterID: default
 csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
- csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
+ csi.storage.k8s.io/controller-expand-secret-namespace: 'default'
 csi.storage.k8s.io/fstype: ext4
 csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
- csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
+ csi.storage.k8s.io/node-stage-secret-namespace: 'default'
 csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
- csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
+ csi.storage.k8s.io/provisioner-secret-namespace: 'default'
 reclaimPolicy: Delete
 allowVolumeExpansion: true
+volumeBindingMode: Immediate
 ---
 # Source: rook-ceph-cluster/templates/cephobjectstore.yaml
 apiVersion: storage.k8s.io/v1
@@ -136,6 +152,7 @@
 name: ceph-bucket
 provisioner: default.ceph.rook.io/bucket
 reclaimPolicy: Delete
+volumeBindingMode: Immediate
 parameters:
 objectStoreName: ceph-objectstore
 objectStoreNamespace: default
@@ -179,10 +196,10 @@
 namespace: default # namespace:cluster
 rules:
 # this is needed for rook's "key-management" CLI to fetch the vault token from the secret when
- # validating the connection details
+ # validating the connection details and for key rotation operations.
 - apiGroups: [""]
 resources: ["secrets"]
- verbs: ["get"]
+ verbs: ["get", "update"]
 - apiGroups: [""]
 resources: ["configmaps"]
 verbs: ["get", "list", "watch", "create", "update", "delete"]
@@ -191,23 +208,6 @@
 verbs: ["get", "list", "create", "update", "delete"]
 ---
 # Source: rook-ceph-cluster/templates/rbac.yaml
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-rules:
- # Placeholder role so the rgw service account will
- # be generated in the csv. Remove this role and role binding
- # when fixing https://github.com/rook/rook/issues/10141.
- - apiGroups:
- - ""
- resources:
- - configmaps
- verbs:
- - get
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
 # Aspects of ceph-mgr that operate within the cluster's namespace
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
@@ -242,9 +242,31 @@
 - apiGroups:
 - ceph.rook.io
 resources:
- - "*"
+ - cephclients
+ - cephclusters
+ - cephblockpools
+ - cephfilesystems
+ - cephnfses
+ - cephobjectstores
+ - cephobjectstoreusers
+ - cephobjectrealms
+ - cephobjectzonegroups
+ - cephobjectzones
+ - cephbuckettopics
+ - cephbucketnotifications
+ - cephrbdmirrors
+ - cephfilesystemmirrors
+ - cephfilesystemsubvolumegroups
+ - cephblockpoolradosnamespaces
+ - cephcosidrivers
 verbs:
- - "*"
+ - get
+ - list
+ - watch
+ - create
+ - update
+ - delete
+ - patch
 - apiGroups:
 - apps
 resources:
@@ -339,102 +361,6 @@
 - update
 ---
 # Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-default-psp
- namespace: default # namespace:cluster
- labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: default
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-rgw-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-mgr-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-mgr
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-cmd-reporter-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-cmd-reporter
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-purge-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-purge-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
 # Allow the operator to create resources in this cluster's namespace
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
@@ -467,22 +393,6 @@
 namespace: default # namespace:cluster
 ---
 # Source: rook-ceph-cluster/templates/rbac.yaml
-# Allow the rgw pods in this namespace to work with configmaps
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: Role
- name: rook-ceph-rgw
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph-cluster/templates/rbac.yaml
 # Allow the ceph mgr to access resources scoped to the CephCluster namespace necessary for mgr modules
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
@@ -582,6 +492,7 @@
 kind: Ingress
 metadata:
 name: default-dashboard
+ namespace: default # namespace:cluster
 spec:
 rules:
 - host: rook.${SECRET_DOMAIN}
@@ -599,11 +510,14 @@
 - hosts:
 - rook.${SECRET_DOMAIN}
 ---
+
+---
 # Source: rook-ceph-cluster/templates/cephblockpool.yaml
 apiVersion: ceph.rook.io/v1
 kind: CephBlockPool
 metadata:
 name: ceph-blockpool
+ namespace: default # namespace:cluster
 spec:
 failureDomain: host
 replicated:
@@ -614,12 +528,13 @@
 kind: CephCluster
 metadata:
 name: default
+ namespace: default # namespace:cluster
 spec:
 monitoring:
 enabled: true
 cephVersion:
 allowUnsupported: false
- image: quay.io/ceph/ceph:v16.2.10
+ image: quay.io/ceph/ceph:v19.2.0
 cleanupPolicy:
 allowUninstallWithVolumes: false
 confirmation: ""
@@ -636,8 +551,6 @@
 urlPrefix: /
 dataDirHostPath: /var/lib/rook
 disruptionManagement:
- machineDisruptionBudgetNamespace: openshift-machine-api
- manageMachineDisruptionBudgets: false
 managePodBudgets: true
 osdMaintenanceTimeout: 30
 pgHealthCheckTimeout: 0
@@ -659,16 +572,24 @@
 disabled: false
 osd:
 disabled: false
+ logCollector:
+ enabled: true
+ maxLogSize: 500M
+ periodicity: daily
 mgr:
 allowMultiplePerNode: false
 count: 2
- modules:
- - enabled: true
- name: pg_autoscaler
+ modules: null
 mon:
 allowMultiplePerNode: false
 count: 3
 network:
+ connections:
+ compression:
+ enabled: false
+ encryption:
+ enabled: false
+ requireMsgr2: false
 provider: host
 priorityClassNames:
 mgr: system-cluster-critical
@@ -678,49 +599,48 @@
 resources:
 cleanup:
 limits:
- cpu: 500m
 memory: 1Gi
 requests:
 cpu: 500m
 memory: 100Mi
 crashcollector:
 limits:
- cpu: 500m
 memory: 60Mi
 requests:
 cpu: 100m
 memory: 60Mi
+ exporter:
+ limits:
+ memory: 128Mi
+ requests:
+ cpu: 50m
+ memory: 50Mi
 logcollector:
 limits:
- cpu: 500m
 memory: 1Gi
 requests:
 cpu: 100m
 memory: 100Mi
 mgr:
 limits:
- cpu: 1000m
 memory: 1Gi
 requests:
 cpu: 500m
 memory: 512Mi
 mgr-sidecar:
 limits:
- cpu: 500m
 memory: 100Mi
 requests:
 cpu: 100m
 memory: 40Mi
 mon:
 limits:
- cpu: 2000m
 memory: 2Gi
 requests:
 cpu: 1000m
 memory: 1Gi
 osd:
 limits:
- cpu: 2000m
 memory: 4Gi
 requests:
 cpu: 1000m
@@ -747,6 +667,7 @@
 name: k8s-worker03
 useAllDevices: false
 useAllNodes: false
+ upgradeOSDRequiresHealthyPGs: false
 waitTimeoutForHealthyOSDInMinutes: 10
 ---
 # Source: rook-ceph-cluster/templates/cephfilesystem.yaml
@@ -754,6 +675,7 @@
 kind: CephFilesystem
 metadata:
 name: ceph-filesystem
+ namespace: default # namespace:cluster
 spec:
 dataPools:
 - failureDomain: host
@@ -769,37 +691,55 @@
 priorityClassName: system-cluster-critical
 resources:
 limits:
- cpu: 2000m
 memory: 4Gi
 requests:
 cpu: 1000m
 memory: 4Gi
 ---
+# Source: rook-ceph-cluster/templates/cephfilesystem.yaml
+apiVersion: ceph.rook.io/v1
+kind: CephFilesystemSubVolumeGroup
+metadata:
+ name: ceph-filesystem-csi # lets keep the svg crd name same as `filesystem name + csi` for the default csi svg
+ namespace: default # namespace:cluster
+spec:
+ # The name of the subvolume group. If not set, the default is the name of the subvolumeGroup CR.
+ name: csi
+ # filesystemName is the metadata name of the CephFilesystem CR where the subvolume group will be created
+ filesystemName: ceph-filesystem
+ # reference https://docs.ceph.com/en/latest/cephfs/fs-volumes/#pinning-subvolumes-and-subvolume-groups
+ # only one out of (export, distributed, random) can be set at a time
+ # by default pinning is set with value: distributed=1
+ # for disabling default values set (distributed=0)
+ pinning:
+ distributed: 1 # distributed=<0, 1> (disabled=0)
+ # export: # export=<0-256> (disabled=-1)
+ # random: # random=[0.0, 1.0](disabled=0.0)
+---
 # Source: rook-ceph-cluster/templates/cephobjectstore.yaml
 apiVersion: ceph.rook.io/v1
 kind: CephObjectStore
 metadata:
 name: ceph-objectstore
+ namespace: default # namespace:cluster
 spec:
 dataPool:
 erasureCoded:
 codingChunks: 1
 dataChunks: 2
 failureDomain: host
+ parameters:
+ bulk: "true"
 gateway:
 instances: 1
 port: 80
 priorityClassName: system-cluster-critical
 resources:
 limits:
- cpu: 2000m
 memory: 2Gi
 requests:
 cpu: 1000m
 memory: 1Gi
- healthCheck:
- bucket:
- interval: 60s
 metadataPool:
 failureDomain: host
 replicated:
@@ -817,810 +757,881 @@
 namespace: default
 spec:
 # Import the raw prometheus rules since they have descriptions that should not be processed with the helm templates
- # copied from https://github.com/ceph/ceph/blob/master/monitoring/ceph-mixin/prometheus_alerts.yml
+ # Copied from https://github.com/ceph/ceph/blob/master/monitoring/ceph-mixin/prometheus_alerts.yml
+ # Attention: This is not a 1:1 copy of ceph-mixin alerts. This file contains several Rook-related adjustments.
+ # List of main adjustments:
+ # - Alerts related to cephadm are excluded
+ # - The PrometheusJobMissing alert is adjusted for the rook-ceph-mgr job, and the PrometheusJobExporterMissing alert is added
 groups:
- - name: cluster health
- rules:
- - alert: CephHealthError
- expr: ceph_health_status == 2
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.2.1
- annotations:
- summary: Cluster is in the ERROR state
- description: >
- The cluster state has been HEALTH_ERROR for more than 5 minutes. Please check "ceph health detail" for more information.
-
- - alert: CephHealthWarning
- expr: ceph_health_status == 1
- for: 15m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- summary: Cluster is in the WARNING state
- description: >
- The cluster state has been HEALTH_WARN for more than 15 minutes. Please check "ceph health detail" for more information.
-
- - name: mon
+ - name: "cluster health"
 rules:
- - alert: CephMonDownQuorumAtRisk
- expr: ((ceph_health_detail{name="MON_DOWN"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (count(ceph_mon_quorum_status == 1) == bool (floor(count(ceph_mon_metadata) / 2) + 1))) == 1
- for: 30s
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.3.1
+ - alert: "CephHealthError"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down
- summary: Monitor quorum is at risk
- description: |
- {{ $min := query "floor(count(ceph_mon_metadata) / 2) +1" | first | value }}Quorum requires a majority of monitors (x {{ $min }}) to be active
- Without quorum the cluster will become inoperable, affecting all services and connected clients.
-
- The following monitors are down:
- {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)" }}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonDown
- expr: (count(ceph_mon_quorum_status == 0) <= (count(ceph_mon_metadata) - floor(count(ceph_mon_metadata) / 2) + 1))
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down
- summary: One or more monitors down
- description: |
- {{ $down := query "count(ceph_mon_quorum_status == 0)" | first | value }}{{ $s := "" }}{{ if gt $down 1.0 }}{{ $s = "s" }}{{ end }}There are {{ $down }} monitor{{ $s }} down.
- Quorum is still intact, but the loss of an additional monitor will make your cluster inoperable.
-
- The following monitors are down:
- {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)" }}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonDiskspaceCritical
- expr: ceph_health_detail{name="MON_DISK_CRIT"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.3.2
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-crit
- summary: Filesystem space on at least one monitor is critically low
- description: |
- The free space available to a monitor's store is critically low.
- You should increase the space available to the monitor(s). The default directory
- is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and under
- /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook.
- Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files.
- Also check any other directories under /var/lib/rook and other directories on the
- same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are;
- {{- range query "ceph_mon_metadata"}}
- - {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonDiskspaceLow
- expr: ceph_health_detail{name="MON_DISK_LOW"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-low
- summary: Disk space on at least one monitor is approaching full
- description: |
- The space available to a monitor's store is approaching full (>70% is the default).
- You should increase the space available to the monitor(s). The default directory
- is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and under
- /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook.
- Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files.
- Also check any other directories under /var/lib/rook and other directories on the
- same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are;
- {{- range query "ceph_mon_metadata"}}
- - {{ .Labels.hostname }}
- {{- end }}
- - alert: CephMonClockSkew
- expr: ceph_health_detail{name="MON_CLOCK_SKEW"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-clock-skew
- summary: Clock skew detected among monitors
- description: |
- Ceph monitors rely on closely synchronized time to maintain
- quorum and cluster consistency. This event indicates that time on at least
- one mon has drifted too far from the lead mon.
-
- Review cluster status with ceph -s. This will show which monitors
- are affected. Check the time sync status on each monitor host with
- "ceph time-sync-status" and the state and peers of your ntpd or chrony daemon.
- - name: osd
+ description: "The cluster state has been HEALTH_ERROR for more than 5 minutes. Please check 'ceph health detail' for more information."
+ summary: "Ceph is in the ERROR state"
+ expr: "ceph_health_status == 2"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.2.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephHealthWarning"
+ annotations:
+ description: "The cluster state has been HEALTH_WARN for more than 15 minutes. Please check 'ceph health detail' for more information."
+ summary: "Ceph is in the WARNING state"
+ expr: "ceph_health_status == 1"
+ for: "15m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "mon"
 rules:
- - alert: CephOSDDownHigh
- expr: count(ceph_osd_up == 0) / count(ceph_osd_up) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 >= 10
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.1
+ - alert: "CephMonDownQuorumAtRisk"
 annotations:
- summary: More than 10% of OSDs are down
- description: |
- {{ $value | humanize }}% or {{ with query "count(ceph_osd_up == 0)" }}{{ . | first | value }}{{ end }} of {{ with query "count(ceph_osd_up)" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%).
-
- The following OSDs are down:
- {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephOSDHostDown
- expr: ceph_health_detail{name="OSD_HOST_DOWN"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.8
- annotations:
- summary: An OSD host is offline
- description: |
- The following OSDs are down:
- {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0" }}
- - {{ .Labels.hostname }} : {{ .Labels.ceph_daemon }}
- {{- end }}
- - alert: CephOSDDown
- expr: ceph_health_detail{name="OSD_DOWN"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.2
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-down
- summary: An OSD has been marked down
- description: |
- {{ $num := query "count(ceph_osd_up == 0)" | first | value }}{{ $s := "" }}{{ if gt $num 1.0 }}{{ $s = "s" }}{{ end }}{{ $num }} OSD{{ $s }} down for over 5mins.
-
- The following OSD{{ $s }} {{ if eq $s "" }}is{{ else }}are{{ end }} down:
- {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0"}}
- - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }}
- {{- end }}
- - alert: CephOSDNearFull
- expr: ceph_health_detail{name="OSD_NEARFULL"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.3
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-nearfull
- summary: OSD(s) running low on free space (NEARFULL)
- description: |
- One or more OSDs have reached the NEARFULL threshold
-
- Use 'ceph health detail' and 'ceph osd df' to identify the problem.
- To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data.
- - alert: CephOSDFull
- expr: ceph_health_detail{name="OSD_FULL"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.6
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-full
- summary: OSD full, writes blocked
- description: |
- An OSD has reached the FULL threshold. Writes to pools that share the
- affected OSD will be blocked.
-
- Use 'ceph health detail' and 'ceph osd df' to identify the problem.
- To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data.
- - alert: CephOSDBackfillFull
- expr: ceph_health_detail{name="OSD_BACKFILLFULL"} > 0
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-backfillfull
- summary: OSD(s) too full for backfill operations
- description: "An OSD has reached the BACKFILL FULL threshold. This will prevent rebalance operations\nfrom completing. \nUse 'ceph health detail' and 'ceph osd df' to identify the problem.\n\nTo resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data.\n"
- - alert: CephOSDTooManyRepairs
- expr: ceph_health_detail{name="OSD_TOO_MANY_REPAIRS"} == 1
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-too-many-repairs
- summary: OSD reports a high number of read errors
- description: |
- Reads from an OSD have used a secondary PG to return data to the client, indicating
- a potential failing disk.
- - alert: CephOSDTimeoutsPublicNetwork
- expr: ceph_health_detail{name="OSD_SLOW_PING_TIME_FRONT"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- summary: Network issues delaying OSD heartbeats (public network)
- description: |
- OSD heartbeats on the cluster's 'public' network (frontend) are running slow. Investigate the network
- for latency or loss issues. Use 'ceph health detail' to show the affected OSDs.
- - alert: CephOSDTimeoutsClusterNetwork
- expr: ceph_health_detail{name="OSD_SLOW_PING_TIME_BACK"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- summary: Network issues delaying OSD heartbeats (cluster network)
- description: |
- OSD heartbeats on the cluster's 'cluster' network (backend) are running slow. Investigate the network
- for latency or loss issues. Use 'ceph health detail' to show the affected OSDs.
- - alert: CephOSDInternalDiskSizeMismatch
- expr: ceph_health_detail{name="BLUESTORE_DISK_SIZE_MISMATCH"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
+ description: "{{ $min := query \"floor(count(ceph_mon_metadata) / 2) + 1\" | first | value }}Quorum requires a majority of monitors (x {{ $min }}) to be active. Without quorum the cluster will become inoperable, affecting all services and connected clients. The following monitors are down: {{- range query \"(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down"
+ summary: "Monitor quorum is at risk"
+ expr: |
+ (
+ (ceph_health_detail{name="MON_DOWN"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (
+ count(ceph_mon_quorum_status == 1) == bool (floor(count(ceph_mon_metadata) / 2) + 1)
+ )
+ ) == 1
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.3.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephMonDown"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-disk-size-mismatch
- summary: OSD size inconsistency error
 description: |
- One or more OSDs have an internal inconsistency between metadata and the size of the device.
- This could lead to the OSD(s) crashing in future. You should redeploy the affected OSDs.
- - alert: CephDeviceFailurePredicted
- expr: ceph_health_detail{name="DEVICE_HEALTH"} == 1
- for: 1m
+ {{ $down := query "count(ceph_mon_quorum_status == 0)" | first | value }}{{ $s := "" }}{{ if gt $down 1.0 }}{{ $s = "s" }}{{ end }}You have {{ $down }} monitor{{ $s }} down. Quorum is still intact, but the loss of an additional monitor will make your cluster inoperable. The following monitors are down: {{- range query "(ceph_mon_quorum_status == 0) + on(ceph_daemon) group_left(hostname) (ceph_mon_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 0)" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-down"
+ summary: "One or more monitors down"
+ expr: |
+ count(ceph_mon_quorum_status == 0) <= (count(ceph_mon_metadata) - floor(count(ceph_mon_metadata) / 2) + 1)
+ for: "30s"
 labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#id2
- summary: Device(s) predicted to fail soon
- description: |
- The device health module has determined that one or more devices will fail
- soon. To review device status use 'ceph device ls'. To show a specific
- device use 'ceph device info <dev id>'.
-
- Mark the OSD out so that data may migrate to other OSDs. Once
- the OSD has drained, destroy the OSD, replace the device, and redeploy the OSD.
- - alert: CephDeviceFailurePredictionTooHigh
- expr: ceph_health_detail{name="DEVICE_HEALTH_TOOMANY"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.7
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephMonDiskspaceCritical"
+ annotations:
+ description: "The free space available to a monitor's store is critically low. You should increase the space available to the monitor(s). The default directory is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook. Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files. Also check any other directories under /var/lib/rook and other directories on the same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are; {{- range query \"ceph_mon_metadata\"}} - {{ .Labels.hostname }} {{- end }}"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-crit"
+ summary: "Filesystem space on at least one monitor is critically low"
+ expr: "ceph_health_detail{name=\"MON_DISK_CRIT\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.3.2"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephMonDiskspaceLow"
+ annotations:
+ description: "The space available to a monitor's store is approaching full (>70% is the default). You should increase the space available to the monitor(s). The default directory is /var/lib/ceph/mon-*/data/store.db on traditional deployments, and /var/lib/rook/mon-*/data/store.db on the mon pod's worker node for Rook. Look for old, rotated versions of *.log and MANIFEST*. Do NOT touch any *.sst files. Also check any other directories under /var/lib/rook and other directories on the same filesystem, often /var/log and /var/tmp are culprits. Your monitor hosts are; {{- range query \"ceph_mon_metadata\"}} - {{ .Labels.hostname }} {{- end }}"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-disk-low"
+ summary: "Drive space on at least one monitor is approaching full"
+ expr: "ceph_health_detail{name=\"MON_DISK_LOW\"} == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephMonClockSkew"
+ annotations:
+ description: "Ceph monitors rely on closely synchronized time to maintain quorum and cluster consistency. This event indicates that the time on at least one mon has drifted too far from the lead mon. Review cluster status with ceph -s. This will show which monitors are affected. Check the time sync status on each monitor host with 'ceph time-sync-status' and the state and peers of your ntpd or chrony daemon."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#mon-clock-skew"
+ summary: "Clock skew detected among monitors"
+ expr: "ceph_health_detail{name=\"MON_CLOCK_SKEW\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "osd"
+ rules:
+ - alert: "CephOSDDownHigh"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-toomany
- summary: Too many devices are predicted to fail, unable to resolve
- description: |
- The device health module has determined that devices predicted to
- fail can not be remediated automatically, since too many OSDs would be removed from the
- cluster to ensure performance and availabililty. Prevent data
- integrity issues by adding new OSDs so that data may be relocated.
- - alert: CephDeviceFailureRelocationIncomplete
- expr: ceph_health_detail{name="DEVICE_HEALTH_IN_USE"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
+ description: "{{ $value | humanize }}% or {{ with query \"count(ceph_osd_up == 0)\" }}{{ . | first | value }}{{ end }} of {{ with query \"count(ceph_osd_up)\" }}{{ . | first | value }}{{ end }} OSDs are down (>= 10%). The following OSDs are down: {{- range query \"(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}"
+ summary: "More than 10% of OSDs are down"
+ expr: "count(ceph_osd_up == 0) / count(ceph_osd_up) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 >= 10"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephOSDHostDown"
+ annotations:
+ description: "The following OSDs are down: {{- range query \"(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0\" }} - {{ .Labels.hostname }} : {{ .Labels.ceph_daemon }} {{- end }}"
+ summary: "An OSD host is offline"
+ expr: "ceph_health_detail{name=\"OSD_HOST_DOWN\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.8"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDDown"
+ annotations:
+ description: |
+ {{ $num := query "count(ceph_osd_up == 0)" | first | value }}{{ $s := "" }}{{ if gt $num 1.0 }}{{ $s = "s" }}{{ end }}{{ $num }} OSD{{ $s }} down for over 5mins. The following OSD{{ $s }} {{ if eq $s "" }}is{{ else }}are{{ end }} down: {{- range query "(ceph_osd_up LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) == 0"}} - {{ .Labels.ceph_daemon }} on {{ .Labels.hostname }} {{- end }}
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-down"
+ summary: "An OSD has been marked down"
+ expr: "ceph_health_detail{name=\"OSD_DOWN\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDNearFull"
+ annotations:
+ description: "One or more OSDs have reached the NEARFULL threshold. Use 'ceph health detail' and 'ceph osd df' to identify the problem. To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-nearfull"
+ summary: "OSD(s) running low on free space (NEARFULL)"
+ expr: "ceph_health_detail{name=\"OSD_NEARFULL\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.3"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDFull"
+ annotations:
+ description: "An OSD has reached the FULL threshold. Writes to pools that share the affected OSD will be blocked. Use 'ceph health detail' and 'ceph osd df' to identify the problem. To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-full"
+ summary: "OSD full, writes blocked"
+ expr: "ceph_health_detail{name=\"OSD_FULL\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.6"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephOSDBackfillFull"
+ annotations:
+ description: "An OSD has reached the BACKFILL FULL threshold. This will prevent rebalance operations from completing. Use 'ceph health detail' and 'ceph osd df' to identify the problem. To resolve, add capacity to the affected OSD's failure domain, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-backfillfull"
+ summary: "OSD(s) too full for backfill operations"
+ expr: "ceph_health_detail{name=\"OSD_BACKFILLFULL\"} > 0"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDTooManyRepairs"
+ annotations:
+ description: "Reads from an OSD have used a secondary PG to return data to the client, indicating a potential failing drive."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#osd-too-many-repairs"
+ summary: "OSD reports a high number of read errors"
+ expr: "ceph_health_detail{name=\"OSD_TOO_MANY_REPAIRS\"} == 1"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDTimeoutsPublicNetwork"
+ annotations:
+ description: "OSD heartbeats on the cluster's 'public' network (frontend) are running slow. Investigate the network for latency or loss issues. Use 'ceph health detail' to show the affected OSDs."
+ summary: "Network issues delaying OSD heartbeats (public network)"
+ expr: "ceph_health_detail{name=\"OSD_SLOW_PING_TIME_FRONT\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDTimeoutsClusterNetwork"
+ annotations:
+ description: "OSD heartbeats on the cluster's 'cluster' network (backend) are slow. Investigate the network for latency issues on this subnet. Use 'ceph health detail' to show the affected OSDs."
+ summary: "Network issues delaying OSD heartbeats (cluster network)"
+ expr: "ceph_health_detail{name=\"OSD_SLOW_PING_TIME_BACK\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDInternalDiskSizeMismatch"
+ annotations:
+ description: "One or more OSDs have an internal inconsistency between metadata and the size of the device. This could lead to the OSD(s) crashing in future. You should redeploy the affected OSDs."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-disk-size-mismatch"
+ summary: "OSD size inconsistency error"
+ expr: "ceph_health_detail{name=\"BLUESTORE_DISK_SIZE_MISMATCH\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephDeviceFailurePredicted"
+ annotations:
+ description: "The device health module has determined that one or more devices will fail soon. To review device status use 'ceph device ls'. To show a specific device use 'ceph device info <dev id>'. Mark the OSD out so that data may migrate to other OSDs. Once the OSD has drained, destroy the OSD, replace the device, and redeploy the OSD."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#id2"
+ summary: "Device(s) predicted to fail soon"
+ expr: "ceph_health_detail{name=\"DEVICE_HEALTH\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephDeviceFailurePredictionTooHigh"
+ annotations:
+ description: "The device health module has determined that devices predicted to fail can not be remediated automatically, since too many OSDs would be removed from the cluster to ensure performance and availability. Prevent data integrity issues by adding new OSDs so that data may be relocated."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-toomany"
+ summary: "Too many devices are predicted to fail, unable to resolve"
+ expr: "ceph_health_detail{name=\"DEVICE_HEALTH_TOOMANY\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.7"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephDeviceFailureRelocationIncomplete"
+ annotations:
+ description: "The device health module has determined that one or more devices will fail soon, but the normal process of relocating the data on the device to other OSDs in the cluster is blocked. \nEnsure that the cluster has available free space. It may be necessary to add capacity to the cluster to allow data from the failing device to successfully migrate, or to enable the balancer."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-in-use"
+ summary: "Device failure is predicted, but unable to relocate data"
+ expr: "ceph_health_detail{name=\"DEVICE_HEALTH_IN_USE\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDFlapping"
+ annotations:
+ description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} was marked down and back up {{ $value | humanize }} times once a minute for 5 minutes. This may indicate a network issue (latency, packet loss, MTU mismatch) on the cluster network, or the public network if no cluster network is deployed. Check the network stats on the listed host(s)."
+ documentation: "https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd#flapping-osds"
+ summary: "Network issues are causing OSDs to flap (mark each other down)"
+ expr: "(rate(ceph_osd_up[5m]) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 60 > 1"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.4"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephOSDReadErrors"
+ annotations:
+ description: "An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors"
+ summary: "Device read errors detected"
+ expr: "ceph_health_detail{name=\"BLUESTORE_SPURIOUS_READ_ERRORS\"} == 1"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGImbalance"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#device-health-in-use
- summary: Device failure is predicted, but unable to relocate data
- description: |
- The device health module has determined that one or more devices will fail
- soon, but the normal process of relocating the data on the device to other
- OSDs in the cluster is blocked.
-
- Ensure that the cluster has available free space. It may be necessary to add
- capacity to the cluster to allow the data from the failing device to
- successfully migrate, or to enable the balancer.
- - alert: CephOSDFlapping
- expr: |
- (
- rate(ceph_osd_up[5m])
- LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(ceph_daemon) group_left(hostname) ceph_osd_metadata
- ) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 60 > 1
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.4
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-osd#flapping-osds
- summary: Network issues are causing OSDs to flap (mark each other down)
- description: >
- OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} was marked down and back up {{ $value | humanize }} times once a minute for 5 minutes. This may indicate a network issue (latency, packet loss, MTU mismatch) on the cluster network, or the public network if no cluster network is deployed. Check network stats on the listed host(s).
-
- - alert: CephOSDReadErrors
- expr: ceph_health_detail{name="BLUESTORE_SPURIOUS_READ_ERRORS"} == 1
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#bluestore-spurious-read-errors
- summary: Device read errors detected
- description: >
- An OSD has encountered read errors, but the OSD has recovered by retrying the reads. This may indicate an issue with hardware or the kernel.
-
- # alert on high deviation from average PG count
- - alert: CephPGImbalance
+ description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count."
+ summary: "PGs are not balanced across OSDs"
 expr: |
 abs(
- (
- (ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)
- ) / on (job) group_left avg(ceph_osd_numpg > 0) by (job)
+ ((ceph_osd_numpg > 0) - on (job) group_left avg(ceph_osd_numpg > 0) by (job)) /
+ on (job) group_left avg(ceph_osd_numpg > 0) by (job)
 ) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on (ceph_daemon) group_left(hostname) ceph_osd_metadata > 0.30
- for: 5m
+ for: "5m"
 labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.4.5
- annotations:
- summary: PGs are not balanced across OSDs
- description: >
- OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count.
-
- # alert on high commit latency...but how high is too high
- - name: mds
+ oid: "1.3.6.1.4.1.50495.1.2.1.4.5"
+ severity: "warning"
+ type: "ceph_default"
+ - name: "mds"
 rules:
- - alert: CephFilesystemDamaged
- expr: ceph_health_detail{name="MDS_DAMAGE"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.1
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages
- summary: CephFS filesystem is damaged.
- description: >
- Filesystem metadata has been corrupted. Data may be inaccessible. Analyze metrics from the MDS daemon admin socket, or escalate to support.
-
- - alert: CephFilesystemOffline
- expr: ceph_health_detail{name="MDS_ALL_DOWN"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.3
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-all-down
- summary: CephFS filesystem is offline
- description: >
- All MDS ranks are unavailable. The MDS daemons managing metadata are down, rendering the filesystem offline.
-
- - alert: CephFilesystemDegraded
- expr: ceph_health_detail{name="FS_DEGRADED"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.4
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-degraded
- summary: CephFS filesystem is degraded
- description: >
- One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, at worst the filesystem is completely unusable.
-
- - alert: CephFilesystemMDSRanksLow
- expr: ceph_health_detail{name="MDS_UP_LESS_THAN_MAX"} > 0
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-up-less-than-max
- summary: MDS daemon count is lower than configured
- description: >
- The filesystem's "max_mds" setting defines the number of MDS ranks in the filesystem. The current number of active MDS daemons is less than this value.
-
- - alert: CephFilesystemInsufficientStandby
- expr: ceph_health_detail{name="MDS_INSUFFICIENT_STANDBY"} > 0
- for: 1m
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-insufficient-standby
- summary: Ceph filesystem standby daemons too few
- description: >
- The minimum number of standby daemons required by standby_count_wanted is less than the current number of standby daemons. Adjust the standby count or increase the number of MDS daemons.
-
- - alert: CephFilesystemFailureNoStandby
- expr: ceph_health_detail{name="FS_WITH_FAILED_MDS"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.5
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-with-failed-mds
- summary: MDS daemon failed, no further standby available
- description: >
- An MDS daemon has failed, leaving only one active rank and no available standby. Investigate the cause of the failure or add a standby MDS.
-
- - alert: CephFilesystemReadOnly
- expr: ceph_health_detail{name="MDS_HEALTH_READ_ONLY"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.5.2
- annotations:
- documentation: https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages
- summary: CephFS filesystem in read only mode due to write error(s)
- description: >
- The filesystem has switched to READ ONLY due to an unexpected error when writing to the metadata pool.
-
- Analyze the output from the MDS daemon admin socket, or escalate to support.
-
- - name: mgr
+ - alert: "CephFilesystemDamaged"
+ annotations:
+ description: "Filesystem metadata has been corrupted. Data may be inaccessible. Analyze metrics from the MDS daemon admin socket, or escalate to support."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages"
+ summary: "CephFS filesystem is damaged."
+ expr: "ceph_health_detail{name=\"MDS_DAMAGE\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemOffline"
+ annotations:
+ description: "All MDS ranks are unavailable. The MDS daemons managing metadata are down, rendering the filesystem offline."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-all-down"
+ summary: "CephFS filesystem is offline"
+ expr: "ceph_health_detail{name=\"MDS_ALL_DOWN\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemDegraded"
+ annotations:
+ description: "One or more metadata daemons (MDS ranks) are failed or in a damaged state. At best the filesystem is partially available, at worst the filesystem is completely unusable."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-degraded"
+ summary: "CephFS filesystem is degraded"
+ expr: "ceph_health_detail{name=\"FS_DEGRADED\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemMDSRanksLow"
+ annotations:
+ description: "The filesystem's 'max_mds' setting defines the number of MDS ranks in the filesystem. The current number of active MDS daemons is less than this value."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-up-less-than-max"
+ summary: "Ceph MDS daemon count is lower than configured"
+ expr: "ceph_health_detail{name=\"MDS_UP_LESS_THAN_MAX\"} > 0"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephFilesystemInsufficientStandby"
+ annotations:
+ description: "The minimum number of standby daemons required by standby_count_wanted is less than the current number of standby daemons. Adjust the standby count or increase the number of MDS daemons."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#mds-insufficient-standby"
+ summary: "Ceph filesystem standby daemons too few"
+ expr: "ceph_health_detail{name=\"MDS_INSUFFICIENT_STANDBY\"} > 0"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephFilesystemFailureNoStandby"
+ annotations:
+ description: "An MDS daemon has failed, leaving only one active rank and no available standby. Investigate the cause of the failure or add a standby MDS."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages/#fs-with-failed-mds"
+ summary: "MDS daemon failed, no further standby available"
+ expr: "ceph_health_detail{name=\"FS_WITH_FAILED_MDS\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.5"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephFilesystemReadOnly"
+ annotations:
+ description: "The filesystem has switched to READ ONLY due to an unexpected error when writing to the metadata pool. Either analyze the output from the MDS daemon admin socket, or escalate to support."
+ documentation: "https://docs.ceph.com/en/latest/cephfs/health-messages#cephfs-health-messages"
+ summary: "CephFS filesystem in read only mode due to write error(s)"
+ expr: "ceph_health_detail{name=\"MDS_HEALTH_READ_ONLY\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.5.2"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "mgr"
 rules:
- - alert: CephMgrModuleCrash
- expr: ceph_health_detail{name="RECENT_MGR_MODULE_CRASH"} == 1
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.6.1
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#recent-mgr-module-crash
- summary: A manager module has recently crashed
- description: >
- One or more mgr modules have crashed and have yet to be acknowledged by an administrator. A crashed module may impact functionality within the cluster. Use the 'ceph crash' command to determine which module has failed, and archive it to acknowledge the failure.
-
- - alert: CephMgrPrometheusModuleInactive
- expr: up{job="ceph"} == 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.6.2
- annotations:
- summary: The mgr/prometheus module is not available
- description: >
- The mgr/prometheus module at {{ $labels.instance }} is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down.
-
- Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine the mgr/prometheus module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'.
-
- - name: pgs
+ - alert: "CephMgrModuleCrash"
+ annotations:
+ description: "One or more mgr modules have crashed and have yet to be acknowledged by an administrator. A crashed module may impact functionality within the cluster. Use the 'ceph crash' command to determine which module has failed, and archive it to acknowledge the failure."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#recent-mgr-module-crash"
+ summary: "A manager module has recently crashed"
+ expr: "ceph_health_detail{name=\"RECENT_MGR_MODULE_CRASH\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.6.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephMgrPrometheusModuleInactive"
+ annotations:
+ description: "The mgr/prometheus module at {{ $labels.instance }} is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'."
+ summary: "The mgr/prometheus module is not available"
+ expr: "up{job=\"ceph\"} == 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.6.2"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "pgs"
 rules:
- - alert: CephPGsInactive
- expr: ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_active) > 0
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.1
- annotations:
- summary: One or more placement groups are inactive
- description: >
- {{ $value }} PGs have been inactive for more than 5 minutes in pool {{ $labels.name }}. Inactive placement groups are not able to serve read/write requests.
-
- - alert: CephPGsUnclean
- expr: ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_clean) > 0
- for: 15m
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.2
- annotations:
- summary: One or more placement groups are marked unclean
- description: >
- {{ $value }} PGs have been unclean for more than 15 minutes in pool {{ $labels.name }}. Unclean PGs have not recovered from a previous failure.
-
- - alert: CephPGsDamaged
- expr: ceph_health_detail{name=~"PG_DAMAGED|OSD_SCRUB_ERRORS"} == 1
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.4
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-damaged
- summary: Placement group damaged; manual intervention needed
- description: >
- Scrubs have flagged at least one PG as damaged or inconsistent.
-
- Check to see which PG is affected, and attempt a manual repair if necessary. To list problematic placement groups, use 'ceph health detail' or 'rados list-inconsistent-pg <pool>'. To repair PGs use the 'ceph pg repair <pg_num>' command.
-
- - alert: CephPGRecoveryAtRisk
- expr: ceph_health_detail{name="PG_RECOVERY_FULL"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.5
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-recovery-full
- summary: OSDs are too full for recovery
- description: >
- Data redundancy is at risk since one or more OSDs are at or above the 'full' threshold. Add capacity to the cluster, restore down/out OSDs, or delete unwanted data.
-
- - alert: CephPGUnavailableBlockingIO
- # PG_AVAILABILITY, but an OSD is not in a DOWN state
- expr: ((ceph_health_detail{name="PG_AVAILABILITY"} == 1) - scalar(ceph_health_detail{name="OSD_DOWN"})) == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.3
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-availability
- summary: PG is unavailable, blocking I/O
- description: >
- Data availability is reduced, impacting the cluster's ability to service I/O. One or more placement groups (PGs) are in a state that blocks I/O.
-
- - alert: CephPGBackfillAtRisk
- expr: ceph_health_detail{name="PG_BACKFILL_FULL"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.7.6
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-backfill-full
- summary: Backfill operations are blocked due to lack of free space
- description: >
- Data redundancy may be at risk due to lack of free space within the cluster. One or more OSDs have breached their 'backfillfull' threshold. Add more capacity, or delete unwanted data.
-
- - alert: CephPGNotScrubbed
- expr: ceph_health_detail{name="PG_NOT_SCRUBBED"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
+ - alert: "CephPGsInactive"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-scrubbed
- summary: Placement group(s) have not been scrubbed
- description: |
- One or more PGs have not been scrubbed recently. Scrubs check metadata integrity,
- protecting against bit-rot. They check that metadata
- is consistent across data replicas. When PGs miss their scrub interval, it may
- indicate that the scrub window is too small, or PGs were not in a 'clean' state during the
- scrub window.
-
- You can manually initiate a scrub with: ceph pg scrub <pgid>
- - alert: CephPGsHighPerOSD
- expr: ceph_health_detail{name="TOO_MANY_PGS"} == 1
- for: 1m
- labels:
- severity: warning
- type: ceph_default
+ description: "{{ $value }} PGs have been inactive for more than 5 minutes in pool {{ $labels.name }}. Inactive placement groups are not able to serve read/write requests."
+ summary: "One or more placement groups are inactive"
+ expr: "ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_active) > 0"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGsUnclean"
+ annotations:
+ description: "{{ $value }} PGs have been unclean for more than 15 minutes in pool {{ $labels.name }}. Unclean PGs have not recovered from a previous failure."
+ summary: "One or more placement groups are marked unclean"
+ expr: "ceph_pool_metadata LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id,instance) group_left() (ceph_pg_total - ceph_pg_clean) > 0"
+ for: "15m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGsDamaged"
+ annotations:
+ description: "During data consistency checks (scrub), at least one PG has been flagged as being damaged or inconsistent. Check to see which PG is affected, and attempt a manual repair if necessary. To list problematic placement groups, use 'rados list-inconsistent-pg <pool>'. To repair PGs use the 'ceph pg repair <pg_num>' command."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-damaged"
+ summary: "Placement group damaged, manual intervention needed"
+ expr: "ceph_health_detail{name=~\"PG_DAMAGED|OSD_SCRUB_ERRORS\"} == 1"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGRecoveryAtRisk"
+ annotations:
+ description: "Data redundancy is at risk since one or more OSDs are at or above the 'full' threshold. Add more capacity to the cluster, restore down/out OSDs, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-recovery-full"
+ summary: "OSDs are too full for recovery"
+ expr: "ceph_health_detail{name=\"PG_RECOVERY_FULL\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.5"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGUnavailableBlockingIO"
+ annotations:
+ description: "Data availability is reduced, impacting the cluster's ability to service I/O. One or more placement groups (PGs) are in a state that blocks I/O."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-availability"
+ summary: "PG is unavailable, blocking I/O"
+ expr: "((ceph_health_detail{name=\"PG_AVAILABILITY\"} == 1) - scalar(ceph_health_detail{name=\"OSD_DOWN\"})) == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGBackfillAtRisk"
+ annotations:
+ description: "Data redundancy may be at risk due to lack of free space within the cluster. One or more OSDs have reached the 'backfillfull' threshold. Add more capacity, or delete unwanted data."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-backfill-full"
+ summary: "Backfill operations are blocked due to lack of free space"
+ expr: "ceph_health_detail{name=\"PG_BACKFILL_FULL\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.7.6"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPGNotScrubbed"
+ annotations:
+ description: "One or more PGs have not been scrubbed recently. Scrubs check metadata integrity, protecting against bit-rot. They check that metadata is consistent across data replicas. When PGs miss their scrub interval, it may indicate that the scrub window is too small, or PGs were not in a 'clean' state during the scrub window. You can manually initiate a scrub with: ceph pg scrub <pgid>"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-scrubbed"
+ summary: "Placement group(s) have not been scrubbed"
+ expr: "ceph_health_detail{name=\"PG_NOT_SCRUBBED\"} == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGsHighPerOSD"
+ annotations:
+ description: "The number of placement groups per OSD is too high (exceeds the mon_max_pg_per_osd setting).\n Check that the pg_autoscaler has not been disabled for any pools with 'ceph osd pool autoscale-status', and that the profile selected is appropriate. You may also adjust the target_size_ratio of a pool to guide the autoscaler based on the expected relative size of the pool ('ceph osd pool set cephfs.cephfs.meta target_size_ratio .1') or set the pg_autoscaler mode to 'warn' and adjust pg_num appropriately for one or more pools."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks/#too-many-pgs"
+ summary: "Placement groups per OSD is too high"
+ expr: "ceph_health_detail{name=\"TOO_MANY_PGS\"} == 1"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPGNotDeepScrubbed"
+ annotations:
+ description: "One or more PGs have not been deep scrubbed recently. Deep scrubs protect against bit-rot. They compare data replicas to ensure consistency. When PGs miss their deep scrub interval, it may indicate that the window is too small or PGs were not in a 'clean' state during the deep-scrub window."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-deep-scrubbed"
+ summary: "Placement group(s) have not been deep scrubbed"
+ expr: "ceph_health_detail{name=\"PG_NOT_DEEP_SCRUBBED\"} == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "nodes"
+ rules:
+ - alert: "CephNodeRootFilesystemFull"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks/#too-many-pgs
- summary: Placement groups per OSD is too high
- description: |
- The number of placement groups per OSD is too high (exceeds the mon_max_pg_per_osd setting).
-
- Check that the pg_autoscaler has not been disabled for any pools with 'ceph osd pool autoscale-status',
- and that the profile selected is appropriate. You may also adjust the target_size_ratio of a pool to guide
- the autoscaler based on the expected relative size of the pool
- ('ceph osd pool set cephfs.cephfs.meta target_size_ratio .1') or set the pg_autoscaler
- mode to "warn" and adjust pg_num appropriately for one or more pools.
- - alert: CephPGNotDeepScrubbed
- expr: ceph_health_detail{name="PG_NOT_DEEP_SCRUBBED"} == 1
- for: 5m
- labels:
- severity: warning
- type: ceph_default
+ description: "Root volume is dangerously full: {{ $value | humanize }}% free."
+ summary: "Root filesystem is dangerously full"
+ expr: "node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"} LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 < 5"
+ for: "5m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephNodeNetworkPacketDrops"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pg-not-deep-scrubbed
- summary: Placement group(s) have not been deep scrubbed
- description: |
- One or more PGs have not been deep scrubbed recently. Deep scrubs
- protect against bit-rot. They compare data
- replicas to ensure consistency. When PGs miss their deep scrub interval, it may indicate
- that the window is too small or PGs were not in a 'clean' state during the deep-scrub
- window.
-
- You can manually initiate a deep scrub with: ceph pg deep-scrub <pgid>
- - name: nodes
- rules:
- - alert: CephNodeRootFilesystemFull
- expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 100 < 5
- for: 5m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.1
- annotations:
- summary: Root filesystem is dangerously full
- description: >
- Root volume is dangerously full: {{ $value | humanize }}% free.
-
- # alert on packet errors and drop rate
- - alert: CephNodeNetworkPacketDrops
+ description: "Node {{ $labels.instance }} experiences packet drop > 0.5% or > 10 packets/s on interface {{ $labels.device }}."
+ summary: "One or more NICs reports packet drops"
 expr: |
 (
- increase(node_network_receive_drop_total{device!="lo"}[1m]) +
- increase(node_network_transmit_drop_total{device!="lo"}[1m])
+ rate(node_network_receive_drop_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_drop_total{device!="lo"}[1m])
 ) / (
- increase(node_network_receive_packets_total{device!="lo"}[1m]) +
- increase(node_network_transmit_packets_total{device!="lo"}[1m])
- ) >= 0.0001 or (
- increase(node_network_receive_drop_total{device!="lo"}[1m]) +
- increase(node_network_transmit_drop_total{device!="lo"}[1m])
+ rate(node_network_receive_packets_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_packets_total{device!="lo"}[1m])
+ ) >= 0.0050000000000000001 and (
+ rate(node_network_receive_drop_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_drop_total{device!="lo"}[1m])
 ) >= 10
 labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.2
- annotations:
- summary: One or more NICs reports packet drops
- description: >
- Node {{ $labels.instance }} experiences packet drop > 0.01% or > 10 packets/s on interface {{ $labels.device }}.
-
- - alert: CephNodeNetworkPacketErrors
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeNetworkPacketErrors"
+ annotations:
+ description: "Node {{ $labels.instance }} experiences packet errors > 0.01% or > 10 packets/s on interface {{ $labels.device }}."
+ summary: "One or more NICs reports packet errors"
 expr: |
 (
- increase(node_network_receive_errs_total{device!="lo"}[1m]) +
- increase(node_network_transmit_errs_total{device!="lo"}[1m])
+ rate(node_network_receive_errs_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_errs_total{device!="lo"}[1m])
 ) / (
- increase(node_network_receive_packets_total{device!="lo"}[1m]) +
- increase(node_network_transmit_packets_total{device!="lo"}[1m])
+ rate(node_network_receive_packets_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_packets_total{device!="lo"}[1m])
 ) >= 0.0001 or (
- increase(node_network_receive_errs_total{device!="lo"}[1m]) +
- increase(node_network_transmit_errs_total{device!="lo"}[1m])
+ rate(node_network_receive_errs_total{device!="lo"}[1m]) +
+ rate(node_network_transmit_errs_total{device!="lo"}[1m])
 ) >= 10
 labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.3
- annotations:
- summary: One or more NICs reports packet errors
- description: >
- Node {{ $labels.instance }} experiences packet errors > 0.01% or > 10 packets/s on interface {{ $labels.device }}.
-
- # Restrict to device names beginning with '/' to skip false alarms from
- # tmpfs, overlay type filesystems
- - alert: CephNodeDiskspaceWarning
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.3"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeNetworkBondDegraded"
+ annotations:
+ description: "Bond {{ $labels.master }} is degraded on Node {{ $labels.instance }}."
+ summary: "Degraded Bond on Node {{ $labels.instance }}"
 expr: |
- predict_linear(node_filesystem_free_bytes{device=~"/.*"}[2d], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) *
- on(instance) group_left(nodename) node_uname_info < 0
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.8.4
- annotations:
- summary: Host filesystem free space is low
- description: >
- Mountpoint {{ $labels.mountpoint }} on {{ $labels.nodename }} will be full in less than 5 days based on the 48 hour trailing fill rate.
-
- - alert: CephNodeInconsistentMTU
- expr: node_network_mtu_bytes{device!="lo"} LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!="lo"} > 0) != on() group_left() (quantile(0.5, node_network_mtu_bytes{device!="lo"}))
+ node_bonding_slaves - node_bonding_active != 0
 labels:
- severity: warning
- type: ceph_default
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeDiskspaceWarning"
+ annotations:
+ description: "Mountpoint {{ $labels.mountpoint }} on {{ $labels.nodename }} will be full in less than 5 days based on the 48 hour trailing fill rate."
+ summary: "Host filesystem free space is getting low"
+ expr: "predict_linear(node_filesystem_free_bytes{device=~\"/.*\"}[2d], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) *on(instance) group_left(nodename) node_uname_info < 0"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.8.4"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephNodeInconsistentMTU"
+ annotations:
+ description: "Node {{ $labels.instance }} has a different MTU size ({{ $value }}) than the median of devices named {{ $labels.device }}."
+ summary: "MTU settings across Ceph hosts are inconsistent"
+ expr: "node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0) == scalar( max by (device) (node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) != quantile by (device) (.5, node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) )or node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0) == scalar( min by (device) (node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) != quantile by (device) (.5, node_network_mtu_bytes LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos (node_network_up{device!=\"lo\"} > 0)) )"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "pools"
+ rules:
+ - alert: "CephPoolGrowthWarning"
 annotations:
- summary: MTU settings across hosts are inconsistent
- description: >
- Node {{ $labels.instance }} has a different MTU size ({{ $value }}) than the median value on device {{ $labels.device }}.
-
- - name: pools
+ description: "Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours."
+ summary: "Pool growth rate may soon exceed capacity"
+ expr: "(predict_linear(ceph_pool_percent_used[2d], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id, instance, pod) group_right() ceph_pool_metadata) >= 95"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.9.2"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPoolBackfillFull"
+ annotations:
+ description: "A pool is approaching the near full threshold, which will prevent recovery/backfill operations from completing. Consider adding more capacity."
+ summary: "Free space in a pool is too low for recovery/backfill"
+ expr: "ceph_health_detail{name=\"POOL_BACKFILLFULL\"} > 0"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephPoolFull"
+ annotations:
+ description: "A pool has reached its MAX quota, or OSDs supporting the pool have reached the FULL threshold. Until this is resolved, writes to the pool will be blocked. Pool Breakdown (top 5) {{- range query \"topk(5, sort_desc(ceph_pool_percent_used LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id) group_right ceph_pool_metadata))\" }} - {{ .Labels.name }} at {{ .Value }}% {{- end }} Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>)"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#pool-full"
+ summary: "Pool is full - writes are blocked"
+ expr: "ceph_health_detail{name=\"POOL_FULL\"} > 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.9.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephPoolNearFull"
+ annotations:
+ description: "A pool has exceeded the warning (percent full) threshold, or OSDs supporting the pool have reached the NEARFULL threshold. Writes may continue, but you are at risk of the pool going read-only if more capacity isn't made available. Determine the affected pool with 'ceph df detail', looking at QUOTA BYTES and STORED. Increase the pool's quota, or add capacity to the cluster first then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>). Also ensure that the balancer is active."
+ summary: "One or more Ceph pools are nearly full"
+ expr: "ceph_health_detail{name=\"POOL_NEAR_FULL\"} > 0"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "healthchecks"
 rules:
- - alert: CephPoolGrowthWarning
- expr: |
- (predict_linear((max(ceph_pool_percent_used) without (pod, instance))[2d:1h], 3600 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 24 LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos 5) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id)
- group_right ceph_pool_metadata) >= 95
- labels:
- severity: warning
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.9.2
- annotations:
- summary: Pool growth rate may soon exceed capacity
- description: >
- Pool '{{ $labels.name }}' will be full in less than 5 days assuming the average fill-up rate of the past 48 hours.
-
- - alert: CephPoolBackfillFull
- expr: ceph_health_detail{name="POOL_BACKFILLFULL"} > 0
- labels:
- severity: warning
- type: ceph_default
+ - alert: "CephSlowOps"
 annotations:
- summary: Free space in a pool is too low for recovery/backfill
- description: >
- A pool is approaching the near full threshold, which will prevent recovery/backfill from completing. Consider adding more capacity.
-
- - alert: CephPoolFull
- expr: ceph_health_detail{name="POOL_FULL"} > 0
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.9.1
+ description: "{{ $value }} OSD requests are taking too long to process (osd_op_complaint_time exceeded)"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops"
+ summary: "OSD operations are slow to complete"
+ expr: "ceph_healthcheck_slow_ops > 0"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "CephDaemonSlowOps"
+ annotations:
+ description: "{{ $labels.ceph_daemon }} operations are taking too long to process (complaint time exceeded)"
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops"
+ summary: "{{ $labels.ceph_daemon }} operations are slow to complete"
+ expr: "ceph_daemon_health_metrics{type=\"SLOW_OPS\"} > 0"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - name: "hardware"
+ rules:
+ - alert: "HardwareStorageError"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#pool-full
- summary: Pool is full - writes are blocked
- description: |
- A pool has reached its MAX quota, or OSDs supporting the pool
- have reached the FULL threshold. Until this is resolved, writes to
- the pool will be blocked.
- Pool Breakdown (top 5)
- {{- range query "topk(5, sort_desc(ceph_pool_percent_used LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(pool_id) group_right ceph_pool_metadata))" }}
- - {{ .Labels.name }} at {{ .Value }}%
- {{- end }}
- Increase the pool's quota, or add capacity to the cluster
- then increase the pool's quota (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>)
- - alert: CephPoolNearFull
- expr: ceph_health_detail{name="POOL_NEAR_FULL"} > 0
- for: 5m
- labels:
- severity: warning
- type: ceph_default
+ description: "Some storage devices are in error. Check `ceph health detail`."
+ summary: "Storage devices error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_STORAGE\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareMemoryError"
+ annotations:
+ description: "DIMM error(s) detected. Check `ceph health detail`."
+ summary: "DIMM error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_MEMORY\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.2"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareProcessorError"
+ annotations:
+ description: "Processor error(s) detected. Check `ceph health detail`."
+ summary: "Processor error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_PROCESSOR\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareNetworkError"
+ annotations:
+ description: "Network error(s) detected. Check `ceph health detail`."
+ summary: "Network error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_NETWORK\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwarePowerError"
+ annotations:
+ description: "Power supply error(s) detected. Check `ceph health detail`."
+ summary: "Power supply error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_POWER\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.5"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "HardwareFanError"
+ annotations:
+ description: "Fan error(s) detected. Check `ceph health detail`."
+ summary: "Fan error(s) detected"
+ expr: "ceph_health_detail{name=\"HARDWARE_FANS\"} > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.13.6"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "PrometheusServer"
+ rules:
+ - alert: "PrometheusJobMissing"
 annotations:
- summary: One or more Ceph pools are nearly full
- description: |
- A pool has exceeded the warning (percent full) threshold, or OSDs
- supporting the pool have reached the NEARFULL threshold. Writes may
- continue, but you are at risk of the pool going read-only if more capacity
- isn't made available.
-
- Determine the affected pool with 'ceph df detail', looking
- at QUOTA BYTES and STORED. Increase the pool's quota, or add
- capacity to the cluster then increase the pool's quota
- (e.g. ceph osd pool set quota <pool_name> max_bytes <bytes>).
- Also ensure that the balancer is active.
- - name: healthchecks
+ description: "The prometheus job that scrapes from Ceph MGR is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance."
+ summary: "The scrape job for Ceph MGR is missing from Prometheus"
+ expr: "absent(up{job=\"rook-ceph-mgr\"})"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.12.1"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "PrometheusJobExporterMissing"
+ annotations:
+ description: "The prometheus job that scrapes from Ceph Exporter is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance."
+ summary: "The scrape job for Ceph Exporter is missing from Prometheus"
+ expr: "sum(absent(up{job=\"rook-ceph-exporter\"})) and sum(ceph_osd_metadata{ceph_version=~\"^ceph version (1[89]|[2-9][0-9]).*\"}) > 0"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.12.1"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "rados"
 rules:
- - alert: CephSlowOps
- expr: ceph_healthcheck_slow_ops > 0
- for: 30s
- labels:
- severity: warning
- type: ceph_default
- annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#slow-ops
- summary: OSD operations are slow to complete
- description: >
- {{ $value }} OSD requests are taking too long to process (osd_op_complaint_time exceeded)
-
- # Object related events
- - name: rados
+ - alert: "CephObjectMissing"
+ annotations:
+ description: "The latest version of a RADOS object can not be found, even though all OSDs are up. I/O requests for this object from clients will block (hang). Resolving this issue may require the object to be rolled back to a prior version manually, and manually verified."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks#object-unfound"
+ summary: "Object(s) marked UNFOUND"
+ expr: "(ceph_health_detail{name=\"OBJECT_UNFOUND\"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (count(ceph_osd_up == 1) == bool count(ceph_osd_metadata)) == 1"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.1"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "generic"
 rules:
- - alert: CephObjectMissing
- expr: (ceph_health_detail{name="OBJECT_UNFOUND"} == 1) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on() (count(ceph_osd_up == 1) == bool count(ceph_osd_metadata)) == 1
- for: 30s
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.10.1
+ - alert: "CephDaemonCrash"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks#object-unfound
- summary: Object(s) marked UNFOUND
- description: |
- The latest version of a RADOS object can not be found, even though all OSDs are up. I/O
- requests for this object from clients will block (hang). Resolving this issue may
- require the object to be rolled back to a prior version manually, and manually verified.
- # Generic
- - name: generic
+ description: "One or more daemons have crashed recently, and need to be acknowledged. This notification ensures that software crashes do not go unseen. To acknowledge a crash, use the 'ceph crash archive <id>' command."
+ documentation: "https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash"
+ summary: "One or more Ceph daemons have crashed, and are pending acknowledgement"
+ expr: "ceph_health_detail{name=\"RECENT_CRASH\"} == 1"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.1.2"
+ severity: "critical"
+ type: "ceph_default"
+ - name: "rbdmirror"
 rules:
- - alert: CephDaemonCrash
- expr: ceph_health_detail{name="RECENT_CRASH"} == 1
- for: 1m
- labels:
- severity: critical
- type: ceph_default
- oid: 1.3.6.1.4.1.50495.1.2.1.1.2
+ - alert: "CephRBDMirrorImagesPerDaemonHigh"
 annotations:
- documentation: https://docs.ceph.com/en/latest/rados/operations/health-checks/#recent-crash
- summary: One or more Ceph daemons have crashed, and are pending acknowledgement
- description: |
- One or more daemons have crashed recently, and need to be acknowledged. This notification
- ensures that software crashes do not go unseen. To acknowledge a crash, use the
- 'ceph crash archive <id>' command.
+ description: "Number of image replications per daemon is not supposed to go beyond threshold 100"
+ summary: "Number of image replications are now above 100"
+ expr: "sum by (ceph_daemon, namespace) (ceph_rbd_mirror_snapshot_image_snapshots) > 100"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.2"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephRBDMirrorImagesNotInSync"
+ annotations:
+ description: "Both local and remote RBD mirror images should be in sync."
+ summary: "Some of the RBD mirror images are not in sync with the remote counter parts."
+ expr: "sum by (ceph_daemon, image, namespace, pool) (topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_local_timestamp) - topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_remote_timestamp)) != 0"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.3"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephRBDMirrorImagesNotInSyncVeryHigh"
+ annotations:
+ description: "More than 10% of the images have synchronization problems"
+ summary: "Number of unsynchronized images are very high."
+ expr: "count by (ceph_daemon) ((topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_local_timestamp) - topk by (ceph_daemon, image, namespace, pool) (1, ceph_rbd_mirror_snapshot_image_remote_timestamp)) != 0) > (sum by (ceph_daemon) (ceph_rbd_mirror_snapshot_snapshots)*.1)"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.4"
+ severity: "critical"
+ type: "ceph_default"
+ - alert: "CephRBDMirrorImageTransferBandwidthHigh"
+ annotations:
+ description: "Detected a heavy increase in bandwidth for rbd replications (over 80%) in the last 30 min. This might not be a problem, but it is good to review the number of images being replicated simultaneously"
+ summary: "The replication network usage has been increased over 80% in the last 30 minutes. Review the number of images being replicated. This alert will be cleaned automatically after 30 minutes"
+ expr: "rate(ceph_rbd_mirror_journal_replay_bytes[30m]) > 0.80"
+ for: "1m"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.10.5"
+ severity: "warning"
+ type: "ceph_default"
+ - name: "nvmeof"
+ rules:
+ - alert: "NVMeoFSubsystemNamespaceLimit"
+ annotations:
+ description: "Subsystems have a max namespace limit defined at creation time. This alert means that no more namespaces can be added to {{ $labels.nqn }}"
+ summary: "{{ $labels.nqn }} subsystem has reached its maximum number of namespaces "
+ expr: "(count by(nqn) (ceph_nvmeof_subsystem_namespace_metadata)) >= ceph_nvmeof_subsystem_namespace_limit"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFTooManyGateways"
+ annotations:
+ description: "You may create many gateways, but 4 is the tested limit"
+ summary: "Max supported gateways exceeded "
+ expr: "count(ceph_nvmeof_gateway_info) > 4.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFMaxGatewayGroupSize"
+ annotations:
+ description: "You may create many gateways in a gateway group, but 2 is the tested limit"
+ summary: "Max gateways within a gateway group ({{ $labels.group }}) exceeded "
+ expr: "count by(group) (ceph_nvmeof_gateway_info) > 2.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFSingleGatewayGroup"
+ annotations:
+ description: "Although a single member gateway group is valid, it should only be used for test purposes"
+ summary: "The gateway group {{ $labels.group }} consists of a single gateway - HA is not possible "
+ expr: "count by(group) (ceph_nvmeof_gateway_info) == 1"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighGatewayCPU"
+ annotations:
+ description: "Typically, high CPU may indicate degraded performance. Consider increasing the number of reactor cores"
+ summary: "CPU used by {{ $labels.instance }} NVMe-oF Gateway is high "
+ expr: "label_replace(avg by(instance) (rate(ceph_nvmeof_reactor_seconds_total{mode=\"busy\"}[1m])),\"instance\",\"$1\",\"instance\",\"(.*):.*\") > 80.00"
+ for: "10m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFGatewayOpenSecurity"
+ annotations:
+ description: "It is good practice to ensure subsystems use host security to reduce the risk of unexpected data loss"
+ summary: "Subsystem {{ $labels.nqn }} has been defined without host level security "
+ expr: "ceph_nvmeof_subsystem_metadata{allow_any_host=\"yes\"}"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFTooManySubsystems"
+ annotations:
+ description: "Although you may continue to create subsystems in {{ $labels.gateway_host }}, the configuration may not be supported"
+ summary: "The number of subsystems defined to the gateway exceeds supported values "
+ expr: "count by(gateway_host) (label_replace(ceph_nvmeof_subsystem_metadata,\"gateway_host\",\"$1\",\"instance\",\"(.*):.*\")) > 16.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFVersionMismatch"
+ annotations:
+ description: "This may indicate an issue with deployment. Check cephadm logs"
+ summary: "The cluster has different NVMe-oF gateway releases active "
+ expr: "count(count by(version) (ceph_nvmeof_gateway_info)) > 1"
+ for: "1h"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighClientCount"
+ annotations:
+ description: "The supported limit for clients connecting to a subsystem is 32"
+ summary: "The number of clients connected to {{ $labels.nqn }} is too high "
+ expr: "ceph_nvmeof_subsystem_host_count > 32.00"
+ for: "1m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighHostCPU"
+ annotations:
+ description: "High CPU on a gateway host can lead to CPU contention and performance degradation"
+ summary: "The CPU is high ({{ $value }}%) on NVMeoF Gateway host ({{ $labels.host }}) "
+ expr: "100-((100*(avg by(host) (label_replace(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]),\"host\",\"$1\",\"instance\",\"(.*):.*\")) LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos on(host) group_right label_replace(ceph_nvmeof_gateway_info,\"host\",\"$1\",\"instance\",\"(.*):.*\")))) >= 80.00"
+ for: "10m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFInterfaceDown"
+ annotations:
+ description: "A NIC used by one or more subsystems is in a down state"
+ summary: "Network interface {{ $labels.device }} is down "
+ expr: "ceph_nvmeof_subsystem_listener_iface_info{operstate=\"down\"}"
+ for: "30s"
+ labels:
+ oid: "1.3.6.1.4.1.50495.1.2.1.14.1"
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFInterfaceDuplex"
+ annotations:
+ description: "Until this is resolved, performance from the gateway will be degraded"
+ summary: "Network interface {{ $labels.device }} is not running in full duplex mode "
+ expr: "ceph_nvmeof_subsystem_listener_iface_info{duplex!=\"full\"}"
+ for: "30s"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighReadLatency"
+ annotations:
+ description: "High latencies may indicate a constraint within the cluster e.g. CPU, network. Please investigate"
+ summary: "The average read latency over the last 5 mins has reached 10 ms or more on {{ $labels.gateway }}"
+ expr: "label_replace((avg by(instance) ((rate(ceph_nvmeof_bdev_read_seconds_total[1m]) / rate(ceph_nvmeof_bdev_reads_completed_total[1m])))),\"gateway\",\"$1\",\"instance\",\"(.*):.*\") > 0.01"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
+ - alert: "NVMeoFHighWriteLatency"
+ annotations:
+ description: "High latencies may indicate a constraint within the cluster e.g. CPU, network. Please investigate"
+ summary: "The average write latency over the last 5 mins has reached 20 ms or more on {{ $labels.gateway }}"
+ expr: "label_replace((avg by(instance) ((rate(ceph_nvmeof_bdev_write_seconds_total[5m]) / rate(ceph_nvmeof_bdev_writes_completed_total[5m])))),\"gateway\",\"$1\",\"instance\",\"(.*):.*\") > 0.02"
+ for: "5m"
+ labels:
+ severity: "warning"
+ type: "ceph_default"
 ---
 
 ---
-apiVersion: snapshot.storage.k8s.io/v1beta1
+apiVersion: snapshot.storage.k8s.io/v1
 kind: VolumeSnapshotClass
 metadata:
 name: csi-rbdplugin-snapclass

chii-bot · 2022-08-31T22:22:20Z

Path: cluster/core/rook-ceph/operator/helm-release.yaml
Version: v1.9.12 -> v1.16.0

@@ -1,86 +1,4 @@
 ---
-# Source: rook-ceph/templates/psp.yaml
-# We expect most Kubernetes teams to follow the Kubernetes docs and have these PSPs.
-# LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos privileged (for kube-system namespace)
-# LICENSE README.md Taskfile.yml cluster default docs hack mkdocs.yml scripts talos restricted (for all logged in users)
-#
-# PSPs are applied based on the first match alphabetically. `rook-ceph-operator` comes after
-# `restricted` alphabetically, so we name this `00-rook-privileged`, so it stays somewhere
-# close to the top and so `rook-system` gets the intended PSP. This may need to be renamed in
-# environments with other `00`-prefixed PSPs.
-#
-# More on PSP ordering: https://kubernetes.io/docs/concepts/policy/pod-security-policy/#policy-order
-apiVersion: policy/v1beta1
-kind: PodSecurityPolicy
-metadata:
- name: 00-rook-privileged
- annotations:
- seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'runtime/default'
- seccomp.security.alpha.kubernetes.io/defaultProfileName: 'runtime/default'
-spec:
- privileged: true
- allowedCapabilities:
- # required by CSI
- - SYS_ADMIN
- - MKNOD
- fsGroup:
- rule: RunAsAny
- # runAsUser, supplementalGroups - Rook needs to run some pods as root
- # Ceph pods could be run as the Ceph user, but that user isn't always known ahead of time
- runAsUser:
- rule: RunAsAny
- supplementalGroups:
- rule: RunAsAny
- # seLinux - seLinux context is unknown ahead of time; set if this is well-known
- seLinux:
- rule: RunAsAny
- volumes:
- # recommended minimum set
- - configMap
- - downwardAPI
- - emptyDir
- - persistentVolumeClaim
- - secret
- - projected
- # required for Rook
- - hostPath
- # allowedHostPaths can be set to Rook's known host volume mount points when they are fully-known
- # allowedHostPaths:
- # - pathPrefix: "/run/udev" # for OSD prep
- # readOnly: false
- # - pathPrefix: "/dev" # for OSD prep
- # readOnly: false
- # - pathPrefix: "/var/lib/rook" # or whatever the dataDirHostPath value is set to
- # readOnly: false
- # Ceph requires host IPC for setting up encrypted devices
- hostIPC: true
- # Ceph OSDs need to share the same PID namespace
- hostPID: true
- # hostNetwork can be set to 'false' if host networking isn't used
- hostNetwork: true
- hostPorts:
- # Ceph messenger protocol v1
- - min: 6789
- max: 6790 # <- support old default port
- # Ceph messenger protocol v2
- - min: 3300
- max: 3300
- # Ceph RADOS ports for OSDs, MDSes
- - min: 6800
- max: 7300
- # # Ceph dashboard port HTTP (not recommended)
- # - min: 7000
- # max: 7000
- # Ceph dashboard port HTTPS
- - min: 8443
- max: 8443
- # Ceph mgr Prometheus Metrics
- - min: 9283
- max: 9283
- # port for CSIAddons
- - min: 9070
- max: 9070
----
 # Source: rook-ceph/templates/cluster-rbac.yaml
 # Service account for Ceph OSDs
 apiVersion: v1
@@ -155,6 +73,19 @@
 # imagePullSecrets:
 # - name: my-registry-secret
 ---
+# Source: rook-ceph/templates/cluster-rbac.yaml
+# Service account for other components
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: rook-ceph-default
+ namespace: default # namespace:cluster
+ labels:
+ operator: rook
+ storage-backend: ceph
+# imagePullSecrets:
+# - name: my-registry-secret
+---
 # Source: rook-ceph/templates/serviceaccount.yaml
 # Service account for the Rook-Ceph operator
 apiVersion: v1
@@ -211,6 +142,20 @@
 # imagePullSecrets:
 # - name: my-registry-secret
 ---
+# Source: rook-ceph/templates/serviceaccount.yaml
+# Service account for Ceph COSI driver
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: objectstorage-provisioner
+ namespace: default # namespace:operator
+ labels:
+ app.kubernetes.io/part-of: container-object-storage-interface
+ app.kubernetes.io/component: driver-ceph
+ app.kubernetes.io/name: cosi-driver-ceph
+# imagePullSecrets:
+# - name: my-registry-secret
+---
 # Source: rook-ceph/templates/configmap.yaml
 # Operator settings that can be updated without an operator restart
 # Operator settings that require an operator restart are found in the operator env vars
@@ -218,36 +163,53 @@
 apiVersion: v1
 metadata:
 name: rook-ceph-operator-config
+ namespace: default # namespace:operator
 data:
 ROOK_LOG_LEVEL: "INFO"
 ROOK_CEPH_COMMANDS_TIMEOUT_SECONDS: "15"
 ROOK_OBC_WATCH_OPERATOR_NAMESPACE: "true"
+ ROOK_CEPH_ALLOW_LOOP_DEVICES: "false"
+ ROOK_ENABLE_DISCOVERY_DAEMON: "false"
 ROOK_CSI_ENABLE_RBD: "true"
 ROOK_CSI_ENABLE_CEPHFS: "true"
+ ROOK_CSI_DISABLE_DRIVER: "false"
 CSI_ENABLE_CEPHFS_SNAPSHOTTER: "true"
+ CSI_ENABLE_NFS_SNAPSHOTTER: "true"
 CSI_ENABLE_RBD_SNAPSHOTTER: "true"
 CSI_PLUGIN_ENABLE_SELINUX_HOST_MOUNT: "false"
 CSI_ENABLE_ENCRYPTION: "false"
 CSI_ENABLE_OMAP_GENERATOR: "false"
 CSI_ENABLE_HOST_NETWORK: "true"
+ CSI_ENABLE_METADATA: "false"
+ CSI_ENABLE_VOLUME_GROUP_SNAPSHOT: "true"
 CSI_PLUGIN_PRIORITY_CLASSNAME: "system-node-critical"
 CSI_PROVISIONER_PRIORITY_CLASSNAME: "system-cluster-critical"
- CSI_RBD_FSGROUPPOLICY: "ReadWriteOnceWithFSType"
- CSI_CEPHFS_FSGROUPPOLICY: "ReadWriteOnceWithFSType"
- CSI_NFS_FSGROUPPOLICY: "ReadWriteOnceWithFSType"
- ROOK_CSI_ENABLE_GRPC_METRICS: "false"
- CSI_ENABLE_VOLUME_REPLICATION: "false"
+ CSI_RBD_FSGROUPPOLICY: "File"
+ CSI_CEPHFS_FSGROUPPOLICY: "File"
+ CSI_NFS_FSGROUPPOLICY: "File"
+ ROOK_CSI_CEPH_IMAGE: "quay.io/cephcsi/cephcsi:v3.13.0"
+ ROOK_CSI_REGISTRAR_IMAGE: "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1"
+ ROOK_CSI_PROVISIONER_IMAGE: "registry.k8s.io/sig-storage/csi-provisioner:v5.0.1"
+ ROOK_CSI_SNAPSHOTTER_IMAGE: "registry.k8s.io/sig-storage/csi-snapshotter:v8.0.1"
+ ROOK_CSI_ATTACHER_IMAGE: "registry.k8s.io/sig-storage/csi-attacher:v4.6.1"
+ ROOK_CSI_RESIZER_IMAGE: "registry.k8s.io/sig-storage/csi-resizer:v1.11.1"
+ ROOK_CSI_IMAGE_PULL_POLICY: "IfNotPresent"
 CSI_ENABLE_CSIADDONS: "false"
+ ROOK_CSIADDONS_IMAGE: "quay.io/csiaddons/k8s-sidecar:v0.11.0"
+ CSI_ENABLE_TOPOLOGY: "false"
 ROOK_CSI_ENABLE_NFS: "false"
 CSI_FORCE_CEPHFS_KERNEL_CLIENT: "true"
 CSI_GRPC_TIMEOUT_SECONDS: "150"
 CSI_PROVISIONER_REPLICAS: "2"
- CSI_RBD_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : csi-omap-generator\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_RBD_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_CEPHFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_CEPHFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n"
- CSI_NFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n cpu: 200m\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n"
- CSI_NFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n cpu: 100m\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n cpu: 500m\n"
+ CSI_RBD_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n limits:\n memory: 1Gi\n- name : csi-omap-generator\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_RBD_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n- name : csi-rbdplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_CEPHFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-resizer\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-attacher\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-snapshotter\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_CEPHFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n- name : csi-cephfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : liveness-prometheus\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n"
+ CSI_NFS_PROVISIONER_RESOURCE: "- name : csi-provisioner\n resource:\n requests:\n memory: 128Mi\n cpu: 100m\n limits:\n memory: 256Mi\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n- name : csi-attacher\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n"
+ CSI_NFS_PLUGIN_RESOURCE: "- name : driver-registrar\n resource:\n requests:\n memory: 128Mi\n cpu: 50m\n limits:\n memory: 256Mi\n- name : csi-nfsplugin\n resource:\n requests:\n memory: 512Mi\n cpu: 250m\n limits:\n memory: 1Gi\n"
+ CSI_CEPHFS_ATTACH_REQUIRED: "true"
+ CSI_RBD_ATTACH_REQUIRED: "true"
+ CSI_NFS_ATTACH_REQUIRED: "true"
 ---
 # Source: rook-ceph/templates/clusterrole.yaml
 kind: ClusterRole
@@ -271,9 +233,24 @@
 - apiGroups: [""]
 resources: ["pods/exec"]
 verbs: ["create"]
- - apiGroups: ["admissionregistration.k8s.io"]
- resources: ["validatingwebhookconfigurations"]
- verbs: ["create", "get", "delete", "update"]
+ - apiGroups: ["csiaddons.openshift.io"]
+ resources: ["networkfences"]
+ verbs: ["create", "get", "update", "delete", "watch", "list", "deletecollection"]
+ - apiGroups: ["apiextensions.k8s.io"]
+ resources: ["customresourcedefinitions"]
+ verbs: ["get"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["cephconnections"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["clientprofiles"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["operatorconfigs"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
+ - apiGroups: ["csi.ceph.io"]
+ resources: ["drivers"]
+ verbs: ["create", "delete", "get", "list", "update", "watch"]
 ---
 # Source: rook-ceph/templates/clusterrole.yaml
 # The cluster role for managing all the cluster-specific resources in a namespace
@@ -332,9 +309,8 @@
 # Node access is needed for determining nodes where mons should run
 - nodes
 - nodes/proxy
- - services
 # Rook watches secrets which it uses to configure access to external resources.
- # e.g., external Ceph cluster; TLS certificates for the admission controller or object store
+ # e.g., external Ceph cluster or object store
 - secrets
 # Rook watches for changes to the rook-operator-config configmap
 - configmaps
@@ -352,6 +328,7 @@
 - persistentvolumeclaims
 # Rook creates endpoints for mgr and object store access
 - endpoints
+ - services
 verbs:
 - get
 - list
@@ -380,6 +357,7 @@
 - create
 - update
 - delete
+ - deletecollection
 # The Rook operator must be able to watch all ceph.rook.io resources to reconcile them.
 - apiGroups: ["ceph.rook.io"]
 resources:
@@ -399,6 +377,7 @@
 - cephfilesystemmirrors
 - cephfilesystemsubvolumegroups
 - cephblockpoolradosnamespaces
+ - cephcosidrivers
 verbs:
 - get
 - list
@@ -467,6 +446,14 @@
 - delete
 - deletecollection
 - apiGroups:
+ - apps
+ resources:
+ # This is to add osd deployment owner ref on key rotation
+ # cron jobs.
+ - deployments/finalizers
+ verbs:
+ - update
+ - apiGroups:
 - healthchecking.openshift.io
 resources:
 - machinedisruptionbudgets
@@ -651,19 +638,19 @@
 rules:
 - apiGroups: [""]
 resources: ["nodes"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
- resources: ["namespaces"]
- verbs: ["get", "list"]
+ verbs: ["get"]
 - apiGroups: [""]
- resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "update"]
- - apiGroups: ["storage.k8s.io"]
- resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update"]
+ resources: ["secrets"]
+ verbs: ["get"]
 - apiGroups: [""]
 resources: ["configmaps"]
- verbs: ["get", "list"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts/token"]
+ verbs: ["create"]
 ---
 # Source: rook-ceph/templates/clusterrole.yaml
 kind: ClusterRole
@@ -675,11 +662,20 @@
 resources: ["secrets"]
 verbs: ["get", "list"]
 - apiGroups: [""]
+ resources: ["configmaps"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["nodes"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: ["storage.k8s.io"]
+ resources: ["csinodes"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: [""]
 resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
+ verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
 - apiGroups: [""]
 resources: ["persistentvolumeclaims"]
- verbs: ["get", "list", "watch", "update"]
+ verbs: ["get", "list", "watch", "patch", "update"]
 - apiGroups: ["storage.k8s.io"]
 resources: ["storageclasses"]
 verbs: ["get", "list", "watch"]
@@ -688,31 +684,40 @@
 verbs: ["list", "watch", "create", "update", "patch"]
 - apiGroups: ["storage.k8s.io"]
 resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update", "patch"]
+ verbs: ["get", "list", "watch", "patch"]
 - apiGroups: ["storage.k8s.io"]
 resources: ["volumeattachments/status"]
 verbs: ["patch"]
 - apiGroups: [""]
- resources: ["nodes"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
 resources: ["persistentvolumeclaims/status"]
- verbs: ["update", "patch"]
+ verbs: ["patch"]
 - apiGroups: ["snapshot.storage.k8s.io"]
 resources: ["volumesnapshots"]
- verbs: ["get", "list", "watch", "update", "patch"]
- - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshotcontents"]
- verbs: ["create", "get", "list", "watch", "update", "delete", "patch"]
+ verbs: ["get", "list", "watch", "update", "patch", "create"]
 - apiGroups: ["snapshot.storage.k8s.io"]
 resources: ["volumesnapshotclasses"]
 verbs: ["get", "list", "watch"]
 - apiGroups: ["snapshot.storage.k8s.io"]
+ resources: ["volumesnapshotcontents"]
+ verbs: ["get", "list", "watch", "patch", "update", "create"]
+ - apiGroups: ["snapshot.storage.k8s.io"]
 resources: ["volumesnapshotcontents/status"]
 verbs: ["update", "patch"]
- - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshots/status"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotclasses"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents"]
+ verbs: ["get", "list", "watch", "update", "patch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents/status"]
 verbs: ["update", "patch"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts"]
+ verbs: ["get"]
+ - apiGroups: [""]
+ resources: ["serviceaccounts/token"]
+ verbs: ["create"]
 ---
 # Source: rook-ceph/templates/clusterrole.yaml
 kind: ClusterRole
@@ -730,26 +735,23 @@
 resources: ["secrets"]
 verbs: ["get", "list"]
 - apiGroups: [""]
- resources: ["nodes"]
- verbs: ["get", "list", "watch"]
- - apiGroups: [""]
- resources: ["namespaces"]
- verbs: ["get", "list"]
- - apiGroups: [""]
 resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "update"]
+ verbs: ["get", "list"]
 - apiGroups: ["storage.k8s.io"]
 resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update"]
+ verbs: ["get", "list"]
 - apiGroups: [""]
 resources: ["configmaps"]
- verbs: ["get", "list"]
+ verbs: ["get"]
 - apiGroups: [""]
 resources: ["serviceaccounts"]
 verbs: ["get"]
 - apiGroups: [""]
 resources: ["serviceaccounts/token"]
 verbs: ["create"]
+ - apiGroups: [""]
+ resources: ["nodes"]
+ verbs: ["get"]
 ---
 # Source: rook-ceph/templates/clusterrole.yaml
 kind: ClusterRole
@@ -762,13 +764,19 @@
 verbs: ["get", "list", "watch"]
 - apiGroups: [""]
 resources: ["persistentvolumes"]
- verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
+ verbs: ["get", "list", "watch", "create", "update", "delete", "patch"]
 - apiGroups: [""]
 resources: ["persistentvolumeclaims"]
 verbs: ["get", "list", "watch", "update"]
 - apiGroups: ["storage.k8s.io"]
+ resources: ["storageclasses"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: [""]
+ resources: ["events"]
+ verbs: ["list", "watch", "create", "update", "patch"]
+ - apiGroups: ["storage.k8s.io"]
 resources: ["volumeattachments"]
- verbs: ["get", "list", "watch", "update", "patch"]
+ verbs: ["get", "list", "watch", "patch"]
 - apiGroups: ["storage.k8s.io"]
 resources: ["volumeattachments/status"]
 verbs: ["patch"]
@@ -776,71 +784,64 @@
 resources: ["nodes"]
 verbs: ["get", "list", "watch"]
 - apiGroups: ["storage.k8s.io"]
- resources: ["storageclasses"]
+ resources: ["csinodes"]
 verbs: ["get", "list", "watch"]
 - apiGroups: [""]
- resources: ["events"]
- verbs: ["list", "watch", "create", "update", "patch"]
+ resources: ["persistentvolumeclaims/status"]
+ verbs: ["patch"]
 - apiGroups: ["snapshot.storage.k8s.io"]
 resources: ["volumesnapshots"]
- verbs: ["get", "list", "watch", "update", "patch"]
- - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshotcontents"]
- verbs: ["create", "get", "list", "watch", "update", "delete", "patch"]
+ verbs: ["get", "list", "watch", "update", "patch", "create"]
 - apiGroups: ["snapshot.storage.k8s.io"]
 resources: ["volumesnapshotclasses"]
 verbs: ["get", "list", "watch"]
 - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshotcontents/status"]
- verbs: ["update", "patch"]
+ resources: ["volumesnapshotcontents"]
+ verbs: ["get", "list", "watch", "patch", "update", "create"]
 - apiGroups: ["snapshot.storage.k8s.io"]
- resources: ["volumesnapshots/status"]
+ resources: ["volumesnapshotcontents/status"]
 verbs: ["update", "patch"]
- - apiGroups: [""]
- resources: ["persistentvolumeclaims/status"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotclasses"]
+ verbs: ["get", "list", "watch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents"]
+ verbs: ["get", "list", "watch", "update", "patch"]
+ - apiGroups: ["groupsnapshot.storage.k8s.io"]
+ resources: ["volumegroupsnapshotcontents/status"]
 verbs: ["update", "patch"]
 - apiGroups: [""]
 resources: ["configmaps"]
 verbs: ["get"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplications", "volumereplicationclasses"]
- verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplications/finalizers"]
- verbs: ["update"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplications/status"]
- verbs: ["get", "patch", "update"]
- - apiGroups: ["replication.storage.openshift.io"]
- resources: ["volumereplicationclasses/status"]
- verbs: ["get"]
 - apiGroups: [""]
 resources: ["serviceaccounts"]
 verbs: ["get"]
 - apiGroups: [""]
 resources: ["serviceaccounts/token"]
 verbs: ["create"]
+ - apiGroups: [""]
+ resources: ["nodes"]
+ verbs: ["get", "list", "watch"]
 ---
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
+# Source: rook-ceph/templates/clusterrole.yaml
 kind: ClusterRole
+apiVersion: rbac.authorization.k8s.io/v1
 metadata:
- name: 'psp:rook'
+ name: objectstorage-provisioner-role
 labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-rules:
- - apiGroups:
- - policy
- resources:
- - podsecuritypolicies
- resourceNames:
- - 00-rook-privileged
- verbs:
- - use
+ app.kubernetes.io/part-of: container-object-storage-interface
+ app.kubernetes.io/component: driver-ceph
+ app.kubernetes.io/name: cosi-driver-ceph
+rules:
+ - apiGroups: ["objectstorage.k8s.io"]
+ resources: ["buckets", "bucketaccesses", "bucketclaims", "bucketaccessclasses", "buckets/status", "bucketaccesses/status", "bucketclaims/status", "bucketaccessclasses/status"]
+ verbs: ["get", "list", "watch", "update", "create", "delete"]
+ - apiGroups: ["coordination.k8s.io"]
+ resources: ["leases"]
+ verbs: ["get", "watch", "list", "delete", "update", "create"]
+ - apiGroups: [""]
+ resources: ["secrets", "events"]
+ verbs: ["get", "delete", "update", "create"]
 ---
 # Source: rook-ceph/templates/cluster-rbac.yaml
 # Allow the ceph mgr to access cluster-wide resources necessary for the mgr modules
@@ -946,28 +947,30 @@
 kind: ClusterRoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
- name: cephfs-csi-nodeplugin
+ name: cephfs-csi-provisioner-role
 subjects:
 - kind: ServiceAccount
- name: rook-csi-cephfs-plugin-sa
+ name: rook-csi-cephfs-provisioner-sa
 namespace: default # namespace:operator
 roleRef:
 kind: ClusterRole
- name: cephfs-csi-nodeplugin
+ name: cephfs-external-provisioner-runner
 apiGroup: rbac.authorization.k8s.io
 ---
 # Source: rook-ceph/templates/clusterrolebinding.yaml
+# This is required by operator-sdk to map the cluster/clusterrolebindings with SA
+# otherwise operator-sdk will create a individual file for these.
 kind: ClusterRoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
 metadata:
- name: cephfs-csi-provisioner-role
+ name: cephfs-csi-nodeplugin-role
 subjects:
 - kind: ServiceAccount
- name: rook-csi-cephfs-provisioner-sa
+ name: rook-csi-cephfs-plugin-sa
 namespace: default # namespace:operator
 roleRef:
 kind: ClusterRole
- name: cephfs-external-provisioner-runner
+ name: cephfs-csi-nodeplugin
 apiGroup: rbac.authorization.k8s.io
 ---
 # Source: rook-ceph/templates/clusterrolebinding.yaml
@@ -984,81 +987,24 @@
 name: rbd-external-provisioner-runner
 apiGroup: rbac.authorization.k8s.io
 ---
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
- name: rook-ceph-system-psp
- labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-system
- namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
+# Source: rook-ceph/templates/clusterrolebinding.yaml
+# RBAC for ceph cosi driver service account
 kind: ClusterRoleBinding
-metadata:
- name: rook-csi-cephfs-provisioner-sa-psp
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-csi-cephfs-provisioner-sa
- namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
 apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
 metadata:
- name: rook-csi-cephfs-plugin-sa-psp
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
+ name: objectstorage-provisioner-role-binding
+ labels:
+ app.kubernetes.io/part-of: container-object-storage-interface
+ app.kubernetes.io/component: driver-ceph
+ app.kubernetes.io/name: cosi-driver-ceph
 subjects:
 - kind: ServiceAccount
- name: rook-csi-cephfs-plugin-sa
+ name: objectstorage-provisioner
 namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
- name: rook-csi-rbd-plugin-sa-psp
 roleRef:
- apiGroup: rbac.authorization.k8s.io
 kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-csi-rbd-plugin-sa
- namespace: default # namespace:operator
----
-# Source: rook-ceph/templates/psp.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
- name: rook-csi-rbd-provisioner-sa-psp
-roleRef:
+ name: objectstorage-provisioner-role
 apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: 'psp:rook'
-subjects:
- - kind: ServiceAccount
- name: rook-csi-rbd-provisioner-sa
- namespace: default # namespace:operator
 ---
 # Source: rook-ceph/templates/cluster-rbac.yaml
 kind: Role
@@ -1068,10 +1014,10 @@
 namespace: default # namespace:cluster
 rules:
 # this is needed for rook's "key-management" CLI to fetch the vault token from the secret when
- # validating the connection details
+ # validating the connection details and for key rotation operations.
 - apiGroups: [""]
 resources: ["secrets"]
- verbs: ["get"]
+ verbs: ["get", "update"]
 - apiGroups: [""]
 resources: ["configmaps"]
 verbs: ["get", "list", "watch", "create", "update", "delete"]
@@ -1080,23 +1026,6 @@
 verbs: ["get", "list", "create", "update", "delete"]
 ---
 # Source: rook-ceph/templates/cluster-rbac.yaml
-kind: Role
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-rules:
- # Placeholder role so the rgw service account will
- # be generated in the csv. Remove this role and role binding
- # when fixing https://github.com/rook/rook/issues/10141.
- - apiGroups:
- - ""
- resources:
- - configmaps
- verbs:
- - get
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
 # Aspects of ceph-mgr that operate within the cluster's namespace
 kind: Role
 apiVersion: rbac.authorization.k8s.io/v1
@@ -1131,9 +1060,31 @@
 - apiGroups:
 - ceph.rook.io
 resources:
- - "*"
+ - cephclients
+ - cephclusters
+ - cephblockpools
+ - cephfilesystems
+ - cephnfses
+ - cephobjectstores
+ - cephobjectstoreusers
+ - cephobjectrealms
+ - cephobjectzonegroups
+ - cephobjectzones
+ - cephbuckettopics
+ - cephbucketnotifications
+ - cephrbdmirrors
+ - cephfilesystemmirrors
+ - cephfilesystemsubvolumegroups
+ - cephblockpoolradosnamespaces
+ - cephcosidrivers
 verbs:
- - "*"
+ - get
+ - list
+ - watch
+ - create
+ - update
+ - delete
+ - patch
 - apiGroups:
 - apps
 resources:
@@ -1269,6 +1220,7 @@
 - create
 - update
 - delete
+ - deletecollection
 - apiGroups:
 - batch
 resources:
@@ -1284,6 +1236,13 @@
 - get
 - create
 - delete
+ - apiGroups:
+ - multicluster.x-k8s.io
+ resources:
+ - serviceexports
+ verbs:
+ - get
+ - create
 ---
 # Source: rook-ceph/templates/role.yaml
 kind: Role
@@ -1292,12 +1251,6 @@
 name: cephfs-external-provisioner-cfg
 namespace: default # namespace:operator
 rules:
- - apiGroups: [""]
- resources: ["endpoints"]
- verbs: ["get", "watch", "list", "delete", "update", "create"]
- - apiGroups: [""]
- resources: ["configmaps"]
- verbs: ["get", "list", "create", "delete"]
 - apiGroups: ["coordination.k8s.io"]
 resources: ["leases"]
 verbs: ["get", "watch", "list", "delete", "update", "create"]
@@ -1309,113 +1262,11 @@
 name: rbd-external-provisioner-cfg
 namespace: default # namespace:operator
 rules:
- - apiGroups: [""]
- resources: ["endpoints"]
- verbs: ["get", "watch", "list", "delete", "update", "create"]
- - apiGroups: [""]
- resources: ["configmaps"]
- verbs: ["get", "list", "watch", "create", "delete", "update"]
 - apiGroups: ["coordination.k8s.io"]
 resources: ["leases"]
 verbs: ["get", "watch", "list", "delete", "update", "create"]
 ---
 # Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-default-psp
- namespace: default # namespace:cluster
- labels:
- operator: rook
- storage-backend: ceph
- app.kubernetes.io/part-of: rook-ceph-operator
- app.kubernetes.io/managed-by: Helm
- app.kubernetes.io/created-by: helm
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: default
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-rgw-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-mgr-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-mgr
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-cmd-reporter-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-cmd-reporter
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: rook-ceph-purge-osd-psp
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: ClusterRole
- name: psp:rook
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-purge-osd
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
 # Allow the operator to create resources in this cluster's namespace
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
@@ -1448,22 +1299,6 @@
 namespace: default # namespace:cluster
 ---
 # Source: rook-ceph/templates/cluster-rbac.yaml
-# Allow the rgw pods in this namespace to work with configmaps
-kind: RoleBinding
-apiVersion: rbac.authorization.k8s.io/v1
-metadata:
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: Role
- name: rook-ceph-rgw
-subjects:
- - kind: ServiceAccount
- name: rook-ceph-rgw
- namespace: default # namespace:cluster
----
-# Source: rook-ceph/templates/cluster-rbac.yaml
 # Allow the ceph mgr to access resources scoped to the CephCluster namespace necessary for mgr modules
 kind: RoleBinding
 apiVersion: rbac.authorization.k8s.io/v1
@@ -1615,6 +1450,7 @@
 kind: Deployment
 metadata:
 name: rook-ceph-operator
+ namespace: default # namespace:operator
 labels:
 operator: rook
 storage-backend: ceph
@@ -1633,39 +1469,37 @@
 labels:
 app: rook-ceph-operator
 spec:
+ tolerations:
+ - effect: NoExecute
+ key: node.kubernetes.io/unreachable
+ operator: Exists
+ tolerationSeconds: 5
 containers:
 - name: rook-ceph-operator
- image: "rook/ceph:v1.9.12"
+ image: "docker.io/rook/ceph:v1.16.0"
 imagePullPolicy: IfNotPresent
 args: ["ceph", "operator"]
 securityContext:
+ capabilities:
+ drop:
+ - ALL
+ runAsGroup: 2016
 runAsNonRoot: true
 runAsUser: 2016
- runAsGroup: 2016
 volumeMounts:
 - mountPath: /var/lib/rook
 name: rook-config
 - mountPath: /etc/ceph
 name: default-config-dir
- - mountPath: /etc/webhook
- name: webhook-cert
- ports:
- - containerPort: 9443
- name: https-webhook
- protocol: TCP
 env:
 - name: ROOK_CURRENT_NAMESPACE_ONLY
 value: "false"
 - name: ROOK_HOSTPATH_REQUIRES_PRIVILEGED
 value: "false"
- - name: ROOK_ENABLE_SELINUX_RELABELING
- value: "true"
 - name: ROOK_DISABLE_DEVICE_HOTPLUG
 value: "false"
- - name: ROOK_ENABLE_DISCOVERY_DAEMON
- value: "false"
- - name: ROOK_DISABLE_ADMISSION_CONTROLLER
- value: "false"
+ - name: ROOK_DISCOVER_DEVICES_INTERVAL
+ value: "60m"
 - name: NODE_NAME
 valueFrom:
 fieldRef:
@@ -1680,7 +1514,6 @@
 fieldPath: metadata.namespace
 resources:
 limits:
- cpu: 500m
 memory: 256Mi
 requests:
 cpu: 10m
@@ -1691,5 +1524,7 @@
 emptyDir: {}
 - name: default-config-dir
 emptyDir: {}
- - name: webhook-cert
- emptyDir: {}
+# Source: rook-ceph/templates/securityContextConstraints.yaml
+# scc for the Rook and Ceph daemons
+# for creating cluster in openshift
+---

chii-bot · 2022-08-31T22:24:17Z

MegaLinter status: ❌ ERROR

Descriptor	Linter	Files	Errors	Elapsed time
❌ COPYPASTE	jscpd	yes	2	1.01s
✅ REPOSITORY	git_diff	yes	no	0.02s
✅ REPOSITORY	secretlint	yes	no	1.25s
✅ YAML	prettier	4	0	0.66s
✅ YAML	yamllint	4	0	0.23s

See errors details in artifact MegaLinter reports on CI Job page
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

github-advanced-security · 2024-06-21T18:24:27Z

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

| datasource | package | from | to | | ---------- | ----------------- | ------- | ------- | | helm | rook-ceph | v1.9.12 | v1.16.0 | | helm | rook-ceph | v1.9.12 | v1.16.0 | | helm | rook-ceph | v1.9.12 | v1.16.0 | | helm | rook-ceph-cluster | v1.9.12 | v1.16.0 | | docker | rook/ceph | v1.9.13 | v1.16.0 | | docker | rook/ceph | v1.9.13 | v1.16.0 |

chii-bot bot requested a review from toboshii as a code owner August 31, 2022 22:21

chii-bot bot added renovate/container renovate/helm type/minor area/cluster Changes made in the cluster directory size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 31, 2022

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.0 (minor)~~ feat(helm): update rook-ceph group to v1.10.1 (minor) Sep 9, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from 4c696c0 to ecc5e00 Compare September 9, 2022 20:22

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.1 (minor)~~ feat(helm): update rook-ceph group to v1.10.2 (minor) Sep 27, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from ecc5e00 to cb07759 Compare September 27, 2022 20:26

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.2 (minor)~~ feat(helm): update rook-ceph group to v1.10.3 (minor) Oct 6, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from cb07759 to 40bd676 Compare October 6, 2022 21:20

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.3 (minor)~~ feat(helm): update rook-ceph group to v1.10.4 (minor) Oct 20, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from 40bd676 to 2b51770 Compare October 20, 2022 20:25

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.4 (minor)~~ feat(helm): update rook-ceph group to v1.10.5 (minor) Nov 3, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from 2b51770 to c1d7a2d Compare November 3, 2022 22:17

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.5 (minor)~~ feat(helm): update rook-ceph group to v1.10.6 (minor) Nov 18, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from c1d7a2d to 9d0fb81 Compare November 18, 2022 01:41

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.6 (minor)~~ feat(helm): update rook-ceph group to v1.10.7 (minor) Dec 6, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from 9d0fb81 to efdb31b Compare December 6, 2022 22:16

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.7 (minor)~~ feat(helm): update rook-ceph group to v1.10.8 (minor) Dec 21, 2022

chii-bot bot force-pushed the renovate/rook-ceph branch from efdb31b to 3e46cca Compare December 21, 2022 18:20

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.8 (minor)~~ feat(helm): update rook-ceph group to v1.10.9 (minor) Jan 12, 2023

chii-bot bot force-pushed the renovate/rook-ceph branch from 3e46cca to 221e89d Compare January 12, 2023 22:17

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.9 (minor)~~ feat(helm): update rook-ceph group to v1.10.10 (minor) Jan 18, 2023

chii-bot bot force-pushed the renovate/rook-ceph branch from 221e89d to 9422233 Compare January 18, 2023 18:21

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.10.10 (minor)~~ feat(helm): update rook-ceph group to v1.10.11 (minor) Feb 10, 2023

chii-bot bot force-pushed the renovate/rook-ceph branch from b8fc3e0 to 0af6374 Compare May 30, 2024 21:18

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.4 (minor)~~ feat(helm): update rook-ceph group to v1.14.5 (minor) May 30, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 0af6374 to c097aad Compare June 13, 2024 23:18

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.5 (minor)~~ feat(helm): update rook-ceph group to v1.14.6 (minor) Jun 13, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from c097aad to ccdacd2 Compare June 21, 2024 18:22

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.6 (minor)~~ feat(helm): update rook-ceph group to v1.14.7 (minor) Jun 21, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from ccdacd2 to 6568064 Compare July 3, 2024 20:17

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.7 (minor)~~ feat(helm): update rook-ceph group to v1.14.8 (minor) Jul 3, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 6568064 to 00cf479 Compare July 25, 2024 22:17

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.8 (minor)~~ feat(helm): update rook-ceph group to v1.14.9 (minor) Jul 25, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 00cf479 to 04cc8d9 Compare August 20, 2024 23:19

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.9 (minor)~~ feat(helm): update rook-ceph group to v1.14.10 (minor) Aug 20, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 04cc8d9 to 89e3bb5 Compare August 21, 2024 01:12

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.14.10 (minor)~~ feat(helm): update rook-ceph group to v1.15.0 (minor) Aug 21, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 89e3bb5 to 61cf10e Compare September 4, 2024 22:17

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.0 (minor)~~ feat(helm): update rook-ceph group to v1.15.1 (minor) Sep 4, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 61cf10e to b5a4b14 Compare September 19, 2024 21:18

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.1 (minor)~~ feat(helm): update rook-ceph group to v1.15.2 (minor) Sep 19, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from b5a4b14 to 1a5b1d0 Compare October 3, 2024 22:19

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.2 (minor)~~ feat(helm): update rook-ceph group to v1.15.3 (minor) Oct 3, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 1a5b1d0 to d7eba24 Compare October 17, 2024 21:19

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.3 (minor)~~ feat(helm): update rook-ceph group to v1.15.4 (minor) Oct 17, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from d7eba24 to cafe080 Compare November 6, 2024 21:19

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.4 (minor)~~ feat(helm): update rook-ceph group to v1.15.5 (minor) Nov 6, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from cafe080 to 3a20c80 Compare November 21, 2024 22:20

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.5 (minor)~~ feat(helm): update rook-ceph group to v1.15.6 (minor) Nov 21, 2024

chii-bot bot force-pushed the renovate/rook-ceph branch from 3a20c80 to 10ea576 Compare December 17, 2024 21:17

chii-bot bot changed the title ~~feat(helm): update rook-ceph group to v1.15.6 (minor)~~ feat(helm): update rook-ceph group to v1.16.0 (minor) Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(helm): update rook-ceph group to v1.16.0 (minor) #746

feat(helm): update rook-ceph group to v1.16.0 (minor) #746

chii-bot bot commented Aug 31, 2022 •

edited

Loading

chii-bot bot commented Aug 31, 2022 •

edited

Loading

chii-bot bot commented Aug 31, 2022 •

edited

Loading

chii-bot bot commented Aug 31, 2022 •

edited

Loading

github-advanced-security bot commented Jun 21, 2024

feat(helm): update rook-ceph group to v1.16.0 (minor) #746

Are you sure you want to change the base?

feat(helm): update rook-ceph group to v1.16.0 (minor) #746

Conversation

chii-bot bot commented Aug 31, 2022 • edited Loading

⚠ Dependency Lookup Warnings ⚠

Release Notes

Upgrade Guide

Breaking Changes

Features

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Upgrade Guide

Breaking Changes

Features

Experimental Features

Improvements

Improvements

Improvements

Improvements

Improvements

What's Changed

What's Changed

Improvements

Improvements

Improvements

Improvements

Improvements

Upgrade Guide

Breaking Changes

Features

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Improvements

Upgrade Guide

Breaking Changes

Features

Improvements

Improvements

Improvements

Improvements

Configuration

chii-bot bot commented Aug 31, 2022 • edited Loading

chii-bot bot commented Aug 31, 2022 • edited Loading

chii-bot bot commented Aug 31, 2022 • edited Loading

MegaLinter status: ❌ ERROR

github-advanced-security bot commented Jun 21, 2024

chii-bot bot commented Aug 31, 2022 •

edited

Loading

chii-bot bot commented Aug 31, 2022 •

edited

Loading

chii-bot bot commented Aug 31, 2022 •

edited

Loading

chii-bot bot commented Aug 31, 2022 •

edited

Loading