From 62e4a01cb9f4406b932bfd883fd689de7ee8fc39 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nils=20M=C3=BCller?= Date: Mon, 8 Apr 2024 21:26:44 +0200 Subject: [PATCH] docs(rook-ceph): add ceph health error runbook --- .../docs/runbooks/ceph-health-error.md | 79 +++++++++++++++++++ .backstage/components/rook-ceph/mkdocs.yml | 1 + 2 files changed, 80 insertions(+) create mode 100644 .backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md diff --git a/.backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md b/.backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md new file mode 100644 index 000000000..12476a194 --- /dev/null +++ b/.backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md @@ -0,0 +1,79 @@ +# CephHealthError + +## Meaning + +The `CephHealthError` alert in Ceph, indicates that the overall health of the Ceph cluster is in an `ERROR` state. This is a serious alert that signifies there are critical issues with the cluster that need immediate attention. + +The `CephHealthError` alert can be triggered by a variety of issues, including but not limited to: + +One or more Object Storage Daemons (OSDs) are down. +The cluster is running low on storage space. +There are too many Placement Groups (PGs) in a degraded or inconsistent state. +There are network issues preventing the OSDs from communicating with each other. +When you see this alert, it's important to investigate and resolve the issue as soon as possible to prevent data loss or further degradation of the cluster's health. +You can use the `ceph health detail` command to get more information about the issues affecting the cluster's health. + +## Impact + +The impact of a `CephHealthError` alert in a Ceph storage cluster can be significant and can lead to serious consequences. Here are some potential impacts: + +Data Loss: If the issues causing the `CephHealthError` alert are not resolved quickly, they could lead to data loss. +For example, if multiple Object Storage Daemons (OSDs) are down and the data they store is not replicated elsewhere, that data could be lost. + +Data Unavailability: Even if no data is lost, the issues causing the `CephHealthError` alert could make data unavailable. +For example, if there are network issues preventing the OSDs from communicating, clients may not be able to access their data. + +Reduced Performance: The issues causing the `CephHealthError` alert could degrade the performance of the Ceph cluster. +For example, if the cluster is running low on storage space, it may have to spend more resources on data management, which could slow down data access. + +Increased Risk: The `CephHealthError` alert indicates that the Ceph cluster is in a vulnerable state. If additional issues occur before the current issues are resolved, the impact could be even greater. + +## Diagnosis + +Check the Ceph Health Status: Use the `ceph health detail` command to get detailed information about the health of the Ceph cluster. This command will provide additional information about the issues causing the `CephHealthError` alert. + +```console +ceph health detail +``` + +Check the OSD Status: Use the `ceph osd tree` command to check the status of the Object Storage Daemons (OSDs). If any OSDs are down, they could be causing the `CephHealthError` alert. + +```console +ceph osd tree +``` + +Check the PG Status: Use the `ceph pg stat` command to check the status of the Placement Groups (PGs). If any PGs are in a degraded or inconsistent state, they could be causing the `CephHealthError` alert. + +```console +ceph pg stat +``` + +Check the Cluster Logs: The Ceph cluster logs may contain useful information about the issues causing the `CephHealthError` alert. You can find the cluster logs in the `/var/log/ceph` directory on the Ceph monitors. + +```console +less /var/log/ceph/ceph-mon.*.log +``` + +Check the Hardware: Issues with the underlying hardware, such as disk failures or network partitions, can cause a `CephHealthError` alert. Check the health of the disks and the network on the Ceph hosts. + +## Mitigation + +Ceph HEALTH_WARN can be resolved with the following steps: + +Display the list of crash reports with the command `ceph crash ls`. + +```console +ceph crash ls +``` + +Optional: to read the message use the command: + +```console +ceph crash info +``` + +Aknowledge/archive the message with the command: + +```console +ceph crash archive +``` diff --git a/.backstage/components/rook-ceph/mkdocs.yml b/.backstage/components/rook-ceph/mkdocs.yml index 557f225e3..f9c2c216c 100644 --- a/.backstage/components/rook-ceph/mkdocs.yml +++ b/.backstage/components/rook-ceph/mkdocs.yml @@ -4,6 +4,7 @@ nav: - Home: index.md - Runnbooks: - CephPGsDamaged: runbooks/ceph-pgs-damaged.md + - CephHealthError: runbooks/ceph-health-error.md plugins: - techdocs-core