From 62e4a01cb9f4406b932bfd883fd689de7ee8fc39 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Nils=20M=C3=BCller?= <tyriis@users.noreply.github.com>
Date: Mon, 8 Apr 2024 21:26:44 +0200
Subject: [PATCH] docs(rook-ceph): add ceph health error runbook

---
 .../docs/runbooks/ceph-health-error.md        | 79 +++++++++++++++++++
 .backstage/components/rook-ceph/mkdocs.yml    |  1 +
 2 files changed, 80 insertions(+)
 create mode 100644 .backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md

diff --git a/.backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md b/.backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md
new file mode 100644
index 000000000..12476a194
--- /dev/null
+++ b/.backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md
@@ -0,0 +1,79 @@
+# CephHealthError
+
+## Meaning
+
+The `CephHealthError` alert in Ceph, indicates that the overall health of the Ceph cluster is in an `ERROR` state. This is a serious alert that signifies there are critical issues with the cluster that need immediate attention.
+
+The `CephHealthError` alert can be triggered by a variety of issues, including but not limited to:
+
+One or more Object Storage Daemons (OSDs) are down.
+The cluster is running low on storage space.
+There are too many Placement Groups (PGs) in a degraded or inconsistent state.
+There are network issues preventing the OSDs from communicating with each other.
+When you see this alert, it's important to investigate and resolve the issue as soon as possible to prevent data loss or further degradation of the cluster's health.
+You can use the `ceph health detail` command to get more information about the issues affecting the cluster's health.
+
+## Impact
+
+The impact of a `CephHealthError` alert in a Ceph storage cluster can be significant and can lead to serious consequences. Here are some potential impacts:
+
+Data Loss: If the issues causing the `CephHealthError` alert are not resolved quickly, they could lead to data loss.
+For example, if multiple Object Storage Daemons (OSDs) are down and the data they store is not replicated elsewhere, that data could be lost.
+
+Data Unavailability: Even if no data is lost, the issues causing the `CephHealthError` alert could make data unavailable.
+For example, if there are network issues preventing the OSDs from communicating, clients may not be able to access their data.
+
+Reduced Performance: The issues causing the `CephHealthError` alert could degrade the performance of the Ceph cluster.
+For example, if the cluster is running low on storage space, it may have to spend more resources on data management, which could slow down data access.
+
+Increased Risk: The `CephHealthError` alert indicates that the Ceph cluster is in a vulnerable state. If additional issues occur before the current issues are resolved, the impact could be even greater.
+
+## Diagnosis
+
+Check the Ceph Health Status: Use the `ceph health detail` command to get detailed information about the health of the Ceph cluster. This command will provide additional information about the issues causing the `CephHealthError` alert.
+
+```console
+ceph health detail
+```
+
+Check the OSD Status: Use the `ceph osd tree` command to check the status of the Object Storage Daemons (OSDs). If any OSDs are down, they could be causing the `CephHealthError` alert.
+
+```console
+ceph osd tree
+```
+
+Check the PG Status: Use the `ceph pg stat` command to check the status of the Placement Groups (PGs). If any PGs are in a degraded or inconsistent state, they could be causing the `CephHealthError` alert.
+
+```console
+ceph pg stat
+```
+
+Check the Cluster Logs: The Ceph cluster logs may contain useful information about the issues causing the `CephHealthError` alert. You can find the cluster logs in the `/var/log/ceph` directory on the Ceph monitors.
+
+```console
+less /var/log/ceph/ceph-mon.*.log
+```
+
+Check the Hardware: Issues with the underlying hardware, such as disk failures or network partitions, can cause a `CephHealthError` alert. Check the health of the disks and the network on the Ceph hosts.
+
+## Mitigation
+
+Ceph HEALTH_WARN can be resolved with the following steps:
+
+Display the list of crash reports with the command `ceph crash ls`.
+
+```console
+ceph crash ls
+```
+
+Optional: to read the message use the command:
+
+```console
+ceph crash info <id>
+```
+
+Aknowledge/archive the message with the command:
+
+```console
+ceph crash archive <id>
+```
diff --git a/.backstage/components/rook-ceph/mkdocs.yml b/.backstage/components/rook-ceph/mkdocs.yml
index 557f225e3..f9c2c216c 100644
--- a/.backstage/components/rook-ceph/mkdocs.yml
+++ b/.backstage/components/rook-ceph/mkdocs.yml
@@ -4,6 +4,7 @@ nav:
   - Home: index.md
   - Runnbooks:
       - CephPGsDamaged: runbooks/ceph-pgs-damaged.md
+      - CephHealthError: runbooks/ceph-health-error.md
 
 plugins:
   - techdocs-core