From fed8624a326fdccf792ca7cc632e583805a7b724 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Nils=20M=C3=BCller?= <tyriis@users.noreply.github.com>
Date: Mon, 8 Apr 2024 21:08:37 +0200
Subject: [PATCH] docs(rook-ceph): update runbook for ceph pgs damaged alert
 with co-pilot

---
 .../docs/runbooks/ceph-pgs-damaged.md         | 72 ++++++++++++++++++-
 1 file changed, 69 insertions(+), 3 deletions(-)

diff --git a/.backstage/components/rook-ceph/docs/runbooks/ceph-pgs-damaged.md b/.backstage/components/rook-ceph/docs/runbooks/ceph-pgs-damaged.md
index 82df87310..9e44866f8 100644
--- a/.backstage/components/rook-ceph/docs/runbooks/ceph-pgs-damaged.md
+++ b/.backstage/components/rook-ceph/docs/runbooks/ceph-pgs-damaged.md
@@ -2,14 +2,80 @@
 
 ## Meaning
 
-Sometimes a Placement Group (PG) might become inconsistent. To return the PG to an active+clean state, you must first determine which of the PGs has become inconsistent and then run the pg repair command on it.
+The `CephPGsDamaged` alert in Ceph, a distributed storage system, indicates that one or more Placement Groups (PGs) in the Ceph cluster are in a damaged state.
+
+A Placement Group (PG) in Ceph is a logical partition of objects in the system.
+It's the smallest unit of data distribution and replication. When a PG is damaged, it means that some of the data in the PG is inconsistent or missing, which can lead to data loss or unavailability.
+
+This alert is typically triggered when there are issues with the underlying storage (like disk failures), network partitions, or bugs in the Ceph software itself.
+
+When you see this alert, it's important to investigate and resolve the issue as soon as possible to prevent data loss.
+
+## Impact
+
+When a Placement Group (PG) is damaged, it means that some of the data in the PG is inconsistent or missing. This can lead to:
+
+Data Loss: If the damaged PGs contain unique data that isn't replicated elsewhere, that data might be lost.
+
+Data Unavailability: Even if the data is replicated, the damaged PGs might cause the data to be unavailable until the issue is resolved.
+
+Reduced Redundancy: Ceph uses PGs for data distribution and replication. If PGs are damaged, the redundancy of your data might be reduced, which increases the risk of data loss if further failures occur.
+
+Performance Degradation: The Ceph cluster might need to spend resources to recover the damaged PGs, which could degrade the performance of the cluster.
 
 ## Diagnosis
 
-Placement group damaged, manual intervention needed
+Diagnosing a CephPGsDamaged alert involves several steps to identify the cause of the damaged Placement Groups (PGs). Here are some steps you can follow:
+
+Check the Ceph Health Status: Use the `ceph health detail` command to get detailed information about the health of the Ceph cluster. This command will list the damaged PGs and may provide additional information about the issue.
+
+```console
+ceph health detail
+```
+
+Check the PG Status: Use the `ceph pg ls` command to list all PGs and their statuses. You can use the `--sort-by option` to sort the list by status, which can make it easier to find the damaged PGs.
+
+```console
+ceph pg ls --sort-by=status
+```
+
+Check the OSD Logs: The logs of the Object Storage Daemons (OSDs) that host the damaged PGs may contain useful information about the issue. You can find the OSD logs in the `/var/log/ceph` directory on the OSD hosts.
+
+```console
+less /var/log/ceph/ceph-osd.*.log
+```
+
+Check the Hardware: Issues with the underlying hardware, such as disk failures or network partitions, can cause PGs to become damaged. Check the health of the disks and the network on the OSD hosts.
+
+Check for Software Bugs: If you can't find any issues with the hardware or the OSD logs, the issue might be caused by a bug in the Ceph software.
+Check the Ceph issue tracker and the Ceph mailing list for known issues that might be related to your problem.
+
+Remember, diagnosing a CephPGsDamaged alert can be complex and time-consuming. If you're not able to resolve the issue yourself, consider reaching out to the Ceph community or a professional support provider for help.
 
 ## Mitigation
 
 Manual intervention is required!
 
-This [page](https://docs.ceph.com/en/latest/rados/operations/pg-repair/) contains commands for diagnosing PGs and the command for repairing PGs that have become inconsistent.
+Identify Damaged PGs: Use the ceph health detail command to get a list of damaged PGs.
+
+```console
+ceph health detail
+```
+
+Repair Damaged PGs: Once you've identified the damaged PGs, you can attempt to repair them using the `ceph pg repair` command. Replace `<pg-id>` with the ID of the damaged PG.
+
+```console
+ceph pg repair <pg-id>
+```
+
+This command instructs Ceph to try to repair the PG by checking the objects in the PG against their checksums and fixing any mismatches.
+
+Monitor the Repair Process: After starting the repair process, monitor the Ceph cluster's health using the ceph health command. If the repair is successful, the CephPGsDamaged alert should eventually clear.
+
+Check Hardware and Network: If the repair process doesn't resolve the issue, check the health of the disks and the network on the OSD hosts. Replace any failed disks and fix any network issues.
+
+Seek Professional Help: If you're unable to resolve the issue yourself, consider reaching out to the Ceph community or a professional support provider for help.
+
+Please note that the ceph pg repair command should be used with caution. It can cause data loss if used improperly. Always make sure you have up-to-date backups before attempting to repair a damaged PG.
+
+[Docs Ceph - PG repair](https://docs.ceph.com/en/latest/rados/operations/pg-repair/) contains commands for diagnosing PGs and the command for repairing PGs that have become inconsistent.