Skip to content

Commit

Permalink
docs(rook-ceph): add ceph health error runbook
Browse files Browse the repository at this point in the history
  • Loading branch information
tyriis committed Apr 8, 2024
1 parent fed8624 commit 62e4a01
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 0 deletions.
79 changes: 79 additions & 0 deletions .backstage/components/rook-ceph/docs/runbooks/ceph-health-error.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# CephHealthError

## Meaning

The `CephHealthError` alert in Ceph, indicates that the overall health of the Ceph cluster is in an `ERROR` state. This is a serious alert that signifies there are critical issues with the cluster that need immediate attention.

The `CephHealthError` alert can be triggered by a variety of issues, including but not limited to:

One or more Object Storage Daemons (OSDs) are down.
The cluster is running low on storage space.
There are too many Placement Groups (PGs) in a degraded or inconsistent state.
There are network issues preventing the OSDs from communicating with each other.
When you see this alert, it's important to investigate and resolve the issue as soon as possible to prevent data loss or further degradation of the cluster's health.
You can use the `ceph health detail` command to get more information about the issues affecting the cluster's health.

## Impact

The impact of a `CephHealthError` alert in a Ceph storage cluster can be significant and can lead to serious consequences. Here are some potential impacts:

Data Loss: If the issues causing the `CephHealthError` alert are not resolved quickly, they could lead to data loss.
For example, if multiple Object Storage Daemons (OSDs) are down and the data they store is not replicated elsewhere, that data could be lost.

Data Unavailability: Even if no data is lost, the issues causing the `CephHealthError` alert could make data unavailable.
For example, if there are network issues preventing the OSDs from communicating, clients may not be able to access their data.

Reduced Performance: The issues causing the `CephHealthError` alert could degrade the performance of the Ceph cluster.
For example, if the cluster is running low on storage space, it may have to spend more resources on data management, which could slow down data access.

Increased Risk: The `CephHealthError` alert indicates that the Ceph cluster is in a vulnerable state. If additional issues occur before the current issues are resolved, the impact could be even greater.

## Diagnosis

Check the Ceph Health Status: Use the `ceph health detail` command to get detailed information about the health of the Ceph cluster. This command will provide additional information about the issues causing the `CephHealthError` alert.

```console
ceph health detail
```

Check the OSD Status: Use the `ceph osd tree` command to check the status of the Object Storage Daemons (OSDs). If any OSDs are down, they could be causing the `CephHealthError` alert.

```console
ceph osd tree
```

Check the PG Status: Use the `ceph pg stat` command to check the status of the Placement Groups (PGs). If any PGs are in a degraded or inconsistent state, they could be causing the `CephHealthError` alert.

```console
ceph pg stat
```

Check the Cluster Logs: The Ceph cluster logs may contain useful information about the issues causing the `CephHealthError` alert. You can find the cluster logs in the `/var/log/ceph` directory on the Ceph monitors.

```console
less /var/log/ceph/ceph-mon.*.log
```

Check the Hardware: Issues with the underlying hardware, such as disk failures or network partitions, can cause a `CephHealthError` alert. Check the health of the disks and the network on the Ceph hosts.

## Mitigation

Ceph HEALTH_WARN can be resolved with the following steps:

Display the list of crash reports with the command `ceph crash ls`.

```console
ceph crash ls
```

Optional: to read the message use the command:

```console
ceph crash info <id>
```

Aknowledge/archive the message with the command:

```console
ceph crash archive <id>
```
1 change: 1 addition & 0 deletions .backstage/components/rook-ceph/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ nav:
- Home: index.md
- Runnbooks:
- CephPGsDamaged: runbooks/ceph-pgs-damaged.md
- CephHealthError: runbooks/ceph-health-error.md

plugins:
- techdocs-core

0 comments on commit 62e4a01

Please sign in to comment.