Failed to add OSD #161

keuko · 2024-12-23T00:36:00Z

Hi,

I really don't know what's the issue but when I am deploying to virtuals with cinder volumes, from time to time (quite often) it's failing to add osd.

TASK [stackhpc.cephadm.cephadm : Add OSDs individually] ***********************************************************************************************************************************************************
failed: [ceph2] (item=/dev/sdb) => {"ansible_loop_var": "item", "changed": true, "cmd": ["cephadm", "shell", "--", "ceph", "orch", "daemon", "add", "osd", "ceph2:/dev/sdb"], "delta": "0:00:12.391850", "end": "2024-12-22 23:59:10.466541", "item": "/dev/sdb", "msg": "non-zero return code", "rc": 1, "start": "2024-12-22 23:58:58.074691", "stderr": "Using ceph image with id '2bc0b0f4375d' and tag 'v18' created on 2024-07-23 22:19:35 +0000 UTC\nquay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906\nError initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')", "stderr_lines": ["Using ceph image with id '2bc0b0f4375d' and tag 'v18' created on 2024-07-23 22:19:35 +0000 UTC", "quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906", "Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')"], "stdout": "", "stdout_lines": []}
failed: [ceph2] (item=/dev/sdc) => {"ansible_loop_var": "item", "changed": true, "cmd": ["cephadm", "shell", "--", "ceph", "orch", "daemon", "add", "osd", "ceph2:/dev/sdc"], "delta": "0:00:15.046005", "end": "2024-12-22 23:59:26.111456", "item": "/dev/sdc", "msg": "non-zero return code", "rc": 1, "start": "2024-12-22 23:59:11.065451", "stderr": "Using ceph image with id '2bc0b0f4375d' and tag 'v18' created on 2024-07-23 22:19:35 +0000 UTC\nquay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906\nError initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')", "stderr_lines": ["Using ceph image with id '2bc0b0f4375d' and tag 'v18' created on 2024-07-23 22:19:35 +0000 UTC", "quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906", "Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')"], "stdout": "", "stdout_lines": []}

But always it's passing on second time ...can you please add retry for task "Add OSDs individually"

The text was updated successfully, but these errors were encountered:

keuko · 2024-12-23T00:36:34Z

It's totally random, disks are dd cleared ..everythihng should be ok.

This change ensures the `Add OSDs individually` task is retried up to 3 times with a 10-second delay between attempts if the Ceph orchestrator command fails (non-zero return code). This enhances task resilience by allowing transient issues to resolve before marking the operation as failed. Resolves stackhpc#161

keuko closed this as completed Dec 23, 2024

keuko reopened this Dec 23, 2024

keuko linked a pull request Dec 23, 2024 that will close this issue

Add retry mechanism to Ceph OSD addition task #162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to add OSD #161

Failed to add OSD #161

keuko commented Dec 23, 2024 •

edited

Loading

keuko commented Dec 23, 2024

Failed to add OSD #161

Failed to add OSD #161

Comments

keuko commented Dec 23, 2024 • edited Loading

keuko commented Dec 23, 2024

keuko commented Dec 23, 2024 •

edited

Loading