Rook orchestrates our storage technology based on The rook operator is installed in the clusters through a Helm chart. The operator watches the cluster for CRDs to create storage clusters currently based on Ceph.
The CRDs we use are currently block and object
- Operator runs in the nais namespace
- Object store agents runs on all worker nodes to attach/detach persistent volumes to pods.
- Ceph pods runs in nais-rook namespace and on dedicated nodes with storage disks.
There is a toolbox pod running with CEPH tools installed.
kubectl exec -it rook-ceph-tools sh -n nais-rook
Example troubleshooting tools:
ceph status
ceph osd status
ceph df
rados df
- Provision CoreOS-node(s) in Basta adding "disk til lagringsnode".
- Specs are 2 CPU, 16GB and 400GB disk.
- Add to nais-inventory as worker and storage nodes.
- Profit?
Note that adding a node will cause CEPH to rebalance the cluster. See section on speeding up cluster recovery.
This is currently a manual process where we need to ensure that data is rebalanced to the other storage nodes.
kubectl label node a30apvl00016.oera.no nais.io/storage-node-
kubectl -n nais-rook edit cm rook-ceph-osd-orchestration-status
kubectl exec rook-ceph-tools -n nais-rook -- sh -c "ceph osd status"
The osd you are looking for should NOT be marked as "up"
kubectl -n nais-rook delete cm rook-ceph-osd-$OSD_ID-fs-backup
kubectl -n nais-rook delete rook-ceph-osd-a30apvl00016.oera.no-config
Exec into the toolbox and run:
ceph osd crush rm osd.$OSD_ID
ceph auth del osd.$OSD_ID
ceph osd rm $OSD_ID
ceph osd crush rm a30apvl00016-oera-no
Make sure the cluster is healing it self by running
ceph health details
Note that adding a node will cause CEPH to rebalance the cluster. See section on speeding up cluster recovery.
Exec into the toolbox pod and run:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
Note that speeding up may put strain on the cluster in general
If a resharding operation is stuck and blocking S3 puts, cancel resharding of the bucket:
radosgw-admin reshard cancel --bucket=<bucket>