Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md

README.md

Rook

https://rook.io/

Rook orchestrates our storage technology based on The rook operator is installed in the clusters through a Helm chart. The operator watches the cluster for CRDs to create storage clusters currently based on Ceph.

The CRDs we use are currently block and object

Overview

Operator runs in the nais namespace
Object store agents runs on all worker nodes to attach/detach persistent volumes to pods.
Ceph pods runs in nais-rook namespace and on dedicated nodes with storage disks.

How to's

Checking/debugging CEPH status

There is a toolbox pod running with CEPH tools installed.

kubectl exec -it rook-ceph-tools sh -n nais-rook

Example troubleshooting tools:

ceph status
ceph osd status
ceph df
rados df

Adding a storage node

Provision CoreOS-node(s) in Basta adding "disk til lagringsnode".
Specs are 2 CPU, 16GB and 400GB disk.
Add to nais-inventory as worker and storage nodes.
Profit?

Note that adding a node will cause CEPH to rebalance the cluster. See section on speeding up cluster recovery.

Removing a storage node

This is currently a manual process where we need to ensure that data is rebalanced to the other storage nodes.

Remove storage label from the node you want to remove:

kubectl label node a30apvl00016.oera.no  nais.io/storage-node-

Remove entry for a30apvl00016.oera.no from the orchestration status map

 kubectl -n nais-rook edit cm rook-ceph-osd-orchestration-status

Find the OSD id for the node you want to remove

 kubectl exec  rook-ceph-tools  -n nais-rook -- sh -c   "ceph osd status"

The osd you are looking for should NOT be marked as "up"

Delete config maps used by the OSD pod.

kubectl -n nais-rook delete cm rook-ceph-osd-$OSD_ID-fs-backup

kubectl -n nais-rook delete rook-ceph-osd-a30apvl00016.oera.no-config

Remove OSD from CEPH config

Exec into the toolbox and run:

ceph osd crush rm osd.$OSD_ID
ceph auth del osd.$OSD_ID
ceph osd rm $OSD_ID 
ceph osd crush rm a30apvl00016-oera-no

Make sure the cluster is healing it self by running

ceph health details

Note that adding a node will cause CEPH to rebalance the cluster. See section on speeding up cluster recovery.

Speeding up ceph recovery

Exec into the toolbox pod and run:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'

Note that speeding up may put strain on the cluster in general

Resharding stuck

If a resharding operation is stuck and blocking S3 puts, cancel resharding of the bucket:

radosgw-admin reshard cancel --bucket=<bucket>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage

storage

README.md

Rook

Overview

How to's

Checking/debugging CEPH status

Adding a storage node

Removing a storage node

Remove storage label from the node you want to remove:

Remove entry for a30apvl00016.oera.no from the orchestration status map

Find the OSD id for the node you want to remove

Delete config maps used by the OSD pod.

Remove OSD from CEPH config

Speeding up ceph recovery

Resharding stuck

Files

storage

Directory actions

More options

Directory actions

More options

Latest commit

History

storage

Folders and files

parent directory

README.md

Rook

Overview

How to's

Checking/debugging CEPH status

Adding a storage node

Removing a storage node

Remove storage label from the node you want to remove:

Remove entry for a30apvl00016.oera.no from the orchestration status map

Find the OSD id for the node you want to remove

Delete config maps used by the OSD pod.

Remove OSD from CEPH config

Speeding up ceph recovery

Resharding stuck