-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshots not working because of incorrect volumeHandle #84
Comments
Mine looks to be failing earlier. I'll do some digging but for me I applied the YAML deployments and the CRDs/Snapshotcontroller from commit ref c72f087f751abb285a90c4f5bb02df9014d2bc19 of https://github.com/kubernetes-csi/external-snapshotter.git I might try rolling this back to a release tag so I'm not on any pre-release stages but this is what I have. Also interesting reading in the v7.0.0 changes around the webhook being deprecated. There's also quite a bit of automation in this test since I'm testing with CloudnativePG operator which manages all the snapshot and restore capability for this cluster instance. It's all very cool but it may be out of step with some of the snapshot controller logic.
The VolumeSnapshot and VolumeSnapshotContent were created but of course the snapshot itself wasn't, had to remove the finalizers to get rid of them. I have updated CNPG to the minor increment to be on the latest but I'll have a proper poke around at some point. I didn't see your error but I suspect I might once I have gotten to the bottom of this first hurdle. |
@emmetog OK I have this working, it's actually quite a simple issue but it's just locked into the configs that they give you here. The maturity of https://github.com/kubernetes-csi/external-snapshotter/tree/master drives some requirements for the synology-csi snapshotter config which is quite old as shown in this repo and uses the container registry.k8s.io/sig-storage/csi-snapshotter:v4.2.1 as the actual interface which talks grpc to the synology driver (this image) in the same pod for snapshot functions. The grpc interface is stabilised so to match the requirements of a later version of the snapshot controller, which goes into kube-system, sets and acts on certain elements of the CRD, the csi-snapshotter also has to behave according to the updated spec. This doesn't effect the CSI driver but you will need to update the IMAGE and RBAC for the synology-csi snapshotter container in the snapshotter pod to match the version of the controller so it can do all the resource changes it does in response to the snapshot triggers and meet the spec set by the snapshot crds and snpshot controller. For me the diff to deploy/kubernetes/v1.20/snapshotter/snapshot.yaml looks like this assuming the current head of the project and using the v7.0.1 image also for the snapshot controller. In terms of compatibility this is a good approach, the snapshotter image version must match the controller and the contract sets out the RBAC requirements for the SA. The CSI driver lives as a sidecar to the snapshotter and has a stable grpc interface so that's also good. What we lack in this project docs is this information showing that the deployment is part based on an old version of the external-snapshot project and needs to be maintained. I'm just showing you a diff here to show what I updated but best to lift and shift current state from https://github.com/kubernetes-csi/external-snapshotter/tree/master/deploy/kubernetes/csi-snapshotter for the specific version you are updating to.
|
Noting this possibly isn't the whole end of the issue, attempting to create a new CPNG instance off a snapshot gave me errors about the mount point already existing but since the real controller for the recreation of the image is possibly the node-driver-registrar (the description doesn't completely match but since this IS the DaemonSet and the only pod guaranteed per node it's my first guess) this possibly also needs to be updated in the DaemonSet and also accordingly any changes to RBAC. Again, this may work when I try it but it probably is a first point of call to get the sig images described in these configs to a current version. I've only done the snapshotter so far and need to think about dinner right now.
|
This does fully work with updated components, however, there is a different issue that I'm facing. Either NodeUnstageVolume is never called or is not unmounting and disconnecting the node because when I use CloudnativePG snapshot restore it does indeed create a volume from the snapshot but then stalls and there is a logged message saying that he mount point already exists when trying to mount the PVC again. I think this is partly init container/operator behaviour that is triggering this case but I have also found that if I delete a PV that has been in use and delete the LUN/Target from the NAS the node keeps attempting to reconnect to the target and the NAS logs messages that there was an attempt to login to an 'un-exist' Target. I need to do a bit more digging before raising a different issue or an MR but I think the snapshot restore probably works as well. |
OK, I figured out what was happening here, and everything works as it should. Including Snapshot restore. What was happening is that I was trying to schedule a snapshot restore of the PV and connect it to the same host that the original PV has connected. This does NOT work because inside the snapshot the filesystem UUID is the same on the volume created from the snapshot and results in the mount command bombing saying 'File exists'. If instead you restore it to another node, or have the first system shutdown and the PV unmounted then Snapshot restore works fine also. On a final point though when this happens it leaves the scsi initiator logged into the PV so if you delete the PV from the NAS at this point you end up with those 'un-exist' Target errors whilst the initiator tries to reconnect. |
Summary here and things that might go wrong but I think this fixes the original issue and similar ones (all of this is related to iSCSI LUNs btw): Snapshotter versionThis repo contains a snapshot.yaml that references an old version of the node-driver-registrar, this image needs to be more in sync with the current version of the snapshot CRDs and Snapshot controller. Version 4.0.0 is listed in the README.txt but this is quite out of date with the snapshot controller being v7.0.1 at this point in time so an update to the image in the snapshotter.yaml AND the associated RBAC for the service account based upon current RBAC is probably what is needed. I'm using K8S 1.29 and will soon migrate to 1.30 so I think it's good to keep these versions up to date. The CSI driver is still in spec to work with the various components if they are updated but does flag as using a 'trivial' provider when a modern version of the csi-attacher is used due to the lack of the ControllerPublishUnpublish capability that the current CSI spec says is a minimum even though it offers little utility for this use case. Features relevant to the CSI are generally called directly by Kubelet but the snapshot controller may well pass volume name information out of step with a significantly different node-driver-registrar and it is definite that this being out of step with the controller/CRDs produces errors at the time of creating the snapshot. It's quite probable that getting this out of sync by using helm with latest versions, or missing RBAC because of change in requirements not delivered by helm could lead you astray so possibly test first with manual updates to the chart. I deployed successfuly the following image changes. controller.yml:
node.yml (also needing RBAC changes noted above)
If using microk8sIf using microk8s then node.yml also needs every occurance of /var/lib/kubelet needs to be updated so that a path to a real location inside the snap install can be referenced, in the standard install /var/lib/kubelet is a symlink into /var/snap/microk8s/common/var/lib/kubelet but this is often on a separate partition and will not be traversed properly by the node-driver-registrar Replace that part of the path as mentioned in the volumes section path under the kubelet-dir, plugin-dir and registration-dir hostPath path entries. Snapshot restore testingDo NOT attempt to restore a snapshot of a volume to the same node that already has the source PV still mounted. The volume created from the snapshot willl have the original random identifier GUID from the volume it was cloned upon and will not mount and just give a 'File exist' error and probably leave your node in an odd state wrt to the scsi initiator (see above for gory details). If you stop the original workload and allow the driver to unmount the volume then you should be able to mount the volume created off the snapshot on the same node, alternatively if you are testing side by side mount it on another node. It may be possible for you to re-touch the volume UUID but this is a filesystem specific task and you must take care to attach the iSCSI volume but not have it mounted/read write when doing so. For btrfs this would be a btrfsutil -u operation. Performing this task whilst mounted will almost certainly corrupt the volume as it dismounts and the OS flushes changes to the volume that no longer match the identifier. Safest approach is avoid using a volume created off another on the same node. Final noteIt goes without saying, test with a test environment and make sure you have other backups! |
Noting updating VolumeSnapshot controller might be quite important with 1.30 K8S. I'm hanging back in home lab because I'm running CNPG because 1.30 is unsupported and there still seem still a few things to fix irrespective of iSCSI pvs and snapshots - a bit of a shame since my GKE test clusters are on 1.30 already (no CNPG) at work but I prefer to test on home lab first still. |
So I'm digging here and trying to debug why my snapshots are not working and I think I've come across a bug in how the snapshotter finds the volume in DSM.
To reproduce
First we create a PVC with the storage class set to use the Synology CSI driver, here's an example:
The
synology-csi-controller
(the provisioner) then successfully creates a PV and a LUN on my NAS, here's an example of the full PV definition that it creates in kubernetes:The PVC itself works fine in that the pods have persistent storage and things look ok in DSM, however note the
volumeHandle: 5743c6be-31e6-4528-a55f-2254b5716227
line, as that's a problem for snapshots.Next we'll try to trigger a snapshot, I'm using velero to do this but I'd imagine that there are other ways of triggering this driver to take a snapshot:
The snapshot doesn't work correctly, we see this in the logs of the
synology-csi-snapshotter-0
pod:Here is the contents of the
snapshot.storage.k8s.io/VolumeSnapshotContent
object that is created:In my DSM the LUN for this PV is called
k8s-csi-pvc-2b7d51ac-0e95-4d36-9dc5-675c057d8324
, which matches with the PV's name (although withk8s-csi-
prefixed). There is no mention of thevolumeHandle: 5743c6be-31e6-4528-a55f-2254b5716227
in DSM.My interpretation
What I think is happening here is that when the PV is created, the driver doesn't specify the
volumeHandle
and so the CSI assigns a random unique id. Then when the snapshot is triggered, this driver tries to use the volumeHandle field to match the volume in DSM, but of course that doesn't exist.If I'm right then the solution is either:
volumeHandle
explicitly to the name of the LUN in DSMname
of the PV (using whatever format was used when the LUN was created, ie prefixingk8s-csi-
to the PV name), and ignore thevolumeHandle
entirelyVersions / Environment
For versions, I'm using kubernetes version
v1.29.2+k3s1
and I'm using flux to deploy the Synology driver using the chart here, here's my manifest:And I'm installing the CSI snapshot controller separately using this chart, like this:
The text was updated successfully, but these errors were encountered: