Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTV-1536 | Use CDI for disk transfer in cold migration #1109

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mnecas
Copy link
Member

@mnecas mnecas commented Oct 16, 2024

Issues:

  • Allow migration of "unknow" guests [1]
    Right now we can't migrate an unknown and unsupported operating system which is unsupported by the virt-v2v [2].

  • Unifying the process and potential speedup [3]
    Right now we are using two different methods for the disk transfer virt-v2v cold local migrations and CDI warm and remote migrations. This brings additional engineering for maintaining two paths. It's harder to debug two different flows.
    The virt-v2v transfers the disks in the sequence whereas using the CDI we can start multiple disk imports in parallel. This can improve the migration speeds.

Fix:

Start using the CNV CDI for the disk transfer instead of virt-v2v.
MTV is already using the CNV CDI for the warm and remote migration. We just need to adjust the code to remove the virt-v2v transfer and rely on the CNV CDI to do it for us.

Drawbacks:

  • CNV CDI requires the VDDK, which was till now highly recommended.
  • CNV CDI is not maintained inside the MTV and there might be problems escalating and backporting the patches as CNV has a different release cycle.
  • Because we will be migrating all disks in parallel we need to optimise our migration scheduler as we don't want to take too much of the hosts/network resources. I have already done some optimisations in [4,5,6].

Notes:

This change removes the usage of virt-v2v and we will only use the virt-v2v-in-place.

Ref:

[1] https://issues.redhat.com/browse/MTV-1536
[2] https://access.redhat.com/articles/1351473
[3] https://issues.redhat.com/browse/MTV-1581
[4] #1088
[5] #1087
[6] #1086

@mnecas mnecas requested a review from yaacov as a code owner October 16, 2024 11:59
@mnecas mnecas force-pushed the cdi_cold_migration branch from 40b3f51 to 419605d Compare October 16, 2024 12:07
@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 22 lines in your changes missing coverage. Please review.

Project coverage is 16.33%. Comparing base (bc87852) to head (419605d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/plan/adapter/vsphere/builder.go 0.00% 9 Missing ⚠️
pkg/controller/plan/scheduler/vsphere/scheduler.go 0.00% 5 Missing ⚠️
pkg/controller/plan/kubevirt.go 0.00% 4 Missing ⚠️
pkg/controller/plan/migration.go 0.00% 3 Missing ⚠️
virt-v2v/cmd/entrypoint.go 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1109      +/-   ##
==========================================
+ Coverage   16.20%   16.33%   +0.13%     
==========================================
  Files         112      112              
  Lines       19882    19719     -163     
==========================================
  Hits         3222     3222              
+ Misses      16375    16212     -163     
  Partials      285      285              
Flag Coverage Δ
unittests 16.33% <0.00%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mnecas
Copy link
Member Author

mnecas commented Oct 16, 2024

@rwmjones This change removes the usage of virt-v2v for disk transfer and we will only use the virt-v2v-in-place. Not sure when we will get this in, I'm aiming for the next Y release (2.8.0) as it has some changes for the users. Just wanted to let you know.

@rwmjones
Copy link
Contributor

We won't be able to support this. We only support conversions using virt-v2v. We don't support usage of virt-v2v-in-place at all, or recommend you use it. Basically, if you do this, you're on your own with supporting v2v.

@mnecas mnecas force-pushed the cdi_cold_migration branch 2 times, most recently from 514634a to 23e30da Compare October 17, 2024 12:34
Issues:
[1] Allow migration of "unknow" guests
Right now when we want to migrate an unknown and unsupported operating
system which is unsupported by the virt-v2v [3].

[2] Unifying the process and potential speedup
Right now we are using two different methods for the disk transfer. This
brings additional engineering for maintaining two paths. It's harder to
debug two different flows.
The virt-v2v transfers the disks in the sequence whereas using the CDI we
can start multiple disk imports in parallel. This can improve the
migration speeds.

Fix:
MTV is already using the CNV CDI for the warm and remote migration. We
just need to adjust the code to remove the virt-v2v transfer and rely on
the CNV CDI to do it for us.

Drawbacks:
- CNV CDI *requires* the VDDK, which was till now highly recommended.
- CNV CDI is not maintained inside the MTV and there might be problems
  escalating and backporting the patches as CNV has a different release
  cycle.
- Because we will be migrating all disks in parallel we need to
  optimise our migration scheduler as we don't want to take too much
  of the hosts/network resources. I have already done some
  optimisations in [4,5,6].

Notes:
This change removes the usage of virt-v2v and we will only use the
virt-v2v-in-place.

Ref:
[1] https://issues.redhat.com/browse/MTV-1536
[2] https://issues.redhat.com/browse/MTV-1581
[3] https://access.redhat.com/articles/1351473
[4] kubev2v#1088
[5] kubev2v#1087
[6] kubev2v#1086

Signed-off-by: Martin Necas <[email protected]>
@mnecas mnecas force-pushed the cdi_cold_migration branch from 23e30da to 75ac8a7 Compare October 18, 2024 13:12
mnecas added a commit to mnecas/forklift that referenced this pull request Oct 18, 2024
Issue:
when migrating the VMs the VM disks need to be transfered over the network.
This is inefficient, takes additional storage and is slow. We need
an alternative way for the disk transfer.

Design:
This PR is PoC of adding an offload plugin for the disk transfer.
The offload plugin is a specific tool that needs to be implemented
by the CSI provider. To specify the storage offload plugin the user will
need to specify it in the StorageMap destination. This will allow the
users to migrate the VM even with disks across multiple data store types.
For example, some could be managed by the offloadPlugin and some would
still go over the network if needed.

Example of the storage map with offload plugin:
```
spec:
  map:
    - destination:
        offloadPlugin: 'quay.io/kubev2v/offload:latest'
        storageClass: CSI
      source:
        id: datastore-30
...
```
The offload plugin is started right after the CreateDataVolumes, this
the way the Kubernetes CSI will create empty PVs into which the disks can be
transfered.

The OffloadPlugin step creates a job on the destination cluster. The job
provided offload plugin image and start. The job is started with the
following parameters.

`HOST` =
`PLAN_NAME` =
`NAMESPACE` =
In addition to these variables, it also mounts the secrets to the vCenter.

Note:
This change additionally requires
kubev2v#1109, because right now the
cold migration transfer is managed by the virt-v2v. The kubev2v#1109 removes
this dependency and moves it out to the CNV CDI. This allows us to split
the transfer and conversion steps which were in the same step from the
forklift perspective. Once the disks are transferred the Forklift does the
`virt-v2v-in-place` on the disks and starts the VM.
In the same way, this step will be done also on the offload plugin as we
will transfer the disks using the offload plugin and then start
`virt-v2v-in-place` on the disks.

TODO:
- Check if OffloadPlugin exists
- Allow storage map with OffloadPlugin and without combination
- Add checking of the offload plugin disk transfer status

Signed-off-by: Martin Necas <[email protected]>
Copy link

mnecas added a commit to mnecas/forklift that referenced this pull request Oct 18, 2024
Issue:
when migrating the VMs the VM disks need to be transfered over the network.
This is inefficient, takes additional storage and is slow. We need
an alternative way for the disk transfer.

Design:
This PR is PoC of adding an offload plugin for the disk transfer.
The offload plugin is a specific tool that needs to be implemented
by the CSI provider. To specify the storage offload plugin the user will
need to specify it in the StorageMap destination. This will allow the
users to migrate the VM even with disks across multiple data store types.
For example, some could be managed by the offloadPlugin and some would
still go over the network if needed.

Example of the storage map with offload plugin:
```
spec:
  map:
    - destination:
        offloadPlugin: 'quay.io/kubev2v/offload:latest'
        storageClass: CSI
      source:
        id: datastore-30
...
```
The offload plugin is started right after the CreateDataVolumes, this
the way the Kubernetes CSI will create empty PVs into which the disks can be
transfered.

The OffloadPlugin step creates a job on the destination cluster. The job
provided offload plugin image and start. The job is started with the
following parameters:
- `HOST` = url to the esxi host
- `PLAN_NAME` = plane name
- `NAMESPACE` = namepsace where the migration is running

In addition to these variables, it also mounts the secrets to the vCenter.
The secrets are in the path `/etc/secret` and the files are:
- `accessKeyId` with a username
- `secretKey` with a password

Note:
This change additionally requires
kubev2v#1109, because right now the
cold migration transfer is managed by the virt-v2v. The kubev2v#1109 removes
this dependency and moves it out to the CNV CDI. This allows us to split
the transfer and conversion steps which were in the same step from the
forklift perspective. Once the disks are transferred the Forklift does the
`virt-v2v-in-place` on the disks and starts the VM.
In the same way, this step will be done also on the offload plugin as we
will transfer the disks using the offload plugin and then start
`virt-v2v-in-place` on the disks.

TODO:
[ ] Add design doc showing larger details, this is just PoC
[ ] Add check if OffloadPlugin image exists
[ ] Add check of the offload plugin disk transfer status
[ ] Allow storage map with OffloadPlugin and without combination
[ ] Improve the name of the offload pugin job, right now its the VM ID

Signed-off-by: Martin Necas <[email protected]>
@mnecas mnecas mentioned this pull request Oct 18, 2024
5 tasks
mnecas added a commit to mnecas/forklift that referenced this pull request Oct 19, 2024
Issue:
when migrating the VMs the VM disks need to be transfered over the network.
This is inefficient, takes additional storage and is slow. We need
an alternative way for the disk transfer.

Design:
This PR is PoC of adding an offload plugin for the disk transfer.
The offload plugin is a specific tool that needs to be implemented
by the CSI provider. To specify the storage offload plugin the user will
need to specify it in the StorageMap destination. This will allow the
users to migrate the VM even with disks across multiple data store types.
For example, some could be managed by the offloadPlugin and some would
still go over the network if needed.

Example of the storage map with offload plugin:
```
spec:
  map:
    - destination:
        offloadPlugin:
          image: 'quay.io/mnecas0/offload:latest'
          vars:
            test1: test1
            test2: test2
        storageClass: CSI
      source:
        id: datastore-30
...
```
The offload plugin is started right after the CreateDataVolumes, this
the way the Kubernetes CSI will create empty PVs into which the disks can be
transfered.

The OffloadPlugin step creates a job on the destination cluster. The job
provided offload plugin image and start. The job is started with the
following parameters:
- `HOST` = url to the esxi host
- `PLAN_NAME` = plane name
- `NAMESPACE` = namepsace where the migration is running

In addition to these variables, it also mounts the secrets to the vCenter.
The secrets are in the path `/etc/secret` and the files are:
- `accessKeyId` with a username
- `secretKey` with a password

Note:
This change additionally requires
kubev2v#1109, because right now the
cold migration transfer is managed by the virt-v2v. The kubev2v#1109 removes
this dependency and moves it out to the CNV CDI. This allows us to split
the transfer and conversion steps which were in the same step from the
forklift perspective. Once the disks are transferred the Forklift does the
`virt-v2v-in-place` on the disks and starts the VM.
In the same way, this step will be done also on the offload plugin as we
will transfer the disks using the offload plugin and then start
`virt-v2v-in-place` on the disks.

TODO:
[ ] Add design doc showing larger details, this is just PoC
[ ] Add check if OffloadPlugin image exists
[ ] Add check of the offload plugin disk transfer status
[ ] Allow storage map with OffloadPlugin and without combination
[ ] Improve the name of the offload pugin job, right now its the VM ID

Signed-off-by: Martin Necas <[email protected]>
mnecas added a commit to mnecas/forklift that referenced this pull request Nov 5, 2024
Issue:
This is alternative to kubev2v#1109
which removes the virt-v2v disk transfer and replaces it with te CDI.
The main dissadvatage of that solution is that in case customer will
have problems with the new CDI method there won't be any easy way how to
go back, to the virt-v2v.

Fix:
This change adds a new feature toggle `feature_vsphere_cold_cdi` in the
forklift controller. By default the feature will be enabled and in case
there will be problems such as disk corruption the user can disable it.

Signed-off-by: Martin Necas <[email protected]>
mnecas added a commit to mnecas/forklift that referenced this pull request Nov 5, 2024
Issue:
This is alternative to kubev2v#1109
which removes the virt-v2v disk transfer and replaces it with te CDI.
The main dissadvatage of that solution is that in case customer will
have problems with the new CDI method there won't be any easy way how to
go back, to the virt-v2v.

Fix:
This change adds a new feature toggle `feature_vsphere_cold_cdi` in the
forklift controller. By default the feature will be enabled and in case
there will be problems such as disk corruption the user can disable it.

Signed-off-by: Martin Necas <[email protected]>
mnecas added a commit to mnecas/forklift that referenced this pull request Nov 6, 2024
Issue:
This is alternative to kubev2v#1109
which removes the virt-v2v disk transfer and replaces it with te CDI.
The main dissadvatage of that solution is that in case customer will
have problems with the new CDI method there won't be any easy way how to
go back, to the virt-v2v.

Fix:
This change adds a new feature toggle `feature_vsphere_cold_cdi` in the
forklift controller. By default the feature will be enabled and in case
there will be problems such as disk corruption the user can disable it.

Signed-off-by: Martin Necas <[email protected]>
mnecas added a commit to mnecas/forklift that referenced this pull request Nov 6, 2024
Issue:
This is alternative to kubev2v#1109
which removes the virt-v2v disk transfer and replaces it with te CDI.
The main dissadvatage of that solution is that in case customer will
have problems with the new CDI method there won't be any easy way how to
go back, to the virt-v2v.

Fix:
This change adds a new feature toggle `feature_vsphere_cold_cdi` in the
forklift controller. By default the feature will be enabled and in case
there will be problems such as disk corruption the user can disable it.

Signed-off-by: Martin Necas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants