Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix live migration #25

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix live migration #25

wants to merge 1 commit into from

Conversation

joyent-automation
Copy link

fix live migration

This PR was migrated-from-gerrit, https://cr.joyent.us/#/c/5750/.
The raw archive of this CR is here.
See MANTA-4594 for info on Joyent Eng's migration from Gerrit.

CR discussion

@YanChii commented at 2019-03-18T19:35:20

Uploaded patch set 2: Patch Set 1 was rebased.

@YanChii commented at 2019-03-29T12:54:25

Patch Set 3: Patch Set 2 was rebased

@YanChii commented at 2019-04-01T15:03:12

Topic set to How to test: https://binaryparadise.com/~janci/livemig/smartos-livemig-howto.txt

@YanChii commented at 2019-04-01T15:03:36

Topic changed from How to test: https://binaryparadise.com/~janci/livemig/smartos-livemig-howto.txt to live migration

@pfmooney commented at 2019-04-05T13:54:12

Patch Set 3:

(3 comments)

@rmustacc commented at 2019-04-05T14:58:53

Patch Set 3:

(6 comments)

@YanChii commented at 2019-04-05T17:22:53

Patch Set 3:

(2 comments)

@YanChii commented at 2019-04-05T18:06:13

Patch Set 3:

(3 comments)

@rmustacc commented at 2019-04-05T18:47:15

Patch Set 3:

(1 comment)

Patch Set 3 code comments
Makefile#62 @pfmooney

See comment in kvm.c about time.h vs sys/time.h

kvm.c#317 @pfmooney

Since this is being built as a module, it would probably be better to use sys/time.h (from uts/common/sys)

kvm.c#317 @rmustacc

Yes, this needs to be sys/time.h and the correspond bit should be removed from the Makefile.

kvm.c#317 @YanChii

Ok, I'll change and test that.

kvm.c#2540 @rmustacc

Please run make check/cstyle. This and many other lines aren't cstyle clean.

kvm.c#2612 @rmustacc

There are a couple of things that need to be checked here. First, we're getting a uint64_t value, but casting it to an int64_t (hrtime_t). This needs to be checked to make sure we don't create a negative number and therefore we need to reject anything larger than INT64_MAX. As this should represent nanoseconds since boot roughly.

How are we supposed to handle the fact that the host's hrtime_t may be substantially less than the one that the target came from? In this case, does a negative value make sense and do you believe we'll do the right thing through all the other uses of this?

kvm.c#2612 @YanChii

The raw value comes always from KVM_GET_CLOCK where it is cast from hrtime_t to uint64_t. It is unfortunate that this needs to be done but we need to fit the value into already-defined qemu structure (kvm_clock_data).

Therefore the casting goes like this:
hrtime_t (boot_hrtime) -> uint64_t (user_ns.clock) -> qemu save/restore -> uint64_t (user_ns.clock) -> hrtime_t (boot_hrtime)

I'm open to other suggestions how to deal with this.

kvm.c#2612 @rmustacc

That's the wrong way to think about this. This is a kernel API, you need to assume you'll get any arbitrary value from a hostile user. What you've described is the theoretical happy path. Unfortunately, all we know is that we get an arbitrary uint64_t and we need to validate that it fits inside an int64_t.

The suggestion is what I laid out above. We need to make sure that we check the validity of the value. Since this should in theory represent how much time the system has been up, it should fit within the positive range of an int64_t and it seems reasonable to generate an error about that fact.

kvm.c#2626 @rmustacc

Right now I don't see the boot_hrtime value ever set. I feel like that means we'll be transmitting an arbitrarily high value here. Are you seeing it set somewhere that I'm not?

kvm.c#2626 @YanChii

19:47 < janci> pmooney: I'm still wondering about one thing (also rmustacc mentioned it): why the boot_hrtime is not set anywhere?
19:48 < pmooney> no good reason
19:48 < pmooney> it's probably a bug
19:48 < janci> as far as I understand, it works because the kvm-clock implementation in the guest cares only about time diff
19:49 < janci> so we are counting from an arbitrary value but it doesn't matter that much

kvm_x86.c#760 @rmustacc

Why is this being cast to its own type?

kvm_x86.c#760 @YanChii

I wanted to be sure. I'll remove that.

kvm_x86.h#158 @pfmooney

Could you explain the motivation behind this change? Why double the effective number of entries in the struct (100 from kvm_msrsentries, 100 from msr_dataentries)?

kvm_x86.h#158 @rmustacc

Note, the struct kvm_msrs should be the identical thing to the QEMU version of the struct msr_data. The fact that the msrs and padding are split apart in the kernel version instead of in the nested kvm_msrs struct shouldn't change anything for userland from an ABI perspective.

kvm_x86.h#158 @YanChii

I took the struct 1:1 directly from kvm-cmd:
https://github.com/joyent/illumos-kvm-cmd/blob/master/qemu-kvm-x86.c#L725

I did not alter the struct hierarchy. As seen e.g. here
YanChii@31a2e4c#diff-b509329e45fc968aef7a785ae00ab4bfL1504

@rmustacc commented at 2019-04-08T21:17:39

Patch Set 5:

(3 comments)

@YanChii commented at 2019-04-08T22:23:00

Patch Set 5:

(1 comment)

Patch Set 5 code comments
kvm.c#2611 @rmustacc

We don't use C++ style comments, only C style comments.

kvm.c#2622 @rmustacc

Why is this a problem? Why can't it be in the future? Wouldn't that stop valid migrations? The hrtime_t is a signed value, so it could be negative, which means when we subtract from it in the future, we'll add to it.

kvm.c#2622 @YanChii

user_ns.clock value effectively contains VM uptime in hrtime_t resolution, not the boot_hrtime itself. Therefore I've made a mistake and correct condition should be:
if (user_ns.clock < 0) {
...

Now the answer to your question:
How we can get a negative number here?
We cannot, user_ns.clock is uint64. The check is useless.

But thinking forward: we can get an overflow for user_ns.clock that will get caught by INT64_MAX if:

  1. the boot_hrtime value was uninitialized (this is the present state of the code, fixed in this commit)
  2. (probably) the hypervisor's clock was set backwards - before the VM boot time (so the gethrtime() returns smaller number than boot_hrtime).
  3. intentional forgery that does not achieve anything besides inconsistent migration.

I don't see 1. and 3. as a problem.
It will work correctly also without this check.

The 2. is more tricky because (in theory) it might affect the valid migration. And this check will not help to solve this problem anyway.

So after re-thinking I think we can remove this check because it doesn't help.

I'll check for overflow in KVM_GET_CLOCK instead.

kvm_x86.c#761 @rmustacc

This has inconsistent indentation with normal continuations (4 spaces after the tab).

@YanChii commented at 2019-04-10T11:18:15

Patch Set 6:

(1 comment)

@rmustacc commented at 2019-04-16T15:29:38

Patch Set 6:

(2 comments)

@YanChii commented at 2019-04-17T09:57:33

Patch Set 6:

(1 comment)

Patch Set 6 code comments
kvm.c#2614 @YanChii

rmustacc: I still don't like this one. It prevents us from transferring a negative value (e.g. when host clock goes backwards a lot after VM boot).

How about this proposal:

KVM_GET_CLOCK:

Variant A:

  • cast hrtime_t to uint64_t directly (there's no overflow when only casting)
    user_ns.clock = (uint64_t)timediff;

Variant B:

  • shift the signed value to the middle of unsigned range
    user_ns.clock = (uint64_t)timediff;
    user_ns.clock += INT64_MAX + 1;

KVM_GET_CLOCK:

Variant A:

  • cast int64_t hrtime_t directly (I know you've refused this one but there's no overflow when only casting and we are getting the right value. Even if malicious user forges the number, it will always fit in as we only change the representation of the same binary number).
    timediff = (hrtime_t)user_ns.clock;

Variant B:
timediff = (hrtime_t)(user_ns.clock - (INT64_MAX + 1));

kvm.c#2614 @rmustacc

So, let's take a step back for a moment. There are a few different things to consider.

First off, let's make sure we're on the same page about what the boot_hrtime is supposed to represent. When we're writing into the clock page via kvm_write_guest_time(), we're using this to calculate pvclock->system_time(). With the addition of the proper initialization of boot_hrtime(), we're using this to represent nanoseconds since boot of the VM.

This means, that the value we're transmitting when we get the clock base is how long the VM has been up. This isn't a time that should jump backwards, it should always be increasing as long as the VM is running, right? Can you further describe how this moves backwards in this scheme?

Based on this, the value we get is correct today. I'm not sure we even should ever be on the negative path that you have below based on this logic. Though that depends on how we deal with setting.

Therefore, when we come back around and set this value here, we need to set it such that a gethrtime() - boot_hrtime = ioctl value.

This, to me, suggests that the boot_hrtime can be negative; however, the value passed in by the user should only ever be a positive int64_t. The value we're transferring out via GET_CLOCK and in via SET_CLOCK should only ever represent nanoseconds since boot of the VM, which is a positive value.

Finally, I think the casting concerns are valid, but not for the reasons you expect. Let's say I you pass in a large positive value, like INT64_MAX + 1 as a uint64_t as the ioctl takes it. When you cast that to an int64_t, that will become -9223372036854775808. If we subtract that from any positive hrtime_t value, as gethrtime() - -9223372036854775808. This will result in overflow of the type, leading to undefined behavior.

It doesn't matter what casting games we play on get. We should just make sure get represents what we want. We have to safeguard set, probably more strictly than we do today to make sure that we don't result in having calculations that'll overflow. That said, if we treat this as the actual nanoseconds since boot of the VM, which I think is right, then valid uses will never hit this problem, because it'd take a lot of nanoseconds since boot for both the VM and system in question for it to be a problem.

Does this make sense? It's not about whether the resulting boot_hrtime will be positive or negative, but rather what user_ns.clock is supposed to represent and how we make sure that doesn't induce undefined behavior.

kvm.c#2614 @YanChii

I understand. Thank you for your effort in explaining your thinking to me.
I'll treat the value of VM uptime as always positive and act accordingly.

Implications:
SET_CLOCK: check for INT64_MAX as you've suggested and simply use the supplied value
GET_CLOCK: simply output the value of gethrtime() - boot_hrtime

We can do it simple here because I've realized that the actual problem is in the other part of the kernel:
https://github.com/joyent/illumos-kvm/blob/master/kvm_x86.c#L753
When calculating the pvclock->system_time for guest, we currently do:
pvclock->system_time = hrt - v->kvm->arch.boot_hrtime;

But the actual pvclock->system_time is uint64_t. Therefore after fixing the boot_hrtime initialization and moving the host OS clock backwards, we get an overflow (and currently we are getting overflows randomly which is possibly even worse).

I'm not sure however if correcting this other thing should be part of this patch.

kvm_x86.c#760 @rmustacc

This is unneeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants