[Bug] Processes get stuck after resuming VM from snapshot #4099

bduffany · 2023-09-10T19:46:49Z

Describe the bug

After resuming a VM from snapshot, processes occasionally get stuck. A minimal example is an init binary that just runs while (true) { sleep 100ms ; print 'hello' } - after resuming from snapshot, it only sometimes is able to resume the loop, and other times it gets stuck and does not print anything at all.

To Reproduce

Clone the following repo, which has the minimal code needed to reproduce: https://github.com/bduffany/firecracker-sleep-issue
Make sure you have make, cc (C compiler, from gcc package), glibc-static (for statically linking the init binary), and jq (for parsing firecracker API output)
Run make test

make test is doing the following:

Fetches the firecracker v1.4.0 release binary from GitHub as well as vmlinux 4.14 from the https://github.com/firecracker-microvm/firecracker-demo repo
Runs firecracker with an initrd where the init binary just loops infinitely, printing "running" then sleeping for 100ms.
Streams the VM logs to the kernel (using tail -f in the background)
Every 1s, pauses the VM, takes a snapshot, then kills firecracker (SIGTERM), then restarts firecracker, resuming the VM from snapshot.

Expected behaviour

When running make test in that repo, the init binary should print running several times after each resume. But on some resumes, it appears stuck, and does not print anything until the next resume.

Interesting details:

The issue does NOT reproduce if I replace the nanosleep syscalls with an NOP loop of 1e9 iterations (for (int i=0; i<1e9; i++) continue;)
The issue seems specific to snapshotting, not just pausing and resuming the VM. I tried just doing pause/resume without taking a snapshot and then restarting the firecracker binary in between, but could not reproduce.
I could not reproduce this on an Intel CPU so far, only AMD (have tried 2 different Intel machines and 2 different AMD machines).

Environment

Firecracker v1.4.0
Host kernel v6.2.0 (Ubuntu 22.04)
- UPDATE: also reproduced on host kernel v5.10 - m6a.metal instance
Guest kernel 4.14.55-84.37.amzn2.x86_64, from https://github.com/firecracker-microvm/firecracker-demo
- Also tried compiling 5.10 using the recommended guest config - the issue still reproduces.
Rootfs: none (initrd only)
Architecture: x86_64 (AMD Ryzen CPU)
- UPDATE: Also reproduced with AMD EPYC (m6a.metal instance)
GLIBC 2.35 (for init binary)

Additional context

This repro above is a minimal example of a much more troublesome issue where we are having trouble reconnecting to microVMs after resuming them from snapshots. We are running a server inside the VM and are having trouble connecting to over vsock. I suspected that the guest process was "stuck" somehow since when running sleep 1 && print('hello') in a background loop, it sometimes doesn't print anything. I came up with this minimal reproducer for this behavior.

Checks

Have you searched the Firecracker Issues database for similar problems?
- [Bug] Loading big memory snapshot can put processes in uninterruptible sleep #3020 - not sure if this is related, but it was closed without resolution. The final comment said to re-open if it could be reproduced with a supported kernel version
Have you read the existing relevant Firecracker documentation?
- I have read the FAQ about guest clock drift / NTP, but this appears more significant than just clock drift since I think nanosleep should work based on relative timing? I could be wrong, though.
Are you certain the bug being reported is a Firecracker issue?
- Not 100% certain, but given that it happens only when loading a snapshot, it seems like it could be Firecracker-related

The text was updated successfully, but these errors were encountered:

bduffany · 2023-09-10T20:31:53Z

I'm not sure if this helps, but I randomly discovered that the issue can no longer be reproduced if I set the nolapic kernel boot param. (I don't really understand what that param does, but I've read somewhere that it only allows the system to use 1 CPU, though I haven't verified this... so not sure yet if it's a real solution)

Also, I can't seem to reproduce this on an Intel machine (running host kernel v5.15). So maybe this issue is AMD-specific? I noticed my local AMD machine has a log line in the guest kernel showing [Firmware Bug]: TSC doesn't count with P0 frequency!, not sure if that could be related. Maybe #815 is also related somehow since I'm also using AMD.

Here are the full kernel boot logs:

Logs

[    0.000000] Linux version 4.14.55-84.37.amzn2.x86_64 (mockbuild@ip-10-0-1-79) (gcc version 7.3.1 20180303 (Red Hat 7.3.1-5) (GCC)) #1 SMP Wed Jul 25 18:47:15 UTC 2018
[    0.000000] Command line: ro console=ttyS0 reboot=k panic=1 pci=off nomodule random.trust_cpu=on i8042.noaux i8042.nomux i8042.nopnp i8042.dumbkbd tsc=reliable ipv6.disable=1
[    0.000000] [Firmware Bug]: TSC doesn't count with P0 frequency!
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.000000] x86/fpu: xstate_offset[9]: 2432, xstate_sizes[9]:    8
[    0.000000] x86/fpu: Enabled xstate features 0x2e7, context size is 2440 bytes, using 'compacted' format.
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000007ffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] Hypervisor detected: KVM
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] e820: last_pfn = 0x8000 max_arch_pfn = 0x400000000
[    0.000000] MTRR: Disabled
[    0.000000] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[    0.000000] CPU MTRRs all blank - virtualized system.
[    0.000000] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC  
[    0.000000] found SMP MP-table at [mem 0x0009fc00-0x0009fc0f] mapped at [ffffffffff200c00]
[    0.000000] Scanning 1 areas for low memory corruption
[    0.000000] Using GB pages for direct mapping
[    0.000000] RAMDISK: [mem 0x07f24000-0x07ffffff]
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x0000000007ffffff]
[    0.000000] NODE_DATA(0) allocated [mem 0x07f02000-0x07f23fff]
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 0:7f00001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 89597168 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x0000000007ffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x0000000007ffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x0000000007ffffff]
[    0.000000] Intel MultiProcessor Specification v1.4
[    0.000000] MPTABLE: OEM ID: FC      
[    0.000000] MPTABLE: Product ID: 000000000000
[    0.000000] MPTABLE: APIC at: 0xFEE00000
[    0.000000] Processor #0 (Bootup-CPU)
[    0.000000] IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23
[    0.000000] Processors: 1
[    0.000000] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
[    0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.000000] PM: Registered nosave memory: [mem 0x0009f000-0x000fffff]
[    0.000000] e820: [mem 0x08000000-0xffffffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on KVM
[    0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.000000] random: get_random_bytes called from start_kernel+0x94/0x486 with crng_init=0
[    0.000000] setup_percpu: NR_CPUS:128 nr_cpumask_bits:128 nr_cpu_ids:1 nr_node_ids:1
[    0.000000] percpu: Embedded 41 pages/cpu @ffff880007c00000 s128728 r8192 d31016 u2097152
[    0.000000] KVM setup async PF for cpu 0
[    0.000000] kvm-stealtime: cpu 0, msr 7c15040
[    0.000000] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 32137
[    0.000000] Policy zone: DMA32
[    0.000000] Kernel command line: ro console=ttyS0 reboot=k panic=1 pci=off nomodule random.trust_cpu=on i8042.noaux i8042.nomux i8042.nopnp i8042.dumbkbd tsc=reliable ipv6.disable=1
[    0.000000] PID hash table entries: 512 (order: 0, 4096 bytes)
[    0.000000] Memory: 110184K/130680K available (8204K kernel code, 622K rwdata, 1464K rodata, 1268K init, 2820K bss, 20496K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.004000] Hierarchical RCU implementation.
[    0.004000]  RCU restricting CPUs from NR_CPUS=128 to nr_cpu_ids=1.
[    0.004000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[    0.004000] NR_IRQS: 4352, nr_irqs: 48, preallocated irqs: 16
[    0.004000] Console: colour dummy device 80x25
[    0.004000] console [ttyS0] enabled
[    0.004000] tsc: Detected 4491.540 MHz processor
[    0.004000] Calibrating delay loop (skipped) preset value.. 8983.08 BogoMIPS (lpj=17966160)
[    0.004000] pid_max: default: 32768 minimum: 301
[    0.004000] Security Framework initialized
[    0.004000] SELinux:  Initializing.
[    0.004000] Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.004000] Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
[    0.004000] Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.004000] Mountpoint-cache hash table entries: 512 (order: 0, 4096 bytes)
[    0.004000] Last level iTLB entries: 4KB 512, 2MB 512, 4MB 256
[    0.004000] Last level dTLB entries: 4KB 3072, 2MB 3072, 4MB 1536, 1GB 0
[    0.004000] Spectre V2 : Mitigation: Full AMD retpoline
[    0.004000] Spectre V2 : Spectre v2 mitigation: Enabling Indirect Branch Prediction Barrier
[    0.004000] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    0.004000] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.006983] Freeing SMP alternatives memory: 28K
[    0.007832] smpboot: Max logical packages: 1
[    0.008055] x2apic enabled
[    0.008213] Switched APIC routing to physical x2apic.
[    0.008780] ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
[    0.008980] smpboot: CPU0: AMD EPYC (family: 0x19, model: 0x61, stepping: 0x2)
[    0.009245] Performance Events: 
[    0.009247] core perfctr but no constraints; unknown hardware!
[    0.009528] no PMU driver, software events only.
[    0.009711] Hierarchical SRCU implementation.
[    0.011485] smp: Bringing up secondary CPUs ...
[    0.011626] smp: Brought up 1 node, 1 CPU
[    0.011752] smpboot: Total of 1 processors activated (8983.08 BogoMIPS)
[    0.012000] devtmpfs: initialized
[    0.012000] x86/mm: Memory block size: 128MB
[    0.012000] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.012000] futex hash table entries: 256 (order: 2, 16384 bytes)
[    0.012152] NET: Registered protocol family 16
[    0.012359] cpuidle: using governor ladder
[    0.012488] cpuidle: using governor menu
[    0.014151] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.014359] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.014640] dmi: Firmware registration failed.
[    0.014818] NetLabel: Initializing
[    0.014924] NetLabel:  domain hash size = 128
[    0.015059] NetLabel:  protocols = UNLABELED CIPSOv4 CALIPSO
[    0.015241] NetLabel:  unlabeled traffic allowed by default
[    0.015444] clocksource: Switched to clocksource kvm-clock
[    0.015624] VFS: Disk quotas dquot_6.6.0
[    0.015748] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.016000] NET: Registered protocol family 2
[    0.016000] TCP established hash table entries: 1024 (order: 1, 8192 bytes)
[    0.016000] TCP bind hash table entries: 1024 (order: 2, 16384 bytes)
[    0.016000] TCP: Hash tables configured (established 1024 bind 1024)
[    0.016000] UDP hash table entries: 256 (order: 1, 8192 bytes)
[    0.016000] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
[    0.016000] NET: Registered protocol family 1
[    0.016000] Unpacking initramfs...
[    0.016000] Freeing initrd memory: 880K
[    0.016000] platform rtc_cmos: registered platform RTC device (no PNP device found)
[    0.016000] Scanning for low memory corruption every 60 seconds
[    0.016350] audit: initializing netlink subsys (disabled)
[    0.016640] Initialise system trusted keyrings
[    0.016780] Key type blacklist registered
[    0.016921] audit: type=2000 audit(1694379250.871:1): state=initialized audit_enabled=0 res=1
[    0.017192] workingset: timestamp_bits=36 max_order=15 bucket_order=0
[    0.017793] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.018927] Key type asymmetric registered
[    0.019056] Asymmetric key parser 'x509' registered
[    0.019214] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 254)
[    0.019452] io scheduler noop registered (default)
[    0.019617] io scheduler cfq registered
[    0.019764] Serial: 8250/16550 driver, 1 ports, IRQ sharing disabled
[    0.040269] serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a U6_16550A
[    0.041195] loop: module loaded
[    0.041300] tun: Universal TUN/TAP device driver, 1.6
[    0.041472] hidraw: raw HID events driver (C) Jiri Kosina
[    0.041657] nf_conntrack version 0.5.0 (1024 buckets, 4096 max)
[    0.041893] ip_tables: (C) 2000-2006 Netfilter Core Team
[    0.042070] Initializing XFRM netlink socket
[    0.042206] IPv6: Loaded, but administratively disabled, reboot required to enable
[    0.042438] NET: Registered protocol family 17
[    0.042582] Bridge firewalling registered
[    0.042810] sched_clock: Marking stable (40263980, 0)->(65107117, -24843137)
[    0.043068] registered taskstats version 1
[    0.043197] Loading compiled-in X.509 certificates
[    0.043628] Loaded X.509 cert 'Build time autogenerated kernel key: 3472798b31ba23b86c1c5c7236c9c91723ae5ee9'
[    0.043943] zswap: default zpool zbud not available
[    0.044111] zswap: pool creation failed
[    0.044354] Key type encrypted registered
[    0.044957] Freeing unused kernel memory: 1268K
[    0.052047] Write protecting the kernel read-only data: 12288k
[    0.053097] Freeing unused kernel memory: 2016K
[    0.053771] Freeing unused kernel memory: 584K

zulinx86 · 2023-09-11T13:17:50Z

Thanks for reporting this. At the moment, firecracker officially supports host kernel 4.14 and 5.10 [README.md] and will add host kernel 6.1 support. Are you able to reproduce it on host kernel 5.10 and 6.1, although we'll also try to reproduce the issue on our side?

bduffany · 2023-09-11T13:45:07Z

I don't have easy access to an AMD machine with any of those kernel versions, but later this week I will try and see if I can get 5.10 or 6.1 working with my current setup.

Does Firecracker have any test suites that run against AMD machines? (I suspect this may be AMD-specific.) I would be happy to contribute a test that demonstrates the issue - that would be a lot easier for me than installing a different kernel version, and hopefully easier to find a fix as well once there's a failing test.

zulinx86 · 2023-09-11T13:47:32Z

Does Firecracker have any test suites that run against AMD machines? I would be happy to contribute a test that demonstrates the issue.

Yes. m6a.metal is AMD instance that firecracker is tested on. Thanks for your help and effort!

bduffany · 2023-09-11T13:48:15Z

Ah OK, I will actually just spin one of those up on AWS then :) Thanks

bduffany · 2023-09-11T14:20:23Z

I was able to reproduce this on an m6a.metal instance with Linux 5.10.

I did aws ec2 run-instances --image-id=ami-0f409bae3775dc8e5 --count=1 --instance-type=m6a.metal then added an SSH rule in the security group settings so I could SSH to it.

Once SSH'd, uname -a shows:

Linux ip-172-31-75-28.ec2.internal 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

If it helps, I had to run sudo yum install git make gcc glibc-static jq in order for the make test command in my repro repository to work. I also ran sudo yum groupinstall "Developer Tools" but I'm not sure if that part was actually needed.

bduffany · 2023-09-11T14:22:28Z

Here is what it looks like when running the repro on that EC2 instance - whenever it prints --- Round <N>: Letting the VM run for a little while... this means that it is resuming from snapshot, and I expect to see init: running... being printed several times (once per 100ms), but it appears to only get printed some of the time:

simplescreenrecorder-2023-09-11_10.20.48.mp4

The kernel logs on this m6a.metal instance show the same error message in the kernel logs:

[Firmware Bug]: TSC doesn't count with P0 frequency!

bduffany · 2023-09-11T14:55:09Z

I've tried codesandbox's fork of firecracker which includes this commit: codesandbox@4164371

I am not sure if this is the "right" fix, but the issue no longer reproduces when using this commit. Any thoughts on whether this commit should be upstreamed?

CuriousGeorgiy · 2023-09-15T14:11:26Z

Not sure this is related, but our implementation of container snapshots backed by firecracker VMs using firecracker-containerd in vHive stumbles upon a similar problem: it seems like firecracker-containerd's in-VM agent stops responding and hangs after a VM snapshot is created and the VM is resumed (vhive-serverless/vHive#818).

We use a default setup suggested by firecracker-containerd.

bduffany · 2023-09-15T14:51:07Z

@CuriousGeorgiy are you hitting this issue on an AMD CPU, or Intel? Also, could you try running the repro steps I posted in the issue description, and letting us know if you see the same problem that I described? This would be very helpful as an extra data point 🙏

CuriousGeorgiy · 2023-09-15T15:11:23Z

@bduffany I am renting a node on CloudLab with an Intel CPU. I tried running the repro steps you posted, but after waiting for 100 rounds I didn't see any problems you described.

bduffany · 2023-09-15T19:50:01Z

@CuriousGeorgiy thanks for running the repro steps! It is interesting that you couldn't repro the issue on an Intel machine. It seems increasingly likely that the original issue I reported is AMD-specific, since I couldn't reproduce it on Intel either.

I think the issue that you are seeing is probably different than the one reported here, since the repro script doesn't exhibit the problematic behavior when you tried it. It may be worth reporting a new issue to the firecracker-containerd repo, with more details about your setup/environment and ideally a minimal reproducer.

wearyzen · 2023-09-28T17:14:10Z

Hi @bduffany, thanks for providing the repro steps. We are working on this issue and to provide an update :

I was able to reproduce the issue on an AMD instance
I modified your steps to run 5 times instead of 100 and saved snapshots from each run. When tried manually to load the snapshots I am able to see the prints which were otherwise not seen through the script. This means the guest takes more time to start in the issue scenario but, the guest is always able to boot if tried manually.
In some cases I saw that the prints were seen if a sleep of 5 seconds was added instead of 1 but it was random.
We'll continue investigating this and let you know if we have an update.

bduffany · 2023-11-02T00:35:29Z

Friendly ping, is this issue still on your radar? This is still a frequent pain point for us when using Firecracker on AMD machines, and has us effectively locked in to Intel which is not very desirable.

wearyzen · 2023-11-06T10:53:59Z

Hi @bduffany, we are tracking this as one of our high priority task. We'll let you know as soon as we have an update on this.

ShadowCurse · 2023-11-14T09:55:12Z

Hi @bduffany. We are still working on the root cause of this issue. We will update you as soon as possible.

pb8o · 2024-01-23T11:35:30Z

Hi @bduffany, I am currently looking at this. I found that passing lapic=notscdeadline in the kernel boot command line also seems to fix it, so I think that narrows it down a bit. I will continue investigating, but could you confirm?

bduffany · 2024-01-23T13:21:35Z

@pb8o thank you for looking into this! I tried setting lapic=notscdeadline in my repro script above on my AMD machine, and confirmed that the issue stops reproducing afterwards. If you learn more, I'd be interested to hear why it stops reproducing with this flag, and whether this is the "right" solution or not :)

bduffany · 2024-02-29T19:39:28Z

hey @pb8o, curious whether there are any leads on this that you can share? We would love to be able to run our Firecracker workloads on AMD machines, because they are much cheaper. Happy to help any way that I can - not much of an expert on virtualization but can maybe help try to find a more consistent repro, if that is part of what makes this tricky? (And thank you again for looking into this!)

pb8o · 2024-03-18T12:00:37Z

Hi @bduffany,

Sorry for the delay, unfortunately we were not able to make progress regarding this issue.
We are still tracking the problem and will provide an update once we will have more data to share.

bduffany · 2024-03-26T21:26:04Z

FYI, we started hitting this again, but on Intel this time (on GCP under nested virtualization), after GCP upgraded our nodes from kernel version 5.15.0-1036-gke to 5.15.0-1052-gke. We were seeing similar symptoms this time, but with some RCU stalls in the kernel logs. (I vaguely remember seeing RCU stalls on AMD too, so this might not be a new symptom.) The same repro script from the PR description also can reproduce the issue on the upgraded Intel nodes. Setting lapic=notscdeadline also seems to work in this case as a mitigation, though we don't fully understand the implications yet.

We never quite figured out how to map those GCP kernel versions to upstream kernel versions, or how to figure out which commits are included between those versions. Maybe someone with contacts at Google Cloud could help 😅

AMD hosts do not have TSC deadline enabled, and that seems to cause issues like in firecracker-microvm#4099. Only expose TSC_DEADLINE to the guest if the host supports it. Closes firecracker-microvm#4099 Signed-off-by: Pablo Barbáchano <[email protected]>

pb8o · 2024-07-02T09:52:57Z

Hi @bduffany, first of all thanks for your patience. We are sorry that it is taking so long to root cause and the time between updates, and I wanted to reassure you that we take this issue seriously and we are still investigating a possible solution.

Regarding the patch

While investigating, we came across this kernel patch which seems possibly relevant: KVM: x86: Fix lapic timer interrupt lost after loading a snapshot

I was able to reproduce the issue on Ubuntu 24.04, which had a 6.8.0 kernel with that patch. I was also able to reproduce the issue with newer guest kernels and newer AMD CPUs like Genoa, so we can rule out something that is fixed in newer kernels or CPUs.

At this point we believe the issue is that we use TSC deadline feature in both Intel and AMD. Intel does support TSC deadline, so works fine. In the AMD CPUs we tested however, TSC deadline is not enabled but Firecracker still presents it to the guest via CPUID. In this case KVM emulates TSC deadline and during a snapshot resume this emulation seems to not work (this is the part we haven't figured out yet if it's Firecracker or KVM who is at fault here).

We have tried working around the issue in #4655 by only enabling TSC deadline if it's enabled on the host. This works fine, but creates a boot time regression of 40-50% in our tests. We are currently trying to understand where this time is going and if we can mitigate it.

AMD hosts do not have TSC deadline enabled, and that seems to cause issues like in firecracker-microvm#4099. Only expose TSC_DEADLINE to the guest if the host supports it. Closes firecracker-microvm#4099 Signed-off-by: Pablo Barbáchano <[email protected]>

The `wait_for_up()` sends a dummy SSH command with a timeout of 10s, with the intention of blocking the integration test until a VM is responsive after booting. However, post-snapshot restore it runs into issue firecracker-microvm#4099, where on AMD inside the guest, sshd gets stuck inside a nanosleep syscall for longer than intended (and thus longer than the 10s timeout). Alternatively, we could pass `lapic=notscdeadline` in the kernel commandline arguments, but we do not have an easy way to _append_ to the default kernel command line from integration tests, and I'd like to avoid duplicating our default across a bunch of places. Signed-off-by: Patrick Roy <[email protected]>

The `wait_for_up()` sends a dummy SSH command with a timeout of 10s, with the intention of blocking the integration test until a VM is responsive after booting. However, post-snapshot restore it runs into issue #4099, where on AMD inside the guest, sshd gets stuck inside a nanosleep syscall for longer than intended (and thus longer than the 10s timeout). Alternatively, we could pass `lapic=notscdeadline` in the kernel commandline arguments, but we do not have an easy way to _append_ to the default kernel command line from integration tests, and I'd like to avoid duplicating our default across a bunch of places. Signed-off-by: Patrick Roy <[email protected]>

When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. Fixes firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the lists of MSRs that KVM gives us into one additional chunk containing all "deferred" MSRs that needs to be restored "as late as possible". This splitting happens at snapshot creation time, to get it off the hot-path. Fixes firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

roypat · 2024-07-08T06:41:43Z

Good morning @bduffany,
Sorry again for taking so long for getting to the bottom of this. We believe to have found the root-cause of this issue and have a proposed fix in #4666. The problem was Firecracker not correctly taking into account dependencies between MSR_IA32_TSC and MSR_IA32_TSC_DEADLINE, which lead to MSR_IA32_TSC_DEADLINE being restored before MSR_IA32_TSC. However, since KVM relies on the guest TSC value (and thus MSR_IA32_TSC) for restoration of the deadline MSR, this lead to incorrect timer intialization.

With the commit from #4666, I am no longer able to reproduce the issue. Could you also give this a try? Thanks!

To follow up on some loose ends from @pb8o's last update: The alternate fix of simply disabling the TSC_DEADLINE feature (either via CPUID or lapic=notscdeadline kernel boot argument) is equally valid. This forces the kernel to fall back onto the APIC timer's oneshot and period modes to setting up timer interrupts. The only difference is that periodic/one-shot modes of the APIC timer run at a different clock frequency than the TSC_DEADLINE time (which counts at the TSC frequency). Thus, the kernel needs to calibrate on startup to find out how to convert between nanoseconds and APIC timer clock cycles. This calibration takes 100ms (depending on kernel configuration), and is the reason why we were seeing a boottime regression with this option. This calibration cannot be avoided, as there is currently no way for a hypervisor to tell a Linux guest about the APIC timer frequency on AMD (funnily enough, there are CPUID leaves that KVM checks on Intel, but that's exactly the case where we don't have any problems...). Other than this initial calibration, there should be no impact on guest functionality though.

When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the lists of MSRs that KVM gives us into one additional chunk containing all "deferred" MSRs that needs to be restored "as late as possible". This splitting happens at snapshot creation time, to get it off the hot-path. Fixes firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

add changelog entry about #4099 Signed-off-by: Patrick Roy <[email protected]>

When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the lists of MSRs that KVM gives us into one additional chunk containing all "deferred" MSRs that needs to be restored "as late as possible". This splitting happens at snapshot creation time, to get it off the hot-path. Fixes firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the lists of MSRs that KVM gives us into one additional chunk containing all "deferred" MSRs that needs to be restored "as late as possible". This splitting happens at snapshot creation time, to get it off the hot-path. Fixes #4099 Signed-off-by: Patrick Roy <[email protected]>

add changelog entry about #4099 Signed-off-by: Patrick Roy <[email protected]>

bduffany · 2024-07-15T22:31:37Z

@roypat Thank you so much for the fix! I verified on my AMD machine that the script is able to complete 100 iterations without getting stuck, without the lapic=notscdeadline workaround.

The `wait_for_up()` sends a dummy SSH command with a timeout of 10s, with the intention of blocking the integration test until a VM is responsive after booting. However, post-snapshot restore it runs into issue firecracker-microvm#4099, where on AMD inside the guest, sshd gets stuck inside a nanosleep syscall for longer than intended (and thus longer than the 10s timeout). Alternatively, we could pass `lapic=notscdeadline` in the kernel commandline arguments, but we do not have an easy way to _append_ to the default kernel command line from integration tests, and I'd like to avoid duplicating our default across a bunch of places. Signed-off-by: Patrick Roy <[email protected]>

When restoring MSRs, we currently do not enforce any sort of relative ordering. However, some MSRs have implicit dependencies on other MSRs being restored before them, and failing to fulfill these dependencies can result in incorrect VM behavior after resuming. One example of such an implicit dependency between MSRs is the pair (`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring `MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this restoration implies that the guest was waiting for the tsc_deadline timer to expire at the time of being paused for snapshotting. If yes, it primes a (either harddware or software depending on support) timer on the host to make sure the guest will receive the expected interrupt after restoration. To determine whether this is needed, KVM looks at the guest's timestamp counter (TSC) and compares it with the requested tsc_deadline value. The value KVM reads for the guest's TSC depends on the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before MSR_IA32_TSC is restored, this comparison will yield a wrong result (as the deadline value is compared with something uninitialized). This can either result in KVM determining the guest wasn't waiting for a timing expiry at the time of snapshotting, or cause it to schedule the timer interrupt too far into the future. Note that the former is only a problem on AMD platforms, which do not support the TSC_DEADLINE feature at the hardware level. Here, KVM falls back to a software timer, which explicitly does not notify the vCPU if the deadline value is "in the past". The hardware timer used on other x86 platforms on the other hand always fires (potentially firing immediately if the deadline value is in the past). This commit fixes the above by ensuring we always restore MSR_IA32_TSC before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the lists of MSRs that KVM gives us into one additional chunk containing all "deferred" MSRs that needs to be restored "as late as possible". This splitting happens at snapshot creation time, to get it off the hot-path. Fixes firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

zulinx86 added the Type: Bug Indicates an unexpected problem or unintended behavior label Sep 11, 2023

zulinx86 self-assigned this Sep 11, 2023

bduffany changed the title ~~[Bug] Processes that use nanosleep get stuck after resuming VM from snapshot~~ [Bug] Processes get stuck after resuming VM from snapshot Sep 11, 2023

CuriousGeorgiy mentioned this issue Sep 15, 2023

firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup vhive-serverless/vHive#818

Open

ShadowCurse assigned wearyzen Sep 25, 2023

ShadowCurse added the Priority: High Indicates than an issue or pull request should be resolved ahead of issues or pull requests labelled label Sep 25, 2023

pb8o changed the title ~~[Bug] Processes get stuck after resuming VM from snapshot~~ [Bug] [AMD] Processes get stuck after resuming VM from snapshot Jan 23, 2024

pb8o self-assigned this Jan 23, 2024

pb8o mentioned this issue Feb 12, 2024

[Bug] Boot timer seems off #4429

Closed

3 tasks

roypat added a commit to pb8o/firecracker that referenced this issue Jul 1, 2024

Merge branch 'main' into firecracker-microvmgh-4099

4079fdb

roypat mentioned this issue Jul 3, 2024

fix(test): remove wait_for_up from test_vmgenid #4664

Merged

9 tasks

roypat mentioned this issue Jul 5, 2024

fix: Always restore MSR_IA32_TSC_DEADLINE after MSR_IA32_TSC #4666

Merged

9 tasks

roypat added a commit to roypat/firecracker that referenced this issue Jul 8, 2024

doc: Add CHANGELOG entry

a186424

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

roypat added a commit to roypat/firecracker that referenced this issue Jul 8, 2024

doc: Add CHANGELOG entry

6128a95

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

pb8o closed this as completed in #4666 Jul 8, 2024

pb8o closed this as completed in e41bc79 Jul 8, 2024

pb8o pushed a commit that referenced this issue Jul 8, 2024

doc: Add CHANGELOG entry

b5bb8ec

add changelog entry about #4099 Signed-off-by: Patrick Roy <[email protected]>

roypat added a commit to roypat/firecracker that referenced this issue Jul 9, 2024

doc: Add CHANGELOG entry

3c95758

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

ShadowCurse pushed a commit that referenced this issue Jul 9, 2024

doc: Add CHANGELOG entry

acecc8d

add changelog entry about #4099 Signed-off-by: Patrick Roy <[email protected]>

ghost pushed a commit to JamesC1305/firecracker that referenced this issue Aug 14, 2024

doc: Add CHANGELOG entry

bf9ce8f

add changelog entry about firecracker-microvm#4099 Signed-off-by: Patrick Roy <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Processes get stuck after resuming VM from snapshot #4099

[Bug] Processes get stuck after resuming VM from snapshot #4099

bduffany commented Sep 10, 2023 •

edited

Loading

bduffany commented Sep 10, 2023 •

edited

Loading

zulinx86 commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

zulinx86 commented Sep 11, 2023

bduffany commented Sep 11, 2023

bduffany commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

CuriousGeorgiy commented Sep 15, 2023 •

edited

Loading

bduffany commented Sep 15, 2023

CuriousGeorgiy commented Sep 15, 2023

bduffany commented Sep 15, 2023 •

edited

Loading

wearyzen commented Sep 28, 2023

bduffany commented Nov 2, 2023 •

edited

Loading

wearyzen commented Nov 6, 2023

ShadowCurse commented Nov 14, 2023

pb8o commented Jan 23, 2024

bduffany commented Jan 23, 2024 •

edited

Loading

bduffany commented Feb 29, 2024 •

edited

Loading

pb8o commented Mar 18, 2024

bduffany commented Mar 26, 2024 •

edited

Loading

pb8o commented Jul 2, 2024

roypat commented Jul 8, 2024

bduffany commented Jul 15, 2024 •

edited

Loading

[Bug] Processes get stuck after resuming VM from snapshot #4099

[Bug] Processes get stuck after resuming VM from snapshot #4099

Comments

bduffany commented Sep 10, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behaviour

Environment

Additional context

Checks

bduffany commented Sep 10, 2023 • edited Loading

zulinx86 commented Sep 11, 2023 • edited Loading

bduffany commented Sep 11, 2023 • edited Loading

zulinx86 commented Sep 11, 2023

bduffany commented Sep 11, 2023

bduffany commented Sep 11, 2023 • edited Loading

bduffany commented Sep 11, 2023 • edited Loading

bduffany commented Sep 11, 2023 • edited Loading

CuriousGeorgiy commented Sep 15, 2023 • edited Loading

bduffany commented Sep 15, 2023

CuriousGeorgiy commented Sep 15, 2023

bduffany commented Sep 15, 2023 • edited Loading

wearyzen commented Sep 28, 2023

bduffany commented Nov 2, 2023 • edited Loading

wearyzen commented Nov 6, 2023

ShadowCurse commented Nov 14, 2023

pb8o commented Jan 23, 2024

bduffany commented Jan 23, 2024 • edited Loading

bduffany commented Feb 29, 2024 • edited Loading

pb8o commented Mar 18, 2024

bduffany commented Mar 26, 2024 • edited Loading

pb8o commented Jul 2, 2024

roypat commented Jul 8, 2024

bduffany commented Jul 15, 2024 • edited Loading

bduffany commented Sep 10, 2023 •

edited

Loading

bduffany commented Sep 10, 2023 •

edited

Loading

zulinx86 commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

bduffany commented Sep 11, 2023 •

edited

Loading

CuriousGeorgiy commented Sep 15, 2023 •

edited

Loading

bduffany commented Sep 15, 2023 •

edited

Loading

bduffany commented Nov 2, 2023 •

edited

Loading

bduffany commented Jan 23, 2024 •

edited

Loading

bduffany commented Feb 29, 2024 •

edited

Loading

bduffany commented Mar 26, 2024 •

edited

Loading

bduffany commented Jul 15, 2024 •

edited

Loading