VM stuck in unresponsive state and prohibits listing processes on host #389

ddrazyk · 2023-05-08T10:03:35Z

We had an issue on 3 out of 4 hosts in an ovirt cluster (4.5.4-1.el8) where one VM is stuck in unresponsive state. It cannot be powered down nor restarted and as long as it's qemu process is running I can't list processes on that host. VM is unreachable through network and ovirt's VNC console. The only way to resolve the issue is to restart host from ovirt webUI (or kill qemu process).
I can see in vdsm logs such entries:

2023-05-05 21:27:52,848+0200 ERROR (qgapoller/1) [virt.periodic.Operation] <bound method QemuGuestAgentPoller._poller of <vdsm.virt.qemuguestagent.QemuGuestAgentPoller object at 0x7fe08c0d9630>> operation failed (periodic:187)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/periodic.py", line 185, in call
self._func()
File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 476, in _poller
vm_id, self._qga_call_get_vcpus(vm_obj))
File "/usr/lib/python3.6/site-packages/vdsm/virt/qemuguestagent.py", line 797, in _qga_call_get_vcpus
if 'online' in vcpus:
TypeError: argument of type 'NoneType' is not iterable

And then eventually leads to:
2023-05-05 21:45:17,709+0200 ERROR (vm/220746d4) [virt.vm] (vmId='220746d4-56a5-40cc-8633-1285c167c4fe') Failed to update CPU set of the VM to match shared pool (cpumanagement:121)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 104, in f
ret = attr(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 114, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 78, in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2303, in pinVcpu
raise libvirtError('virDomainPinVcpu() failed')
libvirt.libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/virt/cpumanagement.py", line 108, in _assign_shared
vm.pin_vcpu(vcpu, cpuset)
File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 6306, in pin_vcpu
self._dom.pinVcpu(vcpu, cpuset)
File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 112, in f
raise toe
vdsm.virt.virdomain.TimeoutError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)

This causes CPU to stuck on qemu process. If I forcibly kill the process everything gets back to normal, but ovirt reports vm's state as "unresponsive" or "powering down" if I try to shut it down from webUI.
Hosts are connected via glusterfs FUSE which runs on separate hosts (3 hosts with replica 3 and jbod setup with 6 nvme disks).
All hosts (hypervisors and gluster) use CentOS 8 Stream.

Version-Release number of selected component:
4.50.3.4-1.el8.x86_64

mz-pdm · 2023-05-15T07:34:20Z

As for the first traceback, the issue is fixed in Vdsm 4.50.5. It may be worth to upgrade Vdsm and see whether it fixes the problem.

ddrazyk · 2023-05-15T08:46:58Z

Hi @mz-pdm, I will update to Vdsm 4.50.5 during next update window and see if the error message goes away.
For the crashes - they seems unrelated to vdsm - after migrating all hypervisor hosts to Rocky8 the issue did not occur for 4 consecutive days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM stuck in unresponsive state and prohibits listing processes on host #389

VM stuck in unresponsive state and prohibits listing processes on host #389

ddrazyk commented May 8, 2023

mz-pdm commented May 15, 2023

ddrazyk commented May 15, 2023

VM stuck in unresponsive state and prohibits listing processes on host #389

VM stuck in unresponsive state and prohibits listing processes on host #389

Comments

ddrazyk commented May 8, 2023

mz-pdm commented May 15, 2023

ddrazyk commented May 15, 2023