-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: watchdog: BUG: soft lockup - CPU - CoreOS 41 CPU lockup on nfs activity #1841
Comments
if you can reproduce and you don't know if it has been fixed upstream yet then consider trying out different kernels to see if the behavior has been fixed. You can either try fresh installs with builds from https://builds.coreos.fedoraproject.org/browser?stream=rawhide (select Or you can override replace the kernel on an existing machine with one of the kernels from https://bodhi.fedoraproject.org/updates/?packages=kernel Example:
NOTE: preferably this would be on a throw away machine you don't care about |
I have been running stable on kernel 6.10.12-200.fc40.x86_64 since November 21st. I am now moving to upstream kernel-6.12.0-65.fc42 and will report on the behavior. |
Ah well, that didn't take too long, issue still there in upstream, just caused a complete crash on my box.
|
I hate "me too" comments but I found this after updating to Ubuntu 24.10 which updated the kernel to 6.11. I have a Ubuntu 24.10 mini PC that has a lot of docker images doing heavy writes to an NFS share. I removed that individual commit only and recompiled and have not run into it in a week. It had happened 2 times the previous week. |
hey @ch3lmi - not being a kernel dev myself really the only thing I would do is search net to see if anyone else has something similar or it is being discussed upstream but not fixed yet.. OR try to bisect to find the kernel commit that introduced the regression, which usually means you now know who to poke to ask about it and will probably fix it. |
Hi @dustymabe thanks for the recommendation ! I actually did my fair amount of research already, found someone who has the issue and regressed to a specific commit in the 6.11 kernel, all of which is described here: https://bugzilla.kernel.org/show_bug.cgi?id=219508 (see my initial post as well). |
ahh. since you know the commit then maybe try emailing the commit author and others who signed off? You can also look at recent git history for the files touched from that commit to see if any new changes were made that seem like fixing a bug in this area. |
It looks like the specific commit was reverted due to incorrect assumptions (as we've found here). It is part of the 6.13 RCs that are out now so I think it's safe to assume when 6.13 is out, this should work. Edit: It's part of the 6.11.11 kernel patch as well. |
Describe the bug
A coreOS VM has been running for the past 4 years with no issues, it looks like kernel 6.11.5 / CoreOS 41 may have doomed it.
The sole purpose of the VM is to act as a docker host, the OS is running nothing besides docker.
It is running 36 containers including portainer.
The container have various purposes and performance needs, some very low (mrtg, acme) some a little more (freshRSS, phpMyAdmin, alpine-cron, mariadb) and a few with more needs (gitlab, nextcloud, mastodon, transmission, borgmatic, synching).
The issues started on November 8th, coinciding with the release of CoreOS 41, it was running flawlessly until then.
Since that day the machine is freezing very frequently, almost daily, during nfs write operations. When the issue occurs the machine will run into a softlockup, the process currently using nfs will produce a cpu spike and remain locked (e.g. not possible to kill -9)
Please see below a dmesg and journalctl dump.
I have searched around the issue and found one other user reporting it (not on coreOS though). I have contacted that user who informed me that a particular commit in the 6.11 kernel branch was causing the issue and that they are now running production with that commit removed with no issue. The issue has been reported as a kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=219508
Reproduction steps
Stage a coreOS virtual machine, install NFS-heavy write containers.
In my case the issue usually happens during a borg backup (borgmatic) from the local disk in the VM to an nfs mounted backup location or when adding a large torrent (for instance from https://torrent.fedoraproject.org/) to a transmission container where the destination disk is nfs mounted.
I have spun a second coreOS box and transferred all my workloads to it, it did not take 2 days for the issue to re-occur so while I can’t reproduce it in the sense that I can’t “do something” that will trigger it, it can definitely be reproduced on a fresh installation.
Expected behavior
Writing extensive sets of data on NFS mounted disks should succeed with no issue
Actual behavior
Writing extensive sets of data on NFS mounted disks fail with a CPU soft lockup
System details
Fedora coreOS VM running on a Truenas Scale host (Dragonfish-24.04.2.5), based on QEmu
2 CPUs with 2 Cores and 2 Threads, 16 Gb memory, UEFI bootloader, 200 Gb disk and one NIC.
Fedora CoreOS version 41.20241027.3.0
Linux coreos 6.11.5-300.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 22 20:11:15 UTC 2024 x86_64 GNU/Linux
Butane or Ignition config
Additional information
No response
The text was updated successfully, but these errors were encountered: