Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: watchdog: BUG: soft lockup - CPU - CoreOS 41 CPU lockup on nfs activity #1841

Open
ch3lmi opened this issue Nov 21, 2024 · 8 comments
Labels
component/kernel kind/bug status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release.

Comments

@ch3lmi
Copy link

ch3lmi commented Nov 21, 2024

Describe the bug

A coreOS VM has been running for the past 4 years with no issues, it looks like kernel 6.11.5 / CoreOS 41 may have doomed it.

The sole purpose of the VM is to act as a docker host, the OS is running nothing besides docker.
It is running 36 containers including portainer.
The container have various purposes and performance needs, some very low (mrtg, acme) some a little more (freshRSS, phpMyAdmin, alpine-cron, mariadb) and a few with more needs (gitlab, nextcloud, mastodon, transmission, borgmatic, synching).

The issues started on November 8th, coinciding with the release of CoreOS 41, it was running flawlessly until then.

Since that day the machine is freezing very frequently, almost daily, during nfs write operations. When the issue occurs the machine will run into a softlockup, the process currently using nfs will produce a cpu spike and remain locked (e.g. not possible to kill -9)

Please see below a dmesg and journalctl dump.

I have searched around the issue and found one other user reporting it (not on coreOS though). I have contacted that user who informed me that a particular commit in the 6.11 kernel branch was causing the issue and that they are now running production with that commit removed with no issue. The issue has been reported as a kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=219508

Reproduction steps

Stage a coreOS virtual machine, install NFS-heavy write containers.
In my case the issue usually happens during a borg backup (borgmatic) from the local disk in the VM to an nfs mounted backup location or when adding a large torrent (for instance from https://torrent.fedoraproject.org/) to a transmission container where the destination disk is nfs mounted.

I have spun a second coreOS box and transferred all my workloads to it, it did not take 2 days for the issue to re-occur so while I can’t reproduce it in the sense that I can’t “do something” that will trigger it, it can definitely be reproduced on a fresh installation.

Expected behavior

Writing extensive sets of data on NFS mounted disks should succeed with no issue

Actual behavior

Writing extensive sets of data on NFS mounted disks fail with a CPU soft lockup

$ dmesg November 17th
Nov 17 15:41:58 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 78s! [kworker/u16:7:101418]
Nov 17 15:41:58 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 17 15:41:59 coreos kernel: #1: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #2: 100% system, 1% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #4: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #5: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
Nov 17 15:41:59 coreos kernel: ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 17 15:41:59 coreos kernel: CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 17 15:41:59 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 17 15:41:59 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 17 15:41:59 coreos kernel: Workqueue: writeback wb_workfn (flush-0:279)
Nov 17 15:41:59 coreos kernel: RIP: 0010:nfs_page_group_unlock+0x28/0x40 [nfs]
Nov 17 15:41:59 coreos kernel: Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf 48 8b 43 38 c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9 e2 79 f1 dd
Nov 17 15:41:59 coreos kernel: RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
Nov 17 15:41:59 coreos kernel: RAX: 0000000000000027 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
Nov 17 15:42:00 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
Nov 17 15:42:00 coreos kernel: RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
Nov 17 15:42:00 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
Nov 17 15:42:00 coreos kernel: R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
Nov 17 15:42:00 coreos kernel: FS: 0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
Nov 17 15:42:00 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 15:42:00 coreos kernel: CR2: 00003c020063e000 CR3: 000000028a42a001 CR4: 0000000000370ef0
Nov 17 15:42:00 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 17 15:42:00 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 17 15:42:00 coreos kernel: Call Trace:
Nov 17 15:42:00 coreos kernel:
Nov 17 15:42:08 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 17 15:42:08 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 17 15:42:11 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 17 15:42:15 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 17 15:42:19 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 17 15:42:21 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 17 15:42:25 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 17 15:42:44 coreos kernel:
Nov 17 15:43:02 coreos kernel:
$ journalctl November 17th
[ 8520.270364] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u16:7:101418]
[ 8520.271777] CPU#2 Utilization every 4s during lockup:
[ 8520.272344] #1: 84% system, 16% softirq, 1% hardirq, 0% idle
[ 8520.273036] #2: 98% system, 2% softirq, 1% hardirq, 0% idle
[ 8520.273717] #3: 100% system, 1% softirq, 0% hardirq, 0% idle
[ 8520.274405] #4: 100% system, 0% softirq, 1% hardirq, 0% idle
[ 8520.275130] #5: 100% system, 1% softirq, 1% hardirq, 0% idle
[ 8520.275821] Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 8520.275890] ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
[ 8520.275901] CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Not tainted 6.11.5-300.fc41.x86_64 #1
[ 8520.275904] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 8520.275908] Workqueue: writeback wb_workfn (flush-0:279)
[ 8520.275918] RIP: 0010:nfs_page_group_unlock+0x24/0x40 [nfs]
[ 8520.275955] Code: 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf <48> 8b 43 38 f6 c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9
[ 8520.275957] RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
[ 8520.275961] RAX: 0000000000000001 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
[ 8520.275963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
[ 8520.275964] RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
[ 8520.275966] R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
[ 8520.275967] R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
[ 8520.275969] FS: 0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
[ 8520.275970] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8520.275972] CR2: 00003c020063e000 CR3: 0000000341ad2004 CR4: 0000000000370ef0
[ 8520.275975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8520.275976] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8520.275978] Call Trace:
[ 8520.275980]
[ 8520.275985] ? watchdog_timer_fn.cold+0x233/0x311
[ 8520.275990] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 8520.275993] ? __hrtimer_run_queues+0x113/0x280
[ 8520.275996] ? kvm_clock_get_cycles+0x18/0x30
[ 8520.275999] ? hrtimer_interrupt+0xfa/0x210
[ 8520.276002] ? __sysvec_apic_timer_interrupt+0x52/0x100
[ 8520.276005] ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 8520.276008]
[ 8520.276009]
[ 8520.276010] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 8520.276014] ? nfs_page_group_unlock+0x24/0x40 [nfs]
[ 8520.276045] nfs_lock_and_join_requests+0x1b3/0x270 [nfs]
[ 8520.276082] nfs_page_async_flush+0x1b/0x1f0 [nfs]
[ 8520.276111] ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
[ 8520.276141] nfs_writepages_callback+0x30/0x50 [nfs]
[ 8520.276168] write_cache_pages+0x65/0xc0
[ 8520.276172] nfs_writepages+0x162/0x2e0 [nfs]
[ 8520.276200] ? __dev_queue_xmit+0x274/0xe80
[ 8520.276203] ? selinux_ip_postroute+0x114/0x430
[ 8520.276209] do_writepages+0x7e/0x280
[ 8520.276212] __writeback_single_inode+0x41/0x340
[ 8520.276216] writeback_sb_inodes+0x21c/0x4f0
[ 8520.276223] __writeback_inodes_wb+0x4c/0xf0
[ 8520.276226] wb_writeback+0x193/0x310
[ 8520.276229] wb_workfn+0x2af/0x450
[ 8520.276233] process_one_work+0x176/0x330
[ 8520.276237] worker_thread+0x252/0x390
[ 8520.276240] ? __pfx_worker_thread+0x10/0x10
[ 8520.276243] kthread+0xcf/0x100
[ 8520.276245] ? __pfx_kthread+0x10/0x10
[ 8520.276247] ret_from_fork+0x31/0x50
[ 8520.276250] ? __pfx_kthread+0x10/0x10
[ 8520.276252] ret_from_fork_asm+0x1a/0x30
[ 8520.276256]
$ journalctl November 21st (newly staged coreOS 41 from scratch, all containers restored)
Nov 21 07:32:06 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 25008s! [borg:491201]
Nov 21 07:32:06 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 21 07:32:06 coreos kernel: #1: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #2: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #4: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: #5: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:06 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace xt_nat veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl qxl virtio_balloon drm_ttm_helper ttm i2c_piix4 i2c_smbus joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_blk virtio_net net_failover failover nvme_tcp nvme_fabrics ata_generic nvme_keyring pata_acpi nvme_core nvme_auth serio_raw be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Nov 21 07:32:06 coreos kernel: scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 21 07:32:06 coreos kernel: CPU: 2 UID: 0 PID: 491201 Comm: borg Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 21 07:32:06 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 21 07:32:06 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 21 07:32:06 coreos kernel: RIP: 0010:nfs_lock_and_join_requests+0x19d/0x270 [nfs]
Nov 21 07:32:06 coreos kernel: Code: 08 48 89 ef e8 54 b7 ff ff 48 8b 6d 48 49 39 ec 75 e8 4c 89 e7 e8 f3 b6 ff ff 48 89 df 49 63 df e8 38 b7 ff ff e9 fd fe ff ff <4d> 8b 64 24 48 4c 39 e3 0f 85 45 ff ff ff 48 89 df e8 8d ab ff ff
Nov 21 07:32:06 coreos kernel: RSP: 0018:ffff9c9d0832f6d8 EFLAGS: 00000246
Nov 21 07:32:06 coreos kernel: RAX: 0000000000000001 RBX: ffff8bd836060200 RCX: 0000000000000001
Nov 21 07:32:06 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8bd96f644434
Nov 21 07:32:06 coreos kernel: RBP: ffff8bd836060238 R08: 0000000000000000 R09: 000000000000001b
Nov 21 07:32:06 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff8bd96f644400
Nov 21 07:32:06 coreos kernel: R13: ffffdc3800ddd000 R14: 0000000000000001 R15: 0000000000000000
Nov 21 07:32:06 coreos kernel: FS: 00007f65b7d6bb08(0000) GS:ffff8bdb2fb00000(0000) knlGS:0000000000000000
Nov 21 07:32:06 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 07:32:06 coreos kernel: CR2: 00007f65b0991000 CR3: 00000001256fc006 CR4: 0000000000370ef0
Nov 21 07:32:06 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 21 07:32:06 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 21 07:32:06 coreos kernel: Call Trace:
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:06 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 21 07:32:06 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 21 07:32:06 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 21 07:32:06 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 21 07:32:06 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 21 07:32:06 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 21 07:32:06 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:06 coreos kernel: ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
Nov 21 07:32:06 coreos kernel: ? nfs_lock_and_join_requests+0x19d/0x270 [nfs]
Nov 21 07:32:06 coreos kernel: ? nfs_lock_and_join_requests+0xcc/0x270 [nfs]
Nov 21 07:32:06 coreos kernel: nfs_page_async_flush+0x1b/0x1f0 [nfs]
Nov 21 07:32:06 coreos kernel: ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
Nov 21 07:32:06 coreos kernel: nfs_writepages_callback+0x30/0x50 [nfs]
Nov 21 07:32:06 coreos kernel: write_cache_pages+0x65/0xc0
Nov 21 07:32:06 coreos kernel: nfs_writepages+0x162/0x2e0 [nfs]
Nov 21 07:32:06 coreos kernel: do_writepages+0x7e/0x280
Nov 21 07:32:06 coreos kernel: filemap_fdatawrite_wbc+0x6e/0xa0
Nov 21 07:32:06 coreos kernel: __filemap_fdatawrite_range+0x60/0x90
Nov 21 07:32:06 coreos kernel: file_write_and_wait_range+0x47/0xb0
Nov 21 07:32:06 coreos kernel: nfs_file_fsync+0xa2/0x1c0 [nfs]
Nov 21 07:32:06 coreos kernel: __x64_sys_fdatasync+0x4f/0x90
Nov 21 07:32:06 coreos kernel: do_syscall_64+0x82/0x160
Nov 21 07:32:06 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:06 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:06 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:06 coreos kernel: ? do_vmi_align_munmap+0x266/0x550
Nov 21 07:32:06 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:06 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:06 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:06 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:06 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:06 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:06 coreos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Nov 21 07:32:06 coreos kernel: RIP: 0033:0x7f65b7d28862
Nov 21 07:32:06 coreos kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 e9 16 cc ff ff 41 54 b8 02 00 00 00 55 48 89 f5 be 00 88 08 00
Nov 21 07:32:06 coreos kernel: RSP: 002b:00007ffe9bc6b658 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Nov 21 07:32:06 coreos kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f65b7d28862
Nov 21 07:32:06 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
Nov 21 07:32:06 coreos kernel: RBP: 00007f65b7d6bb08 R08: 0000000000000000 R09: 0000000000000000
Nov 21 07:32:06 coreos kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f65b74c67d0
Nov 21 07:32:06 coreos kernel: R13: 00007f65b7c49d78 R14: 00007f65b75c6480 R15: 00007f65b74c6838
Nov 21 07:32:06 coreos kernel:
Nov 21 07:32:34 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 25034s! [borg:491201]
Nov 21 07:32:34 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 21 07:32:34 coreos kernel: #1: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #2: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #4: 100% system, 1% softirq, 1% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: #5: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 21 07:32:34 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace xt_nat veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl qxl virtio_balloon drm_ttm_helper ttm i2c_piix4 i2c_smbus joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_blk virtio_net net_failover failover nvme_tcp nvme_fabrics ata_generic nvme_keyring pata_acpi nvme_core nvme_auth serio_raw be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
Nov 21 07:32:34 coreos kernel: scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 21 07:32:34 coreos kernel: CPU: 2 UID: 0 PID: 491201 Comm: borg Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 21 07:32:34 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 21 07:32:34 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 21 07:32:34 coreos kernel: RIP: 0010:nfs_folio_find_head_request+0x59/0x90 [nfs]
Nov 21 07:32:34 coreos kernel: Code: db 74 1f 48 39 5b 50 75 45 48 8d 7b 34 b8 01 00 00 00 f0 0f c1 43 34 85 c0 74 27 8d 50 01 09 c2 78 14 4c 89 e7 e8 d7 0b d7 d5 <48> 89 d8 5b 5d 41 5c c3 cc cc cc cc be 01 00 00 00 e8 41 17 43 d5
Nov 21 07:32:34 coreos kernel: RSP: 0018:ffff9c9d0832f6b8 EFLAGS: 00000246
Nov 21 07:32:34 coreos kernel: RAX: 0000000000000001 RBX: ffff8bd836060200 RCX: 0000000000000001
Nov 21 07:32:34 coreos kernel: RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8bd9bdac0cbc
Nov 21 07:32:34 coreos kernel: RBP: ffffdc3800ddd000 R08: 0000000000000000 R09: 000000000000001b
Nov 21 07:32:34 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff8bd9bdac0cbc
Nov 21 07:32:34 coreos kernel: R13: ffffdc3800ddd000 R14: 0000000000000001 R15: 0000000000000000
Nov 21 07:32:34 coreos kernel: FS: 00007f65b7d6bb08(0000) GS:ffff8bdb2fb00000(0000) knlGS:0000000000000000
Nov 21 07:32:34 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 07:32:34 coreos kernel: CR2: 00007f65b0991000 CR3: 00000001256fc006 CR4: 0000000000370ef0
Nov 21 07:32:34 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 21 07:32:34 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 21 07:32:34 coreos kernel: Call Trace:
Nov 21 07:32:34 coreos kernel:
Nov 21 07:32:34 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 21 07:32:34 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 21 07:32:34 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 21 07:32:34 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 21 07:32:34 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 21 07:32:34 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 21 07:32:34 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 21 07:32:34 coreos kernel:
Nov 21 07:32:34 coreos kernel:
Nov 21 07:32:34 coreos kernel: ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
Nov 21 07:32:34 coreos kernel: ? nfs_folio_find_head_request+0x59/0x90 [nfs]
Nov 21 07:32:34 coreos kernel: ? nfs_folio_find_head_request+0x59/0x90 [nfs]
Nov 21 07:32:34 coreos kernel: nfs_lock_and_join_requests+0x64/0x270 [nfs]
Nov 21 07:32:34 coreos kernel: nfs_page_async_flush+0x1b/0x1f0 [nfs]
Nov 21 07:32:34 coreos kernel: ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
Nov 21 07:32:34 coreos kernel: nfs_writepages_callback+0x30/0x50 [nfs]
Nov 21 07:32:34 coreos kernel: write_cache_pages+0x65/0xc0
Nov 21 07:32:34 coreos kernel: nfs_writepages+0x162/0x2e0 [nfs]
Nov 21 07:32:34 coreos kernel: do_writepages+0x7e/0x280
Nov 21 07:32:34 coreos kernel: filemap_fdatawrite_wbc+0x6e/0xa0
Nov 21 07:32:34 coreos kernel: __filemap_fdatawrite_range+0x60/0x90
Nov 21 07:32:34 coreos kernel: file_write_and_wait_range+0x47/0xb0
Nov 21 07:32:34 coreos kernel: nfs_file_fsync+0xa2/0x1c0 [nfs]
Nov 21 07:32:34 coreos kernel: __x64_sys_fdatasync+0x4f/0x90
Nov 21 07:32:34 coreos kernel: do_syscall_64+0x82/0x160
Nov 21 07:32:34 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:34 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:34 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:34 coreos kernel: ? do_vmi_align_munmap+0x266/0x550
Nov 21 07:32:34 coreos kernel: ? __vm_munmap+0xb9/0x170
Nov 21 07:32:34 coreos kernel: ? syscall_exit_to_user_mode+0x10/0x220
Nov 21 07:32:34 coreos kernel: ? do_syscall_64+0x8e/0x160
Nov 21 07:32:34 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:34 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:34 coreos kernel: ? clear_bhb_loop+0x25/0x80
Nov 21 07:32:34 coreos kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Nov 21 07:32:34 coreos kernel: RIP: 0033:0x7f65b7d28862
Nov 21 07:32:34 coreos kernel: Code: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 e9 16 cc ff ff 41 54 b8 02 00 00 00 55 48 89 f5 be 00 88 08 00
Nov 21 07:32:34 coreos kernel: RSP: 002b:00007ffe9bc6b658 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
Nov 21 07:32:34 coreos kernel: RAX: ffffffffffffffda RBX: 000000000000004b RCX: 00007f65b7d28862
Nov 21 07:32:34 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005
Nov 21 07:32:34 coreos kernel: RBP: 00007f65b7d6bb08 R08: 0000000000000000 R09: 0000000000000000
Nov 21 07:32:34 coreos kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007f65b74c67d0
Nov 21 07:32:34 coreos kernel: R13: 00007f65b7c49d78 R14: 00007f65b75c6480 R15: 00007f65b74c6838
Nov 21 07:32:34 coreos kernel:

System details

State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Thu 2024-11-21 12:08:47 UTC)
BootedDeployment:
● fedora:fedora/x86_64/coreos/stable
                  Version: 41.20241027.3.0 (2024-11-08T21:22:46Z)
                   Commit: 8caf08dc8e25f389ade284ad2ad3ef64763e30a63038459d89a616b8725d12a6
             GPGSignature: Valid signature by 466CF2D8B60BC3057AA9453ED0622462E99D6AD1

Fedora coreOS VM running on a Truenas Scale host (Dragonfish-24.04.2.5), based on QEmu
2 CPUs with 2 Cores and 2 Threads, 16 Gb memory, UEFI bootloader, 200 Gb disk and one NIC.

Fedora CoreOS version 41.20241027.3.0
Linux coreos 6.11.5-300.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 22 20:11:15 UTC 2024 x86_64 GNU/Linux

Butane or Ignition config

variant: fcos
version: 1.3.0
passwd:
  users:
    - name: user
      ssh_authorized_keys:
        - ssh-rsa XXXXXX
      password_hash: XXXXX
storage:
  files:
    - path: /etc/ssh/sshd_config.d/20-enable-passwords.conf
      mode: 0644
      contents:
        inline: |
          # Fedora CoreOS disables SSH password login by default.
          # Enable it.
          # This file must sort before 40-disable-passwords.conf.
          PasswordAuthentication yes
    - path: /etc/profile.d/systemd-pager.sh
      mode: 0644
      contents:
        inline: |
          # Tell systemd to not use a pager when printing information
          export SYSTEMD_PAGER=cat
    - path: /etc/sysctl.d/20-silence-audit.conf
      mode: 0644
      contents:
        inline: |
          # Raise console message logging level from DEBUG (7) to WARNING (4)
          # to hide audit messages from the interactive console
          kernel.printk=4
    - path: /etc/hostname
      mode: 0644
      contents:
        inline: |
          coreos
systemd:
  units:
    - name: docker.portainer.service
      enabled: true
      contents: |-
        [Unit]
        Description=Portainer Admin Container
        After=docker.service
        Requires=docker.service network.target network-online.target

        [Service]
        Type=oneshot
        RemainAfterExit=yes
        TimeoutStartSec=0
        ExecStartPre=-/usr/bin/docker stop %n
        ExecStartPre=-/usr/bin/docker rm %n
        ExecStartPre=/usr/bin/docker pull portainer/portainer-ee
        # Privileged mode is required for binding to local socket to work due to SELINUX (https://github.com/portainer/portainer/issues/849)
        ExecStart=/usr/bin/docker run --privileged=true -d -p 9000:9000 --name %n --restart always -v /var/run/docker.sock:/var/run/docker.sock -v portainer_data:/data portainer/portainer-ee
        ExecStop=/usr/bin/docker stop -t 15 %n

        [Install]
        WantedBy=multi-user.target

Additional information

No response

@dustymabe
Copy link
Member

dustymabe commented Nov 22, 2024

it can definitely be reproduced on a fresh installation.

if you can reproduce and you don't know if it has been fixed upstream yet then consider trying out different kernels to see if the behavior has been fixed.

You can either try fresh installs with builds from https://builds.coreos.fedoraproject.org/browser?stream=rawhide (select rawhide stream if you want the very latest)

Or you can override replace the kernel on an existing machine with one of the kernels from https://bodhi.fedoraproject.org/updates/?packages=kernel

Example:

sudo rpm-ostree override replace https://bodhi.fedoraproject.org/updates/FEDORA-2024-c5baa66128

NOTE: preferably this would be on a throw away machine you don't care about

@ch3lmi
Copy link
Author

ch3lmi commented Dec 2, 2024

I have been running stable on kernel 6.10.12-200.fc40.x86_64 since November 21st.

I am now moving to upstream kernel-6.12.0-65.fc42 and will report on the behavior.

@ch3lmi
Copy link
Author

ch3lmi commented Dec 2, 2024

Ah well, that didn't take too long, issue still there in upstream, just caused a complete crash on my box.

[  474.898009] br-ff4b6fc15e1b: port 1(veth49fe0d5) entered forwarding state
[15934.071816] hrtimer: interrupt took 2744199 ns
[27395.991743] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kworker/u34:9:238078]
[27395.991781] CPU#6 Utilization every 4s during lockup:
[27395.991788] 	#1: 100% system,	  0% softirq,	  1% hardirq,	  0% idle
[27395.991797] 	#2: 100% system,	  0% softirq,	  1% hardirq,	  0% idle
[27395.991805] 	#3: 100% system,	  0% softirq,	  1% hardirq,	  0% idle
[27395.991813] 	#4: 100% system,	  0% softirq,	  1% hardirq,	  0% idle
[27395.991821] 	#5: 100% system,	  0% softirq,	  0% hardirq,	  0% idle
[27395.991831] Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver xt_nat nfs lockd grace nfs_localio veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon qxl drm_ttm_helper i2c_piix4 ttm i2c_smbus joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_blk nvme_tcp virtio_net nvme_fabrics net_failover failover nvme_keyring ata_generic pata_acpi nvme_core nvme_auth serio_raw be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi
[27395.991906]  scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
[27395.991922] CPU: 6 UID: 0 PID: 238078 Comm: kworker/u34:9 Not tainted 6.12.0-65.fc42.x86_64 #1
[27395.991925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[27395.991929] Workqueue: writeback wb_workfn (flush-0:326)
[27395.991939] RIP: 0010:_raw_spin_lock+0x1b/0x30
[27395.991943] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 ff 05 88 73 e1 50 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 05 c3 cc cc cc cc 89 c6 e8 d7 04 00 00 90 c3 cc cc cc cc 90 90
[27395.991945] RSP: 0018:ffffb6e2c1b73800 EFLAGS: 00000246
[27395.991947] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[27395.991948] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9c7659df3e94
[27395.991950] RBP: fffffb4745409e00 R08: 0000000000000000 R09: 0000000000000000
[27395.991951] R10: 0000000000000000 R11: ffff9c797fd217c0 R12: ffff9c7659df3e94
[27395.991952] R13: fffffb4745409e00 R14: 0000000000000001 R15: 0000000000000000
[27395.991954] FS:  0000000000000000(0000) GS:ffff9c797fd00000(0000) knlGS:0000000000000000
[27395.991956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[27395.991957] CR2: 00007fd9dd3c7000 CR3: 000000043782a004 CR4: 0000000000372ef0
[27395.991961] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[27395.991962] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[27395.991965] Call Trace:
[27395.991967]  <IRQ>
[27395.991970]  ? watchdog_timer_fn.cold+0x233/0x311
[27395.991975]  ? __pfx_watchdog_timer_fn+0x10/0x10
[27395.991978]  ? __hrtimer_run_queues+0x113/0x280
[27395.991981]  ? kvm_clock_get_cycles+0x18/0x30
[27395.991984]  ? hrtimer_interrupt+0xfa/0x210
[27395.991986]  ? __sysvec_apic_timer_interrupt+0x52/0x100
[27395.991989]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[27395.991991]  </IRQ>
[27395.991992]  <TASK>
[27395.991993]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[27395.991997]  ? _raw_spin_lock+0x1b/0x30
[27395.992000]  nfs_folio_find_head_request+0x29/0x90 [nfs]
[27395.992040]  nfs_lock_and_join_requests+0x64/0x270 [nfs]
[27395.992072]  nfs_page_async_flush+0x1b/0x1f0 [nfs]
[27395.992100]  ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
[27395.992128]  nfs_writepages_callback+0x30/0x50 [nfs]
[27395.992155]  write_cache_pages+0x65/0xc0
[27395.992159]  nfs_writepages+0x162/0x2e0 [nfs]
[27395.992190]  do_writepages+0x7e/0x280
[27395.992193]  ? __mark_inode_dirty+0x87/0x350
[27395.992196]  __writeback_single_inode+0x41/0x340
[27395.992199]  writeback_sb_inodes+0x21d/0x4e0
[27395.992207]  __writeback_inodes_wb+0x4c/0xf0
[27395.992210]  wb_writeback+0x193/0x310
[27395.992214]  wb_workfn+0x355/0x450
[27395.992218]  process_one_work+0x176/0x330
[27395.992222]  worker_thread+0x252/0x390
[27395.992225]  ? __pfx_worker_thread+0x10/0x10
[27395.992227]  kthread+0xcf/0x100
[27395.992230]  ? __pfx_kthread+0x10/0x10
[27395.992232]  ret_from_fork+0x31/0x50
[27395.992234]  ? __pfx_kthread+0x10/0x10
[27395.992236]  ret_from_fork_asm+0x1a/0x30
[27395.992240]  </TASK>

@RichardD012
Copy link

I hate "me too" comments but I found this after updating to Ubuntu 24.10 which updated the kernel to 6.11. I have a Ubuntu 24.10 mini PC that has a lot of docker images doing heavy writes to an NFS share. I removed that individual commit only and recompiled and have not run into it in a week. It had happened 2 times the previous week.

@dustymabe
Copy link
Member

dustymabe commented Dec 3, 2024

hey @ch3lmi - not being a kernel dev myself really the only thing I would do is search net to see if anyone else has something similar or it is being discussed upstream but not fixed yet.. OR try to bisect to find the kernel commit that introduced the regression, which usually means you now know who to poke to ask about it and will probably fix it.

@ch3lmi
Copy link
Author

ch3lmi commented Dec 4, 2024

Hi @dustymabe thanks for the recommendation !

I actually did my fair amount of research already, found someone who has the issue and regressed to a specific commit in the 6.11 kernel, all of which is described here: https://bugzilla.kernel.org/show_bug.cgi?id=219508 (see my initial post as well).
This is also confirmed by @RichardD012 now who also confirmed that this commit is causing the issue.
Honestly not really sure what more can be done at this stage unfortunately.

@dustymabe
Copy link
Member

ahh. since you know the commit then maybe try emailing the commit author and others who signed off?

You can also look at recent git history for the files touched from that commit to see if any new changes were made that seem like fixing a bug in this area.

@RichardD012
Copy link

RichardD012 commented Dec 9, 2024

It looks like the specific commit was reverted due to incorrect assumptions (as we've found here). It is part of the 6.13 RCs that are out now so I think it's safe to assume when 6.13 is out, this should work.

Edit: It's part of the 6.11.11 kernel patch as well.

@dustymabe dustymabe added the status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kernel kind/bug status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release.
Projects
None yet
Development

No branches or pull requests

4 participants