Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using the fan-overlay network causes system instability #12161

Closed
freddrueck opened this issue Aug 18, 2023 · 8 comments
Closed

using the fan-overlay network causes system instability #12161

freddrueck opened this issue Aug 18, 2023 · 8 comments

Comments

@freddrueck
Copy link

Distribution: Ubuntu
Distribution version:22.04.3
The output of "inc info" or if that fails:
Kernel version:
LXC version: Linux hv1 6.2.0-26-generic lxc/incus#26 SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
LXD version: lxd 5.15 installed via snap 5.15-002fa0f
Storage backend in use: zfs
Issue description
When using lxd with the an automatically created fan-overlay network, a kernel bug triggers which causes system to almost completely lock up.

A brief description of the problem. Should include what you were
attempting to do, what you did, what happened and what you expected to
see happen.

Steps to reproduce
Step one
Step two
Step three
Information to attach
output from journalctl:

Aug 17 13:24:48 hv1 kernel: ------------[ cut here ]------------
Aug 17 13:24:48 hv1 kernel: Voluntary context switch within RCU read-side critical section!
Aug 17 13:24:48 hv1 kernel: WARNING: CPU: 5 PID: 9611 at kernel/rcu/tree_plugin.h:318 rcu_note_context_switch+0x2a7/0x2f0
Aug 17 13:24:48 hv1 kernel: Modules linked in: veth nft_masq nft_chain_nat vxlan ip6_udp_tunnel udp_tunnel dummy bridge stp llc ebtable_filter ebtables ip6table_raw ip6table_ma>
Aug 17 13:24:48 hv1 kernel: mei_me soundcore mei intel_pch_thermal mac_hid acpi_pad sch_fq_codel msr parport_pc ppdev lp parport ramoops reed_solomon pstore_blk pstore_zone ef>
Aug 17 13:24:48 hv1 kernel: CPU: 5 PID: 9611 Comm: rsyslogd Tainted: P IO 6.2.0-26-generic lxc/incus#26
Aug 17 13:24:48 hv1 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z170 Extreme6, BIOS P7.50 10/18/2018
Aug 17 13:24:48 hv1 kernel: RIP: 0010:rcu_note_context_switch+0x2a7/0x2f0
Aug 17 13:24:48 hv1 kernel: Code: 08 f0 83 44 24 fc 00 48 89 de 4c 89 f7 e8 61 c4 ff ff e9 1e fe ff ff 48 c7 c7 98 4e 53 9d c6 05 ee b7 3f 02 01 e8 09 1b f3 ff <0f> 0b e9 bd fd>
Aug 17 13:24:48 hv1 kernel: RSP: 0018:ffffae450d4df910 EFLAGS: 00010046
Aug 17 13:24:48 hv1 kernel: RAX: 0000000000000000 RBX: ffff9c9336172e40 RCX: 0000000000000000
Aug 17 13:24:48 hv1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Aug 17 13:24:48 hv1 kernel: RBP: ffffae450d4df930 R08: 0000000000000000 R09: 0000000000000000
Aug 17 13:24:48 hv1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Aug 17 13:24:48 hv1 kernel: R13: ffff9c844e928000 R14: 0000000000000000 R15: 0000000000000000
Aug 17 13:24:48 hv1 kernel: FS: 00007f418098dc40(0000) GS:ffff9c9336140000(0000) knlGS:0000000000000000
Aug 17 13:24:48 hv1 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 17 13:24:48 hv1 kernel: CR2: 00007f4180d42000 CR3: 000000017152c002 CR4: 00000000003706e0
Aug 17 13:24:48 hv1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 17 13:24:48 hv1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 17 13:24:48 hv1 kernel: Call Trace:
Aug 17 13:24:48 hv1 kernel:
Aug 17 13:24:48 hv1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 17 13:24:48 hv1 kernel: Call Trace:
Aug 17 13:24:48 hv1 kernel:
Aug 17 13:24:48 hv1 kernel: __schedule+0xbc/0x5f0
Aug 17 13:24:48 hv1 kernel: schedule+0x68/0x110
Aug 17 13:24:48 hv1 kernel: schedule_hrtimeout_range_clock+0x97/0x130
Aug 17 13:24:48 hv1 kernel: ? __pfx_hrtimer_wakeup+0x10/0x10
Aug 17 13:24:48 hv1 kernel: schedule_hrtimeout_range+0x13/0x30
Aug 17 13:24:48 hv1 kernel: do_poll.constprop.0+0x22a/0x3b0
Aug 17 13:24:48 hv1 kernel: do_sys_poll+0x166/0x260
Aug 17 13:24:48 hv1 kernel: ? ___sys_sendmsg+0x95/0xe0
Aug 17 13:24:48 hv1 kernel: ? __mod_lruvec_state+0x37/0x50
Aug 17 13:24:48 hv1 kernel: ? __mod_lruvec_page_state+0xa0/0x160
Aug 17 13:24:48 hv1 kernel: ? folio_memcg_unlock+0x38/0x80
Aug 17 13:24:48 hv1 kernel: ? unlock_page_memcg+0x18/0x60
Aug 17 13:24:48 hv1 kernel: ? page_add_file_rmap+0x89/0x2b0
Aug 17 13:24:48 hv1 kernel: ? __pfx_pollwake+0x10/0x10
Aug 17 13:24:48 hv1 kernel: ? __sys_sendmmsg+0x100/0x210
Aug 17 13:24:48 hv1 kernel: ? __secure_computing+0x9b/0x110
Aug 17 13:24:48 hv1 kernel: ? __seccomp_filter+0x3df/0x5e0
Aug 17 13:24:48 hv1 kernel: ? __pfx_pollwake+0x10/0x10
Aug 17 13:24:48 hv1 kernel: ? __sys_sendmmsg+0x100/0x210
Aug 17 13:24:48 hv1 kernel: ? __secure_computing+0x9b/0x110
Aug 17 13:24:48 hv1 kernel: ? __seccomp_filter+0x3df/0x5e0
Aug 17 13:24:48 hv1 kernel: ? syscall_exit_to_user_mode+0x2a/0x50
Aug 17 13:24:48 hv1 kernel: ? ktime_get_ts64+0x52/0x110
Aug 17 13:24:48 hv1 kernel: __x64_sys_poll+0xb5/0x150
Aug 17 13:24:48 hv1 kernel: do_syscall_64+0x59/0x90
Aug 17 13:24:48 hv1 kernel: ? exc_page_fault+0x92/0x1b0
Aug 17 13:24:48 hv1 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
Aug 17 13:24:48 hv1 kernel: RIP: 0033:0x7f4180d32d47
Aug 17 13:24:48 hv1 kernel: Code: 00 00 00 5b 49 8b 45 10 5d 41 5c 41 5d 41 5e c3 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff>
Aug 17 13:24:48 hv1 kernel: RSP: 002b:00007ffdc5692788 EFLAGS: 00000246 ORIG_RAX: 0000000000000007
Aug 17 13:24:48 hv1 kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4180d32d47
Aug 17 13:24:48 hv1 kernel: RDX: 0000000000001388 RSI: 0000000000000001 RDI: 00007ffdc56928b8
Aug 17 13:24:48 hv1 kernel: RBP: 0000000034c6ac4a R08: 0000000000000005 R09: 0000000000000000
Aug 17 13:24:48 hv1 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002
Aug 17 13:24:48 hv1 kernel: R13: 00007ffdc5692880 R14: 00007ffdc56928b8 R15: 00007f4180e3c340
Aug 17 13:24:48 hv1 kernel:
Aug 17 13:24:48 hv1 kernel: ---[ end trace 0000000000000000 ]---

output of inc monitor while reproducing the issue)

Here is how I ran lxd init:

root@hv1:/snap/lxd/25112# lxd init
Would you like to use LXD clustering? (yes/no) [default=no]: yes
What IP address or DNS name should be used to reach this server? [default=192.168.3.1]:
Are you joining an existing cluster? (yes/no) [default=no]:
What member name should be used to identify this server in the cluster? [default=hv1]:
Do you want to configure a new local storage pool? (yes/no) [default=yes]:
Name of the storage backend to use (btrfs, dir, lvm, zfs) [default=zfs]:
Would you like to create a new zfs dataset under rpool/lxd? (yes/no) [default=yes]:
Do you want to configure a new remote storage pool? (yes/no) [default=no]:
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]:
Would you like to create a new Fan overlay network? (yes/no) [default=yes]:
What subnet should be used as the Fan underlay? [default=auto]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]:
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]: yes
config:
core.https_address: 192.168.3.1:8443
networks:

config:
bridge.mode: fan
fan.underlay_subnet: auto
description: ""
name: lxdfan0
type: ""
project: default
storage_pools:
config:
source: rpool/lxd
description: ""
name: local
driver: zfs
profiles:
config: {}
description: ""
devices:
eth0:
name: eth0
network: lxdfan0
type: nic
root:
path: /
pool: local
type: disk
name: default
projects: []
cluster:
server_name: hv1
enabled: true
member_config: []
cluster_address: ""
cluster_certificate: ""
server_address: ""
cluster_password: ""
cluster_certificate_path: ""
cluster_token: ""
It's hard to be precise about exactly when the bug triggers. Just running lxd init above will not trigger the bug if there are no containers running. If I bring up a container with no network connection, the bug also does not trigger. However if I have at least one container running with an active network connection, the bug seems to trigger reliably. Within a few minutes the system will usually be so unstable it is barely usable, though at other times the system remains somewhat usable for at least 20 minutes (if not more).

What does seem to reliably happen is that the system will not cleanly reboot. I can only get it to reboot using the "sysrec magic key" , presumably this is related to this kernel message:

Aug 17 16:36:21 hv1 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P5846 } 243611 jiffies s: 801 root: 0x0/T
Aug 17 16:36:21 hv1 kernel: rcu: blocking rcu_node structures (internal RCU debug):

presumably the stalled tasks cannot be terminated, thus the system will not reboot.

@tomponline
Copy link
Member

This looks like a kernel issue. Did it only start occurring recently when you upgraded to the 6.2 HWE kernel?

Please could you report this to the Ubuntu kernel team here:

https://bugs.launchpad.net/ubuntu/+source/linux

Also they've asked that you run ubuntu-bug linux on your system and then submit that info as well to help them.

Thanks!

@GodBleak
Copy link

GodBleak commented Oct 9, 2023

I'm also experiencing this issue. @freddrueck, did you open an issue with the Ubuntu kernel team I can follow?

@MggMuggins
Copy link
Contributor

I'm also experiencing this on Jammy VMs on my lxd branch based on 6413a94. Opened LP#2064176. Happy to try and reproduce on a release version if needed.

@roosterfish
Copy link
Contributor

@MggMuggins thanks for digging this up.

I was facing the same issues when testing the PowerFlex storage driver and was suspecting the error on this side. But instead it seems to be fan related.

@tomponline
Copy link
Member

@MggMuggins you tried the non-hwe kernel?

@tomponline
Copy link
Member

And thanks for opening a kernel issue as the problem does appear to lie there.

@MggMuggins
Copy link
Contributor

I had trouble getting 5.15 to boot in my LXD VM and didn't sink the time into figuring it out; tried with Focal HWE and had no issues.

@mihalicyn
Copy link
Member

Fix https://lists.ubuntu.com/archives/kernel-team/2024-September/153511.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants