Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoconnect nbft systemd service #2078

Merged
merged 3 commits into from
Nov 8, 2023
Merged

autoconnect nbft systemd service #2078

merged 3 commits into from
Nov 8, 2023

Conversation

igaw
Copy link
Collaborator

@igaw igaw commented Oct 6, 2023

Add support to auto connect via NBFT.

@mwilck during updating our downstream packages I noticed these patches never made it upstream. I think they should also be upstream. Any objections?

@mwilck
Copy link
Contributor

mwilck commented Oct 9, 2023

No objections.


[Service]
Type=oneshot
ExecStartPre=/sbin/modprobe nvme-fabrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using ExecStartPre to load the kernel module, I think that the preferred way is to do as follows under the [Unit] section:

[Unit]

Wants=modprobe@nvme_fabrics.service
After=modprobe@nvme_fabrics.service

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Good catch! Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call me stupid, but where are the actual 'connect' calls?
Are they done by the 'nvmf-autoconnect.service'?
If so, shouldn't we add an explicit order like 'Before' here?
If not, how does autoconnect happen?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, incidentally, we're using 'ExecStartPre' in 'nvmf-autoconnect.in', too. Maybe we should be changing that, too.

Copy link
Collaborator Author

@igaw igaw Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connect call is there ExecStart=@SBINDIR@/nvme connect-nbft. The github web UI is not really good for reviewing. See the last line is a below this comment.

Anyway, I think this should be:

Wants=modprobe@nvme_fabrics.service
After=modprobe@nvme_fabrics.service
After=network-online.target
Before=remote-fs-pre.target

Yep, makes sense to update nvmf-autoconnect.in alongside.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call me stupid, but where are the actual 'connect' calls? Are they done by the 'nvmf-autoconnect.service'? If so, shouldn't we add an explicit order like 'Before' here? If not, how does autoconnect happen?

The nvm connect-all --nbft call is currently done by dracut.

https://github.com/dracutdevs/dracut/blob/6acfecae572fb457115b276b5b64d9424ad5187b/modules.d/95nvmf/nvmf-autoconnect.sh#L35

So maybe it's not a good idea to add it here.

Copy link
Collaborator

@hreinecke hreinecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argl. I so hate the 'nbft' (and it's previous incarnation 'ibft') network device naming.
You always have the trouble to keep the udev rules in sync between initrd and the running system, not to mention the confusion arising when for some reason not booting from nbft
or nbft parsing fails after configuration.

@igaw
Copy link
Collaborator Author

igaw commented Oct 11, 2023

Argl. I so hate the 'nbft' (and it's previous incarnation 'ibft') network device naming. You always have the trouble to keep the udev rules in sync between initrd and the running system, not to mention the confusion arising when for some reason not booting from nbft or nbft parsing fails after configuration.

Any ideas how to improve this then? I know with newer kernels a device can have alternative names. If we could make one of these 'stable' we don't have to do this udev rename dance.

@mwilck
Copy link
Contributor

mwilck commented Oct 11, 2023

@hreinecke, we have been discussing this up and down in various places. My conclusion is that using a separate name space like ibft$X or nbft$X is the best thing we can currently do. Dracut's code for network interface naming would need to be completely rewritten if we were to use eth$X or system-style network interace names. It's actually a benefit that the names of the NBFT interfaces don't depend on the persistent network device naming rules in the root FS: We don't need to make sure that these rules are always in sync betwen root FS and initrd, because we use a separate name space.

See https://bugzilla.suse.com/show_bug.cgi?id=1194910#c45 and follow-up comments.
If you read through that bug, you can see that the initial intention was to have dracut generate eth$X names that are consistent with the present udev rules, that we seriously tried to achieve this, and that we eventually abandoned the idea because we couldn't figure out how to make it work reliably.

Your point about "not booting from the NBFT" is valid. But TBH, if a system that's normally booted via NBFT now is booted differently, it's a different system. A common problem is the difference between installation time (possibly without NBFT) and runtime (with NBFT), but our NVMe-oF/TCP testing with the nbft$X approach has not revealed any issues, neither on VMs nor on actual hardware. The point is that wicked sees the interface configuration through its plugin, and that the udev rule in this PR makes sure no renaming happens.

Note also that the udev rule in this PR is much simpler than the rule-generator approach that we currently have for iSCSI. Changing iSCSI to use the same approach is on my todo list.

The current status quo, like it or not, is that dracut generates the nbft$X interface name, which is consistent with it's behavior for iSCSI, and has not been controversial in the Timberland group or upstream dracut PRs. If nbft* names exist, they shouldn't be changed. That's all what the udev rule ensures. If you or anyone else can come up with initrd code that takes the to-be-booted system's persistent interface naming rules into account in a reliable way, fine with me; in that case the rule from this PR will be a no-op because no nbft$X interfaces will exist any more.

@mwilck
Copy link
Contributor

mwilck commented Oct 11, 2023

Obviously, upstream nvme-cli should not enforce any interface naming policy. But the udev rule in this PR doesn't. It just ensures that names matching a certain pattern, which is distinct from any other interface name patterns, shouldn't be changed. It's debatable whether such a rule should be part of upstream nvme-cli, but don't see a big problem with it.

Copy link
Contributor

@johnmeneghini johnmeneghini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should work. NVMe/FC boot does rely on nvme-cli, but since the nvmfc-boot-connections.service runs first, and nvme connect-all doesn't currently default --nbft, this should work.


[Service]
Type=oneshot
ExecStart=@SBINDIR@/nvme connect-nbft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has any one tested this with nvme-fc boot? We also support nvme-fc boot too.

Copy link
Contributor

@mwilck mwilck Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did some testing, but more can't hurt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmeneghini's comment makes me realize that this is the legacy syntax. It must be nvme connect-all --nbft for upstream.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the nvme invocation. Anything else to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some concerns about how this will work together with dracut and network manager during nbft boot. I've asked @tbzatek to review this. Can we add Tomas as a reviewer?

https://github.com/tbzatek

I'd also like to test these changes in my nvmft boot testbed before we merge. I'll do that next week.


[Service]
Type=oneshot
ExecStart=@SBINDIR@/nvme connect-nbft
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmeneghini's comment makes me realize that this is the legacy syntax. It must be nvme connect-all --nbft for upstream.

Use Wants/After to let systemd handle loading the kernel module, which
is the preferred way.

Signed-off-by: Daniel Wagner <[email protected]>
Create a separate unit file for connecting to NBFT-defined subsystems.
This unit is intended to be called in "post-up" scripts from network
management software if an interface defined in the HFI section of the
NBFT is brought up (L3-configured).

In simple scenarios with just one HFI, this won't be necessary because the
interface must be brought up in the initramfs already. But in multipath
scenarios, the initramfs may choose not to wait for every HFI to come up, and
thus it may be necessary to bring up the secondary connection(s) later on.

Signed-off-by: Martin Wilck <[email protected]>
[dwagner: use unit options instead of ExecStartPre
          update nvme command line]
Signed-off-by: Daniel Wagner <[email protected]>
In the initramfs, the interface naming is taken care of by dracut.
But there are various network-interface-naming policies in place which
may attempt to rename the interface, causing confusion and possibly
wrong interface parameters.

Add an udev rule that avoids renaming any network interface that
has been assigned a name nbft$N, which is by convention the naming
scheme that is used for NBFT device in the initramfs.

Note: The simpler 'NAME:="%k"' directive doesn't work because udev rejects
it ('Ignoring NAME="%k", as it will take no effect.'). The ":=" syntax makes
sure the interface isn't renamed any more by later rules. "INTERFACE" is set
by the kernel in the "add" uevent for a network interface.

Signed-off-by: Martin Wilck <[email protected]>
Copy link
Contributor

@johnmeneghini johnmeneghini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvalena and @tbzatek can you please review these changes?

@pvalena
Copy link

pvalena commented Nov 3, 2023

@pvalena and @tbzatek can you please review these changes?

Looks good. I have no insight into nvme-cli, but FWIW I can see no conflict with what dracut does.

Copy link
Contributor

@tbzatek tbzatek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a brief look it looks fine, I'm not a systemd expert though. I'm going to test it on a real environment but I don't see anything that should block this pull request.

@mwilck
Copy link
Contributor

mwilck commented Nov 7, 2023

@johnmeneghini, wrt your concerns about NetworkManager: The autoconnect scripts are only concerned with the NVMe part of the connection establishment process.

Making another attempt to connect after switching root makes sense if dracut didn't bring up all connections during initrd processing, which can happen

  1. if some network connections failed / were unavailable during boot,
  2. in multipath scenarios, when the rootfs was mounted before all connections were established.

In case 1., making a second autoconnect attempt after switching root will only have an effect if the network management tool (NetworkManager) had succeeded in bringing up network routes that dracut had failed to bring up. Whether that has any chance to succeed depends on the configuration. In case 2., autoconnect may succeed even with no post-switch-root changes in the network configuration.

Either way, this is orthogonal to any attempts to set up NBFT-defined network connections. The service starts after "network-online.target", and that's where NM comes into play. While working towards "network-online.target", NM might read the NBFT, see if all NBFT-defined network interfaces have been configured, and try to bring up any missing ones. I think we all understand that "network-online.target" is a rather clumsy all-or-nothing target which is a good fit for laptops that need a working internet connection, but less so for a server booting from NVMeoF in a complex networking environment.

@tbzatek
Copy link
Contributor

tbzatek commented Nov 8, 2023

Okay, this appears to be working fine on my simple test setup. Uptime 19 hours without any glitch.

@igaw igaw merged commit 096c3e6 into linux-nvme:master Nov 8, 2023
12 checks passed
@igaw
Copy link
Collaborator Author

igaw commented Nov 8, 2023

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants