Discovery log page optimizations #690

calebsander · 2023-08-08T02:44:16Z

Several improvements to fetching the discovery log page:

Don't set the Retain Asynchronous Event bit. To ensure notifications aren't missed, this bit needs to be cleared on at least once Get Log Page command sent, which is not currently happening. The simplest solution is just to clear the bit in all commands, so do that.
Avoid fetching the log page header twice in a row in case of a GENCTR mismatch
Reduce the number of bytes of the header fetched from 1K to 20, since only the first 18 bytes currently have a defined meaning
Implement nvmf_get_discovery_log() in terms of nvmf_get_discovery_wargs() to avoid code duplication
Pass struct nvme_get_discovery_args instead of struct nvme_get_log_args to nvme_discovery_log(). This avoids confusion about which function is responsible for filling in which arguments, and avoids redundant arguments.

Tested with the mock ioctl test infrastructure in #688. The test cases were updated to match the new discovery behavior.

It also works when run against a real NVMe/TCP controller:

$ nvme discover --transport tcp --traddr 192.168.1.12 --trsvcid 4420

Discovery Log Number of Records 4, Generation counter 5
=====Discovery Log Entry 0======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  12549
trsvcid: 4420
subnqn:  nqn.2010-06.com.purestorage:flasharray.58152d9d09e68236
traddr:  192.168.4.12
eflags:  none
sectype: none
=====Discovery Log Entry 1======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  12550
trsvcid: 4420
subnqn:  nqn.2010-06.com.purestorage:flasharray.58152d9d09e68236
traddr:  192.168.2.12
eflags:  none
sectype: none
=====Discovery Log Entry 2======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  12294
trsvcid: 4420
subnqn:  nqn.2010-06.com.purestorage:flasharray.58152d9d09e68236
traddr:  192.168.1.12
eflags:  none
sectype: none
=====Discovery Log Entry 3======
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  12293
trsvcid: 4420
subnqn:  nqn.2010-06.com.purestorage:flasharray.58152d9d09e68236
traddr:  192.168.3.12
eflags:  none
sectype: none

igaw · 2023-08-08T08:11:58Z

From a quick glance, the refactoring makes sense. Obviously, the main point I need to wrap my head around is the RAE changes. @hreinecke @keithbusch

src/nvme/fabrics.c

hreinecke · 2023-08-11T10:22:24Z

Be careful here. Clearing the RAE bit means that everyone else will be missing the event. Especially the kernel driver, which relies on the asynchronous event for normal operations. So modifying the RAE bit from userspace should be disallowed.
So: NACK.

calebsander · 2023-08-11T15:11:17Z

@hreinecke I'm not sure I quite understand the scenario you're describing. As I understand the spec, clearing the RAE bit can never prevent an Asynchronous Event Notification from being delivered. An event occurring should be atomic with sending AENs in response to all relevant AERs received. So clearing the RAE bit only re-enables sending AENs in response to a new event.
Perhaps you are thinking of non-discovery controllers, where some log pages contain relative information and fetching them with the RAE bit cleared does cause that information to no longer be retained? In those cases I agree with your point that if multiple users are accessing the log page, they would need to coordinate the RAE bit to avoid missing an event. But the Discovery Log Page is absolute; the value of the RAE bit set in a Get Log Page command does not affect its contents in that fetch or future fetches. This change only affects the Discover Log Page, so I don't see how "everyone else will be missing the event".

hreinecke · 2023-08-11T16:46:12Z

I would agree with you if we were able to hold operations on the discovery controller purely in userspace, ie ensure that the kernel never opens persistent discovery controller connections internally.
Shouldn't be too hard, though, but I'm pretty sure that we haven't discussed that scenario so far. I'll see to bring it up a the ALPSS conference end of September, and once we have a consensus we can move forward here.

calebsander · 2023-08-11T17:06:48Z

I would agree with you if we were able to hold operations on the discovery controller purely in userspace, ie ensure that the kernel never opens persistent discovery controller connections internally.

I guess I'm still not understanding what your concern is. Could you explain the scenario more precisely?
If the kernel opened a persistent discovery connection internally, it wouldn't be exposed to userspace and therefore libnvme (running in userspace) would have no way to send a Get Log Page command to it. So I don't understand how this change would impact a (hypothetical) kernel-internal discovery controller.
In the current implementation of persistent discovery controllers, the kernel submits an AER initially, and upon receiving an AEN, notifies userspace via a udev event and resubmits the AER. So the kernel isn't itself interested in Discovery Log Page changes. Even if the kernel were interested, userspace clearing the RAE bit could only result in the kernel receiving a new notification sooner than it otherwise would. So more notifications are possible, but not fewer or missed notifications. If userspace really wanted to mess with the kernel, it would try to submit AERs to reach the controller's AER limit, blocking the kernel from submitting its AERs. That could prevent the kernel from receiving a notification.

hreinecke · 2023-08-12T09:12:45Z

I would agree with you if we were able to hold operations on the discovery controller purely in userspace, ie ensure that the kernel never opens persistent discovery controller connections internally.

I guess I'm still not understanding what your concern is. Could you explain the scenario more precisely? If the kernel opened a persistent discovery connection internally, it wouldn't be exposed to userspace and therefore libnvme (running in userspace) would have no way to send a Get Log Page command to it. So I don't understand how this change would impact a (hypothetical) kernel-internal discovery controller. In the current implementation of persistent discovery controllers, the kernel submits an AER initially, and upon receiving an AEN, notifies userspace via a udev event and resubmits the AER. So the kernel isn't itself interested in Discovery Log Page changes. Even if the kernel were interested, userspace clearing the RAE bit could only result in the kernel receiving a new notification sooner than it otherwise would. So more notifications are possible, but not fewer or missed notifications. If userspace really wanted to mess with the kernel, it would try to submit AERs to reach the controller's AER limit, blocking the kernel from submitting its AERs. That could prevent the kernel from receiving a notification.

Wrong. Any connection to a persistent discovery controller will be handled by the kernel; nvme-cli merely issues the command to the kernel to initiate the connection. The kernel will create the connection which will then be reflected back to userspace via sysfs. So whenever there is a persistent discovery connection the kernel will know about it, and will be receiving AENs from that connection. As the RAE bit is one-shot only (there can be only one place where the bit is cleared) one needs to establish clear rules who is responsible for clearing that bit.
Problem now is that we cannot know if there is a userspace program able and willing to clear the RAE bit; and even if there were, we don't have guarantees that there will be only one userspace program; there might be several all fighting to clear the bit.
Hence the only sane solution here is to have the kernel clear the RAE bit, and reflect any AEN back to userspace via uevents. And disallow userspace to clear the RAE bit, ever.

calebsander · 2023-08-12T16:48:03Z

Wrong.

Let's take a step back here; it sounds like we are coming from different premises.

For a log page that stores a diff (or "events"), I agree multiple concurrent users cannot safely issue Get Log Page commands in response to AENs to the same controller. One of them will have to clear the RAE bit so they can continue receiving notifications, and any users who fetch the Log Page later will miss the event. The only safe solution is to have one user per controller responding to AENs with Get Log Page commands. Perhaps I am misunderstanding your suggestion "to have the kernel clear the RAE bit". It seems like having the kernel immediately send a Get Log Page command with the RAE bit cleared would make things worse by adding yet another user on the same controller. The only way I can see this work is for the kernel to be the sole user, storing the results of the Get Log Page command and exposing them to userspace via some other mechanism (e.g. sysfs) so no users are trying to issue Get Log Page commands concurrently.
I don't think any of this applies to the Discovery Log Page for the reason I mentioned before: it is absolute, not a diff. So clearing the RAE bit on a Get Log Page command for the Discovery Log Page does not affect the Log Page's contents in future fetches. If multiple users are listening for AENs on a persistent discovery controller, then all can clear the RAE bit in their Get Log Page commands without affecting what the others see. The controller will resume sending AENs after the first Get Log Page command it receives, so the other users may receive additional AENs if the Discovery Log Page changes again before they can fetch it. But that's fine; there are no lost events. If there's no user listening for AENs, I don't think that's a problem either; there's no need to clear the RAE bit just to generate another ignored AEN in the future. Having the kernel send its own Get Log Page command to clear the RAE bit seems like it would add the overhead of executing another command, and possibly result in additional AENs before the first user has even fetched the Log Page in response to the first one. So I really don't see what benefit that would buy.
So I buy your argument that it's dangerous to allow userspace to submit Get Log Page commands with the RAE bit cleared when (1) this is a Log Page where that actually causes the event information to be lost and (2) the kernel or other users are also submitting Get Log Page commands in response to AENs on that controller. If the kernel is not using the Log Page itself, I don't think this is a big issue; userspace programs will need to coordinate access to a shared device with each other. If the kernel is using the Log Page, I could see that being a reason to block the Get Log Page command with the RAE bit cleared. It looks, however, like the kernel is already accustomed to dealing with this possibility (see the comment in nvme_clear_changed_ns_log()).
I don't see anything broken with the current userspace/kernel division of labor around the persistent discovery controller. If you do think there's something wrong, please describe a specific scenario that leads to the wrong results. I would, however, like to fix the bug I've identified where the RAE bit is never cleared if the Discovery Log Page has no entries.

igaw · 2023-10-11T09:20:05Z

Unfortunately, we missed the opportunity to discuss this during ALPSS in person. FWIW, your argument that the Discovery Log page is consumed as absolute and not a diff make sense to me. Given this I don't think we have a problem.

Anyway, I think this topic is worth to bring to the nvme mailing list and get everybody on board.

hreinecke · 2023-10-11T09:32:08Z

Well; the current design is that the kernel keeps track of all connected controllers, and hence will be receiving AENs from all controllers. Once an AEN is received the kernel reads the log page and clears the AEN. It then converts the AEN into a uevent for userspace consumption.
That is established practice, and I don't think we should change that.
But is also means that we should not allow to clear the RAE bit from userspace irrespective of which log page is requested.
As you outlined above, for 'normal' AENs this is potentially dangerous as we might miss events. And for discovery log page it doesn't matter as we'll always be able to read the entire contents.
But as it doesn't matter we might as well disallow clearing the RAE bit from userspace for all log pages and save us quite some hassle in the kernel trying to figure out on which log page clearing the RAE bit allowed and where not.

calebsander · 2023-10-12T03:06:45Z

Once an AEN is received the kernel reads the log page and clears the AEN.

Are you saying that the kernel is reading the Discovery Log Page in response to the AEN? I don't think that's true. If you look at the kernel code, in response to NVME_AER_NOTICE_DISC_CHANGED, it simply records the CQE result in aen_result for reporting in the uevent. This is different from the other types of AENs, to which the kernel responds to by actually retrieving the updated information. In the current model, it is userspace's responsibility to issue the Get Log Page command in response to a discovery AEN and process the results. (In nvme-cli, the uevent triggers a systemd service that calls nvme connect-all on the discovery controller.) Having looked at tcpdumps for persistent discovery connections, I can also confirm anecdotally that the kernel only resubmits the AER command in response to the AEN; it is nvme-cli that is sending the Get Log Page command to fetch the new contents of the discovery log page.

static void nvme_aen_uevent(struct nvme_ctrl *ctrl)
{
	char *envp[2] = { NULL, NULL };
	u32 aen_result = ctrl->aen_result;

	ctrl->aen_result = 0;
	if (!aen_result)
		return;

	envp[0] = kasprintf(GFP_KERNEL, "NVME_AEN=%#08x", aen_result);
	if (!envp[0])
		return;
	kobject_uevent_env(&ctrl->device->kobj, KOBJ_CHANGE, envp);
	kfree(envp[0]);
}

static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
{
	u32 aer_notice_type = nvme_aer_subtype(result);
	bool requeue = true;

	switch (aer_notice_type) {
	case NVME_AER_NOTICE_NS_CHANGED:
		set_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events);
		nvme_queue_scan(ctrl);
		break;
	case NVME_AER_NOTICE_FW_ACT_STARTING:
		/*
		 * We are (ab)using the RESETTING state to prevent subsequent
		 * recovery actions from interfering with the controller's
		 * firmware activation.
		 */
		if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) {
			nvme_auth_stop(ctrl);
			requeue = false;
			queue_work(nvme_wq, &ctrl->fw_act_work);
		}
		break;
#ifdef CONFIG_NVME_MULTIPATH
	case NVME_AER_NOTICE_ANA:
		if (!ctrl->ana_log_buf)
			break;
		queue_work(nvme_wq, &ctrl->ana_work);
		break;
#endif
	case NVME_AER_NOTICE_DISC_CHANGED:
		ctrl->aen_result = result;
		break;
	default:
		dev_warn(ctrl->device, "async event result %08x\n", result);
	}
	return requeue;
}

I agree with you that having the kernel re-fetch the log page and expose it to userspace somehow is another possible model, but it's not how Linux and nvme-cli currently work. Given that the current model works, I don't see a great reason to spend effort implementing Discovery Log Page fetching in the kernel, other than to mirror the other AENs, as you've outlined above.

Please let me know if I am misunderstanding you.

In several circumstances, nvme_discovery_log() leaves the RAE bit set for all Get Log Page commands it issues. This happens if the header indicates the log page has no records, or if a command other than the header re-fetch errors out. If the RAE bit is never cleared, the controller will not send more Discovery Asynchronous Event Notifications. Setting the RAE bit is not necessary to avoid missed events since the use of GENCTR ensures the log page is fetched atomically. It is better to risk receiving additional AENs than for the discovery controller to never send more AENs because it is waiting for a Get Log Page command that never comes. So clear the RAE bit when fetching each piece of the log page. Signed-off-by: Caleb Sander <[email protected]>

In case of a GENCTR mismatch, nvme_discovery_log() currently fetches the Discovery Log Page header again before fetching the entries. This is unnecessary since it was just fetched to determine GENCTR. So only fetch the header once and re-use it for the next loop iteration. Remove a couple of unnecessary variables in the process. Signed-off-by: Caleb Sander <[email protected]>

Most of the 1 KB Discovery Log Page header is reserved. Only the first 18 bytes are currently defined. So avoid transfering more data from the controller than necessary. Signed-off-by: Caleb Sander <[email protected]>

There is a lot of duplicated code between nvmf_get_discovery_log() and nvmf_get_discovery_wargs(). Since nvmf_get_discovery_wargs() is more general, use it to implement nvmf_get_discovery_log(). Signed-off-by: Caleb Sander <[email protected]>

nvme_discovery_log() takes nvme_get_log_args as input and then overwrites many of its fields. This leads to wasted code in the caller setting up unused fields. Instead pass nvme_get_discovery_args, which more accurately expresses the possible discovery log arguments. As an added benefit, it also subsumes the other arguments. And nvmf_get_discovery_wargs() can pass its arguments straight through. Signed-off-by: Caleb Sander <[email protected]>

igaw · 2023-10-31T11:16:32Z

I agree with @calebsander reasoning. The kernel is not relying on the RAE for the Asynchronous Event Type Notice. That means, userspace is full control of the RAE for the Notice type.

hreinecke

Wrong. 'genctr' might have been changed after the first call, so we need to repeatedly retrieve the header until we get a stable 'genctr'.

src/nvme/fabrics.c

igaw · 2023-11-02T13:45:34Z

Wrong. 'genctr' might have been changed after the first call, so we need to repeatedly retrieve the header until we get a stable 'genctr'.

I've discussed this offline with Hannes. So after fixing the initial get header retry, all should be fine.

calebsander · 2023-11-02T17:05:21Z

Sorry, I'm not following either of your comments. The only change to when the Discovery Log Page header is fetched is that we don't fetch it twice in a row in case of a GENCTR mismatch, as described in the commit description of "fabrics: avoid redundant Get Log Page on retry". The purpose of GENCTR is to ensure the log page is fetched atomically even though multiple Get Log Page commands may be issued. By verifying that GENCTR matches before and after fetching the log page entries (and retrying if it doesn't), we ensure the fetched log page isn't torn between Get Log Page commands. There is no requirement to check for GENCTR to be "stable", and indeed there's no way to ensure GENCTR isn't going to change in the future.

Here are some example scenarios to compare the old and new behavior (these are also illustrated by the unit tests):

(GENCTR = A, NUMREC = 0): no Get Log Page commands for the entries is necessary, so the log page fetch was atomic. The behavior is unchanged.
(GENCTR = A, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = A): GENCTR matches after fetching the full log page, so it was fetched atomically. The behavior is unchanged.
(GENCTR = A, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = B, NUMREC > 0) -> (GENCTR = B) -> fetch NUMREC entries -> (GENCTR = B): GENCTR mismatch after fetching the log pages the first time, so we retry. The new behavior is to skip the second GENCTR = B header fetch in a row, since we already re-fetched the header to check GENCTR.
(GENCTR = A, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = B, NUMREC > 0) -> (GENCTR = C, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = C): GENCTR changes again between the two consecutive header fetches and doesn't change again before the last header fetch. The new behavior would only detect it after fetching all the entries and the header again, so another retry would be required: (GENCTR = A, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = B, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = C, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = C).
(GENCTR = A, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = B, NUMREC > 0) -> (GENCTR = B) -> fetch NUMREC entries -> (GENCTR = C, NUMREC > 0) -> (GENCTR = C) -> fetch NUMREC entries -> (GENCTR = C): GENCTR changes again after the new header is fetched. The new behavior skips the 2 redundant header fetches: (GENCTR = A, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = B, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = C, NUMREC > 0) -> fetch NUMREC entries -> (GENCTR = C).

Both approaches achieve the atomicity requirement. In the unlikely event that there's another GENCTR change between fetching the header after fetching the log page and fetching the header again in the old code and no more GENCTR changes before the full log page is fetched, then the new behavior could require another retry (at least 2 Get Log Page commands). In the more likely case that GENCTR doesn't change again, or changes while fetching the log page entries, then the new behavior avoids fetching the header twice in a row. If you would prefer to keep the current behavior of fetching the header twice, I can drop that commit.

Please let me know if I am misunderstanding you.

igaw · 2023-11-02T17:18:31Z

Sorry, I'm not following either of your comments

My bad. You can ignore the point from Hannes about GENCTR. Your change is not breaking anything.

My comment is about the initial header get code, it misses the retry logic:

	nvme_msg(r, LOG_DEBUG, "%s: get header (try %d/%d)\n",
		 name, retries, args->max_retries);
	log_args.log = log;
	log_args.len = DISCOVERY_HEADER_LEN;
	if (nvme_get_log_page(fd, NVME_LOG_PAGE_PDU_SIZE, &log_args)) {
		nvme_msg(r, LOG_INFO,
			 "%s: discover try %d/%d failed, error %d\n",
			 name, retries, args->max_retries, errno);
		goto out_free_log;
	}

This should be something like

do {
	nvme_msg(r, LOG_DEBUG, "%s: get header (try %d/%d)\n",
		 name, retries, args->max_retries);
	log_args.log = log;
	log_args.len = DISCOVERY_HEADER_LEN;
	if (nvme_get_log_page(fd, NVME_LOG_PAGE_PDU_SIZE, &log_args)) {
		nvme_msg(r, LOG_INFO,
			 "%s: discover try %d/%d failed, error %d\n",
			 name, retries, args->max_retries, errno);
		goto out_free_log;
	}
} while (++retries < args->max_retries);
[...]

calebsander · 2023-11-03T02:11:23Z

Ah, thanks for explaining. But isn't the existing code already not retrying in case the header fetch fails?

ret = nvme_get_log_page(fd, NVME_LOG_PAGE_PDU_SIZE, args);
if (ret) {
	nvme_msg(r, LOG_INFO,
		 "%s: discover try %d/%d failed, error %d\n",
		 name, retries, max_retries, errno);
	goto out_free_log;
}
// ...
out_free_log:
free(log);
return NULL;

My understanding was that max_retries was just a limit on the number of times we would start over in case of repeated GENCTR mismatches. But we stop immediately if any of the Get Log Page commands fail.

igaw · 2023-11-03T07:32:21Z

Yeah, makes sense. Maybe we should update the error log message because it confused me :)

calebsander · 2023-11-06T16:32:07Z

Are there other things you want to discuss about these changes, or do you think they are ready to merge in?

igaw · 2023-11-07T07:07:29Z

Sorry for the delay. Busy at SUSE's Hackweek :)

calebsander · 2023-11-07T07:14:00Z

Thanks!

martin-belanger reviewed Aug 8, 2023

View reviewed changes

src/nvme/fabrics.c Outdated Show resolved Hide resolved

calebsander force-pushed the opt/discovery branch 2 times, most recently from ac5bf30 to 2e90d57 Compare August 9, 2023 20:24

calebsander force-pushed the opt/discovery branch from 2e90d57 to d32a5c0 Compare August 11, 2023 14:34

calebsander marked this pull request as ready for review August 11, 2023 14:52

calebsander added 5 commits October 13, 2023 12:43

fabrics: fetch smaller Discovery Log Page header

0fc136c

Most of the 1 KB Discovery Log Page header is reserved. Only the first 18 bytes are currently defined. So avoid transfering more data from the controller than necessary. Signed-off-by: Caleb Sander <[email protected]>

calebsander force-pushed the opt/discovery branch from d32a5c0 to 5835e18 Compare October 13, 2023 19:06

hreinecke requested changes Nov 2, 2023

View reviewed changes

igaw reviewed Nov 2, 2023

View reviewed changes

src/nvme/fabrics.c Show resolved Hide resolved

igaw merged commit 2d631ea into linux-nvme:master Nov 7, 2023
13 checks passed

calebsander deleted the opt/discovery branch November 7, 2023 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovery log page optimizations #690

Discovery log page optimizations #690

calebsander commented Aug 8, 2023 •

edited

Loading

igaw commented Aug 8, 2023

hreinecke commented Aug 11, 2023

calebsander commented Aug 11, 2023

hreinecke commented Aug 11, 2023

calebsander commented Aug 11, 2023

hreinecke commented Aug 12, 2023

calebsander commented Aug 12, 2023

igaw commented Oct 11, 2023

hreinecke commented Oct 11, 2023

calebsander commented Oct 12, 2023 •

edited

Loading

igaw commented Oct 31, 2023

hreinecke left a comment

igaw commented Nov 2, 2023

calebsander commented Nov 2, 2023

igaw commented Nov 2, 2023 •

edited

Loading

calebsander commented Nov 3, 2023

igaw commented Nov 3, 2023

calebsander commented Nov 6, 2023

igaw commented Nov 7, 2023

calebsander commented Nov 7, 2023

Discovery log page optimizations #690

Discovery log page optimizations #690

Conversation

calebsander commented Aug 8, 2023 • edited Loading

igaw commented Aug 8, 2023

hreinecke commented Aug 11, 2023

calebsander commented Aug 11, 2023

hreinecke commented Aug 11, 2023

calebsander commented Aug 11, 2023

hreinecke commented Aug 12, 2023

calebsander commented Aug 12, 2023

igaw commented Oct 11, 2023

hreinecke commented Oct 11, 2023

calebsander commented Oct 12, 2023 • edited Loading

igaw commented Oct 31, 2023

hreinecke left a comment

Choose a reason for hiding this comment

igaw commented Nov 2, 2023

calebsander commented Nov 2, 2023

igaw commented Nov 2, 2023 • edited Loading

calebsander commented Nov 3, 2023

igaw commented Nov 3, 2023

calebsander commented Nov 6, 2023

igaw commented Nov 7, 2023

calebsander commented Nov 7, 2023

calebsander commented Aug 8, 2023 •

edited

Loading

calebsander commented Oct 12, 2023 •

edited

Loading

igaw commented Nov 2, 2023 •

edited

Loading