-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery log page optimizations #690
Conversation
From a quick glance, the refactoring makes sense. Obviously, the main point I need to wrap my head around is the RAE changes. @hreinecke @keithbusch |
ac5bf30
to
2e90d57
Compare
Be careful here. Clearing the RAE bit means that everyone else will be missing the event. Especially the kernel driver, which relies on the asynchronous event for normal operations. So modifying the RAE bit from userspace should be disallowed. |
2e90d57
to
d32a5c0
Compare
@hreinecke I'm not sure I quite understand the scenario you're describing. As I understand the spec, clearing the RAE bit can never prevent an Asynchronous Event Notification from being delivered. An event occurring should be atomic with sending AENs in response to all relevant AERs received. So clearing the RAE bit only re-enables sending AENs in response to a new event. |
I would agree with you if we were able to hold operations on the discovery controller purely in userspace, ie ensure that the kernel never opens persistent discovery controller connections internally. |
I guess I'm still not understanding what your concern is. Could you explain the scenario more precisely? |
Wrong. Any connection to a persistent discovery controller will be handled by the kernel; nvme-cli merely issues the command to the kernel to initiate the connection. The kernel will create the connection which will then be reflected back to userspace via sysfs. So whenever there is a persistent discovery connection the kernel will know about it, and will be receiving AENs from that connection. As the RAE bit is one-shot only (there can be only one place where the bit is cleared) one needs to establish clear rules who is responsible for clearing that bit. |
Let's take a step back here; it sounds like we are coming from different premises. For a log page that stores a diff (or "events"), I agree multiple concurrent users cannot safely issue Get Log Page commands in response to AENs to the same controller. One of them will have to clear the RAE bit so they can continue receiving notifications, and any users who fetch the Log Page later will miss the event. The only safe solution is to have one user per controller responding to AENs with Get Log Page commands. Perhaps I am misunderstanding your suggestion "to have the kernel clear the RAE bit". It seems like having the kernel immediately send a Get Log Page command with the RAE bit cleared would make things worse by adding yet another user on the same controller. The only way I can see this work is for the kernel to be the sole user, storing the results of the Get Log Page command and exposing them to userspace via some other mechanism (e.g. sysfs) so no users are trying to issue Get Log Page commands concurrently. |
Unfortunately, we missed the opportunity to discuss this during ALPSS in person. FWIW, your argument that the Discovery Log page is consumed as absolute and not a diff make sense to me. Given this I don't think we have a problem. Anyway, I think this topic is worth to bring to the nvme mailing list and get everybody on board. |
Well; the current design is that the kernel keeps track of all connected controllers, and hence will be receiving AENs from all controllers. Once an AEN is received the kernel reads the log page and clears the AEN. It then converts the AEN into a uevent for userspace consumption. |
Are you saying that the kernel is reading the Discovery Log Page in response to the AEN? I don't think that's true. If you look at the kernel code, in response to static void nvme_aen_uevent(struct nvme_ctrl *ctrl)
{
char *envp[2] = { NULL, NULL };
u32 aen_result = ctrl->aen_result;
ctrl->aen_result = 0;
if (!aen_result)
return;
envp[0] = kasprintf(GFP_KERNEL, "NVME_AEN=%#08x", aen_result);
if (!envp[0])
return;
kobject_uevent_env(&ctrl->device->kobj, KOBJ_CHANGE, envp);
kfree(envp[0]);
}
static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
{
u32 aer_notice_type = nvme_aer_subtype(result);
bool requeue = true;
switch (aer_notice_type) {
case NVME_AER_NOTICE_NS_CHANGED:
set_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events);
nvme_queue_scan(ctrl);
break;
case NVME_AER_NOTICE_FW_ACT_STARTING:
/*
* We are (ab)using the RESETTING state to prevent subsequent
* recovery actions from interfering with the controller's
* firmware activation.
*/
if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) {
nvme_auth_stop(ctrl);
requeue = false;
queue_work(nvme_wq, &ctrl->fw_act_work);
}
break;
#ifdef CONFIG_NVME_MULTIPATH
case NVME_AER_NOTICE_ANA:
if (!ctrl->ana_log_buf)
break;
queue_work(nvme_wq, &ctrl->ana_work);
break;
#endif
case NVME_AER_NOTICE_DISC_CHANGED:
ctrl->aen_result = result;
break;
default:
dev_warn(ctrl->device, "async event result %08x\n", result);
}
return requeue;
} I agree with you that having the kernel re-fetch the log page and expose it to userspace somehow is another possible model, but it's not how Linux and Please let me know if I am misunderstanding you. |
In several circumstances, nvme_discovery_log() leaves the RAE bit set for all Get Log Page commands it issues. This happens if the header indicates the log page has no records, or if a command other than the header re-fetch errors out. If the RAE bit is never cleared, the controller will not send more Discovery Asynchronous Event Notifications. Setting the RAE bit is not necessary to avoid missed events since the use of GENCTR ensures the log page is fetched atomically. It is better to risk receiving additional AENs than for the discovery controller to never send more AENs because it is waiting for a Get Log Page command that never comes. So clear the RAE bit when fetching each piece of the log page. Signed-off-by: Caleb Sander <[email protected]>
In case of a GENCTR mismatch, nvme_discovery_log() currently fetches the Discovery Log Page header again before fetching the entries. This is unnecessary since it was just fetched to determine GENCTR. So only fetch the header once and re-use it for the next loop iteration. Remove a couple of unnecessary variables in the process. Signed-off-by: Caleb Sander <[email protected]>
Most of the 1 KB Discovery Log Page header is reserved. Only the first 18 bytes are currently defined. So avoid transfering more data from the controller than necessary. Signed-off-by: Caleb Sander <[email protected]>
There is a lot of duplicated code between nvmf_get_discovery_log() and nvmf_get_discovery_wargs(). Since nvmf_get_discovery_wargs() is more general, use it to implement nvmf_get_discovery_log(). Signed-off-by: Caleb Sander <[email protected]>
nvme_discovery_log() takes nvme_get_log_args as input and then overwrites many of its fields. This leads to wasted code in the caller setting up unused fields. Instead pass nvme_get_discovery_args, which more accurately expresses the possible discovery log arguments. As an added benefit, it also subsumes the other arguments. And nvmf_get_discovery_wargs() can pass its arguments straight through. Signed-off-by: Caleb Sander <[email protected]>
d32a5c0
to
5835e18
Compare
I agree with @calebsander reasoning. The kernel is not relying on the RAE for the Asynchronous Event Type Notice. That means, userspace is full control of the RAE for the Notice type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong. 'genctr' might have been changed after the first call, so we need to repeatedly retrieve the header until we get a stable 'genctr'.
I've discussed this offline with Hannes. So after fixing the initial get header retry, all should be fine. |
Sorry, I'm not following either of your comments. The only change to when the Discovery Log Page header is fetched is that we don't fetch it twice in a row in case of a GENCTR mismatch, as described in the commit description of "fabrics: avoid redundant Get Log Page on retry". The purpose of GENCTR is to ensure the log page is fetched atomically even though multiple Get Log Page commands may be issued. By verifying that GENCTR matches before and after fetching the log page entries (and retrying if it doesn't), we ensure the fetched log page isn't torn between Get Log Page commands. There is no requirement to check for GENCTR to be "stable", and indeed there's no way to ensure GENCTR isn't going to change in the future. Here are some example scenarios to compare the old and new behavior (these are also illustrated by the unit tests):
Both approaches achieve the atomicity requirement. In the unlikely event that there's another GENCTR change between fetching the header after fetching the log page and fetching the header again in the old code and no more GENCTR changes before the full log page is fetched, then the new behavior could require another retry (at least 2 Get Log Page commands). In the more likely case that GENCTR doesn't change again, or changes while fetching the log page entries, then the new behavior avoids fetching the header twice in a row. If you would prefer to keep the current behavior of fetching the header twice, I can drop that commit. Please let me know if I am misunderstanding you. |
My bad. You can ignore the point from Hannes about GENCTR. Your change is not breaking anything. My comment is about the initial header get code, it misses the retry logic: nvme_msg(r, LOG_DEBUG, "%s: get header (try %d/%d)\n",
name, retries, args->max_retries);
log_args.log = log;
log_args.len = DISCOVERY_HEADER_LEN;
if (nvme_get_log_page(fd, NVME_LOG_PAGE_PDU_SIZE, &log_args)) {
nvme_msg(r, LOG_INFO,
"%s: discover try %d/%d failed, error %d\n",
name, retries, args->max_retries, errno);
goto out_free_log;
} This should be something like do {
nvme_msg(r, LOG_DEBUG, "%s: get header (try %d/%d)\n",
name, retries, args->max_retries);
log_args.log = log;
log_args.len = DISCOVERY_HEADER_LEN;
if (nvme_get_log_page(fd, NVME_LOG_PAGE_PDU_SIZE, &log_args)) {
nvme_msg(r, LOG_INFO,
"%s: discover try %d/%d failed, error %d\n",
name, retries, args->max_retries, errno);
goto out_free_log;
}
} while (++retries < args->max_retries);
[...] |
Ah, thanks for explaining. But isn't the existing code already not retrying in case the header fetch fails? ret = nvme_get_log_page(fd, NVME_LOG_PAGE_PDU_SIZE, args);
if (ret) {
nvme_msg(r, LOG_INFO,
"%s: discover try %d/%d failed, error %d\n",
name, retries, max_retries, errno);
goto out_free_log;
}
// ...
out_free_log:
free(log);
return NULL; My understanding was that |
Yeah, makes sense. Maybe we should update the error log message because it confused me :) |
Are there other things you want to discuss about these changes, or do you think they are ready to merge in? |
Sorry for the delay. Busy at SUSE's Hackweek :) |
Thanks! |
Several improvements to fetching the discovery log page:
nvmf_get_discovery_log()
in terms ofnvmf_get_discovery_wargs()
to avoid code duplicationstruct nvme_get_discovery_args
instead ofstruct nvme_get_log_args
tonvme_discovery_log()
. This avoids confusion about which function is responsible for filling in which arguments, and avoids redundant arguments.Tested with the mock ioctl test infrastructure in #688. The test cases were updated to match the new discovery behavior.
It also works when run against a real NVMe/TCP controller: