-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filters: handle ENOENT when nvme_core isn't loaded #795
Conversation
Give each sysfs test case its own name instead of using "sysfs" for all. This will make it easier to distinguish each test case in the test results if more cases are added in the future. Signed-off-by: Caleb Sander Mateos <[email protected]>
If the nvme_core kernel module isn't loaded, /sys/class/nvme and /sys/class/nvme-subsystem won't exist. This leads to ENOENT errors in nvme_scan_ctrls() and nvme_scan_subsystems(). Suppress these errors so an empty topology will be created. Signed-off-by: Caleb Sander Mateos <[email protected]>
Create a sysfs test where /sys/class/nvme and /sys/class/nvme-subsystem don't exist, simulating the nvme_core module being unloaded. Ensure nvme_scan_topology() is successful and creates an empty tree. Signed-off-by: Caleb Sander Mateos <[email protected]>
88681d9
to
e87a456
Compare
/* If nvme_core isn't loaded, /sys/class/nvme-subsystem won't exist. | ||
* Exit successfully with no NVMe subsystems. | ||
*/ | ||
if (ret < 0 && errno == ENOENT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was pondering on this for a while and I think we should not change this in the library, because any users of this call can't know if there was no data or if the modules were not loaded. This error needs to be handled on the caller side IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From man scandir
:
ERRORS
ENOENT The path in dirp does not exist.
It seems pretty clear that ENOENT
means /sys/class/nvme
(or /sys/class/nvme-subsystem
) doesn't exist. What cause could there be other than that the nvme_core
module isn't loaded? I'm not sure what you mean by "there was no data".
Mostly, it seems weird to me that the caller would have to special-case whether or not nvme_core
is loaded (or handle an error). I can't imagine a caller that would care whether nvme_core
is unloaded vs. nvme_core
is loaded but there are no NVMe devices connected.
As far as "caller", do you mean the check should move to nvme_scan_topology()
? Or to every caller in nvme-cli
? Seems like a lot of duplicated work to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering errors (or in this case state) in the middle of a stack out is generally a bad idea. ENOENT
might be the wrong error code for module not loaded. But the caller needs to know if 'no results' is because there is nothing or if the operation didn't work. The caller has to handles this anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by "the operation didn't work" or by "The caller has to handles this anyway". What failure mode do you have in mind? Do you mean a general caller of these functions or the nvme-cli
/libnvme
callers in particular? And why would a caller need to handle ENOENT
; regardless of whether you think it's the right approach, it is certainly possible to handle ENOENT
in these functions themselves?
Let me try to explain our use case better. We would like to be able to issue nvme disconnect-all
to disconnect all NVMe devices (to clean up the system state). We want it to succeed; a failed command is something we have to look into. And we would prefer not to have another check that nvme_core
is loaded before making the call.
So it seems like we are talking about 3 situations:
nvme_core
is loaded, but there are no NVMe devices:/sys/class/nvme{,-subsystem}
are emptynvme_core
is not loaded (so there are no NVMe devices):/sys/class/nvme{,-subsystem}
don't exist- Some sort of error occurs while scanning
/sys/class/nvme{,-subsystem}
that is not due tonvme_core
being unloaded (do you have an example?)
We would like nvme-cli
/libnvme
not to distinguish between the first 2 cases because they are identical as far as the topology is concerned. It sounds like you would like to distinguish between the last 2.
To me, ENOENT
here seems pretty unambiguous that we are in the second case. Since scanndir()
is documented to only set errno = ENOENT
if the directory doesn't exist, the only possibilities are that /sys
, /sys/class
, or /sys/class/nvme{,-subsystem}
are missing. And /sys
or /sys/class
being absent seems highly unlikely (at least on a Linux system). If you would prefer, I can add an explicit check in the error case that /sys/class
exists, or that the nvme_core
module is unloaded.
As far as where the check belongs, I think checking as close as possible to the scandir()
operation that may fail is a good idea. Otherwise we have to duplicate it in every function that calls this one. And if we put the check higher up in the call stack, I think it's harder to know what call ENOENT
may be coming from. Also keep in mind that nvme_scan_subsystems()
and nvme_scan_ctrls()
are exported from libnvme
, so even if they are "in the middle of a stack" for nvme-cli
's use-case, they could be getting called directly by other users of libnvme
.
I would appreciate a concrete suggestion of where you would like to handle the errors that result from nvme_core
being unloaded. Or let me know if you have decided nvme-cli
just won't support this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are talking past each other. I mean we just have to fix the nvme-cli code base to handle the error correctly. I don't expect that the caller of nvme-cli has to handle it.
I just don't like to address this here in the library as there more than just nvme-cli which use this interface (e.g. libudisk, nvme-stas, etc.) and these tools might also want to distinguish between no results because no modules loaded or nothing there.
ENOENT could be mapped to ENODEV to make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't imagine a caller that would care whether
nvme_core
is unloaded vs.nvme_core
is loaded but there are no NVMe devices connected.
Not exactly nvme_core
but I often find nvme_fabrics
not loaded which leads to confusion on a caller side. Similar situation, the caller would then just typically modprobe whatever is needed and retry.
Please also clarify return values in the docs:
Lines 53 to 59 in 691f809
/** | |
* nvme_scan_subsystems() - Scan for subsystems | |
* @subsys: Pointer to array of dirents | |
* | |
* Return: number of entries in @subsys | |
*/ | |
int nvme_scan_subsystems(struct dirent ***subsys); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about something like this:
modified src/nvme/tree.c
@@ -148,6 +148,8 @@ int nvme_scan_topology(struct nvme_root *r, nvme_scan_filter_t f, void *f_args)
if (ctrls.num < 0) {
nvme_msg(r, LOG_DEBUG, "failed to scan ctrls: %s\n",
strerror(errno));
+ if (errno == ENOENT)
+ errno = ENODEV;
return ctrls.num;
}
@@ -169,6 +171,8 @@ int nvme_scan_topology(struct nvme_root *r, nvme_scan_filter_t f, void *f_args)
if (subsys.num < 0) {
nvme_msg(r, LOG_DEBUG, "failed to scan subsystems: %s\n",
strerror(errno));
+ if (errno == ENOENT)
+ errno = ENODEV;
return subsys.num;
}
modified src/nvme/tree.h
@@ -1259,7 +1259,9 @@ const char *nvme_subsystem_get_iopolicy(nvme_subsystem_t s);
* Scans the NVMe topology and filters out the resulting elements
* by applying @f.
*
- * Return: Number of elements scanned
+ * Return: Number of elements scanned. If the operation fails, the function
+ * return -1 and sets updates errno. ENODEV means the nvme modules are not
+ * loaded.
*/
And the also update nvme-cli accordingly. We could issue 'modules not loaded' message too. I would consider it an error but it seems there workflows which want a success status in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with pushing the error farther up the stack is that the farther removed the error handling code is from the scandir()
operations, the more possible causes of ENOENT
(or ENODEV
) there are. So it becomes harder to say that ENODEV
is due to nvme_core
not being loaded, rather than some other failure. There would also be a lot of code duplication to handle this error in every caller of nvme_scan_topology()
in nvme-cli
.
I take it you are opposed to handling the error (returning 0 with an empty topology) in nvme_scan_topology()
as well?
How would you feel about creating a separate function nvme_scan_topology_safe()
(name up for bikeshedding) that can handle nvme_core
not being loaded? That way we wouldn't change the semantics of existing libnvme
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, EBUSY.
Let's try to split the discussion into two separate threads:
- which error should be returned
nvme_scan_topology
which returns an empty topology also when the modules are missing
Re 1.: we could just introduce another libnvme specific error code when the module is missing (
Line 45 in 86e6d82
enum nvme_connect_err { |
Re 2.: sure if this makes things simpler we can add an inline function along the lines of:
static inline int nvme_scan_topology_safe(struct nvme_root *r, nvme_scan_filter_t f, void *f_args)
{
int err;
err = nvme_scan_topology(r, f, f_args);
if (err < 0 && errno == ENVME_SCAN_NVME_NOT_AVAILABLE) {
errno = 0;
return 0;
}
return err;
}
After thinking about this some more, I think we have different opinions about what the default behavior should be when |
If the
nvme_core
kernel module isn't loaded,/sys/class/nvme
and/sys/class/nvme-subsystem
won't exist.This leads to
ENOENT
errorsin
nvme_scan_ctrls()
andnvme_scan_subsystems()
.Suppress these errors so an empty topology will be created.
Also add a sysfs unit test for this case.
Fixes linux-nvme/nvme-cli#2241 (for real)