-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nexus zone failed to come up due to uninitialized underlay that appears to have been initialized? #5314
Comments
Re-launching the environment from scratch did not produce this issue, so it's a non-deterministic one. |
My guess here is that the
Perhaps |
Quick test to confirm that the module can be unloaded after the underlay is initialised:
|
I think if we do so we would also need to add an ioctl to also clear the underlay, if only to allow for easier local testing. AFAIK we remove the underlay by unloading XDE today. |
Would we expect this behavior on non-DEBUG kernels? I don't believe I'm running debug bits and this just torpedoed another test run. |
I think it's possible modules can be unloaded for reasons other than the DEBUG bits enthusiasm for doing it regularly. If memory is tight, for example, I'm pretty sure one of the things we might try to make more available is unloading things that are willing to unload. Conceivably it could also be some of our own software doing it; e.g., omicron/tools/virtual_hardware.sh Line 92 in 8fae16a
|
I think this is probably critical to do; unloading can happen any time. If your module has discretionary state set up that you don't want to lose (such as the operator having configured the underlay), you need to prevent yourself detaching (if you're something that has instances) or unloading (if you are not). In the case that you don't have a driver instance, I believe you can take a hold on yourself to essentially make yourself busy -- and then release that hold after someone unconfigures the underlay etc. |
Do you know if there are any counters for this (or some other easy way to tell if it's happened)? |
You can see how many unloads there have been in general with
but not anything that would tell you it was |
The OPTE-side work is laid out in oxidecomputer/opte#485, which is not at all complex. We can likely integrate it here once #5423 and its followups land, I believe |
OPTE now prevents itself from being unloaded if its underlay state is set. Currently, underlay setup is performed only once, and it seems to be the case that XDE can be unloaded in some scenarios (e.g., `a4x2` setup). However, a consequence is that removing the driver requires an extra operation to explicitly clear the underlay state. This PR adds this operation to the `cargo xtask virtual-hardware destroy` command. This is currently blocked on opte#485 being approved/merged. Closes #5314.
OPTE now prevents itself from being unloaded if its underlay state is set. Currently, underlay setup is performed only once, and it seems to be the case that XDE can be unloaded in some scenarios (e.g., `a4x2` setup). However, a consequence is that removing the driver requires an extra operation to explicitly clear the underlay state. This PR adds this operation to the `cargo xtask virtual-hardware destroy` command. Closes #5314.
In an a4x2 run, I had a cluster come all the way up, except for a single nexus zone. Attached is the log from the sled agent responsible for launching the missing nexus zone.
The reason for the zone not coming up is sled-agent observing an uninitialized xde underlay state.
However, earlier in the log file, sled agent appears to initialize the underlay.
sled-agent-no-nexus.log
The text was updated successfully, but these errors were encountered: