-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNI: use tmpfs location for ipam plugin #24650
Conversation
When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
I'm going to move this back into draft until we've had a bit more time to look into the CNI check command workflow being discussed in #24292 (comment) |
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I tested this using an AWS environment with 1 server and 1 client instance. The client instance has CNI v1.3.0 plugins installed.
JobSpec:
job "example" {
group "webserver" {
network {
mode = "bridge"
port "http" {
to = 80
}
}
task "python3" {
driver = "exec"
config {
command = "python3"
args = ["-m", "http.server", "80"]
}
resources {
cpu = 500
memory = 256
}
}
}
}
Running through the steps using a build from main f4529485563924462dbdccdd1b4cacbd9d68616e
when I rebooted the instance the allocation failed with the error detailed below:
2024-12-13T10:57:44Z Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.12 has been allocated to 0dc525f4-c491-7e25-a67d-0121069ad55e, duplicate allocation is not allowed
2024-12-13T10:57:41Z Failed Restoring Task failed to restore task; will not run until server is contacted
I then tested this patch 9e1a365ae3d22f04c6bb8aa0ce0fee6d1f83ae6
f I performed the same steps as detailed in the PR and performed in the previous test. When the instance was rebooted the following task events are recorded:
2024-12-13T11:08:42Z Started Task started by client
2024-12-13T11:08:41Z Failed Restoring Task failed to restore task; will not run until server is contacted
The allocation HTTP server is accessible and responds as it did before the reboot.
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650 retry and tests [squashme]
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When a Nomad host reboots, the network namespace files in the tmpfs in
/var/run
are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at/var/lib/cni
. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too.Reconfigure the CNI bridge configuration to use
/var/run/cni
as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too inlibcni
.Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state.
Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
Testing & Reproduction steps
Run a cluster on a set of VMs, with at least one client. This can't be a server+client because we need to reboot the hosts. You should probably set the
server.heartbeat_grace = "5m"
to give yourself time to work.network.mode = "bridge"
. Wait for it to be healthy.Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.