Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't reconfigure networkd on "stop" #107

Merged
merged 4 commits into from
Mar 8, 2024
Merged

Conversation

nmeyerhans
Copy link
Contributor

@nmeyerhans nmeyerhans commented Mar 5, 2024

Issue #, if available: n/a

Description of changes:

This makes some changes to the behavior of ec2-net-utils when the policy-routes service is stopped. The major change is that stop no longer removes the generated config. This reduces the amount of work done and eliminates reloading of systemd-networkd when doing so provides no meaningful benefit. Any routes and policy rules associated with an instance are deleted when an interface is removed, so the config removal is not meaningful.

This fixes an issue observed when stopping [email protected] that would lead to forwarded connections (e.g. from a local Docker bridge network) to be flushed from the conntrack tables, leading to dropped packets.

There are other smaller changes to the systemd unit files:

  • Set KillMode to only signal the top-level process, rather than the default behavior of signalling all processes in the cgroup
  • Add Wants= and Also= relationships between [email protected] and [email protected] units to clarify the relationship.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Noah Meyerhans added 2 commits March 6, 2024 16:03
Previously, stopping [email protected] would delete the
installed configuration for the foo interface and trigger a networkd
configuration reload.  Doing so would revert the interface's
configuration back to the default, and the subsequent networkd reload
would reset any conntrack state for connections associated with that
interface.  Doing so would cause traffic for any connections that
relied on the RELATED or ESTABLISHED conntrack properties to be
dropped, when the expectation is that it would continue to be passed.

Impact from this issue was particularly visible on systems running
Docker in bridged networking mode, where the containers rely on the
Docker-installed iptables rules for connectivity, including an ACCEPT
rule based on established connections, by default.  In this case, any
connections open from local containers to a remove service would see
100% packet loss after stopping [email protected] (where foo
is the interface through which container generated traffic would
egress).

With this change, the generated config is left behind after stopping
the [email protected], even after an ENI is removed.  In
practice, this is not a problem because:

1. re-attaching the same ENI will use the old configuration, with any
configuration changes picked up by the policy-routes service
3. Connecting a different ENI in the same slot (thus with the same
name) will not match the MAC Address value, and will use the default
configuration.  The policy-routes service will then generate the
correct ENI-specific configuration, overwriting any existing
configuration left behind by the previously attached ENI.
The systemd default of of `control-group` for this value is more
aggressive than we want.
@nmeyerhans nmeyerhans marked this pull request as ready for review March 7, 2024 00:29
@nmeyerhans nmeyerhans changed the title WIP: don't reconfigure networkd on "stop" don't reconfigure networkd on "stop" Mar 7, 2024
...rather than explicitly in the udev rules.
Copy link
Collaborator

@vigh-m vigh-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vigh-m vigh-m merged commit d34a1d4 into amazonlinux:main Mar 8, 2024
4 checks passed
@rickwargo
Copy link

@nmeyerhans When will this become available? I lost network connectivity last night and saw this service ultimately timeout. I also received the Systems Manager role issue (EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials - resulting in 404 from get http://169.254.169.254/latest (I think)). I have a newly installed image (Apr 3) and it is fairly vanilla. I am running gunicorn/uvicorn (I have seen that in another post with the same errors). It's odd as I only have one network interface (enX0). I'd like to try this to see if my instance stays stable.

@nmeyerhans
Copy link
Contributor Author

@rickwargo I'm no longer involved in Amazon Linux development and thus cannot answer your question. Maybe @vigh-m can help. I suspect this is blocked on #108

@nmeyerhans nmeyerhans deleted the no-stop branch April 10, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants