Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support attaching multiple network interfaces to the same network card #2859

Open
wants to merge 3 commits into
base: release-3.12
Choose a base branch
from

Conversation

hanwen-cluster
Copy link
Contributor

@hanwen-cluster hanwen-cluster commented Jan 7, 2025

Description of changes

  • Cherry-picked from develop branch

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hanwen-cluster hanwen-cluster requested review from a team as code owners January 7, 2025 18:30
@hanwen-cluster hanwen-cluster changed the title Release 3.12 Support attaching multiple network interfaces to the same network card Jan 7, 2025
Prior to this commit, the code had two assumptions:
1. Single network card instances can have only one network interface
 a. therefore, when there are more than one network interface, the code made a IMDS call to retrieve device index (network interface index), which is only available on instances with multiple network cards. Having a secondary network interface on single network card instance failed the code and caused instance launch failures.
2. Each network card can have only one network interface
 a. therefore, the route table is unique to each network card. Having multiple network interfaces on a network card confused the code and generated wrong route tables.

To fix (1), this commit uses a fallback value 0 when retrieval of device index fails.
To fix (2), this commit names the route tables in a way that is unique to network interface and network card.

FYI:
1. Network card is the physical card. Network interface is the virtual concept. Each network card can have multiple network interfaces (which is AWS is 1 or 2)
2. "Network interface" is synonym to "device"

Signed-off-by: Hanwen <[email protected]>
This commit fixes a bug from aws#2855, which made the route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 took priority and failed test_multiple_nics integration test on AL2023.

This commit makes the number smaller (meaning higher priority) to fix the issue.
e.g.
Prior to this commit, the number for table for 1,1 is 1001001. After this commit, the number is 101+10=111. The "+10" is to properly handle table for 0,0, which has number 10. Without "+10", the table would conflict with table 0 from OS.

FYI: the number of unwanted default AL2023 rule starts with 10101

Signed-off-by: Hanwen <[email protected]>
aws#2855 made the pcluster route table/metric number larger (meaning lower priority). Thereafter, some unwanted default rules on AL2023 ec2-net-utils took priority and failed test_multiple_nics integration test on AL2023.
Then, aws#2857 made the number too small, interfering route table configurations from IMDS on AL2.

Therefore, this commit tries to imitate the priority prior to these two PRs. This is not the cleanest fix, because it is staying in the lucky priority rand instead of fully resolving the issue (i.e. prevent IMDS and ec2-net-utils from configuring the route tables). However, this commit is the least breaking change. So I propose to go with this commit.

Metric number range before the two PRs
Network card (0,0): 1000
Network card (0,1): 1000 (which was causing conflicts and the reason for all these PRs)
Network card (n,1): 100n (for p5, which has 32 network card, it will be 1000-10031)

Metric number range after the first PR:
Network card (0,0): 1000000
Network card (0,1): 1000001 (conflict fixed :) )
Network card (n,1): 1000001+n*1000 (for p5, it will be 1000000-1031001)

Metric number range after the second PR:
Network card (0,0): 10
Network card (0,1): 75
Network card (n,1): 0x(hexadecimal number)n01+10 (for p5, it will be 10-12555. The hexadecimal number was accidentally introduced because bash automatically interpret numbers start with "00" as hexadecimal number)

Metric number range after this commit:
Network card (0,0): 1000
Network card (0,1): 1001
Network card (n,1): n01+1000 (for p5, it will be 1000-4101)

Signed-off-by: Hanwen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant