-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/nodeipamcontroller #679
Add/nodeipamcontroller #679
Conversation
Welcome @DockToFuture! |
Hi @DockToFuture. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @cartermckinnon I'll give this a look 👍 |
/assign @M00nF1sh |
/ok-to-test |
5c1d5e5
to
93424bc
Compare
/test all |
/kind feature |
@DockToFuture how did you test this? I can't spot any tests in this PR, are parts of this controller here borrowed from other places? could we borrow the tests that exist there as well? |
Hello @dims I've deployed the controller into our IPv6 Kubernetes clusters and tested both modes ( |
@DockToFuture so what i was thinking was if we were to enhance one of the existing harness(es) to enable this control, for example
if we take this harness (switch on ipv6!), add a way to ensure that this new controller is activated/enabled that would be the first step. then we could add some basic golang based tests in test/e2e folder to ensure we don't break this controller going forward. You called out Does that make sense? |
@DockToFuture also there are some unit tests there in that other repo you pointed to, some of it may be ported over here as well? |
It should be possible to use this as a drop-in replacement to the ipam controller we implemented in kOps. kOps would not support dualstack though, so it may be an idea to mark that option experimental unless there is another way of getting that e2e-tested. |
93424bc
to
d198bf0
Compare
0e5cf91
to
ba8554d
Compare
New changes are detected. LGTM label has been removed. |
@dims I've rebased the PR. Can you take a look? |
/retest |
@DockToFuture kicked off the tests again hoping these are flakes, if not, then we need to add some option to enable the controller on demand and leave it disabled otherwise i think. (then we can work on a follow up PR to enable them by default) |
@DockToFuture: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@dims a valid dns-zone needs to be set in the environment variable "DNS_ZONE" in order to create the kops cluster successfully, see my earlier comment: #679 (comment) for more information.
|
@M00nF1sh any suggestions on how to make this work? |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
// cidrSet are mapped to clusterCIDR by index | ||
cidrSets := make([]*cidrset.CidrSet, len(allocatorParams.ClusterCIDRs)) | ||
for idx, cidr := range allocatorParams.ClusterCIDRs { | ||
cidrSet, err := cidrset.NewCIDRSet(cidr, allocatorParams.NodeCIDRMaskSizes[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can cause crash if the len(allocatorParams.NodeCIDRMaskSizes) < len(allocatorParams.ClusterCIDRs).
It's a nit comment given this shall be supplied by cluster admins so lt's low risk, but good to have validations with clear error message when len(allocatorParams.NodeCIDRMaskSizes) != len(allocatorParams.ClusterCIDRs)
// If node has a pre allocate cidr that does not exist in our cidrs. | ||
// This will happen if cluster went from dualstack(multi cidrs) to non-dualstack | ||
// then we have now way of locking it | ||
if idx >= len(r.cidrSets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, we can optimize the performance and readability by doing the check once before the loop
e.g.
if len(node.Spec.PodCIDRs) < len(r.cidrSets) {
return "some error"
}
for idx, cidr := range node.Spec.PodCIDRs {
.....original logic without the idx >= len(r.cidrSets) check
}
// If node has a pre allocate cidr that does not exist in our cidrs. | ||
// This will happen if cluster went from dualstack(multi cidrs) to non-dualstack | ||
// then we have now way of locking it | ||
if idx >= len(r.cidrSets) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as previous comment, i'd like to move this check before the loop for performance and readability.
the len(node.spec.PodCIDRs)(when it's not 0) seems must always >= len(r.cidrSets)
// function you have to make sure to update nodesInProcessing properly with the | ||
// disposition of the node when the work is done. | ||
func (r *rangeAllocator) AllocateOrOccupyCIDR(node *v1.Node) error { | ||
if node == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, when will node been nil? or is this sort of Defensive programming(if so, let's add a comment like this should never happen for readability) or remove this check if we can ensure it's non-nil.
|
||
for idx := range r.cidrSets { | ||
podCIDR, err := r.cidrSets[idx].AllocateNext() | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this happens, seems the allocations for other index(< current idx) will be leaked
} | ||
|
||
klog.V(4).Infof("release CIDR %s for node:%v", cidr, node.Name) | ||
if err = r.cidrSets[idx].Release(podCIDR); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do a best effort by try release for each cidrSet and return aggregated error?
} | ||
|
||
// aws:///eu-central-1a/i-07577a7bcf3e576f2 | ||
instanceID, _ := awsv1.KubernetesInstanceID(node.Spec.ProviderID).MapToAWSInstanceID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error is ignored..what id the instance id is a non-valid instanceid? like a fargate task?
} | ||
|
||
// aws:///eu-central-1a/i-07577a7bcf3e576f2 | ||
instanceID, _ := awsv1.KubernetesInstanceID(node.Spec.ProviderID).MapToAWSInstanceID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error is ignored.
recorder: recorder, | ||
nodesInProcessing: sets.NewString(), | ||
} | ||
nodeList, err := listNodes(kubeClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to use listNodes? shouldn't the informer handles all existing nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a more in-depth review of the code.
Background
- Kubernetes KCM have built-in node-iam controller that is capable of either:
- allocate cidr to nodes from preconfigured cidrs
- or delegate to GCE's specific implementation(deprecated and not build under 'providerless' build tag)
- The Kubernetes CloudProvider interface don't have a interface defined for "NodeIPAM".
- If with a strict definition of "CloudProvider" to be "every thing in CloudProvider interface", this shouldn't be part of "AWSCloudProvider".
- If with a general definition of "CloudProvider" to be "every thing AWS related", then this controller can be part of "AWSCloudProvider", however personally i don't favor it given a few reasons:
- Personally I prefer to keep the AWSCloudProvider as thin as possible, thus it only contains common & critical parts to get Kubernetes on AWS working.
- Similar to why we move the "Service/Ingress Implementation" into a dedicated AWSLoadBalancerController(which also falls into this general definition "CloudProvider"), dedicated controllers allows us to iterate faster and more flexible(new features/ bug fixes), and don't need to worry our changes would break other k8s on AWS users, since the versioning of those optional features can be upgraded independently from core AWS cloudProvider functionality.
High level comments on the code itself:
-
the use case of the "cidr_allocator" seems confusing to me. From the code seems it performs both "allocate from cidr" and "allocate from IPv6"(it's not "dualstack" but more like
cidr_and_ipv6prefix_allocator
:D). I'm wondering whether there are real use case for this mode. -
the implementation pattern of "cidr_allocator" and "ipv6_allocator" seems differs a lot, while technically they are doing same work, and i feel a bit hard to maintain tbh.
- with the "cidr_allocator", the major work is handle directly during informer's event handler, while uses a work queue(channel) to defer the expensive ipv6 prefix lookup and cidr updates.
- it's an anti-pattern to do expensive computes in event handlers.
- there is no retries if there are failures in event handlers
- i understand that the "nodesInProcessing" tries to aim only allow one worker thread to processing a node at any given time, but it a bit hard to follow tbh given it's removeNodeFromProcessing is called from multiple places/goroutines)..
- with the "cidr_allocator", the major work is handle directly during informer's event handler, while uses a work queue(channel) to defer the expensive ipv6 prefix lookup and cidr updates.
I will close this PR as we have decided to implement an own IPAM controller as suggested by @M00nF1sh to keep the CloudProvider AWS as slim as possible (analogous to the AWS loadbalancer controller). Furthermore this approach gives us more degrees of freedom how the tests for the IPAM controller are run and executed. |
I want to share the freshly created gardener aws-ipam-controller which originated out of this PR and might be beneficial to everyone who might be considering the use of IPv6 prefix delegation for single-stack or dual-stack scenarios. |
thanks @DockToFuture ! |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This change adds a
nodeipam controller
to the aws cloud-controller-manager for IPv6 prefix delegation. The node ipam controller assigns POD addresses to kubernetes nodes based on prefix delegations. The controller supports plain IPv6 (--dualstack=false
) and dual stack (--dualstack=true
). In dual stack mode the IPAM for IPv4 is done in the same way it was done in the kube-controller-manager.Which issue(s) this PR fixes:
Fixes #608
Special notes for your reviewer:
Does this PR introduce a user-facing change?: