Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop Node event when EC2 instance does not exist #753

Merged

Conversation

cartermckinnon
Copy link
Contributor

@cartermckinnon cartermckinnon commented Nov 27, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

This avoids unnecessary retries when the ec2:CreateTags call fails with an InvalidInstanceId.NotFound error. Excessive retries for each event can lead to a growing work queue that may increase dequeue latency dramatically.

If the Node is newly-created, we requeue the event and retry, to handle eventual-consistency of this API. If the Node is not newly-created, we drop the event.

This PR is only concerned with retries for a single event; all nodes still have the implicit "retry" that results from each update event (every ~5m).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 27, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 27, 2023
@cartermckinnon cartermckinnon force-pushed the instance-not-exists-handling branch 3 times, most recently from 1eef0b0 to 1ccb35f Compare November 27, 2023 21:54
@cartermckinnon
Copy link
Contributor Author

Something seems wonky with the CI, I'll look into it.

@cartermckinnon
Copy link
Contributor Author

/retest

All failures are related to persistent volumes, doesn't seem related to this change.

Copy link
Contributor

@tzneal tzneal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look ok to me, but can we add a test for this new behavior in tagging_controller_test.go?

@cartermckinnon
Copy link
Contributor Author

/retest

@cartermckinnon
Copy link
Contributor Author

The CI is definitely hosed, same cases have been red in k/k since ~11/24: https://testgrid.k8s.io/presubmits-ec2#pull-kubernetes-e2e-ec2

@cartermckinnon cartermckinnon force-pushed the instance-not-exists-handling branch from 1ccb35f to da5beab Compare November 30, 2023 04:51
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 30, 2023
@cartermckinnon cartermckinnon force-pushed the instance-not-exists-handling branch 4 times, most recently from 02a76b9 to e80c120 Compare November 30, 2023 05:03
@cartermckinnon cartermckinnon force-pushed the instance-not-exists-handling branch from e80c120 to 1108878 Compare November 30, 2023 05:05
@cartermckinnon
Copy link
Contributor Author

@tzneal added a couple unit test cases 👍

@tzneal
Copy link
Contributor

tzneal commented Nov 30, 2023

Seems ok, any idea on the CI problem?

@cartermckinnon
Copy link
Contributor Author

Haven't had a chance to go down the rabbit hole. Looks like things broke when the kubekins image was bumped in the test spec, I assume a change in cloud-provider-aws-test-infra is the issue but I don't see anything obvious in the commit log.

@dims do you have a guess?

@dims
Copy link
Member

dims commented Nov 30, 2023

@cartermckinnon no, i have not looked at this yet ..

@cartermckinnon
Copy link
Contributor Author

@dims I'll try to get a fix up 👍

@cartermckinnon
Copy link
Contributor Author

CI should be fixed by this: kubernetes-sigs/provider-aws-test-infra#232

@cartermckinnon
Copy link
Contributor Author

/retest

Copy link
Contributor

@mmerkes mmerkes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 1, 2023
Copy link
Contributor

@ndbaker1 ndbaker1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! did we also want to make the queue size visible with some logging around here? (since dequeuing latency isn't the most direct metric)

@k8s-ci-robot
Copy link
Contributor

@ndbaker1: changing LGTM is restricted to collaborators

In response to this:

Thanks for picking this up! did we also want to make the queue size visible with some logging around here? (since dequeuing latency isn't the most direct metric)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cartermckinnon
Copy link
Contributor Author

did we also want to make the queue size visible

I plan to add a metric for this in a separate PR, because it'd be helpful to debug in the future; but I think dequeue latency is still the more important metric to track and alarm on. There can be many events in the queue that are no-ops, and that doesn't necessarily have an impact e.g. how quickly a new Node is tagged.

@k8s-ci-robot
Copy link
Contributor

@ndbaker1: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member

dims commented Dec 1, 2023

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, mmerkes, ndbaker1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants