-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33958: secret re-creation scenario for externalCertificate with active informer #614
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
41a1750
to
f51a3f2
Compare
/jira refresh |
@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
70121a3
to
f655bd2
Compare
@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@chiragkyal: This pull request references Jira Issue OCPBUGS-33958, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Signed-off-by: chiragkyal <[email protected]>
I played with the fix on a live cluster:
Two things I noticed about the status of the route:
I think that we need to test again those cases and try to find out whether there is any race or a glitch. |
Thanks for doing the test. I also tried on a live cluster. I think there is a race condition happening while updating the status of the route. The secret handler status update is racing with the status update of the plugin re-trigger. The top level plugin has a lock : router/pkg/router/controller/router_controller.go Lines 210 to 217 in be831fa
which means all routes are getting processed serially, unlike secret handler. As a solution, I think the |
Signed-off-by: chiragkyal <[email protected]>
Signed-off-by: chiragkyal <[email protected]>
@alebedev87 @Miciah As per our discussion over sync call, I've updated the contention tracker to ignore the status update done by the secret manager. Please take a final look at the PR, and let's try to get this merged. Thanks! |
I did some testing on a real cluster with multiple routes using the same secret, and after the latest changes, the transition of the status were as expected and consistent through out the routes.
|
/approve Thanks for the testing, we should be good now. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alebedev87 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
tide/merge-method-squash |
/label tide/merge-method-squash |
if ignoreIngressConditionReason.Has(a.Reason) || ignoreIngressConditionReason.Has(b.Reason) { | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed today that this condition may be broader than necessary, but we couldn't think of a realistic scenario in which it would cause problems. However, we may need to revisit this condition if we ever observe any strange behavior with respect to status updates for routes with external certificates during, say, a rolling update of a router deployment.
// - If the external certificate has changed, it unregisters the old one and registers the new one. | ||
// - If the external certificate has not changed, it revalidates and updates the in-memory TLS certificate and key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is a little unclear. Presumably by "the external certificate has changed" or "the external certificate has not changed", you mean that the external certificate secret reference has or has not changed. That is, if the secret reference changes, you need to track the new referent, but if the reference is the same, you only need to check whether the referent secret's contents have changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean that the external certificate secret reference has or has not changed
Yes exactly, if the secret reference is changed to a new secret, we will track the new secret. And if the secret name is unchanged, we will check the contents of the secret.
// Update the route status to notify plugins, including this plugin, for re-evaluation. | ||
// - If the route is admitted (Admitted=True), record an update event. | ||
// - If the route is not admitted, record a rejection event (keep it rejected). | ||
if isRouteAdmittedTrue(route.DeepCopy()) { | ||
p.recorder.RecordRouteUpdate(route, ExtCrtStatusReasonSecretUpdated, msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isRouteAdmittedTrue
checks whether any router has admitted the route:
router/pkg/router/controller/route_secret_manager.go
Lines 396 to 407 in 3966d00
// isRouteAdmittedTrue returns true if the given route has been admitted | |
// by any of its ingress controllers, otherwise false. | |
func isRouteAdmittedTrue(route *routev1.Route) bool { | |
for _, ingress := range route.Status.Ingress { | |
for _, condition := range ingress.Conditions { | |
if condition.Type == routev1.RouteAdmitted && condition.Status == kapi.ConditionTrue { | |
return true | |
} | |
} | |
} | |
return false | |
} |
Suppose that the current router has previously rejected the route, but another router admitted it. Then will this logic cause the current router to admit the route erroneously?
The route status could have a stale status entry from a router that is no longer running, or from a router that is using an old image, or from a router that is configured differently (for example, it might allow wildcards while the current one does not). Why wouldn't you check only whether this router has admitted the route?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't you check only whether this router has admitted the route?
Thanks for the suggestion. Yeah, it makes sense to me. I have updated the logic in 62daef4 commit to pass the routerName to RouteSecretManager and check the status change for only this specific router.
// - Handles secret recreation: If a secret is recreated (created after being | ||
// deleted), it fetches the latest route object associated with it using the | ||
// RouteLister and updates the route's status to trigger a full | ||
// re-evaluation of the route by the entire plugin chain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a gap here. Consider the following scenario:
- Start with a router deployment.
- Create a route with an external certificate reference to secret.
- Delete the secret.
- Restart the router deployment's pods (for example, for a rolling update).
- Recreate the secret.
The new router pods will reject the route at step 4 and will miss the secret recreate event at step 5, right?
It might not be feasible to close this gap, but I want to raise this concern to make sure I understand correctly and to note that it is a potential issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is likely possible because the route pod is managing the lifecycle of the secret manager and the secret informers. Restarting the router deployment's pods will kill the previously stored watch, and new pods will not register the new watch because the secret won't be there. So, yeah, in this scenario, we will miss the secret recreate event.
It might not be feasible to close this gap, but I want to raise this concern to make sure I understand correctly and to note that it is a potential issue.
I agree on this, given the secret manager's dependency on the router pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see something related to this in the EP as well.
The router will bootstrap the secret monitor to ensure it can keep the certificate up-to-date. This means the router pod will maintain active watches for every secret that is referenced by a route.
Signed-off-by: chiragkyal <[email protected]>
@chiragkyal: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR:
Handles secret re-creation scenarios for externalCertificate where a previously deleted secret is added again.
Introduces a
deletedSecrets
map inRouteSecretManager
to track routes for which the associated secret was deleted. This helps differentiate between new secret creation and recreation of a previously deleted secret.Refactors secret handling logic in
generateSecretHandler
to address these changes.AddFunc
handles the secret recreations. Upon detecting a recreated secret, it revalidates the route, updates its TLS configuration, and triggers aModified
event for the next plugins.DeleteFunc
does not callUnregisterRoute()
, instead tracks deleted secrets.Improves the logic for handling
Modified
events for routes with externalCertificate to avoid blindly unregistering and re-registrations with the secret manager.UnregisterRoute()
, which leads to the stopping of the informer.Makes stopping of the informer self-contained to the
RouteSecretManager
plugin.RemoveRoute
(templateRouter
) should not callUnregisterRoute()
to avoid removal of informer whenRouteSecretManager
calls the next plugin chain withwatch.Deleted
event.externalCertificate
field is removedexternalCertificate
field is changed to a new secret name. [Old secret name is cached insideSecretManager
'sregisteredHandlers
map.Adds unit tests to cover the new logic and scenarios.
New Helper Functions:
validate()
to encapsulate the externalCertificate validation logic.populateRouteTLSFromSecret()
to update a route's in-memory TLS configuration from the referenced secret.unregister()
to remove the route's registration with the secret manager and clean up references indeletedSecrets
.Part of : https://issues.redhat.com//browse/OCPBUGS-33958