-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-29894: Check if CRLs are downloaded when determining ready status #595
base: master
Are you sure you want to change the base?
Conversation
@rfredette: This pull request references Jira Issue OCPBUGS-29894, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@rfredette: This pull request references Jira Issue OCPBUGS-29894, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest |
/assign |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
645c9ea
to
e6243d4
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a router pod start failing readiness checks if it has outdated CRLs, right?
To fix OCPBUGS-29894, it should be sufficient to fail readiness only for the initial synch, so that startup probes (which use the readiness endpoint) fail until the initial synch is done.
Once the router pod has done the initial synch, we want readiness checks to pass even if refresh fails, for two reasons:
- The expectation is to restore the behavior prior to OCPBUGS-6661, OCPBUGS-9464: Move mTLS CRL handling into the router, and fix accidental duplication of CRLs cluster-ingress-operator#939 and OCPBUGS-6661, OCPBUGS-9464: Handle mTLS CRLs, and fix accidental CRL duplication #472, and that behavior was to prevent a router pod from serving traffic until it had CRLs, not to prevent a router pod from serving traffic if it had outdated CRLs.
- It is generally less bad to continue using outdated CRLs, rather than to stop serving traffic entirely when refresh fails.
This does make me realize that we need a Prometheus metric and an alert when refresh fails for a prolonged period. Failure to refresh has two nasty implications:
- Router pods are using outdated CRLs.
- The next rolling update of the router deployment (for an upgrade, configuration change, or whatever reason) could get stuck as presumably the new pods would fail on initial synch.
Ack, I'll update this so that the CRLs readiness check is only used for the initial sync.
That make sense, although I think that's out of the scope of this bug. I'll open a jira issue for that. |
This fixes OCPBUGS-29894
e7b4fc2
to
4b7b65f
Compare
e2e-upgrade failed during bootstrap. /test e2e-upgrade |
@rfredette: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Require all CRLs to be downloaded before the router can report that it's ready. This prevents forwarding requests to a router until it's ready to handle mTLS.
This fixes OCPBUGS-29894