Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-29894: Check if CRLs are downloaded when determining ready status #595

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rfredette
Copy link
Contributor

Require all CRLs to be downloaded before the router can report that it's ready. This prevents forwarding requests to a router until it's ready to handle mTLS.

This fixes OCPBUGS-29894

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 13, 2024
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-29894, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Require all CRLs to be downloaded before the router can report that it's ready. This prevents forwarding requests to a router until it's ready to handle mTLS.

This fixes OCPBUGS-29894

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from frobware and gcs278 May 13, 2024 19:15
@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 13, 2024
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-29894, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 13, 2024
@openshift-ci openshift-ci bot requested a review from lihongan May 13, 2024 19:19
@rfredette
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Jun 5, 2024

/assign

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2024
@lihongan
Copy link
Contributor

lihongan commented Sep 4, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2024
Copy link
Contributor

openshift-ci bot commented Sep 16, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from miciah. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@Miciah Miciah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a router pod start failing readiness checks if it has outdated CRLs, right?

To fix OCPBUGS-29894, it should be sufficient to fail readiness only for the initial synch, so that startup probes (which use the readiness endpoint) fail until the initial synch is done.

Once the router pod has done the initial synch, we want readiness checks to pass even if refresh fails, for two reasons:

This does make me realize that we need a Prometheus metric and an alert when refresh fails for a prolonged period. Failure to refresh has two nasty implications:

  • Router pods are using outdated CRLs.
  • The next rolling update of the router deployment (for an upgrade, configuration change, or whatever reason) could get stuck as presumably the new pods would fail on initial synch.

pkg/router/crl/crl.go Outdated Show resolved Hide resolved
@rfredette
Copy link
Contributor Author

Once the router pod has done the initial synch, we want readiness checks to pass even if refresh fails

Ack, I'll update this so that the CRLs readiness check is only used for the initial sync.

This does make me realize that we need a Prometheus metric and an alert when refresh fails for a prolonged period.

That make sense, although I think that's out of the scope of this bug. I'll open a jira issue for that.

@rfredette
Copy link
Contributor Author

e2e-upgrade failed during bootstrap.

/test e2e-upgrade

Copy link
Contributor

openshift-ci bot commented Sep 24, 2024

@rfredette: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2024
@lihongan
Copy link
Contributor

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants