Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-43745: Add support for IdleCloseTerminationPolicy #1166

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

frobware
Copy link
Contributor

@frobware frobware commented Nov 19, 2024

Introduce logic in desiredRouterDeployment to set the environmentvariable ROUTER_IDLE_CLOSE_ON_RESPONSE when the IdleConnectionTerminationPolicy field in the IngressController spec is set to Deferred. This change enables configuring HAProxy with the idle-close-on-response option for better control over idle connection termination behaviour.

Requires:

@openshift-ci openshift-ci bot requested review from alebedev87 and Miciah November 19, 2024 15:19
@frobware frobware changed the title OCPBUGS 43745 idle close on response OCPBUGS:43745: idle close on response Nov 19, 2024
@frobware frobware changed the title OCPBUGS:43745: idle close on response OCPBUGS-43745: idle close on response Nov 19, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 19, 2024
@openshift-ci-robot
Copy link
Contributor

@frobware: This pull request references Jira Issue OCPBUGS-43745, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.19.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@frobware
Copy link
Contributor Author

/hold

Many commits should never merge.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 19, 2024
@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from a7dea46 to 991d367 Compare November 20, 2024 15:53
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/release-4.18/e2e-aws-operator openshift/router#639

Copy link
Contributor

openshift-ci bot commented Nov 20, 2024

@frobware, testwith: could not generate prow job. ERROR:

no ref for requested test included in command

@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

@frobware frobware changed the title OCPBUGS-43745: idle close on response OCPBUGS-43745: Add support for IdleCloseTerminationPolicy Nov 20, 2024
@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 991d367 to 052f6b6 Compare November 20, 2024 19:23
@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 052f6b6 to 0aa6244 Compare November 21, 2024 07:47
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 0aa6244 to 1072f8f Compare November 21, 2024 08:04
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 27, 2024
@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch 2 times, most recently from f89fbb2 to 9707f72 Compare December 10, 2024 13:44
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2024
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 9707f72 to 77673b3 Compare December 10, 2024 14:49
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 77673b3 to 3029807 Compare December 10, 2024 15:43
@alebedev87
Copy link
Contributor

/assign

@candita
Copy link
Contributor

candita commented Dec 11, 2024

/assign @grzpiotrowski

Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from alebedev87. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 650e796 to 6449f4c Compare December 13, 2024 18:33
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

Add a `/custom-response` endpoint that returns a configurable response
via the `CUSTOM_RESPONSE` environment variable.
Pickup openshift/api#2102

% git show 27316471eb72fe8fcf0d44fb5a0602f698f253dc
commit 27316471eb72fe8fcf0d44fb5a0602f698f253dc
Merge: de9de05a8 b7417509c
Author: openshift-merge-bot[bot] <148852131+openshift-merge-bot[bot]@users.noreply.github.com>
Date:   Wed Dec 18 10:31:50 2024 +0000

    Merge pull request #2102 from frobware/OCPBUGS-43745-idle-close-on-response

    OCPBUGS-43745: Add IdleCloseOnResponse field to IngressControllerSpec

Vendoring steps:

$ go mod edit -replace github.com/openshift/api=github.com/openshift/api@27316471eb72fe8fcf0d44fb5a0602f698f253dc
$ go mod tidy
$ go mod vendor
$ make update
@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 6449f4c to 60b3714 Compare December 19, 2024 10:19
@frobware
Copy link
Contributor Author

openshift/api#2102 has merged. Vendor bump in 6a882ac.

This is ready for review but needs openshift/router#639 for the new e2e test to pass.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 19, 2024
@frobware
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Dec 19, 2024
@openshift-ci-robot
Copy link
Contributor

@frobware: This pull request references Jira Issue OCPBUGS-43745, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Dec 19, 2024
@openshift-ci openshift-ci bot requested a review from lihongan December 19, 2024 10:22
Copy link
Contributor

@alebedev87 alebedev87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change LGTM. All the comments are about the e2e.

Comment on lines 514 to 162
expectedBackendName := fmt.Sprintf("be_http:%s:%s", routeName.Namespace, routeName.Name)
expectedServerName := fmt.Sprintf("pod:%s:%s:http:%s:%d", tc.pods[serviceIndex].Name, service.Name, tc.pods[serviceIndex].Status.PodIP, service.Spec.Ports[0].Port)

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
if err := waitForHAProxyConfigUpdate(ctx, t, ic, expectedBackendName, expectedServerName); err != nil {
return nil, fmt.Errorf("error waiting for HAProxy configuration update for route %s to point to service %s/%s: %w", routeName, service.Namespace, service.Name, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That adds a lot of moving parts (== code) to the test. Also, will it work in DCM case? The server address may be updated dynamically (I have to check how DCM behaves during a server switch). May be a simple time.Sleep(6 * time.Seconds) would be enough (and would suit the DCM case too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all good points. Given this is a parallel test I could just sleep 30s (for busy clusters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rip this out and sleep 30s instead. Why 30s? To avoid flakes if we chose 6s (i.e., a 1s more than the 5s routing update default). There's little point in chasing test flakes when the test is a) parallel and b) runs for a couple of minutes anyway.

Comment on lines 262 to 273
podList, err := getPods(t, kclient, deployment)
if err != nil {
return nil, fmt.Errorf("failed to fetch pods for deployment %s/%s: %w", deployment.Namespace, deployment.Name, err)
}

if len(podList.Items) == 0 {
return nil, fmt.Errorf("no pods in deployment %s/%s", deployment.Namespace, deployment.Name)
}

for i := range podList.Items {
tc.pods = append(tc.pods, &podList.Items[i])
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for the config check? I'm asking because the check for Complete deployment guarantees us available pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was local assertions, or post-conditions. Exiting this function should have setup everything required by the remainder of the test code.

Comment on lines 258 to 260
if err := waitForDeploymentComplete(t, kclient, deployment, 2*time.Minute); err != nil {
return nil, fmt.Errorf("deployment %s/%s is not ready: %w", deployment.Namespace, deployment.Name, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a check for the deployment Complete in idleConnectionCreateBackendService already.

Comment on lines 292 to 30
if len(tc.deployments) != 2 {
return nil, fmt.Errorf("expected 2 deployments, but got %d", len(tc.deployments))
}

if len(tc.services) != 2 {
return nil, fmt.Errorf("expected 2 services, but got %d", len(tc.services))
}

if len(tc.pods) != 2 {
return nil, fmt.Errorf("expected 2 pods, but got %d", len(tc.pods))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks seem redundant, if we reached this point in code it means that the API server assured us that these objects are created in idleConnectionCreateBackendService.

Comment on lines 756 to 437
if i == 0 && policy == currentIdleTerminationPolicy {
t.Logf("Skipping initial policy switch as current policy %q already matches %q", currentIdleTerminationPolicy, policy)
} else {
if err := idleConnectionSwitchIdleTerminationPolicy(t, ic, icName, policy); err != nil {
t.Fatalf("failed to switch to policy %q: %v", policy, err)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking whether 2 test cases (1 for Immediate and 1 for Deferred) would be a simpler approach. There would be no need 1) for idleConnectionSwitchIdleTerminationPolicy function, 2) to care for the order of the policies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a lot of repetition to have two test cases for 10 lines of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it’s important to test the case where the termination policy is switched. If we had two private ICs pre-configured with Deferred and Immediate policies, we would never perform an end-to-end test to validate the behaviour of switching between them.

return nil
}

func idleConnectionCreateDeployment(ctx context.Context, namespace string, serviceNumber int, serverResponse, image string) (*appsv1.Deployment, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want your test workload to be a Deployment for a resilience reason? I'm asking because build*Pod functions from utils_test.go (example) are used in most cases and look shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the test benefit form just a pod? I don't know. I just chose a deployment as a reasonable abstraction.

Comment on lines 367 to 368
{Name: "TLS_CERT", Value: "/etc/serving-cert/tls.crt"},
{Name: "TLS_KEY", Value: "/etc/serving-cert/tls.key"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need all the hustle with the serving cert secret if at the end we hit the plain http port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I rely on the server in test/http2 which requires certificates for the server that listens on port https.

Copy link
Contributor

@alebedev87 alebedev87 Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. Does it make sense to make the listening on HTTPS port conditional (== don't listen on HTTPS if envvar is not provided)?

@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 60b3714 to 1c5e4e9 Compare December 24, 2024 09:55
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

Introduce logic in desiredRouterDeployment to set the environment
variable `ROUTER_IDLE_CLOSE_ON_RESPONSE` when the
`IdleConnectionTerminationPolicy` field in the IngressController spec is
set to `Deferred`. This change enables configuring HAProxy with the
`idle-close-on-response` option for better control over idle connection
termination behaviour.
@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 1c5e4e9 to 014b62f Compare December 24, 2024 15:47
@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

@frobware frobware force-pushed the OCPBUGS-43745-idle-close-on-response branch from 014b62f to 49e561e Compare December 24, 2024 15:57
Copy link
Contributor

openshift-ci bot commented Dec 24, 2024

@frobware: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify 014b62f link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@frobware
Copy link
Contributor Author

/testwith openshift/cluster-ingress-operator/master/e2e-aws-operator openshift/router#639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants