Policy sync RFC #36

sergioifg94 · 2023-11-02T09:13:49Z

Boomatang

I have a few comments, from them the main takeaway can be asked as, is there a strong argument that I am missing for wanting to do this work?

Boomatang · 2023-11-02T12:47:07Z

rfcs/0004-policy-sync-v1.md

+to be defined in the hub cluster, as well as replicated in the multiple spoke clusters.
+
+As Kuadrant users:
+* Gateway-admin has a set of homogeneous clusters and needs to apply per cluster rate limits across the entire set.


Is there evidence that Gateway-admins uses sets of comparable clusters? Thinking how ocm allows having on-prem, aws, gcp and so on clusters centrally managed, its hard to believe these clusters would be homogeneous.

Even if looking at using only aws, it would be reasonable to have two clusters, one in us-west and the other in us-east. As the most customer are in us-west that region would get most traffic and would a larger cluster. us-east might only be there of HA if us-west goes down. us-east could still handle traffic but not to the same scale, meaning the limits would also be different.

I believe the more common use case at the gateway level would be to add limits A to cluster A and limits B to cluster B.

From the point of view of a user who wants to create and manage policies for their Gateway, the clusters are homogeneous. The underlying infrastructure is irrelevant for that purpose.

As for the case of applying specific policies to a subset of the clusters, that's why the hierarchy and overriding section is for. A user can create a RateLimitPolicy in the hub cluster to apply in all clusters, and create a RateLimitPolicy in only one of the clusters that overrides the synced Policy.

Basically, the purpose of this RFC is to allow to create policies that target all spoke clusters, in a centralised way, with the possibility of them being overriden

At some point the capacity of the underlying infrastructure becomes relevant, and I believe this point is at the level of the gateway or gateway class. What the actual underlying infrastructure is (AWS, GCP), that is irrelevant for this purpose.

As of today the RateLimitPolicy and AuthPolicy don't support defaults and overrides, which I am assuming would be required for this. There is talks of getting that support there, so I am ok with this RFC being a thing before that is there.

Where I do see a problem with the use case of using defaults and overrides, the Kuadrant Policies can only target one resource. And I have not heard any talks about changing this. With the example of creating a RateLimitPolicy on the hub targeting the gateway and is copied to the spokes. If the Gateway owner wanted to override that policy on a per spoke bases, the overrides would need to be added to the RateLimitingPolicy attached to the HTTPRoute. Which maybe owned by different orgs.

To create the overrides in a RateLimitPolicy on the Gateway, it would require a RateLimitPolicy targeting the GatewayClass which is not supported today.

I think there's a mixup about the concept of overrides. When I talk about overrides in this RFC I mean adding support in the Kuadrant operator to ignore a policy that has been synced from the hub if it conflicts with a policy created in the spoke. That support would be part of the implementation of this RFC

Maybe I could add to the document an item to the nomenclature section to avoid confusion

Where I do see a problem with the use case of using defaults and overrides, the Kuadrant Policies can only target one resource. And I have not heard any talks about changing this. With the example of creating a RateLimitPolicy on the hub targeting the gateway and is copied to the spokes. If the Gateway owner wanted to override that policy on a per spoke bases, the overrides would need to be added to the RateLimitingPolicy attached to the HTTPRoute. Which maybe owned by different orgs.

If they wanted to do something per cluster, they would go back to creating the policy on a per cluster basis. This is what they have to do now.
If they want something to be replicated accross many clusters with the same definition but define that targeting the multicluster gateway, they would define the policy in the hub targeting the multicluster gateway

Some of this really is also to do with the fact we do not let the user set a counter strategy in anyway.
IE counterStrategy for limit A is clusterUnique for limit B is shared. being able to define the counter strategy would allow you decide if a somethign should be shared or unique
Also it is possible in RLP to set a limit that only get triggered "when" a certain value is passed. So if you had a unique value being passed from each gateway you could use that to distinguish limits (IE if gateway A set limit to x every y minutes)
Finally another option here is for the policy sync to add some form of label to identify the cluster , these label could be injected into the config for the WASMPlugin and used in conditions. That said, I would probably prefer if this was somethign that could be done on a per cluster basis. IE via the kuadrant CRD set an cluster identifier that is injected or some such

All of this is to say, that the syncing of the policy is not necessarilly where we want to or where we need to solve changing a set of limits for a particular cluster

Kuadrant/kuadrant-operator#271

Boomatang · 2023-11-02T12:53:43Z

rfcs/0004-policy-sync-v1.md

+
+In order for a Policy to be supported for syncing, the MGC must have permissions
+to watch/list/get the resource, and the implementation of the downstream Gateway
+controller must be aware of the `policy-synced` annotation.


Does this problem get far more complex if/when the abstraction policies are/is implemented?

#34

I'm not familiar with that concept, I'll give it a look and see how it affects this RFC

Boomatang · 2023-11-02T14:15:49Z

rfcs/0004-policy-sync-v1.md

+to watch/list/get the resource, and the implementation of the downstream Gateway
+controller must be aware of the `policy-synced` annotation.
+
+# Rationale and alternatives


I am surprised there is no alternatives listed. There are two questions that I am asking myself here with this section. How are the Gateway-admins and Platfrom-admins syncing resources across the fleets today? Are these the only resources that the admins would need to sync across clusters?

To the latter I thinking the answer is no, there will be other resources. If the admins currently don't have a way to sync resources across clusters then they would like a solution that can sync all their resources. While this policy sync would be nice, I think the user will be trying to solve the bigger problem.

To the first question there a few methods that I can come up with. Note also I don't know what OCM can offer to help with this problem.

Using GitOps with something like argoCD. With argo you can target multiply clusters from a single instance. With the helm templates it is possible to even adjust resources depending on which cluster they are being deployed to. This can then allow cluster A have higher limits than cluster B but still use the same base configuration. This method would allow for all resources to be synced to the clusters.

It is possible to manage k8s resources with ansible. I have zero experience of using ansible but this is an interesting one. On some mail tread (can find link if required) the notion that users don't choose to be multi cluster but end that why by acquisition, change of leadership and regulatory constraints. By the same reasoning it is possible that customers started not using k8s but servers and VMs. In this case you can easily see ansible being used to manage a fleet so it is possible that they also use it today to manage k8s resources.

The third option I will call a hand rolled deployment system and the example I will use is qontract-server. This many not be a widely known or used system, the shared link is only one component of the system. Once again it targets multiply clusters, using the concepts of GitOps but also has a rich validation and user permission management system.

From the alternative methods listed, what is the likely hood of the proposed policy sync being used in the wild?

There are many ways to sync resources across clusters, but the goal of this RFC is not only to sync policies, but to manage the synced resources from the hub cluster alongside the policies that target the Multicluster Gateway.

These alternatives you listed would leave it in the hands of the user to manage the syncing of the policies, or add extra dependencies to the controller, whereas the solution proposed in this RFC would rely on OCM which is already the main framework used to place resources in the spoke clusters.

Where you say "to manage the synced resources" I am assuming this would mean a user modifies the resource on the spoke cluster, the hub would revert those changes. Out of the three examples I gave above there is only one that does not do that out of the box.

If OCM can do this and is being used today to place resources in the spoke clusters, then that is what we should aim to use. But when there is not alternatives stated it comes across that OCM and this RFC is the only option.

I'll add these options to the list of alternatives to make sure it's clear why OCM is the best alternative for our situation

rfcs/0004-policy-sync-v1.md

sergioifg94 · 2023-11-14T14:58:58Z

@Boomatang @maleck13 Pushed a commit addressing the comments

.gitignore

rfcs/0004-policy-sync-v1.md

pehala · 2023-11-15T09:31:40Z

rfcs/0004-policy-sync-v1.md

+#### Dynamic Policy watches
+
+The Multicluster Gateway Controller reconciles parameters referenced by the
+GatewayClass of a Gateway. A new field is added to the parameters that allows


Will I have to add that I want to sync AP/RLP to every gateway class I create?

Not necessarily, as that is configured in a ConfigMap, it would be possible to have multiple gateway classes referencing the single ConfigMap so the configuration doesn't have to be repeated

rfcs/0004-policy-sync-v1.md

sergioifg94 · 2023-11-29T10:32:56Z

Pushed some changes with @pehala's suggestions

sergioifg94 mentioned this pull request Nov 2, 2023

[WIP] Policy sync RFC #32

Closed

Boomatang reviewed Nov 2, 2023

View reviewed changes

maleck13 reviewed Nov 10, 2023

View reviewed changes

rfcs/0004-policy-sync-v1.md Outdated Show resolved Hide resolved

pehala reviewed Nov 15, 2023

View reviewed changes

philbrookes and others added 7 commits November 29, 2023 10:29

start policy sync rfc

07bf7cd

very rough first pass of policy sync RFC

ec0d805

small RFC update

c95987a

Complete Policy sync RFC

5fc107c

Addressing comments

f5e2719

Addressing suggestions

b1789ab

Updating files from rebase

868d978

sergioifg94 force-pushed the policy-sync-rfc branch from ef80675 to 868d978 Compare November 29, 2023 10:31

pehala approved these changes Nov 29, 2023

View reviewed changes

sergioifg94 merged commit 61f8eda into Kuadrant:main Nov 30, 2023
1 check passed

philbrookes mentioned this pull request Mar 7, 2024

Feature: Distributed DNS #56

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy sync RFC #36

Policy sync RFC #36

sergioifg94 commented Nov 2, 2023 •

edited

Loading

Boomatang left a comment

Boomatang Nov 2, 2023

sergioifg94 Nov 3, 2023

Boomatang Nov 7, 2023

sergioifg94 Nov 9, 2023

maleck13 Nov 10, 2023

maleck13 Nov 10, 2023 •

edited

Loading

maleck13 Nov 10, 2023

Boomatang Nov 2, 2023

sergioifg94 Nov 3, 2023

Boomatang Nov 2, 2023

sergioifg94 Nov 3, 2023

Boomatang Nov 7, 2023

sergioifg94 Nov 9, 2023

sergioifg94 commented Nov 14, 2023

pehala Nov 15, 2023

sergioifg94 Nov 28, 2023

sergioifg94 commented Nov 29, 2023

Policy sync RFC #36

Policy sync RFC #36

Conversation

sergioifg94 commented Nov 2, 2023 • edited Loading

Boomatang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleck13 Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergioifg94 commented Nov 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sergioifg94 commented Nov 29, 2023

sergioifg94 commented Nov 2, 2023 •

edited

Loading

maleck13 Nov 10, 2023 •

edited

Loading