Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy sync RFC #36

Merged
merged 7 commits into from
Nov 30, 2023
Merged

Policy sync RFC #36

merged 7 commits into from
Nov 30, 2023

Conversation

sergioifg94
Copy link
Contributor

@sergioifg94 sergioifg94 commented Nov 2, 2023

@sergioifg94 sergioifg94 mentioned this pull request Nov 2, 2023
Copy link
Contributor

@Boomatang Boomatang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments, from them the main takeaway can be asked as, is there a strong argument that I am missing for wanting to do this work?

to be defined in the hub cluster, as well as replicated in the multiple spoke clusters.

As Kuadrant users:
* Gateway-admin has a set of homogeneous clusters and needs to apply per cluster rate limits across the entire set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there evidence that Gateway-admins uses sets of comparable clusters? Thinking how ocm allows having on-prem, aws, gcp and so on clusters centrally managed, its hard to believe these clusters would be homogeneous.

Even if looking at using only aws, it would be reasonable to have two clusters, one in us-west and the other in us-east. As the most customer are in us-west that region would get most traffic and would a larger cluster. us-east might only be there of HA if us-west goes down. us-east could still handle traffic but not to the same scale, meaning the limits would also be different.

I believe the more common use case at the gateway level would be to add limits A to cluster A and limits B to cluster B.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of view of a user who wants to create and manage policies for their Gateway, the clusters are homogeneous. The underlying infrastructure is irrelevant for that purpose.

As for the case of applying specific policies to a subset of the clusters, that's why the hierarchy and overriding section is for. A user can create a RateLimitPolicy in the hub cluster to apply in all clusters, and create a RateLimitPolicy in only one of the clusters that overrides the synced Policy.

Basically, the purpose of this RFC is to allow to create policies that target all spoke clusters, in a centralised way, with the possibility of them being overriden

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point the capacity of the underlying infrastructure becomes relevant, and I believe this point is at the level of the gateway or gateway class. What the actual underlying infrastructure is (AWS, GCP), that is irrelevant for this purpose.

As of today the RateLimitPolicy and AuthPolicy don't support defaults and overrides, which I am assuming would be required for this. There is talks of getting that support there, so I am ok with this RFC being a thing before that is there.

Where I do see a problem with the use case of using defaults and overrides, the Kuadrant Policies can only target one resource. And I have not heard any talks about changing this. With the example of creating a RateLimitPolicy on the hub targeting the gateway and is copied to the spokes. If the Gateway owner wanted to override that policy on a per spoke bases, the overrides would need to be added to the RateLimitingPolicy attached to the HTTPRoute. Which maybe owned by different orgs.

To create the overrides in a RateLimitPolicy on the Gateway, it would require a RateLimitPolicy targeting the GatewayClass which is not supported today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a mixup about the concept of overrides. When I talk about overrides in this RFC I mean adding support in the Kuadrant operator to ignore a policy that has been synced from the hub if it conflicts with a policy created in the spoke. That support would be part of the implementation of this RFC

Maybe I could add to the document an item to the nomenclature section to avoid confusion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where I do see a problem with the use case of using defaults and overrides, the Kuadrant Policies can only target one resource. And I have not heard any talks about changing this. With the example of creating a RateLimitPolicy on the hub targeting the gateway and is copied to the spokes. If the Gateway owner wanted to override that policy on a per spoke bases, the overrides would need to be added to the RateLimitingPolicy attached to the HTTPRoute. Which maybe owned by different orgs.

If they wanted to do something per cluster, they would go back to creating the policy on a per cluster basis. This is what they have to do now.
If they want something to be replicated accross many clusters with the same definition but define that targeting the multicluster gateway, they would define the policy in the hub targeting the multicluster gateway

Copy link
Collaborator

@maleck13 maleck13 Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this really is also to do with the fact we do not let the user set a counter strategy in anyway.
IE counterStrategy for limit A is clusterUnique for limit B is shared. being able to define the counter strategy would allow you decide if a somethign should be shared or unique
Also it is possible in RLP to set a limit that only get triggered "when" a certain value is passed. So if you had a unique value being passed from each gateway you could use that to distinguish limits (IE if gateway A set limit to x every y minutes)
Finally another option here is for the policy sync to add some form of label to identify the cluster , these label could be injected into the config for the WASMPlugin and used in conditions. That said, I would probably prefer if this was somethign that could be done on a per cluster basis. IE via the kuadrant CRD set an cluster identifier that is injected or some such

All of this is to say, that the syncing of the policy is not necessarilly where we want to or where we need to solve changing a set of limits for a particular cluster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


In order for a Policy to be supported for syncing, the MGC must have permissions
to watch/list/get the resource, and the implementation of the downstream Gateway
controller must be aware of the `policy-synced` annotation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this problem get far more complex if/when the abstraction policies are/is implemented?

#34

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with that concept, I'll give it a look and see how it affects this RFC

to watch/list/get the resource, and the implementation of the downstream Gateway
controller must be aware of the `policy-synced` annotation.

# Rationale and alternatives
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised there is no alternatives listed. There are two questions that I am asking myself here with this section. How are the Gateway-admins and Platfrom-admins syncing resources across the fleets today? Are these the only resources that the admins would need to sync across clusters?

To the latter I thinking the answer is no, there will be other resources. If the admins currently don't have a way to sync resources across clusters then they would like a solution that can sync all their resources. While this policy sync would be nice, I think the user will be trying to solve the bigger problem.

To the first question there a few methods that I can come up with. Note also I don't know what OCM can offer to help with this problem.

  1. Using GitOps with something like argoCD. With argo you can target multiply clusters from a single instance. With the helm templates it is possible to even adjust resources depending on which cluster they are being deployed to. This can then allow cluster A have higher limits than cluster B but still use the same base configuration. This method would allow for all resources to be synced to the clusters.
  2. It is possible to manage k8s resources with ansible. I have zero experience of using ansible but this is an interesting one. On some mail tread (can find link if required) the notion that users don't choose to be multi cluster but end that why by acquisition, change of leadership and regulatory constraints. By the same reasoning it is possible that customers started not using k8s but servers and VMs. In this case you can easily see ansible being used to manage a fleet so it is possible that they also use it today to manage k8s resources.
  3. The third option I will call a hand rolled deployment system and the example I will use is qontract-server. This many not be a widely known or used system, the shared link is only one component of the system. Once again it targets multiply clusters, using the concepts of GitOps but also has a rich validation and user permission management system.

From the alternative methods listed, what is the likely hood of the proposed policy sync being used in the wild?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ways to sync resources across clusters, but the goal of this RFC is not only to sync policies, but to manage the synced resources from the hub cluster alongside the policies that target the Multicluster Gateway.

These alternatives you listed would leave it in the hands of the user to manage the syncing of the policies, or add extra dependencies to the controller, whereas the solution proposed in this RFC would rely on OCM which is already the main framework used to place resources in the spoke clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where you say "to manage the synced resources" I am assuming this would mean a user modifies the resource on the spoke cluster, the hub would revert those changes. Out of the three examples I gave above there is only one that does not do that out of the box.

If OCM can do this and is being used today to place resources in the spoke clusters, then that is what we should aim to use. But when there is not alternatives stated it comes across that OCM and this RFC is the only option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add these options to the list of alternatives to make sure it's clear why OCM is the best alternative for our situation

rfcs/0004-policy-sync-v1.md Outdated Show resolved Hide resolved
@sergioifg94
Copy link
Contributor Author

@Boomatang @maleck13 Pushed a commit addressing the comments

.gitignore Outdated Show resolved Hide resolved
rfcs/0004-policy-sync-v1.md Outdated Show resolved Hide resolved
#### Dynamic Policy watches

The Multicluster Gateway Controller reconciles parameters referenced by the
GatewayClass of a Gateway. A new field is added to the parameters that allows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will I have to add that I want to sync AP/RLP to every gateway class I create?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily, as that is configured in a ConfigMap, it would be possible to have multiple gateway classes referencing the single ConfigMap so the configuration doesn't have to be repeated

rfcs/0004-policy-sync-v1.md Outdated Show resolved Hide resolved
@sergioifg94
Copy link
Contributor Author

Pushed some changes with @pehala's suggestions

@sergioifg94 sergioifg94 merged commit 61f8eda into Kuadrant:main Nov 30, 2023
1 check passed
@philbrookes philbrookes mentioned this pull request Mar 7, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants