Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple kuadrant instances #5

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 152 additions & 0 deletions rfcs/0000-multiple-kuadrant-instances.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# RFC 0000

- Feature Name: `multiple kuadrant instances`
- Start Date: 2023-01-12
- RFC PR: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/pull/0000)
- Issue tracking: [Kuadrant/architecture#0000](https://github.com/Kuadrant/architecture/issues/0000)

# Summary
[summary]: #summary

This RFC proposes a new kuadrant architecture design to enable **multiple kuadrant instances** to be running in a single cluster.

![](https://i.imgur.com/ZsPibfO.png)

# Motivation
[motivation]: #motivation

The main benefit of multiple Kuadrant instances in a single cluster is that it allows dedicated Kuadrant's services for tenants.

Dedicated Kuadrant deployment brings lots of benefits. Just to name a few:
* Protection against external traffic load spikes. Other tenant's traffic spikes does not affect Authorino/Limitador throughput and delay as it would when shared.
* No need to have cluster administrator role to deploy a kuadrant instance. One tenant administrator can manage gateways, Limitador and Authorino instances (including deployment modes).
* The cluster administrator gets control and visibility across all the Kuadrant instances, while the tenant administrator only gets control over their specific gateway(s), Limitador and Authorino instances.
* (looking for ideas for more benefits)...
Copy link
Collaborator

@maleck13 maleck13 Jan 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key benefits are protection against noisy neighbour, isolation particularly in the auth context, and the ability to independently scale based on usage.


# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

### Kuadrant instance definition
![](https://i.imgur.com/BfOXfnB.png)

A kuadrant is composed of:
* One Limitador deployment instance
* One Authorino deployment instance
* A list of dedicated gateways.

Some properties to highlight:

* The policies are not included as part of the kuadrant instances.
* The Kuadrant instance is not enclosed by k8s namespaces.
* Gateways are not shared between kuadrant instances. Each gateway is managed by a single kuadrant instance.
* The control plane has cluster scope and will be shared between instances. In other words, it is only in the data plane that each Kuadrant instance has dedicated services and resources.
* Each kuadrant instance owns one instance (possibly multiple replicas, though) of Limitador and one instance of Authorino. Those instances are shared among all gateways included in the kuadrant instance.

In the following diagram policies RLP 1 and KAP 1 are applied in the instance *A* and the policies RLP 2 and KAP 2 are applied in the instance *B*.

![](https://i.imgur.com/yChVsT6.png)

### All the gateways referenced by a single policy must belong to the same kuadrant instance

The Gateway API allows, in its latest version [v1beta1](https://gateway-api.sigs.k8s.io/v1alpha2/references/spec/#gateway.networking.k8s.io/v1beta1.CommonRouteSpec), an HTTPRoute to have multiple gateway parents. Thus, a kuadrant policy might technically target multiple gateways managed by multiple kuadrant instances. Kuadrant does **not** support this use case.

![](https://i.imgur.com/ZpsBf4i.png)

The main reason is related to the rate limiting capability. The limits specified in the RateLimit Policy would be enforced per kuadrant instance basis (provided by Limitador instance). Thus, traffic hitting one gateway would see different rate limiting counters than traffic hitting the other gateway. The user would expect X rps and actually it would be X rps per gateway. For consistency reasons, when this configuration happens, the control plane will reject the policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this is a good first step and we indeed should reject such policies initially, I think there might be a way to support this as we get to architect for multiple cluster support. The use-case isn't much different. What I'm thinking is that each limitador would know about each other. Upon hitting a "shared" limit, they'd start sharing the counters for it. Which, other than latency being lower and cross communication being easier, there isn't much differences between the two applications.

Sharing this thought here as an FYI and sharing a possibly path forward. Again, it looks perfectly acceptable to have that limitation initially.


### The Kuadrant CRD

Currently, the Kuadrant CRD has an empty spec.

```yaml
apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
name: kuadrant-sample
spec: {}
```

According to the definition above of the kuadrant instance,
the proposed new Kuadrant CRD would add a label __selector__ to specify which gateways that instance would manage.

```yaml
apiVersion: kuadrant.io/v1beta1
kind: Kuadrant
metadata:
name: kuadrant-a
spec:
gatewaysSelector:
matchLabels:
app: kuadrant-a
Comment on lines +78 to +80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about the label selector approach for this application lately. Although it is consistent with other usages such as how Authorino selects AuthConfigs, how AuthConfigs select API key and X.509 cert issuer Secrets, how Istio selects workloads, among others, I think for this particular case of selecting (assigning) Gateways for a Kuadrant instance nevertheless an approach based on the GatewayClasses could be a better fit. A few advantages I see:

  • More assertive as the Kuadrant CR would list GatewayClasses by name explicitly;
  • Rational topology with individual GatewayClasses sometimes split into two or more, making it clear which groups of Gateways are configured by Kuadrant and which are not;
  • Easier to validate that the Gateways assigned to a Kuadrant instance are implemented by a compatible gateway provider (Istio, OSSM, EG, etc), with less (or zero) variations of providers within each group of gateways assigned to a Kuadrant instance. (See also Multiple gateway providers in the same cluster #7)

```

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

### Wiring Kuadrant policies with Kuadrant instances
Technically, the Kuadrant policies do not belong to any Kuadrant instance. At any moment of time, one policy can switch the targeted network resource specified in the `spec` from one gateway to another. Directly or indirectly via the HTTPRoute. The target references are dynamic by nature, so is the list of gateways to which kuadrant policies should apply.
Thus, the Kuadrant's control plane needs a procedure to associate a policy with **one** kuadrant instance at any time. When the control plane knows which kuadrant instance is affected, the policy rules can be used to configure the Limitador and Authorino instances belonging to that kuadrant's instance. Since the associated kuadrant instance of a policy is dynamic by nature, this procedure must be executed on every event related to the policy.

When the policy's `targetRef` targets a Gateway, there is a direct reference to the gateway.

When the policy's `targetRef` targets an HTTPRoute, Kuadrant will follow the [`parentRef`](https://gateway-api.sigs.k8s.io/v1alpha2/references/spec/#gateway.networking.k8s.io%2fv1beta1.CommonRouteSpec) attribute which should be a direct reference to the gateway or gateways.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that indirectly, a KAP or RLP, would be applied to more than 1 Kuadrant instance. As mentioned before, the control plane in charge of associating policy with 1 Kuadrant instance, will need to decide which policy apply in case there's another HTTPRoute sharing one or more of parentRef gateways

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's mentioned that this use case is not supported , still we might need to provide a way to mitigate since one might not have access to modify the network topology


Given a gateway, Kuadrant needs to find out which Kuadrant's instance is managing that specific gateway. By design, Kuadrant knows it is only one. There are at least two options to implement that mapping:
* Read all Kuadrant CR objects and the first one that matches label selector.
* This approach works as long as the control plane ensures that each gateway is matched by only one kuadrant gateway selector. The control plane must reject any new kuadrant instance matching a gateway already "taken" by other kuadrant instance.
* Add annotation in the gateway with a value of the Name/Namespace of the Kuadrant CR.
* This approach is commonly used. Requires annotation management.
Comment on lines +97 to +98
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even tho this would require the extra management, might be the simpler and more flexible way. At the moment, the control plane assumes there's only one Kuadrant instance and annotates every single gateway. Both the Kuadrant CR and Gateways targeted should be in sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define the list of gateways that meant to be managed in the Kuadrant CR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest a list of GatewayClasses instead of Gateways.

Users can always "split" one GatewayClass into two or more, handled by the same controller or not, to highlight the fact that different configurations apply to each class. In fact, this is encouraged by GW-API:

We expect that one or more GatewayClasses will be created by the infrastructure provider for the user. It allows decoupling of which mechanism (e.g. controller) implements the Gateways from the user. For instance, an infrastructure provider may create two GatewayClasses named internet and private to reflect Gateways that define Internet-facing vs private, internal applications.

Some discussion about this also here: #7 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kuadrant manages gateway instances, not gateway classes. Kuadrant instance A may manage gateway X and instance B may mananage instance Y. Both X and Y gateways may be share the same gateway class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm saying is that Kuadrant can use the GatewayClass to know which Gateways to manage.

In your provided example,

Kuadrant instance A may manage gateway X and instance B may mananage instance Y

  1. You can have both instances of Kuadrant listing the one and only GatewayClass; or
  2. You can have 2 GatewayClasses, one for each gateway, one for each instance of Kuadrant.



### Just one external auth stage and one rate limiting stage
Kuadrant configures the gateway with a single external authorization stage (backed by Authorino) and a single external rate limiting stage (backed by Limitador).
Multiple rate limit or authN/AuthZ stages involving multiple instances of Authorino and Limitador can be implemented, technically speaking.
Until there is a real use case and it is strictly necessary, this scenario is discarded. The main reason is about complexity. It is already complex enough to reason about rate limiting and auth services having a single stage. Adding multiple rate limiting stages, or hitting multiple Limitador instances in a single stage (doable with the WASM module) makes it too complex to reason about observed behavior. Currently there is no use case to require that complex scenario.

# Drawbacks
[drawbacks]: #drawbacks

Multitenancy is not a requested capability from users. Usually ingress gateways are shared resources managed by cluster administrators and a cluster may have only few of them. It is also a cluster admin task to route traffic to the ingress gateway. Cluster users usually do not control the life cycle of the ingress gateways in order to have their own Kuadrant instance.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

- Why is this design the best in the space of possible designs?

The gateway is the top class entity in the design, not the policy. The API protection happens at the gateway and the configuration needs to be done at the gateway. This kuadrant instance design protects the gateway isolating them from other instances (mis)configurations or traffic spikes.

- What other designs have been considered and what is the rationale for not choosing them?

TODO

- What is the impact of not doing this?

This design is a step forward in a consistent API to protect other service APIs. It makes easier to protect any API, no matter traffic nature. Either north-south or east-west. It makes easier to have the scenario where cluster users deploy their own (not ingress) gateways and enable API protection declaratively.

# Prior art
[prior-art]: #prior-art

TODO

# Unresolved questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might already be work in progress, but a specification of how errors (and which error cases) would be surfaced to the user would be good. For instance, a policy is applied to a HTTPRoute alright. The route gets an additional Gateway wired to it… what's the behavior? How does the user know?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a policy is applied to a HTTPRoute alright. The route gets an additional Gateway wired to it… what's the behavior? How does the user know?

Currently, the addition of a new Gateway to a HTTPRoute targeted by a policy is not reflected in the policy status, just like any other event of modifying the HTTPRoute that is handled successfully by the controller. The controller basically only reports "HTTPRoute is protected" or reconciliation errors. But even with the error messages, they are not always helpful to users.

I think this is already important today and even more so in a context of multiple Kuadrant instances per cluster. Users could benefit from the info of which Gateways have been configured for a policy, which were expected to but have not (and why not), in the policy status. Currently, users can only partially work that out by reading in the annotations of the Gateways which policies affect the Gateway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[unresolved-questions]: #unresolved-questions

- What parts of the design do you expect to resolve through the RFC process before this gets merged?

Validate the main points of the design:
a) Single auth/ratelimit stage in the processing pipeline of the gateway

b) Gateways are not shared among kuadrant instances

- What parts of the design do you expect to resolve through the implementation of this feature before stabilization?

The wiring mechanism.

- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?

Supporting multiple gateway providers #7

# Future possibilities
[future-possibilities]: #future-possibilities

TODO