Multitenancy #25

kbData · 2022-05-13T09:08:35Z

kbData
May 13, 2022

Hello!

As far as I know, there were plans to support such a concept as "Connector as a Service" (let it be CaaS) for those clients who can't / don't want to run their own instance of the Connector. This is especially true for SMEs (may even have no IT department). Now, those companies operating CaaS for SMEs will handle multiple customers, potentially thousands of them (according for example to CATENA-X KPI goals).

It is economically too expensive to host an instance of Connector per Customer. The costs just for the resources would go over 1k$/year (estimation) - too expensive for an SME, and operational costs come on top.

We would like to have the Multitenancy feature for the EDC connector, meaning, single instance of EDC can serve multiple customers, with customer data being separated across customers (tenant separation concept). Let us discuss this idea here.

With best regards
Kiryl

kbData · 2022-05-13T09:11:18Z

kbData
May 13, 2022
Author

@sn0wcat @stefan-ettl this might be of interest to you

0 replies

mspiekermann · 2022-05-31T09:52:30Z

mspiekermann
May 31, 2022
Maintainer

@kbData how about creating a concept for this, maybe together with @sn0wcat and @stefan-ettl (since you already mentioned them)? Would definitely not be a priority work imho, as other issues are more urgent and need to be addressed by the dev team. But a (conceptual) contribution to this topic would be very welcome and would be a good starting point to deal with it in the future or even better to find more interested people to work on it.

0 replies

sn0wcat · 2022-05-31T10:10:57Z

sn0wcat
May 31, 2022

@mspiekermann we will provide a proposal. //cc: @kbData

0 replies

sn0wcat · 2022-06-02T15:10:57Z

sn0wcat
Jun 2, 2022

Motivation for the Multitenancy

As @kbData stated, the data space participants who are not ready to operate their own instance would use different "Connector as a Service" offerings which would operate the service for them. At the moment the participant agent is represented through it's own instance of the EDC Runtime - there is no differentiation between the two.

This way of thinking works well for the "self-hosting" companies which will operate one or two instances, however this becomes significantly more complicated for all "Connector as a Service" scenarios (which include also integration into existing multitenant products).

In order to provide this service at an affordable price (see also the calculation provided by @kbData ), multitenancy support in the EDC Runtime would be appreciated. Beside the price, the operation concerns, like updates, upgrades, security checks etc. are significantly easier to handle compared to the multi-instance scenarios.

Multitenant EDC as a Service

In a multitenant scenario a customer would use a "managed EDC endpoint" (identified through the URL, self-description, catalog etc.) which would be isolated through authentication checks from the other tenants while using the same infrastructure. For this the EDC would need to be "tenant-aware", i.a. it should be able to decode the current tenant information from the authentication token and it should be able to manage the state of the request dependent on it (e.g. present the catalog of the required tenant dependent on the authentication information.

e.g. if we have e.g. 1000 participants whose EDC endpoints are represented with URLs like

* https://participant1.managed.edc.catena-x.cloud
* https://participant2.managed.edc.catena-x.cloud
* ...
* https://participant3.managed.edc.catena-x.cloud

we would only have to scale the number of running instances according to load:

Multi-Instance EDC as a Service

Another alternative is to offer the EDC as a Service but in a multi-instance scenario: every EDC participant gets its own managed instance of the EDC runtime with its own persistence, which means that for the said 1000 participants which are purchasing the software, the 1000 DB instances + 1000 Runtime Instances (+ gateways, certificates etc.) should be spawned. Even though a system like this can be automated, it produces significantly more cost, heat, entropy, climate change affecting gases ;) and overall more overhead in the management then the other scenario:

Scheduling the EDC instances on-demand

There was an idea mentioned by @MoritzKeppler that the EDC instances could be started on demand and terminated according to the need. However it is not quite clear how this would work with long running workflows of EDC - especially if the EDC assumes that it is always running during the workflow execution.

We are still ready to look into the multitenancy concept as we are convinced that it is the best way to operate "EDC as a Service", however I am getting the feeling that this is conceptually against the philosophy of EDC core development team.

cc:// @mspiekermann, @MoritzKeppler, @stefan-ettl, @paullatzelsperger @alexandrudanciu and @kbData

0 replies

jimmarino · 2022-06-03T08:24:11Z

jimmarino
Jun 3, 2022
Collaborator

Let me state upfront a few things:

I disagree with the costing analysis and approach it takes. There is no reason why EDC requires multiple gateways, database "instances," VMs, and ancillary support infrastructure. My suggestion would be to table these assumptions and instead turn our attention to how the EDC is designed and architected. Doing so will lead to a clearer picture of EDC deployment options.
Multi-tenancy is an overloaded term. It means many things to many people, so I suggest we not use the term. Again, I think it will be more productive to focus on operational requirements and how the EDC is architected to support a diverse set of deployment environments.
It is not the goal of the EDC to provide a software product that is readily consumable by organizations with limited technical staff. Some people may refer to this as a "Connector as a Service" model. The EDC is an open-source project, and our goal is to provide a software platform that others can take, customize to their requirements, and use. In other words, the EDC is a platform, not a product. This focus on a platform approach requires users to have technical depth. Some users may create products derived from the EDC that downstream users with limited technical experience can deploy. This philosophy is embodied in two core EDC design tenets: modularity and extensibility. We chose this approach since we believe it is the best way to meet users' needs with diverse requirements.

Let's start with the design principles that underly the EDC, as we must be clear on those at the outset.

EDC Design Principles: Participant Agents and Identity

A participant agent is a software system that performs a specific operation or role in a dataspace. Currently, the following are participant agent types:

Connector
Federated Catalog Cache
Federated Catalog Node

There will be more in the future. A fundamental design principle of EDC is that a participant agent is associated with one, and only one, identity. We often refer to an "EDC runtime" as an instantiation of a participant agent. Hence, a runtime is associated with one participant identity.

There are several key nuances that follow from the above design. First, process boundaries are purposely not specified by the above architecture. They are a deployment design decision, and the EDC can support many diverse deployment topologies. For example, a participant agent could be deployed as a ReplicaSet in a Kubernetes cluster. In this case, the participant agent would consist of multiple runtimes, each deployed in separate process boundaries.

However, it does not follow that a participant agent must be deployed in its own process. Multiple runtimes, each configured as distinct participant agents, may be deployed within the same process "space." There are many ways to do that, which we can cover in detail.

The EDC design supports a myriad of operational requirements, and I have not seen any that would require us to rethink our approach. To state this slightly differently: you can likely already do what you want by adopting the "EDC as a platform" approach and its architecture to create a service offering that fits your requirements.

EDC as a Service Deployment Possibilities

If you would like to deploy and manage EDC as a service for multiple organizations in a "dense" operational environment, I see at least two possibilities. Namely, a container-first approach that relies on the Kubernetes ecosystem and a second that is a bespoke implementation.

If I were designing for this type of operational environment, my personal preference would be to leverage the Kubernetes ecosystem since the infrastructure it provides addresses many of the challenges associated with such a complex environment. However, your requirements or preferences may differ.

The Container-First Approach

This approach is quite simple: run each EDC as a ReplicaSet and configure Kubernetes to perform routing, isolation, and other requirements. The ReplicaSet can be scheduled to run on a shared compute infrastructure. There is no need to run multiple gateways—map URLs to different Kubernetes ingress points.

Similarly, you don't necessarily need separate "heavyweight database instances." You could opt to:

Use an optimized storage engine like SQLite (https://www.sqlite.org/index.html) that performs backup replication.
Use a database as a service for persistence.
Create an optimized persistence layer that meets your requirements by extending the EDC *Store interfaces, which are designed for this purpose.

One of the advantages of this approach is that it opens the possibility of leveraging Kubernetes' rich management and DevOps ecosystem instead of having to roll your own.

Another advantage of this approach concerns data sovereignty. In this approach, data does not have to be "collocated" in the same process space as the Kubernetes ecosystem can be used to implement isolation at various levels (e.g., containers, network, etc.).

Finally, it should be noted that Kubernetes-based systems can also be made to run in constrained and cost-efficient environments.

Roll Your Own

If the previous approach is not an option, you can recreate some (but not all) of the capabilities the Kubernetes ecosystem already provides with EDC extensions. Returning to EDC design principles, I mentioned it is possible to instantiate multiple EDC participant agents ("runtimes") in the same process space. The EDC JUnit launcher does exactly this.

The EDC contains a lightweight core (no unnecessary dependencies, no application frameworks, etc.) that is memory-efficient and compact. Leverage that and build a multiplexer launcher by doing the following.

1. Create a launcher

The launcher is responsible for loading multiple runtimes and managing their lifecycle. A launcher could support a static configuration mechanism or a dynamic one where runtimes are created and destroyed based on an external events.

2. Create a multiplexing-aware `WebServer` and `WebService` implementations

Your bootstrap launcher will need to create an HTTP ingress point. This can be done in several ways. For example, if you want to use Jetty, you could opt for individual Jetty servers or something like virtual hosts. The ingress point can then be a delegate for individual WebServer implementations configured for each runtime instance (EDC extensions do not need to be loaded from service files, they can also be set directly in the boot context).

The final piece would be to hook in a WebService implementation. If you opt for Jetty as the server, you could use the existing Jersey extension. That extension should work with the WebServer implementation that delegates to the HTTP ingress point for the launcher.

3. Create a configuration mechanism

The EDC configuration system is entirely extensible. Create your own ConfigurationExtension that sources configuration for each runtime instance as it is booted by the launcher.

4. Decide on a persistence approach

Reuse an existing storage implementation or implement a custom approach based on your requirements. If you reuse an existing implementation, each storage extension could be provided custom configuration that segments data by participant. Otherwise, you could implement your own storage approach that either uses a lightweight storage engine or supports some other type of segmentation approach. For example, the ingress point could set a Threadlocal that identifies a namespace for storing data.

Summing Up

I hope this sheds light on how the EDC is architected. Going back to our "platform" philosophy, our focus as a project is on building general components that people can use to meet a diverse set of requirements. I don't see anything in the current EDC architecture that precludes your use case. Undoubtedly, you will have some choices to make and engineering effort to customize the EDC for your purposes, but the effort seems to be reasonable from this cursory assessment.

1 reply

ndr-brt Jun 3, 2022
Collaborator

this explaination should definitely be included in the documentation

sn0wcat · 2022-06-03T09:09:35Z

sn0wcat
Jun 3, 2022

So to summarize, EDC doesn't have "Container as a Service" scenarios in the EDC platform in the focus so there is no need for any contributions to EDC which would go into this direction and the recommendation is to go with the development of the external management system or derived product of the EDC and the engineering efforts can be directed towards that?

1 reply

jimmarino Jun 3, 2022
Collaborator

Our focus as an open-source project is on providing a platform that can enable those scenarios but not on implementing a "Connector as a Service" product.

mspiekermann · 2022-06-03T10:23:30Z

mspiekermann
Jun 3, 2022
Maintainer

First, thanks to @sn0wcat for clarifying and visualizing the requirements that make this discussion possible. Thanks also to @jimmarino for the detailed explanation and recommendation.

To add my perspective on that, I would like to highlight the part of "providing a platform that can enable those scenarios". It is important for me to clarify this, as it would frame your statement "so there is no need for any contributions to EDC which would go into this direction". If initiatives that use the EDC (in the various operating models) encounter limitations that require changes in the architecture of the EDC and the above described platform approach, this is exactly the contribution that we hope for and want to work together on a solution within the OSS project.

5 replies

sn0wcat Jun 3, 2022

Thanks. Even though, I was actually under the impression that I've just encountered the limitation in the EDC (not possible to serve multiple participants from a same (clustered) runtime) and proposed to work together on it ;).

mspiekermann Jun 3, 2022
Maintainer

Well but as I understood Jim's explanation, there are ways to overcome the main issue ($) that was mentioned in the initial post and that EDC from a platform perspective enables these scenarios. Maybe not with a single (multiple use) runtime...

jimmarino Jun 3, 2022
Collaborator

Thanks. Even though, I was actually under the impression that I've just encountered the limitation in the EDC (not possible to serve multiple participants from a same (clustered) runtime) and proposed to work together on it ;).

I attempted to show how the EDC can support your use case today in my write-up. What you seem to be advocating is an unnecessary architectural change. As I outlined, it is possible to achieve the following deployment topologies:

Multiple participant agents deployed to a dense shared compute infrastructure (e.g., one based on K8S)
Multiple participant agents (a.k.a. "runtimes") deployed within a common process space

I don't see the need to rearchitect the EDC such that a runtime becomes a first-class concept that multiplexes between participant agent subsystems. The two options above will solve your use case without introducing unnecessary complexity into the core EDC implementation.

I'm happy to collaborate on ensuring that the EDC facilitates those scenarios.

sn0wcat Jun 3, 2022

You seem to be quite fond of the word 'unnecessary' as you have repeated this several times in your post. What is unnecessary for you might be critical for somebody else. Nevertheless, we will look if it is possible to manage the EDC in a multi-instance fashion.

jimmarino Jun 7, 2022
Collaborator

To be clear, I used the term "unnecessary" to describe the architectural changes you are proposing, not the use case you outlined. The existing EDC architecture supports that use case without the need to introduce significant changes or complex tenant-based designs.

I view this as a positive since it means users with diverse requirements can benefit from the same simple core design.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eclipse Dataspace Components

Multitenancy #25

{{title}}

Replies: 0 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Eclipse Dataspace Components

Multitenancy #25

kbData May 13, 2022

Replies: 0 comments · 14 replies

kbData May 13, 2022 Author

mspiekermann May 31, 2022 Maintainer

sn0wcat May 31, 2022

sn0wcat Jun 2, 2022

Motivation for the Multitenancy

Multitenant EDC as a Service

Multi-Instance EDC as a Service

Scheduling the EDC instances on-demand

jimmarino Jun 3, 2022 Collaborator

EDC Design Principles: Participant Agents and Identity

EDC as a Service Deployment Possibilities

The Container-First Approach

Roll Your Own

1. Create a launcher

2. Create a multiplexing-aware WebServer and WebService implementations

3. Create a configuration mechanism

4. Decide on a persistence approach

Summing Up

ndr-brt Jun 3, 2022 Collaborator

sn0wcat Jun 3, 2022

jimmarino Jun 3, 2022 Collaborator

mspiekermann Jun 3, 2022 Maintainer

sn0wcat Jun 3, 2022

mspiekermann Jun 3, 2022 Maintainer

jimmarino Jun 3, 2022 Collaborator

sn0wcat Jun 3, 2022

jimmarino Jun 7, 2022 Collaborator

kbData
May 13, 2022

Replies: 0 comments 14 replies

kbData
May 13, 2022
Author

mspiekermann
May 31, 2022
Maintainer

sn0wcat
May 31, 2022

sn0wcat
Jun 2, 2022

jimmarino
Jun 3, 2022
Collaborator

2. Create a multiplexing-aware `WebServer` and `WebService` implementations

ndr-brt Jun 3, 2022
Collaborator

sn0wcat
Jun 3, 2022

jimmarino Jun 3, 2022
Collaborator

mspiekermann
Jun 3, 2022
Maintainer

mspiekermann Jun 3, 2022
Maintainer

jimmarino Jun 3, 2022
Collaborator

jimmarino Jun 7, 2022
Collaborator