Use a single JupyterHub instance to spawn pods in multiple clusters #34

vvcb · 2024-05-06T07:26:32Z

Currently, JupyterHub spawns and manages pods in the same cluster that it is installed on.

However, it will be useful to be able to spawn pods on more than one cluster but managed by the same JupyterHub instance.

The main use case for us now is the ability to deploy pods to an on-prem cluster that may have bespoke compute allowing us to make use of existing investments in our infrastructure or partner's infrastructure. A good example is to off load GPU workloads to existing on-prem GPU compute to save costs.

Looks like @yuvipanda has already built https://github.com/yuvipanda/jupyterhub-multicluster-kubespawner and it will be worth investigating this further.

This may also be something to consider for the remote access work.

vvcb · 2024-05-06T10:36:44Z

Linked to:

qcaas-nhs-sjt · 2024-05-07T10:03:42Z

@vvcb I've reviewed this project and it looks like a good start, however I do have a number of concerns and potential improvements to pose before choosing this as an option.

The solution relies on an internal kubespawner to do the work, rather than working in the standard kubernetes native design pattern which uses controllers to perform the bulk of the task. This is an ommission in the jupyterhub kubespawner as well, so it is not surprising. What this means is that all of the work is carried out by a single service, that service is the same service which the user interacts with creating a single vector for attack which would ultimately give a single vector for attack. If this service is breached due to an exploit in a library that jupyterhub uses then this could be used to run other workloads on the cluster which could potentially see the entire cluster and all of the data related to it breached. If we're going back to the drawing board of how the kubespawner is working then I would expect that we would want to see this item filled.

Additionally, I am concerned about the use of a CLI application inside of the application, in my experience while this may work now, as the CLI evolves it will likely change syntax and will add a layer of complexity that could ultimately lead to it being harder to support. While it may seem easier to develop in this way, it is ultimately a false economy. When it comes to exception handling, such models often obfuscate any error messages and ultimately make it more difficult to debug.

Kubernetes have provided class libraries and design models for interacting with the API's which are designed with an upgrade path model in mind, this in theory should allow us to interact with the models long into the future in spite of any changes to the CLI and should provide appropriate exception handling and feedback in the event of any problems. This may seem more difficult to learn, but ultimately this is really just a standard API and is well documented so isn't as hard to implement as it first appears.

Security wise I also have concerns, the model relies on the various clusters being able to talk directly with the control plane of the other clusters, which is not a model I'm comfortable with. It will then be capable of provisioning workloads on the linked clusters with very few real controls around what is being provisioned.

I would suggest instead that what we need to do is work on a new version of the kubespawner, that is designed to work in the way that kubernetes is designed to work.

sequenceDiagram
    participant Hub
    participant API
    participant Operator
    
    Hub ->> Hub: User Logs into Jupyterhub and selects workspace
    Hub ->> API: Create Custom Resource
    Operator ->> API: Fetch Updated Custom Resources
    Operator ->> API: Create Pod and wait for readiness
    Operator ->> API: Update Status of Custom Resource to PodReady
    Operator ->> Operator: Update Proxy
    Operator ->> API: Update Status of Custom Resource to ProxyReady    
    Hub ->> API: Fetch Status
    Hub ->> Hub: Redirect User Session to Pod

As part of the custom resource definition you could have custom properties which would be used to allow developers to extend the notebook definition with their own metadata that can then be used in their own implementation of the JupyterNotebooksOperator, which we could also build with event hooks to facilitate development of extensions

This means that if they wanted it to do other things, they would only need to extend the operator code.

Then we can add to this model a pub/sub model from one server to another that will allow another server to stay updated and provide feedback just on the one resource type we want (our custom resource). This would then be picked up and implemented on the relevant cluster:

sequenceDiagram
    participant Hub-1
    participant API-1
    participant Operator-1
    participant Publisher-1
    participant Subscriber-1

    participant Subscriber-2
    participant Publisher-2
    participant API-2
    participant Operator-2

    Hub-1 ->> Hub-1: User Logs into Jupyterhub and selects workspace
    Hub-1 ->> API-1: Create Custom Resource
    par
        Operator-1 ->> API-1: Fetch Updated Custom Resources
        Operator-1 ->> Operator-1: No Action as managed off server
    and
        Publisher-1 ->> API-1: Fetch Updated Custom Resources
        Subscriber-2 ->> Publisher-1: Fetch Updated Custom Resources
        Subscriber-2 ->> API-2: Custom Resource Added 
        Operator-2 ->> API-2: Fetch Updated Custom Resources
        par
            Operator-2 ->> API-2: Create Pod and wait for readiness
        and
            Operator-2 ->> API-2: Create Service and wait for readiness
        end
        Operator-2 ->> API-2: Update Status of Custom Resource to PodReady
        Publisher-2 ->> API-2: Fetch Changes to Custom Resource
        Subscriber-1 ->> Publisher-2: Fetch Changes to Custom Resource
        Subscriber-1 ->> API-1: Update Status of Custom Resource to PodReady
        Operator-1 ->> API-1: Fetch Changes to Custom Resource
        Operator-1 ->> Operator-1: Update Proxy
        Operator-1 ->> API-1: Update Status of Custom Resource to ProxyReady    
    end
    Hub-1 ->> API-1: Fetch Status
    Hub-1 ->> Hub-1: Redirect User Session to Pod

This model would have instances of everything running on both clusters with the publishers able to serve many subscribers at once and we would put information on ownership of a service to a specific cluster so that we knew which cluster created it, and which implemented it, etc. We could even add in election logic into this that allowed multiple clusters to bid on it based on resource and the server which met the capabilities and had the greatest amount of free resource would be the one responsible for implementing it.

This would require the development of the following:

The new CRD for jupyter notebook instances
Python libraries based upon the CRD (potentially could be autogenerated)
A new version of kubespawner for kubernetes interoperability
Jupyter Notebooks Operator
Jupyter Notebooks Publisher Service
Jupyter Notebooks Subscriber Service
Update the z2jh helm charts

Ultimately these should be relatively simple to implement and I suspect this new framework would be extremely useful to the community as it would make extension easier than it is at present and would increase security of the solution

qcaas-nhs-sjt · 2024-05-07T11:48:12Z

Per my conversation with @vvcb I have raised the primary design pattern as an issue on the kubespawner project:
jupyterhub/kubespawner#839

yuvipanda · 2024-05-07T18:50:30Z

Excited to see ongoing conversations about this :)

Additionally, I am concerned about the use of a CLI application inside of the application, in my experience while this may work now, as the CLI evolves it will likely change syntax and will add a layer of complexity that could ultimately lead to it being harder to support. While it may seem easier to develop in this way, it is ultimately a false economy. When it comes to exception handling, such models often obfuscate any error messages and ultimately make it more difficult to debug.

100% agreed! I used kubectl apply intentionally in the prototype because server side apply (https://kubernetes.io/docs/reference/using-api/server-side-apply/) was in the process of being worked on, and I knew that when it became available I could rip out the CLI for that instead. kubectl apply has a --server-side flag now, and my migration path was to move to that, and then to talk directly to the API.

Mostly just wanted to quickly respond here, as I wanted to explain away that particular code smell :) I'll try to respond to the other bits over the next day or so.

vvcb added the enhancement New feature or request label May 6, 2024

vvcb assigned qcaas-nhs-sjt and vvcb May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a single JupyterHub instance to spawn pods in multiple clusters #34

Use a single JupyterHub instance to spawn pods in multiple clusters #34

vvcb commented May 6, 2024

vvcb commented May 6, 2024 •

edited

Loading

qcaas-nhs-sjt commented May 7, 2024 •

edited

Loading

qcaas-nhs-sjt commented May 7, 2024 •

edited

Loading

yuvipanda commented May 7, 2024

Use a single JupyterHub instance to spawn pods in multiple clusters #34

Use a single JupyterHub instance to spawn pods in multiple clusters #34

Comments

vvcb commented May 6, 2024

vvcb commented May 6, 2024 • edited Loading

qcaas-nhs-sjt commented May 7, 2024 • edited Loading

qcaas-nhs-sjt commented May 7, 2024 • edited Loading

yuvipanda commented May 7, 2024

vvcb commented May 6, 2024 •

edited

Loading

qcaas-nhs-sjt commented May 7, 2024 •

edited

Loading

qcaas-nhs-sjt commented May 7, 2024 •

edited

Loading