Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a single JupyterHub instance to spawn pods in multiple clusters #34

Open
vvcb opened this issue May 6, 2024 · 4 comments
Open

Use a single JupyterHub instance to spawn pods in multiple clusters #34

vvcb opened this issue May 6, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@vvcb
Copy link
Contributor

vvcb commented May 6, 2024

Currently, JupyterHub spawns and manages pods in the same cluster that it is installed on.

However, it will be useful to be able to spawn pods on more than one cluster but managed by the same JupyterHub instance.

The main use case for us now is the ability to deploy pods to an on-prem cluster that may have bespoke compute allowing us to make use of existing investments in our infrastructure or partner's infrastructure. A good example is to off load GPU workloads to existing on-prem GPU compute to save costs.

Looks like @yuvipanda has already built https://github.com/yuvipanda/jupyterhub-multicluster-kubespawner and it will be worth investigating this further.

This may also be something to consider for the remote access work.

@vvcb vvcb added the enhancement New feature or request label May 6, 2024
@qcaas-nhs-sjt
Copy link
Collaborator

qcaas-nhs-sjt commented May 7, 2024

@vvcb I've reviewed this project and it looks like a good start, however I do have a number of concerns and potential improvements to pose before choosing this as an option.

The solution relies on an internal kubespawner to do the work, rather than working in the standard kubernetes native design pattern which uses controllers to perform the bulk of the task. This is an ommission in the jupyterhub kubespawner as well, so it is not surprising. What this means is that all of the work is carried out by a single service, that service is the same service which the user interacts with creating a single vector for attack which would ultimately give a single vector for attack. If this service is breached due to an exploit in a library that jupyterhub uses then this could be used to run other workloads on the cluster which could potentially see the entire cluster and all of the data related to it breached. If we're going back to the drawing board of how the kubespawner is working then I would expect that we would want to see this item filled.

Additionally, I am concerned about the use of a CLI application inside of the application, in my experience while this may work now, as the CLI evolves it will likely change syntax and will add a layer of complexity that could ultimately lead to it being harder to support. While it may seem easier to develop in this way, it is ultimately a false economy. When it comes to exception handling, such models often obfuscate any error messages and ultimately make it more difficult to debug.

Kubernetes have provided class libraries and design models for interacting with the API's which are designed with an upgrade path model in mind, this in theory should allow us to interact with the models long into the future in spite of any changes to the CLI and should provide appropriate exception handling and feedback in the event of any problems. This may seem more difficult to learn, but ultimately this is really just a standard API and is well documented so isn't as hard to implement as it first appears.

Security wise I also have concerns, the model relies on the various clusters being able to talk directly with the control plane of the other clusters, which is not a model I'm comfortable with. It will then be capable of provisioning workloads on the linked clusters with very few real controls around what is being provisioned.

I would suggest instead that what we need to do is work on a new version of the kubespawner, that is designed to work in the way that kubernetes is designed to work.

sequenceDiagram
    participant Hub
    participant API
    participant Operator
    
    Hub ->> Hub: User Logs into Jupyterhub and selects workspace
    Hub ->> API: Create Custom Resource
    Operator ->> API: Fetch Updated Custom Resources
    Operator ->> API: Create Pod and wait for readiness
    Operator ->> API: Update Status of Custom Resource to PodReady
    Operator ->> Operator: Update Proxy
    Operator ->> API: Update Status of Custom Resource to ProxyReady    
    Hub ->> API: Fetch Status
    Hub ->> Hub: Redirect User Session to Pod

Loading

As part of the custom resource definition you could have custom properties which would be used to allow developers to extend the notebook definition with their own metadata that can then be used in their own implementation of the JupyterNotebooksOperator, which we could also build with event hooks to facilitate development of extensions

This means that if they wanted it to do other things, they would only need to extend the operator code.

Then we can add to this model a pub/sub model from one server to another that will allow another server to stay updated and provide feedback just on the one resource type we want (our custom resource). This would then be picked up and implemented on the relevant cluster:

sequenceDiagram
    participant Hub-1
    participant API-1
    participant Operator-1
    participant Publisher-1
    participant Subscriber-1

    participant Subscriber-2
    participant Publisher-2
    participant API-2
    participant Operator-2

    Hub-1 ->> Hub-1: User Logs into Jupyterhub and selects workspace
    Hub-1 ->> API-1: Create Custom Resource
    par
        Operator-1 ->> API-1: Fetch Updated Custom Resources
        Operator-1 ->> Operator-1: No Action as managed off server
    and
        Publisher-1 ->> API-1: Fetch Updated Custom Resources
        Subscriber-2 ->> Publisher-1: Fetch Updated Custom Resources
        Subscriber-2 ->> API-2: Custom Resource Added 
        Operator-2 ->> API-2: Fetch Updated Custom Resources
        par
            Operator-2 ->> API-2: Create Pod and wait for readiness
        and
            Operator-2 ->> API-2: Create Service and wait for readiness
        end
        Operator-2 ->> API-2: Update Status of Custom Resource to PodReady
        Publisher-2 ->> API-2: Fetch Changes to Custom Resource
        Subscriber-1 ->> Publisher-2: Fetch Changes to Custom Resource
        Subscriber-1 ->> API-1: Update Status of Custom Resource to PodReady
        Operator-1 ->> API-1: Fetch Changes to Custom Resource
        Operator-1 ->> Operator-1: Update Proxy
        Operator-1 ->> API-1: Update Status of Custom Resource to ProxyReady    
    end
    Hub-1 ->> API-1: Fetch Status
    Hub-1 ->> Hub-1: Redirect User Session to Pod
Loading

This model would have instances of everything running on both clusters with the publishers able to serve many subscribers at once and we would put information on ownership of a service to a specific cluster so that we knew which cluster created it, and which implemented it, etc. We could even add in election logic into this that allowed multiple clusters to bid on it based on resource and the server which met the capabilities and had the greatest amount of free resource would be the one responsible for implementing it.

This would require the development of the following:

  • The new CRD for jupyter notebook instances
  • Python libraries based upon the CRD (potentially could be autogenerated)
  • A new version of kubespawner for kubernetes interoperability
  • Jupyter Notebooks Operator
  • Jupyter Notebooks Publisher Service
  • Jupyter Notebooks Subscriber Service
  • Update the z2jh helm charts

Ultimately these should be relatively simple to implement and I suspect this new framework would be extremely useful to the community as it would make extension easier than it is at present and would increase security of the solution

@qcaas-nhs-sjt
Copy link
Collaborator

qcaas-nhs-sjt commented May 7, 2024

Per my conversation with @vvcb I have raised the primary design pattern as an issue on the kubespawner project:
jupyterhub/kubespawner#839

@yuvipanda
Copy link

Excited to see ongoing conversations about this :)

Additionally, I am concerned about the use of a CLI application inside of the application, in my experience while this may work now, as the CLI evolves it will likely change syntax and will add a layer of complexity that could ultimately lead to it being harder to support. While it may seem easier to develop in this way, it is ultimately a false economy. When it comes to exception handling, such models often obfuscate any error messages and ultimately make it more difficult to debug.

100% agreed! I used kubectl apply intentionally in the prototype because server side apply (https://kubernetes.io/docs/reference/using-api/server-side-apply/) was in the process of being worked on, and I knew that when it became available I could rip out the CLI for that instead. kubectl apply has a --server-side flag now, and my migration path was to move to that, and then to talk directly to the API.

Mostly just wanted to quickly respond here, as I wanted to explain away that particular code smell :) I'll try to respond to the other bits over the next day or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants