Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Move tolerationSettings from notebooks generally to data science projects #1306

Open
shalberd opened this issue May 29, 2023 · 6 comments
Labels
community feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) kind/enhancement New functionality request (existing augments or new additions) priority/normal An issue with the product; fix when possible

Comments

@shalberd
Copy link
Contributor

shalberd commented May 29, 2023

Feature description

Currently, the notebook toleration settings from odh dashboard config apply to all notebooks in all namespaces.

Assume we have a cluster with different dedicated nodes per customer:

  • nodes A-B (worker nodes) tainted NoExecute, Equal, key: customer, value: customer1
  • nodes C-D (worker nodes) tainted NoExecute, Equal, key: customer, value: customer2

The idea is having namespaces per customer, it can be one namespace per user, I have grown used to that concept, but there needs to be a way to ensure that users / workbench namespaces can belong to different customers and have different scheduling placements for pods in terms of on which node they land.

So, my suggestion would be to

  • move notebookTolerationSettings in ODH Dashboard Config being a global setting for all notebooks in all namespaces to tolerationSettings on specific Data Science Projects, that is, namespace / project-specific
  • change effect from NoSchedule to NoExecute to ensure that existing pods on the node are evicted and moved to a non-taint node
  • change operator from Exists to Equal. Exists is ok for evaluating node taint keys like nvidia.com/gpu, where the value does not matter, I presume. But it is not ok for tolerations where key AND value must match, e.g. my described scenario above. Just matching key: customer would not be enough.

Describe alternatives you've considered

For now, we do not have multiple customers, with data science projects namespaces grouped per customer, so we schedule all notebooks on nodes with a given node taint key, e.g. key: opendatahub, using the existing mechanism in OdhDashboardConfig.

But going forward, the issue of moving to namespace-specific instead of for-all configs will become important. Be it for tolerations or for things like linking all service accounts to an image pull secret, also those dynamic ones for notebooks in data science projects.

Anything else?

No response

@shalberd shalberd added kind/enhancement New functionality request (existing augments or new additions) untriaged Indicates the newly create issue has not been triaged yet labels May 29, 2023
@github-project-automation github-project-automation bot moved this to Needs prioritization in ODH Dashboard Planning May 29, 2023
@Gkrumbach07 Gkrumbach07 added feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) priority/normal An issue with the product; fix when possible and removed untriaged Indicates the newly create issue has not been triaged yet labels May 31, 2023
@Gkrumbach07
Copy link
Member

cc @andrewballantyne

@Gkrumbach07 Gkrumbach07 moved this from Needs prioritization to Backlog in ODH Dashboard Planning May 31, 2023
@Gkrumbach07 Gkrumbach07 added the needs-info Further information is requested from the reporter or from another source label May 31, 2023
@andrewballantyne andrewballantyne moved this from Backlog to To do in ODH Dashboard Planning Sep 15, 2023
@bdattoma
Copy link

could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints

@andrewballantyne
Copy link
Member

could this be applied to models as well? Maybe we could have a set of tolerations to allow models to be served on GPU nodes which are dedicated to serving by mean of taints

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

This request is for allowing more flexibility in general tolerations for Notebooks (and in general I imagine all of a set of DS Project resources -- unrelated to GPUs or Accelerators)

@andrewballantyne
Copy link
Member

I think this predates the UX flow. Moving to UX.

UX Context

I think we need to design a way to bring the NotebookTolerations cluster settings to the project so the user can manage their resources against tolerations. This may be more possible with the added state in the admin view of Habana part 2 & the toleration modal. #1255

@andrewballantyne andrewballantyne moved this from Dev To do to UX Backlog in ODH Dashboard Planning Nov 23, 2023
@andrewballantyne andrewballantyne removed the needs-info Further information is requested from the reporter or from another source label Nov 23, 2023
@bdattoma
Copy link

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.

@andrewballantyne
Copy link
Member

This is no longer the case when we talk in AcceleratorProfiles. I think 1.33 or 2.4 of RHOAI has Accelerator Profiles. Tolerations behind GPU usage so you can effectively use taints is already covered @bdattoma

Is it possible to set a custom toleration for the accelerator? If I don't want to use the default nvidia.com/gpu which I think is automatically added when attaching the GPU profile.

@bdattoma Yes it is -- when you create the AcceleratorProfile (or modify the one we create on migration) you can pick whatever tolerations you want and as many as you want. Our old world was a single static toleration, so we migrate with that -- but it is modifiable.

The Admin UI is coming in 2.6 I believe, and is currently in incubation if you want to check it out. The tracker: #1255

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community feature/ds-projects Data Science Projects feature (formerly Data Science Groupings - DSG) kind/enhancement New functionality request (existing augments or new additions) priority/normal An issue with the product; fix when possible
Projects
Status: No status
Status: No status
Status: UX Backlog
Development

No branches or pull requests

5 participants