This repository has been archived by the owner on Aug 9, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 7
feat!: overhaul slurmctld charm API #26
Merged
NucciTheBoss
merged 2 commits into
charmed-hpc:main
from
jamesbeedy:slurm_config_editor_preparation
Jun 27, 2024
+1,370
−2,267
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,3 +8,4 @@ __pycache__/ | |
.idea | ||
.vscode/ | ||
version | ||
.ruff_cache |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,28 +22,15 @@ links: | |
source: | ||
- https://github.com/charmed-hpc/slurmctld-operator | ||
|
||
peers: | ||
slurmctld-peer: | ||
interface: slurmctld-peer | ||
requires: | ||
slurmd: | ||
interface: slurmd | ||
slurmdbd: | ||
interface: slurmdbd | ||
slurmrestd: | ||
interface: slurmrestd | ||
influxdb-api: | ||
interface: influxdb-api | ||
elasticsearch: | ||
interface: elasticsearch | ||
fluentbit: | ||
interface: fluentbit | ||
provides: | ||
prolog-epilog: | ||
interface: prolog-epilog | ||
grafana-source: | ||
interface: grafana-source | ||
scope: global | ||
|
||
assumes: | ||
- juju | ||
|
@@ -58,163 +45,102 @@ bases: | |
channel: "22.04" | ||
architectures: [amd64] | ||
|
||
parts: | ||
charm: | ||
build-packages: [git] | ||
charm-python-packages: [setuptools] | ||
|
||
# Create a version file and pack it into the charm. This is dynamically generated | ||
# as part of the build process for a charm to ensure that the git revision of the | ||
# charm is always recorded in this version file. | ||
version-file: | ||
plugin: nil | ||
build-packages: | ||
- git | ||
override-build: | | ||
VERSION=$(git -C $CRAFT_PART_SRC/../../charm/src describe --dirty --always) | ||
echo "Setting version to $VERSION" | ||
echo $VERSION > $CRAFT_PART_INSTALL/version | ||
stage: | ||
- version | ||
|
||
config: | ||
options: | ||
custom-slurm-repo: | ||
type: string | ||
default: "" | ||
description: > | ||
Use a custom repository for Slurm installation. | ||
|
||
This can be set to the Organization's local mirror/cache of packages and | ||
supersedes the Omnivector repositories. Alternatively, it can be used to | ||
track a `testing` Slurm version, e.g. by setting to | ||
`ppa:omnivector/osd-testing`. | ||
|
||
Note: The configuration `custom-slurm-repo` must be set *before* | ||
deploying the units. Changing this value after deploying the units will | ||
not reinstall Slurm. | ||
cluster-name: | ||
type: string | ||
default: osd-cluster | ||
description: > | ||
description: | | ||
Name to be recorded in database for jobs from this cluster. | ||
|
||
This is important if a single database is used to record information from | ||
multiple Slurm-managed clusters. | ||
|
||
default-partition: | ||
type: string | ||
default: "" | ||
description: > | ||
description: | | ||
Default Slurm partition. This is only used if defined, and must match an | ||
existing partition. | ||
custom-config: | ||
|
||
slurm-conf-parameters: | ||
type: string | ||
default: "" | ||
description: > | ||
User supplied Slurm configuration. | ||
|
||
This value supplements the charm supplied `slurm.conf` that is used for | ||
Slurm Controller and Compute nodes. | ||
description: | | ||
User supplied Slurm configuration as a multiline string. | ||
|
||
Example usage: | ||
$ juju config slurmcltd custom-config="FirstJobId=1234" | ||
proctrack-type: | ||
type: string | ||
default: proctrack/cgroup | ||
description: > | ||
Identifies the plugin to be used for process tracking on a job step | ||
basis. | ||
cgroup-config: | ||
$ juju config slurmcltd slurm-conf-parameters="$(cat additional.conf)" | ||
|
||
cgroup-parameters: | ||
type: string | ||
default: | | ||
CgroupAutomount=yes | ||
ConstrainCores=yes | ||
description: > | ||
Configuration content for `cgroup.conf`. | ||
description: | | ||
User supplied configuration for `cgroup.conf`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same thing as above. We're giving users a lot of power over their cluster here. Could possibly role into an action where the user has to acknowledge which file they're updating:
|
||
|
||
health-check-params: | ||
default: "" | ||
type: string | ||
description: > | ||
description: | | ||
Extra parameters for NHC command. | ||
|
||
This option can be used to customize how NHC is called, e.g. to send an | ||
e-mail to an admin when NHC detects an error set this value to | ||
e-mail to an admin when NHC detects an error set this value to. | ||
`-M [email protected]`. | ||
|
||
health-check-interval: | ||
default: 600 | ||
type: int | ||
description: Interval in seconds between executions of the Health Check. | ||
|
||
health-check-state: | ||
default: "ANY,CYCLE" | ||
type: string | ||
description: Only run the Health Check on nodes in this state. | ||
|
||
acct-gather-frequency: | ||
type: string | ||
default: "task=30" | ||
description: > | ||
Accounting and profiling sampling intervals for the acct_gather plugins. | ||
|
||
Note: A value of `0` disables the periodic sampling. In this case, the | ||
accounting information is collected when the job terminates. | ||
|
||
Example usage: | ||
$ juju config slurmcltd acct-gather-frequency="task=30,network=30" | ||
acct-gather-custom: | ||
type: string | ||
default: "" | ||
description: > | ||
User supplied `acct_gather.conf` configuration. | ||
|
||
This value supplements the charm supplied `acct_gather.conf` file that is | ||
used for configuring the acct_gather plugins. | ||
|
||
actions: | ||
show-current-config: | ||
description: > | ||
description: | | ||
Display the currently used `slurm.conf`. | ||
|
||
Note: This file only exists in `slurmctld` charm and is automatically | ||
distributed to all compute nodes by Slurm. | ||
|
||
Example usage: | ||
$ juju run-action slurmctld/leader --format=json --wait | jq .[].results.slurm.conf | xargs -I % -0 python3 -c 'print(%)' | ||
|
||
```bash | ||
juju run slurmctld/leader show-current-config \ | ||
--quiet --format=json | jq .[].results.slurm.conf | xargs -I % -0 python3 -c 'print(%)' | ||
``` | ||
|
||
drain: | ||
description: > | ||
description: | | ||
Drain specified nodes. | ||
|
||
Example usage: | ||
$ juju run-action slurmctld/leader drain nodename=node-[1,2] reason="Updating kernel" | ||
$ juju run slurmctld/leader drain nodename="node-[1,2]" reason="Updating kernel" | ||
params: | ||
nodename: | ||
type: string | ||
description: The nodes to drain, using the Slurm format, e.g. `node-[1,2]`. | ||
description: The nodes to drain, using the Slurm format, e.g. `"node-[1,2]"`. | ||
reason: | ||
type: string | ||
description: Reason to drain the nodes. | ||
required: | ||
- nodename | ||
- reason | ||
resume: | ||
description: > | ||
description: | | ||
Resume specified nodes. | ||
|
||
Note: Newly added nodes will remain in the `down` state until configured, | ||
with the `node-configured` action. | ||
|
||
Example usage: $ juju run-action slurmctld/leader resume nodename=node-[1,2] | ||
Example usage: $ juju run slurmctld/leader resume nodename="node-[1,2]" | ||
params: | ||
nodename: | ||
type: string | ||
description: > | ||
The nodes to resume, using the Slurm format, e.g. `node-[1,2]`. | ||
description: | | ||
The nodes to resume, using the Slurm format, e.g. `"node-[1,2]"`. | ||
required: | ||
- nodename | ||
|
||
influxdb-info: | ||
description: > | ||
Get InfluxDB info. | ||
|
||
This action returns the host, port, username, password, database, and | ||
retention policy regarding to InfluxDB. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
ops==2.* | ||
influxdb==5.3.1 | ||
jinja2==3.1.3 | ||
distro | ||
pycryptodome | ||
ops==2.14.0 | ||
distro==1.9.0 | ||
pycryptodome==3.20.0 |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... so for this config option, I'd like your feedback on some things @jamesbeedy
I've witnessed discussions in the past where it's considered a "charm anti-pattern" to enable users to supply configuration like this. Reason saying is that it goes against the idea that charms should be opinionated about what they look like/how the configure applications. Obviously that model doesn't totally work with Slurm as everyone has their own way to configure Slurm and no two clusters are the same, but what I'm worried is that users will accidentally bork their cluster. E.g. overwrite how we have munge set up and cause their cluster to fail to authenticate correctly.
I see that we have two options here:
force-unmount
action in the nfs-client charm.What would be really great imho is if we could have this be an action, and make the user acknowledge which file they are going to update within the slurmctld-operator:
This way the user acknowledges which file they're updating, and they'll likely read the action documentation that informs them that this action gives them A LOT of power over their cluster and they should routinely backup their configuration. Let me know your thoughts on this.