Skip to content
This repository has been archived by the owner on Aug 9, 2024. It is now read-only.

feat!: overhaul slurmctld charm API #26

Merged
merged 2 commits into from
Jun 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Run tests
uses: get-woke/woke-action@v0
with:
Expand All @@ -34,7 +34,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install dependencies
run: python3 -m pip install tox
- name: Run linters
Expand All @@ -45,12 +45,23 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Install dependencies
run: python3 -m pip install tox
- name: Run tests
run: tox -e unit

type-check:
name: Static type checking
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Install dependencies
run: python3 -m pip install tox
- name: Run tests
run: tox -e type

integration-test:
strategy:
fail-fast: true
Expand All @@ -63,9 +74,10 @@ jobs:
- inclusive-naming-check
- lint
- unit-test
- type-check
steps:
- name: Checkout
uses: actions/checkout@v3
uses: actions/checkout@v4
- name: Setup operator environment
uses: charmed-kubernetes/actions-operator@main
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ __pycache__/
.idea
.vscode/
version
.ruff_cache
8 changes: 1 addition & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,7 @@ This operator should be used with Juju 3.x or greater.
```shell
$ juju deploy slurmctld --channel edge
$ juju deploy slurmd --channel edge
$ juju deploy slurmdbd --channel edge
$ juju deploy mysql --channel 8.0/edge
$ juju deploy mysql-router slurmdbd-mysql-router --channel dpe/edge
$ juju integrate slurmctld:slurmd slurmd:slurmd
$ juju integrate slurmdbd-mysql-router:backend-database mysql:database
$ juju integrate slurmdbd:database slurmdbd-mysql-router:database
$ juju integrate slurmctld:slurmdbd slurmdbd:slurmdbd
$ juju integrate slurmctld:slurmd slurmd:slurmctld
```

## Project & Community
Expand Down
134 changes: 30 additions & 104 deletions charmcraft.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,28 +22,15 @@ links:
source:
- https://github.com/charmed-hpc/slurmctld-operator

peers:
slurmctld-peer:
interface: slurmctld-peer
requires:
slurmd:
interface: slurmd
slurmdbd:
interface: slurmdbd
slurmrestd:
interface: slurmrestd
influxdb-api:
interface: influxdb-api
elasticsearch:
interface: elasticsearch
fluentbit:
interface: fluentbit
provides:
prolog-epilog:
interface: prolog-epilog
grafana-source:
interface: grafana-source
scope: global

assumes:
- juju
Expand All @@ -58,163 +45,102 @@ bases:
channel: "22.04"
architectures: [amd64]

parts:
charm:
build-packages: [git]
charm-python-packages: [setuptools]

# Create a version file and pack it into the charm. This is dynamically generated
# as part of the build process for a charm to ensure that the git revision of the
# charm is always recorded in this version file.
version-file:
plugin: nil
build-packages:
- git
override-build: |
VERSION=$(git -C $CRAFT_PART_SRC/../../charm/src describe --dirty --always)
echo "Setting version to $VERSION"
echo $VERSION > $CRAFT_PART_INSTALL/version
stage:
- version

config:
options:
custom-slurm-repo:
type: string
default: ""
description: >
Use a custom repository for Slurm installation.

This can be set to the Organization's local mirror/cache of packages and
supersedes the Omnivector repositories. Alternatively, it can be used to
track a `testing` Slurm version, e.g. by setting to
`ppa:omnivector/osd-testing`.

Note: The configuration `custom-slurm-repo` must be set *before*
deploying the units. Changing this value after deploying the units will
not reinstall Slurm.
cluster-name:
type: string
default: osd-cluster
description: >
description: |
Name to be recorded in database for jobs from this cluster.

This is important if a single database is used to record information from
multiple Slurm-managed clusters.

default-partition:
type: string
default: ""
description: >
description: |
Default Slurm partition. This is only used if defined, and must match an
existing partition.
custom-config:

slurm-conf-parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... so for this config option, I'd like your feedback on some things @jamesbeedy

I've witnessed discussions in the past where it's considered a "charm anti-pattern" to enable users to supply configuration like this. Reason saying is that it goes against the idea that charms should be opinionated about what they look like/how the configure applications. Obviously that model doesn't totally work with Slurm as everyone has their own way to configure Slurm and no two clusters are the same, but what I'm worried is that users will accidentally bork their cluster. E.g. overwrite how we have munge set up and cause their cluster to fail to authenticate correctly.

I see that we have two options here:

  1. We specifically set what options users are allowed to configure. E.g. cluster name but not where the slurmcltd pid file is located on the machine. We've talked in the past about how this doesn't scale as Slurm has 250+ configuration parameters so it's hard to know which parameters we should make configurable v.s. which ones we shouldn't.
  2. We include a warning message with these configuration options that warns the user that they can break their cluster if they apply an option incorrectly and that they should first back up their current cluster configuration. We do something similar to this with the force-unmount action in the nfs-client charm.

What would be really great imho is if we could have this be an action, and make the user acknowledge which file they are going to update within the slurmctld-operator:

juju run --quiet slurmctld/leader set-config parameters="FirstJobId=1234" config-file="slurm.conf"

This way the user acknowledges which file they're updating, and they'll likely read the action documentation that informs them that this action gives them A LOT of power over their cluster and they should routinely backup their configuration. Let me know your thoughts on this.

type: string
default: ""
description: >
User supplied Slurm configuration.

This value supplements the charm supplied `slurm.conf` that is used for
Slurm Controller and Compute nodes.
description: |
User supplied Slurm configuration as a multiline string.

Example usage:
$ juju config slurmcltd custom-config="FirstJobId=1234"
proctrack-type:
type: string
default: proctrack/cgroup
description: >
Identifies the plugin to be used for process tracking on a job step
basis.
cgroup-config:
$ juju config slurmcltd slurm-conf-parameters="$(cat additional.conf)"

cgroup-parameters:
type: string
default: |
CgroupAutomount=yes
ConstrainCores=yes
description: >
Configuration content for `cgroup.conf`.
description: |
User supplied configuration for `cgroup.conf`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as above. We're giving users a lot of power over their cluster here. Could possibly role into an action where the user has to acknowledge which file they're updating:

juju run --quiet slurmctld/leader set-config parameters="..." config-file="cgroup.conf"


health-check-params:
default: ""
type: string
description: >
description: |
Extra parameters for NHC command.

This option can be used to customize how NHC is called, e.g. to send an
e-mail to an admin when NHC detects an error set this value to
e-mail to an admin when NHC detects an error set this value to.
`-M [email protected]`.

health-check-interval:
default: 600
type: int
description: Interval in seconds between executions of the Health Check.

health-check-state:
default: "ANY,CYCLE"
type: string
description: Only run the Health Check on nodes in this state.

acct-gather-frequency:
type: string
default: "task=30"
description: >
Accounting and profiling sampling intervals for the acct_gather plugins.

Note: A value of `0` disables the periodic sampling. In this case, the
accounting information is collected when the job terminates.

Example usage:
$ juju config slurmcltd acct-gather-frequency="task=30,network=30"
acct-gather-custom:
type: string
default: ""
description: >
User supplied `acct_gather.conf` configuration.

This value supplements the charm supplied `acct_gather.conf` file that is
used for configuring the acct_gather plugins.

actions:
show-current-config:
description: >
description: |
Display the currently used `slurm.conf`.

Note: This file only exists in `slurmctld` charm and is automatically
distributed to all compute nodes by Slurm.

Example usage:
$ juju run-action slurmctld/leader --format=json --wait | jq .[].results.slurm.conf | xargs -I % -0 python3 -c 'print(%)'

```bash
juju run slurmctld/leader show-current-config \
--quiet --format=json | jq .[].results.slurm.conf | xargs -I % -0 python3 -c 'print(%)'
```

drain:
description: >
description: |
Drain specified nodes.

Example usage:
$ juju run-action slurmctld/leader drain nodename=node-[1,2] reason="Updating kernel"
$ juju run slurmctld/leader drain nodename="node-[1,2]" reason="Updating kernel"
params:
nodename:
type: string
description: The nodes to drain, using the Slurm format, e.g. `node-[1,2]`.
description: The nodes to drain, using the Slurm format, e.g. `"node-[1,2]"`.
reason:
type: string
description: Reason to drain the nodes.
required:
- nodename
- reason
resume:
description: >
description: |
Resume specified nodes.

Note: Newly added nodes will remain in the `down` state until configured,
with the `node-configured` action.

Example usage: $ juju run-action slurmctld/leader resume nodename=node-[1,2]
Example usage: $ juju run slurmctld/leader resume nodename="node-[1,2]"
params:
nodename:
type: string
description: >
The nodes to resume, using the Slurm format, e.g. `node-[1,2]`.
description: |
The nodes to resume, using the Slurm format, e.g. `"node-[1,2]"`.
required:
- nodename

influxdb-info:
description: >
Get InfluxDB info.

This action returns the host, port, username, password, database, and
retention policy regarding to InfluxDB.
8 changes: 3 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
ops==2.*
influxdb==5.3.1
jinja2==3.1.3
distro
pycryptodome
ops==2.14.0
distro==1.9.0
pycryptodome==3.20.0
Loading
Loading