This repository contains playbooks and configuration to define a Slurm-based HPC environment including:
- A Centos 8 and OpenHPC v2-based Slurm cluster.
- Shared fileystem(s) using NFS (with servers within or external to the cluster).
- Slurm accounting using a MySQL backend.
- A monitoring backend using Prometheus and ElasticSearch.
- Grafana with dashboards for both individual nodes and Slurm jobs.
- Production-ready Slurm defaults for access and memory.
- A Packer-based build pipeline for compute node images.
The repository is designed to be forked for a specific use-case/HPC site but can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us!
- Working DNS so that we can use the ansible inventory name as the address for connecting to services.
- Bootable images based on Centos 8 Cloud images.
These instructions assume the deployment host is running Centos 8:
git clone [email protected]:stackhpc/openhpc-demo.git
cd openhpc-demo
python3 -m venv venv
. venv/bin/activate
pip install -U pip
pip install -r requirements.txt
# Install ansible dependencies ...
ansible-galaxy role install -r requirements.yml -p ansible/roles
ansible-galaxy collection install -r requirements.yml -p ansible/collections # ignore the path warning here
environments/
: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates.ansible/
: Contains the ansible playbooks to configure the infrastruture.packer/
: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information.dev/
: Contains development tools.
NB: This section describes generic instructions - check for any environment-specific instructions in environments/<environment>/README.md
before starting.
-
Activate the environment - this must be done before any other commands are run:
source environments/<environment>activate
-
Deploy instances - see environment-specific instructions.
-
Generate passwords:
ansible-playbook ansible/adhoc/generate-passwords.yml
This will output a set of passwords in
environments/<environment>/inventory/group_vars/all/secrets.yml
. It is recommended that these are encrpyted and then commited to git using:ansible-vault encrypt inventory/group_vars/all/secrets.yml
See the Ansible vault documentation for more details.
-
Deploy the appliance:
ansible-playbook ansible/site.yml
or if you have encrypted secrets use:
ansible-playbook ansible/site.yml --ask-vault-password
Tags as defined in the various sub-playbooks defined in
ansible/
may be used to only run part of thesite
tasks. -
"Utility" playbooks for managing a running appliance are contained in
ansible/adhoc
- run these by activating the environment and using:ansible-playbook ansible/adhoc/<playbook name>
Currently they include:
test.yml
: MPI-based post-deployment tests for latency, bandwidth and floating point performance. Seeansible/collections/ansible_collections/stackhpc/slurm_openstack_tools/roles/test/README.md
for full details. Note that you may wish to reconfigure the Slurm compute nodes into a single partition before running this. IMPORTANT: Do not use these tests on a cluster in production as the reconfiguration it performs will crash running jobs.update-packages.yml
: Update all packages on the cluster.
An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in environments/
, containing:
- Any deployment automation required - e.g. Terraform configuration or HEAT templates.
- An ansible
inventory/
directory. - An
activate
script which sets environment variables to point to this configuration. - Optionally, additional playbooks in
/hooks
to run before or after the main tasks.
All environments load the inventory from the common
environment first, with the environment-specific inventory then overriding parts of this as required.
This repo contains a cookiecutter
template which can be used to create a new environment from scratch. Run the installation on deployment host instructions above, then in the repo root run:
. venv/bin/activate
cd environments
cookiecutter skeleton
and follow the prompts to complete the environment name and description.
Alternatively, you could copy an existing environment directory.
Now add deployment automation if required, and then complete the environment-specific inventory as described below.
The ansible inventory for the environment is in environments/<environment>/inventory/
. It should generally contain:
- A
hosts
file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc. - A
groups
file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g.openhpc
orgrafana
. The meaning and use of each group is described in comments inenvironments/common/inventory/groups
. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment'sgroups
file. Two template examples are provided inenvironments/commmon/layouts/
demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality. - Optionally, group variable files in
group_vars/<group_name>/overrides.yml
, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined inenvironments/common/inventory/group_vars/all/<group_name>.yml
(the use ofall
here is due to ansible's precedence rules).
Although most of the inventory uses the group convention described above there are a few special cases:
- The
control
,login
andcompute
groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-outhosts
file. - The cluster name must be set on all hosts using
openhpc_cluster_name
. Using an[all:vars]
section in thehosts
file is usually convenient. environments/common/inventory/group_vars/all/defaults.yml
contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using aenvironments/<environment>/inventory/group_vars/all/overrides.yml
file.- The
ansible/adhoc/generate-passwords.yml
playbook sets secrets for all hosts inenvironments/<environent>/inventory/group_vars/all/secrets.yml
. - The Packer-based pipeline for building compute images creates a VM in groups
builder
andcompute
, allowing build-specific properties to be set inenvironments/common/inventory/group_vars/builder/defaults.yml
or the equivalent inventory-specific path. - Each Slurm partition must have:
- An inventory group
<cluster_name>_<partition_name>
defining the hosts it contains - these must be homogenous w.r.t CPU and memory. - An entry in the
openhpc_slurm_partitions
mapping inenvironments/<environment>/inventory/group_vars/openhpc/overrides.yml
. See the openhpc role documentation for more options.
- An inventory group
TODO: this is just rough notes:
- Add new plays into existing playbook, or add a new playbook and update
site.yml
. - Add new empty group into
environments/common/inventory/groups
- Add new default group vars.
- Update example groups file
environments/common/layouts/everything
- Update default Packer build variables in
environments/common/inventory/group_vars/builder/defaults.yml
. - Update READMEs.
Please see the monitoring-and-logging.README.md for details.