Monitoring Dashboard built upon several common open source projects.
- Grafana
- Victoria Metrics
- Slurm Exporter
- Node Exporter
- Node Exporter Textfile Collector Scripts
- Power Exporter
- Slurm Job Exporter
- Suitable sized host with sufficient storage in
/var
for Docker volumes. - Docker / Podman installed.
- SSL certificate and keyfile.
- Git
- Optional: sssd configured on the host to provide basic nginx authentication.
Clone this git repository and checkout the desired branch / release.
git clone https://github.com/gsangwell/monitoring-dashboard
git checkout <branch/release>
Run the install.sh
script to install the default configuration files.
cd monitoring-dashboard
bash install.sh
Copy your SSL certificate and key.
cp /path/to/certificate /etc/alces-dashboard/certs/dashboard.crt
cp /path/to/key /etc/alces-dashboard/certs/dashboard.key
Build and run the containers using the bash script. If you do not wish to use sssd configured on the host for authentication, update the build script to build use docker/proxy-noauth.Dockerfile
instead.
bash build.sh
bash run.sh
Update the dashboard admin password.
docker exec -it alces-dashboard-grafana /usr/share/grafana/bin/grafana-cli admin reset-admin-password "password"
Once deployed, you will need to configure the relevant services to collect metrics from your hosts.
Add your list of hosts to /etc/alces-dashboard/metrics/targets/node-exporter.yml
. Ensure any compute nodes are added with the correct node_type: compute
label and any core nodes are added with node_type: core
.
- targets:
- master01:9100
- master02:9100
labels:
node_type: core
- targets:
- node01:9100
- node02:9100
labels:
node_type: compute
The slurm-gpu-allocation.sh
and user-storage-quota.sh
scripts need to be installed on a suitable host - typically this would be the same host you install Slurm Exporter to. Follow the upstream documentation to install this.
Install this as per the upstream documentation and then update the target file /etc/alces-dashboard/metrics/targets/slurm-exporter.yml
to include the relevant host. The default configuration assumes this is installed on the same host.
- targets:
- localhost:9103
Install this as per the upstream documentation and then update the target file /etc/alces-dashboard/metrics/targets/slurm-job-exporter.yml
to include the relevant host. The default configuration assumes this is installed on the same host.
- targets:
- localhost:9103
Configure config.yaml
and nodes.yaml
as per the project README and restart the container.
vim /etc/alces-dashboard/power-exporter/config.yaml
vim /etc/alces-dashboard/power-exporter/nodes.yaml
docker restart alces-dashboard-power-exporter