Skip to content

Commit

Permalink
Adding/refactoring unified config
Browse files Browse the repository at this point in the history
fixes to the conf functionality

lint fix

add configuration data to readme

address pr comments
  • Loading branch information
Ubuntu authored and Ubuntu committed Apr 25, 2024
1 parent be75f06 commit 58c42b4
Show file tree
Hide file tree
Showing 18 changed files with 86 additions and 109 deletions.
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,20 @@ Get the code:

Note: If you are using an [Azure Ubuntu HPC-AI](https://github.com/Azure/azhpc-images) VM image you can find the Moneo in this path: /opt/azurehpc/tools/Moneo

### Configuration File ###

The [moneo_config.json](./moneo_config.json) file can be used to specify certain deployment settings prior to moneo deployment.

There are 4 groups of configurations:

1. exporter_conf - This applies to all deployments. See the following settings:
- gpu_sample_interval - Sample rate per minute for Nvidia GPU exporter. Choices are [1, 2, 30, 60, 120, 600]. with 60 samples per minute being default.
- gpu_profiling - Switches on additional profile metrics (Tensor, FP16, FP32, and FP64). Choices are true/false with false as default.
- Note: These settings may have an impact on performance. Default settings were chosen to minimize impact.
2. prom_config - This group of settings applies to the Headless deployment method. Refer to [Headless Deployment Guide](./docs/HeadlessDeployment.md) for usage.
3. geneva_config - Applies to Geneva deployement. Refer to [Geneva deployment](./docs/GenevaAgent.MD) for usage.
4. publisher_config - Applies to both Geneva and Azure Monitor agent deployment methods see [Geneva deployment](./docs/GenevaAgent.MD) or [Azure Monitor Agent deployment](./docs/AzureMonitorAgent.md) for usage.

### Prefered Moneo Deployment ###

The prefered way to deploy Moneo is the headless method using Azure Managaed Grafana and Prometheus resources.
Expand Down Expand Up @@ -185,7 +199,7 @@ Note: For more options check the Moneo help menu

1. For Managed Grafana (headless) deployment
- Verify that the user managed identity is assigned to the VM resource.
- Verify the prerequisite configure file (`Moneo/src/worker/publisher/config/managed_prom_config.json`) is configured correctly on each worker node.
- Verify the prerequisite configure file (`Moneo/moneo_config.json`) is configured correctly on each worker node.
- On the worker nodes verify functionality of prometheus agent remote write:
- Check prometheus docker with `sudo docker logs prometheus | grep 'Done replaying WAL'`
It will have the result like this:
Expand Down
7 changes: 4 additions & 3 deletions dockerfile/moneo-exporter-nvidia.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ ARG BRANCH_OR_TAG=main

ENV DCGM_VERSION=3.1.1
ENV OFED_VERSION=23.07-0.5.1.2
ENV PROFILING false
ENV GPU_SAMPLE_RATE 2

# Install dependencies
RUN apt-get update -y \
Expand Down Expand Up @@ -43,11 +41,14 @@ RUN cd /tmp && \
RUN git config --global advice.detachedHead false
RUN git clone --branch ${BRANCH_OR_TAG} https://github.com/Azure/Moneo.git

# Set up tmp space for Moneo
RUN mkdir -p /tmp/moneo-worker

# Install DCGM
WORKDIR Moneo/src/worker
RUN sudo bash install/nvidia.sh

# Set EntryPoint
COPY dockerfile/moneo-exporter-nvidia_entrypoint.sh .
RUN chmod +x moneo-exporter-nvidia_entrypoint.sh
CMD /bin/bash moneo-exporter-nvidia_entrypoint.sh ${PROFILING} ${GPU_SAMPLE_RATE}
CMD /bin/bash moneo-exporter-nvidia_entrypoint.sh
13 changes: 4 additions & 9 deletions dockerfile/moneo-exporter-nvidia_entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,22 +1,17 @@
#!/bin/bash
set -e

enable_profiling=$1
gpu_sample_rate=$2
ethernet_dev_name=$3
# Ethernet device naming, if not present it will use default eth0 name
ethernet_dev_name=$1

# Start NVIDIA, Net and Node Exporter
echo "Starting NVIDIA, Net and Node Exporter"

if [ $enable_profiling = true ]; then
python3 exporters/nvidia_exporter.py -m -s $gpu_sample_rate &
else
python3 exporters/nvidia_exporter.py -s $gpu_sample_rate &
fi
python3 exporters/nvidia_exporter.py &

python3 exporters/net_exporter.py --inifiband_sysfs=/hostsys/class/infiniband &

if [-n $ethernet_dev_name]; then
if [ -n "$ethernet_dev_name" ]; then
python3 exporters/node_exporter.py -e $ethernet_dev_name &
else
python3 exporters/node_exporter.py &
Expand Down
2 changes: 1 addition & 1 deletion docs/AzureMonitorAgent.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Prequisites:
1. An Azure Monitor Metrics (Application Insights) resource, please enable alerting on custom metric dimensions by refering this [document](https://learn.microsoft.com/en-us/azure/azure-monitor/app/pre-aggregated-metrics-log-metrics#custom-metrics-dimensions-and-pre-aggregation) to restore the metrics dimentions.(Lead to a extra cost)
2. PSSH installed on manager nodes.
3. Ensure passwordless ssh is installed in you environment.
4. Config publisher config file in `Moneo/src/worker/publisher/config/publisher_config.json`.
4. Config publisher config file in `Moneo/moneo_config.json`.
Note: You can obtain your connection string from the Application Insights pages you created in the Azure portal.
```
{
Expand Down
4 changes: 2 additions & 2 deletions docs/HeadlessDeployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Follow steps outlined in [Infrastructure deployment](../deploy_managed_infra/REA
3. Skip to step 5.
Note: This step can be performed in parallel using pssh. Reference step 4 for start and stop commands.

3. Modify the managed prometheus config file in `Moneo/src/worker/publisher/config/managed_prom_config.json`.
3. Modify the managed prometheus config file in `Moneo/moneo_config.json`.
- Reference the user managed identity created during infrastructure deployment to get the "identity client id"
- Reference the Managed Prometheus resource created during infrastructure deployment to get the "metrics ingestion endpoint"
- The config file modifcations must be distributed to the Moneo directories on all workers.

```json
{
"prom_config": {
"IDENTITY_CLIENT_ID": "<identity client id>",
"INGESTION_ENDPOINT": "<metrics ingestion endpoint>"
}
Expand Down
4 changes: 2 additions & 2 deletions docs/ManagedPrometheusAgent.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ This guide will provide step-by-step instructions on how to to publish your exp
- Click add at the bottom of the open blade.
3. PSSH installed on manager nodes.
4. Ensure passwordless ssh is installed in you environment.
5. Config managed prometheus config file in `Moneo/src/worker/publisher/config/managed_prom_config.json`.
5. Config managed prometheus config file in `Moneo/moneo_config.json`.
Note: You can obtain your IDENTITY_CLIENT_ID in your indentity resource page and your metrics ingestion endpoint from the AWM pages you created in the Azure portal.

``` json
{
"prom_config": {
"IDENTITY_CLIENT_ID": "<identity client id>",
"INGESTION_ENDPOINT": "<metrics ingestion endpoint>"
}
Expand Down
13 changes: 7 additions & 6 deletions linux_service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,18 @@ Configuration/Installation is only required once. After that is complete the Lin

1. Configuration and installation of the Linux service is done with the following command:
```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/configure_service.sh"```
- Note: If using Azure monitor or Geneva add an extra argument "./start_moneo_services.sh azure_monitor" or "./configure_service.sh geneva" respectively.
- Note: If using the Azure AI/HPC VM market place image, this step is already completed for managed prometheus deployment
- Note: If using Azure monitor or Geneva add an extra argument "./configure_service.sh azure_monitor" or "./configure_service.sh geneva" respectively.
- Note: Geneva authentication is user managed identity "umi" by default, you can choose to change to "cert" method by modifiying [the start script](./configure_service.sh) "PUBLISHER_AUTH" variable.

2. For Azure Monitor or Managed Prometheus methods if you have not yet modified the configuration files reference the following:
- For Azure Managed Prometheus:
- modify [managed_prom_config.json](../src/worker/publisher/config) and copy the file to the compute nodes.
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/src/worker/publisher/config/managed_prom_config.json /opt/azurehpc/tools/Moneo/src/worker/publisher/config```
- modify [moneo_config.json](../moneo_config.json) and copy the file to the compute nodes.
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/moneo_config.json /opt/azurehpc/tools/Moneo```
- Lastly check that that the managed user identity used to set up Managed Prometheus (Azure role assignments) is assigned to your VMSS.
- For Azure Monitor:
- modify the connection string of "azure_monitor_agent_config" section and copy the file to the compute nodes.
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/src/worker/publisher/config/publisher_config.json /opt/azurehpc/tools/Moneo/src/worker/publisher/config```
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/moneo_config.json /opt/azurehpc/tools/Moneo```

### Launch Services ###

Expand Down Expand Up @@ -84,10 +85,10 @@ Stopping services is the same command for all methods.

Assuming configuration files have been updated and user managed ID applied if necessary (Managed Prometheus) reference these commands for the work flow:

- Configuration/Install:
- Configuration/Install (not needed for market place image, using managed Prometheus):
```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/configure_service.sh"```
- Extra Configure step for AZ Monitor and/or Managed Prometheus
```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/src/worker/publisher/config/<Respective config file> /opt/azurehpc/tools/Moneo/src/worker/publisher/config```
```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/moneo_config.json /opt/azurehpc/tools/Moneo```
- Start
```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/start_moneo_services.sh"```
Note:
Expand Down
3 changes: 3 additions & 0 deletions linux_service/start_moneo_services.sh
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ function proc_check(){
echo "All Services Running"
exit 0
}
# stop nvidia exporter in the event there was a config change
sudo systemctl stop moneo@nvidia_exporter.service 2> /dev/null
sleep 2 # wait a bit for the exporter to stop

$MONEO_PATH/linux_service/moneo_prestart.sh $MONEO_PATH 2> /dev/null

Expand Down
28 changes: 7 additions & 21 deletions moneo.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,8 @@ def deploy_worker(self, hosts_file, max_threads=16): # noqa: C901
logging.info('Copying files to workers')
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
# Copy config file
copy_path = './moneo_config.json'
destination_dir = '/tmp/moneo-worker'
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
print('--------------------------')
Expand All @@ -150,12 +150,6 @@ def deploy_worker(self, hosts_file, max_threads=16): # noqa: C901
print('-Starting metric exporters on workers-')
logging.info('Starting metric exporters on workers')
cmd = '/tmp/moneo-worker/start.sh'
if self.args.profiler_metrics:
print('-Profiling enabled-')
logging.info('Profiling enabled')
cmd = cmd + ' true'
else:
cmd = cmd + ' false'
if self.args.launch_publisher:
agent = self.args.launch_publisher
if agent == 'geneva' and not self.args.publisher_auth:
Expand Down Expand Up @@ -183,8 +177,8 @@ def deploy_worker(self, hosts_file, max_threads=16): # noqa: C901
else:
cmd = cmd + ' false'
cmd = cmd + " \"\""
# gpu sample rate + ethernet device
cmd = cmd + " " + str(args.gpu_sample_rate) + " " + args.ethernet_device
# ethernet device
cmd = cmd + " " + args.ethernet_device
if self.args.custom_metrics_file_path:
print('-Custom exporter enabled-')
logging.info('Custom exporter enabled')
Expand All @@ -201,6 +195,10 @@ def deploy_work_docker(self, hosts_file, max_threads=16):
logging.info('Deploying docker container to workers')
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
# Copy config file
copy_path = './moneo_config.json'
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
out = pssh(cmd='/tmp/moneo-worker/deploy_docker.sh',
hosts_file=hosts_file, max_threads=max_threads, user=self.args.user)
logging.info(out)
Expand Down Expand Up @@ -356,13 +354,6 @@ def parallel_ssh_check():
default='full',
nargs="?",
help='Type of deployment/shutdown. Choices: {manager,workers,full}. Default: full.')
parser.add_argument(
'-p',
'--profiler_metrics',
action='store_true',
default=False,
help='Enable profile metrics (Tensor Core,FP16,FP32,FP64 activity).'
'Addition of profile metrics encurs additional overhead on computer nodes.')
parser.add_argument(
'-r',
'--container',
Expand Down Expand Up @@ -408,11 +399,6 @@ def parallel_ssh_check():
'--custom_metrics_file_path',
type=str,
help='The path of the custom metrics file.')
parser.add_argument(
'--gpu_sample_rate',
type=int,
choices=[1, 2, 30, 60, 120, 600],
help='Number of samples per minute for GPU monitoring. Valid options are 1,2,3,10', default=60)
parser.add_argument(
'--ethernet_device',
type=str,
Expand Down
5 changes: 4 additions & 1 deletion moneo_config.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@

{
"exporter_config":{
"gpu_sample_interval": "60",
"gpu_profiling": "false"
},
"prom_config":{
"IDENTITY_CLIENT_ID": "<identity client id>",
"INGESTION_ENDPOINT": "<metrics ingestion endpoint>"
Expand Down
5 changes: 2 additions & 3 deletions src/worker/deploy_docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,14 @@

IMAGE=azmoneo/moneo-exporter:nvidia
CONT_NAME=moneo-exporter-nvidia
PROFILING=$1

if [ -e "/dev/nvidiactl" ]; then
docker pull $IMAGE

docker rm --force $CONT_NAME && \
docker run --name=$CONT_NAME --net=host --restart=unless-stopped \
-e PROFILING=$PROFILING --rm --runtime=nvidia \
--cap-add SYS_ADMIN -v /sys:/hostsys/ -itd $IMAGE
--rm --runtime=nvidia \
--cap-add SYS_ADMIN -v /sys:/hostsys/ -v /tmp/moneo-worker/moneo_config.json:/tmp/moneo-worker/moneo_config.json -itd $IMAGE
else

echo 'No Nvidia devices found Docker deployment canceled'
Expand Down
46 changes: 28 additions & 18 deletions src/worker/exporters/nvidia_exporter.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import time
import signal
import logging

import json
import prometheus_client

sys.path.append('/usr/local/dcgm/bindings/python3')
Expand Down Expand Up @@ -276,15 +276,40 @@ def Loop(self):
pass


def get_custom_config():
try:
with open('/tmp/moneo-worker/moneo_config.json') as f:
mon_config = json.load(f)

sample_per_min = int(mon_config['exporter_config']['gpu_sample_interval'])
sample_intervals = [1, 2, 30, 60, 120, 600]

if sample_per_min not in sample_intervals:
mon_config['exporter_config']['gpu_sample_interval'] = 60
else:
mon_config['exporter_config']['gpu_sample_interval'] = sample_per_min

if (mon_config['exporter_config']['gpu_profiling']).lower() == "true":
mon_config['exporter_config']['gpu_profiling'] = True
else:
mon_config['exporter_config']['gpu_profiling'] = False
return mon_config
except Exception:
mon_config = {'exporter_config': {'gpu_sample_interval': 60, 'gpu_profiling': False}}
return mon_config


def init_config():
global dcgm_config
mon_config = get_custom_config()
dcgm_config = {
'exit': False,
'ignoreList': [],
'dcgmHostName': None,
'prometheusPort': None,
'prometheusPublishInterval': None,
'prometheusPublishInterval': mon_config['exporter_config']['gpu_sample_interval'],
'publishFieldIds': None,
'profilerMetrics': mon_config['exporter_config']['gpu_profiling'],
'last_value': {}
}

Expand All @@ -304,22 +329,9 @@ def parse_dcgm_cli():
publish_port=8000,
log_level='INFO',
)
parser.add_argument(
'-m',
'--profiler_metrics',
action='store_true',
help='Enable profile metrics (Tensor Core,FP16,FP32,FP64 activity).'
'Addition of profile metrics encurs additional overhead on computer nodes.')
parser.add_argument(
'-s',
'--sample_per_min',
type=int,
default=60,
choices=[1, 2, 30, 60, 120, 600],
help='Samples per minute. Default 60')
args = dcgm_client_cli_parser.run_parser(parser)
# add profiling metrics if flag enabled
if (args.profiler_metrics):
if (dcgm_config['profilerMetrics']):
args.field_ids.extend(DCGM_PROF_FIELDS)
field_ids = dcgm_client_cli_parser.get_field_ids(args)
numeric_log_level = dcgm_client_cli_parser.get_log_level(args)
Expand All @@ -334,11 +346,9 @@ def parse_dcgm_cli():
else:
dcgm_config['dcgmHostName'] = args.hostname
dcgm_config['prometheusPort'] = args.publish_port
dcgm_config['prometheusPublishInterval'] = int(args.sample_per_min)
dcgm_config['publishFieldIds'] = field_ids
dcgm_config['sendUuid'] = True
dcgm_config['jobId'] = None
dcgm_config['profilerMetrics'] = args.profiler_metrics
logging.basicConfig(
level=numeric_log_level,
filemode=filemode,
Expand Down
5 changes: 0 additions & 5 deletions src/worker/publisher/config/geneva_config.json

This file was deleted.

4 changes: 0 additions & 4 deletions src/worker/publisher/config/managed_prom_config.json

This file was deleted.

13 changes: 0 additions & 13 deletions src/worker/publisher/config/publisher_config.json

This file was deleted.

2 changes: 1 addition & 1 deletion src/worker/publisher/metrics_publisher.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ def get_publisher_metrics_config():
Returns:
config(dict): The geneva metrics configuration
"""
with open('/tmp/moneo-worker/publisher/config/moneor_config.json') as f:
with open('/tmp/moneo-worker/moneo_config.json') as f:
config = json.load(f)
return config

Expand Down
Loading

0 comments on commit 58c42b4

Please sign in to comment.