Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config changes #80

Merged
merged 2 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,20 @@ Get the code:

Note: If you are using an [Azure Ubuntu HPC-AI](https://github.com/Azure/azhpc-images) VM image you can find the Moneo in this path: /opt/azurehpc/tools/Moneo

### Configuration File ###

The [moneo_config.json](./moneo_config.json) file can be used to specify certain deployment settings prior to moneo deployment.

There are 4 groups of configurations:

1. exporter_conf - This applies to all deployments. See the following settings:
- gpu_sample_interval - Sample rate per minute for Nvidia GPU exporter. Choices are [1, 2, 30, 60, 120, 600]. with 60 samples per minute being default.
- gpu_profiling - Switches on additional profile metrics (Tensor, FP16, FP32, and FP64). Choices are true/false with false as default.
- Note: These settings may have an impact on performance. Default settings were chosen to minimize impact.
2. prom_config - This group of settings applies to the Headless deployment method. Refer to [Headless Deployment Guide](./docs/HeadlessDeployment.md) for usage.
3. geneva_config - Applies to Geneva deployement. Refer to [Geneva deployment](./docs/GenevaAgent.MD) for usage.
4. publisher_config - Applies to both Geneva and Azure Monitor agent deployment methods see [Geneva deployment](./docs/GenevaAgent.MD) or [Azure Monitor Agent deployment](./docs/AzureMonitorAgent.md) for usage.

### Prefered Moneo Deployment ###

The prefered way to deploy Moneo is the headless method using Azure Managaed Grafana and Prometheus resources.
Expand Down Expand Up @@ -185,7 +199,7 @@ Note: For more options check the Moneo help menu

1. For Managed Grafana (headless) deployment
- Verify that the user managed identity is assigned to the VM resource.
- Verify the prerequisite configure file (`Moneo/src/worker/publisher/config/managed_prom_config.json`) is configured correctly on each worker node.
- Verify the prerequisite configure file (`Moneo/moneo_config.json`) is configured correctly on each worker node.
- On the worker nodes verify functionality of prometheus agent remote write:
- Check prometheus docker with `sudo docker logs prometheus | grep 'Done replaying WAL'`
It will have the result like this:
Expand Down
7 changes: 4 additions & 3 deletions dockerfile/moneo-exporter-nvidia.dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@ ARG BRANCH_OR_TAG=main

ENV DCGM_VERSION=3.1.1
ENV OFED_VERSION=23.07-0.5.1.2
ENV PROFILING false
ENV GPU_SAMPLE_RATE 2

# Install dependencies
RUN apt-get update -y \
Expand Down Expand Up @@ -43,11 +41,14 @@ RUN cd /tmp && \
RUN git config --global advice.detachedHead false
RUN git clone --branch ${BRANCH_OR_TAG} https://github.com/Azure/Moneo.git

# Set up tmp space for Moneo
RUN mkdir -p /tmp/moneo-worker

# Install DCGM
WORKDIR Moneo/src/worker
RUN sudo bash install/nvidia.sh

# Set EntryPoint
COPY dockerfile/moneo-exporter-nvidia_entrypoint.sh .
RUN chmod +x moneo-exporter-nvidia_entrypoint.sh
CMD /bin/bash moneo-exporter-nvidia_entrypoint.sh ${PROFILING} ${GPU_SAMPLE_RATE}
CMD /bin/bash moneo-exporter-nvidia_entrypoint.sh
13 changes: 4 additions & 9 deletions dockerfile/moneo-exporter-nvidia_entrypoint.sh
Original file line number Diff line number Diff line change
@@ -1,22 +1,17 @@
#!/bin/bash
set -e

enable_profiling=$1
gpu_sample_rate=$2
ethernet_dev_name=$3
# Ethernet device naming, if not present it will use default eth0 name
ethernet_dev_name=$1
rafsalas19 marked this conversation as resolved.
Show resolved Hide resolved

# Start NVIDIA, Net and Node Exporter
echo "Starting NVIDIA, Net and Node Exporter"

if [ $enable_profiling = true ]; then
python3 exporters/nvidia_exporter.py -m -s $gpu_sample_rate &
else
python3 exporters/nvidia_exporter.py -s $gpu_sample_rate &
fi
python3 exporters/nvidia_exporter.py &

python3 exporters/net_exporter.py --inifiband_sysfs=/hostsys/class/infiniband &

if [-n $ethernet_dev_name]; then
if [ -n "$ethernet_dev_name" ]; then
python3 exporters/node_exporter.py -e $ethernet_dev_name &
else
python3 exporters/node_exporter.py &
Expand Down
2 changes: 1 addition & 1 deletion docs/AzureMonitorAgent.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Prequisites:
1. An Azure Monitor Metrics (Application Insights) resource, please enable alerting on custom metric dimensions by refering this [document](https://learn.microsoft.com/en-us/azure/azure-monitor/app/pre-aggregated-metrics-log-metrics#custom-metrics-dimensions-and-pre-aggregation) to restore the metrics dimentions.(Lead to a extra cost)
2. PSSH installed on manager nodes.
3. Ensure passwordless ssh is installed in you environment.
4. Config publisher config file in `Moneo/src/worker/publisher/config/publisher_config.json`.
4. Config publisher config file in `Moneo/moneo_config.json`.
Note: You can obtain your connection string from the Application Insights pages you created in the Azure portal.
```
{
Expand Down
4 changes: 2 additions & 2 deletions docs/HeadlessDeployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Follow steps outlined in [Infrastructure deployment](../deploy_managed_infra/REA
3. Skip to step 5.
Note: This step can be performed in parallel using pssh. Reference step 4 for start and stop commands.

3. Modify the managed prometheus config file in `Moneo/src/worker/publisher/config/managed_prom_config.json`.
3. Modify the managed prometheus config file in `Moneo/moneo_config.json`.
- Reference the user managed identity created during infrastructure deployment to get the "identity client id"
- Reference the Managed Prometheus resource created during infrastructure deployment to get the "metrics ingestion endpoint"
- The config file modifcations must be distributed to the Moneo directories on all workers.

```json
{
"prom_config": {
"IDENTITY_CLIENT_ID": "<identity client id>",
"INGESTION_ENDPOINT": "<metrics ingestion endpoint>"
}
Expand Down
4 changes: 2 additions & 2 deletions docs/ManagedPrometheusAgent.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ This guide will provide step-by-step instructions on how to to publish your exp
- Click add at the bottom of the open blade.
3. PSSH installed on manager nodes.
4. Ensure passwordless ssh is installed in you environment.
5. Config managed prometheus config file in `Moneo/src/worker/publisher/config/managed_prom_config.json`.
5. Config managed prometheus config file in `Moneo/moneo_config.json`.
Note: You can obtain your IDENTITY_CLIENT_ID in your indentity resource page and your metrics ingestion endpoint from the AWM pages you created in the Azure portal.

``` json
{
"prom_config": {
"IDENTITY_CLIENT_ID": "<identity client id>",
"INGESTION_ENDPOINT": "<metrics ingestion endpoint>"
}
Expand Down
13 changes: 7 additions & 6 deletions linux_service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,18 @@ Configuration/Installation is only required once. After that is complete the Lin

1. Configuration and installation of the Linux service is done with the following command:
```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/configure_service.sh"```
- Note: If using Azure monitor or Geneva add an extra argument "./start_moneo_services.sh azure_monitor" or "./configure_service.sh geneva" respectively.
- Note: If using the Azure AI/HPC VM market place image, this step is already completed for managed prometheus deployment
- Note: If using Azure monitor or Geneva add an extra argument "./configure_service.sh azure_monitor" or "./configure_service.sh geneva" respectively.
- Note: Geneva authentication is user managed identity "umi" by default, you can choose to change to "cert" method by modifiying [the start script](./configure_service.sh) "PUBLISHER_AUTH" variable.

2. For Azure Monitor or Managed Prometheus methods if you have not yet modified the configuration files reference the following:
- For Azure Managed Prometheus:
- modify [managed_prom_config.json](../src/worker/publisher/config) and copy the file to the compute nodes.
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/src/worker/publisher/config/managed_prom_config.json /opt/azurehpc/tools/Moneo/src/worker/publisher/config```
- modify [moneo_config.json](../moneo_config.json) and copy the file to the compute nodes.
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/moneo_config.json /opt/azurehpc/tools/Moneo```
- Lastly check that that the managed user identity used to set up Managed Prometheus (Azure role assignments) is assigned to your VMSS.
- For Azure Monitor:
- modify the connection string of "azure_monitor_agent_config" section and copy the file to the compute nodes.
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/src/worker/publisher/config/publisher_config.json /opt/azurehpc/tools/Moneo/src/worker/publisher/config```
- ```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/moneo_config.json /opt/azurehpc/tools/Moneo```

### Launch Services ###

Expand Down Expand Up @@ -84,10 +85,10 @@ Stopping services is the same command for all methods.

Assuming configuration files have been updated and user managed ID applied if necessary (Managed Prometheus) reference these commands for the work flow:

- Configuration/Install:
- Configuration/Install (not needed for market place image, using managed Prometheus):
```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/configure_service.sh"```
- Extra Configure step for AZ Monitor and/or Managed Prometheus
```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/src/worker/publisher/config/<Respective config file> /opt/azurehpc/tools/Moneo/src/worker/publisher/config```
```parallel-scp -h hostfile /opt/azurehpc/tools/Moneo/moneo_config.json /opt/azurehpc/tools/Moneo```
- Start
```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/start_moneo_services.sh"```
Note:
Expand Down
1 change: 1 addition & 0 deletions linux_service/moneo_prestart.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@ fi
mkdir -p /tmp/moneo-worker

cp -rf $MONEO_PATH/src/worker/* /tmp/moneo-worker/
cp -f $MONEO_PATH/moneo_config.json /tmp/moneo-worker/
13 changes: 6 additions & 7 deletions linux_service/moneo_service_deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,10 @@

MONEO_VERSION=v0.3.4 # Release tag
MONITOR_DIR=/opt/azurehpc/tools # install directory
IDENTITY_CLIENT_ID="38b84eb5-8aec-4971-aaeb-ddd7e9bfef98" # This is the client ID of the Managed Identity for the Azure Prometheus Monitor Workspace
INGESTION_ENDPOINT="https://moneo-amw-q14z.southcentralus-1.metrics.ingest.monitor.azure.com/dataCollectionRules/dcr-c0192b4cd2c748f88ffd422e7a0d77ac/streams/Microsoft-PrometheusMetrics/api/v1/write?api-version=2023-04-24" # This is the ingestion endpoint for the Azure Prometheus Monitor Workspace
MONEO_PATH=$MONITOR_DIR/Moneo
IDENTITY_CLIENT_ID="" # This is the client ID of the Managed Identity for the Azure Prometheus Monitor Workspace
INGESTION_ENDPOINT=""
PublisherMethod="" # This is the publisher method for Moneo. Options are azure_monitor, geneva (Msft internal Use), or leave blank for Azure Managed Prometheus

MONEO_PATH=$MONITOR_DIR/Moneo
# clone source to specified directory
if [[ -d "$MONEO_PATH" ]]; then
pushd $MONEO_PATH
Expand All @@ -36,9 +35,9 @@ fi
sudo chmod -R 777 $MONEO_PATH

# Configure step
echo "{
\"IDENTITY_CLIENT_ID\": \"$IDENTITY_CLIENT_ID\",
\"INGESTION_ENDPOINT\": \"$INGESTION_ENDPOINT\" }" > $MONEO_PATH/src/worker/publisher/config/managed_prom_config.json
jq '(.prom_config.IDENTITY_CLIENT_ID |= "'"$IDENTITY_CLIENT_ID"'")' "$MONEO_PATH/moneo_config.json" > "$MONEO_PATH/temp.json" && mv "$MONEO_PATH/temp.json" "$MONEO_PATH/moneo_config.json"
rafsalas19 marked this conversation as resolved.
Show resolved Hide resolved
jq '(.prom_config.INGESTION_ENDPOINT |= "'"$INGESTION_ENDPOINT"'")' "$MONEO_PATH/moneo_config.json" > "$MONEO_PATH/temp.json" && mv "$MONEO_PATH/temp.json" "$MONEO_PATH/moneo_config.json"
rm -f "$MONEO_PATH/temp.json"

pushd $MONEO_PATH/linux_service
sudo ./configure_service.sh >> moneoServiceInstall.log
Expand Down
3 changes: 3 additions & 0 deletions linux_service/start_moneo_services.sh
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ function proc_check(){
echo "All Services Running"
exit 0
}
# stop nvidia exporter in the event there was a config change
sudo systemctl stop moneo@nvidia_exporter.service 2> /dev/null
sleep 2 # wait a bit for the exporter to stop

$MONEO_PATH/linux_service/moneo_prestart.sh $MONEO_PATH 2> /dev/null

Expand Down
30 changes: 10 additions & 20 deletions moneo.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,10 @@ def deploy_worker(self, hosts_file, max_threads=16): # noqa: C901
logging.info('Copying files to workers')
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
# Copy config file
copy_path = './moneo_config.json'
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
print('--------------------------')
if self.args.skip_install:
pass
Expand All @@ -146,12 +150,6 @@ def deploy_worker(self, hosts_file, max_threads=16): # noqa: C901
print('-Starting metric exporters on workers-')
logging.info('Starting metric exporters on workers')
cmd = '/tmp/moneo-worker/start.sh'
if self.args.profiler_metrics:
print('-Profiling enabled-')
logging.info('Profiling enabled')
cmd = cmd + ' true'
else:
cmd = cmd + ' false'
if self.args.launch_publisher:
agent = self.args.launch_publisher
if agent == 'geneva' and not self.args.publisher_auth:
Expand Down Expand Up @@ -179,8 +177,8 @@ def deploy_worker(self, hosts_file, max_threads=16): # noqa: C901
else:
cmd = cmd + ' false'
cmd = cmd + " \"\""
# gpu sample rate + ethernet device
cmd = cmd + " " + str(args.gpu_sample_rate) + " " + args.ethernet_device
# ethernet device
cmd = cmd + " " + args.ethernet_device
if self.args.custom_metrics_file_path:
print('-Custom exporter enabled-')
logging.info('Custom exporter enabled')
Expand All @@ -197,6 +195,10 @@ def deploy_work_docker(self, hosts_file, max_threads=16):
logging.info('Deploying docker container to workers')
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
# Copy config file
copy_path = './moneo_config.json'
out = pscp(copy_path, destination_dir, hosts_file, user=self.args.user)
logging.info(out)
out = pssh(cmd='/tmp/moneo-worker/deploy_docker.sh',
hosts_file=hosts_file, max_threads=max_threads, user=self.args.user)
logging.info(out)
Expand Down Expand Up @@ -352,13 +354,6 @@ def parallel_ssh_check():
default='full',
nargs="?",
help='Type of deployment/shutdown. Choices: {manager,workers,full}. Default: full.')
parser.add_argument(
'-p',
'--profiler_metrics',
action='store_true',
default=False,
help='Enable profile metrics (Tensor Core,FP16,FP32,FP64 activity).'
'Addition of profile metrics encurs additional overhead on computer nodes.')
parser.add_argument(
'-r',
'--container',
Expand Down Expand Up @@ -404,11 +399,6 @@ def parallel_ssh_check():
'--custom_metrics_file_path',
type=str,
help='The path of the custom metrics file.')
parser.add_argument(
'--gpu_sample_rate',
type=int,
choices=[1, 2, 30, 60, 120, 600],
help='Number of samples per minute for GPU monitoring. Valid options are 1,2,3,10', default=60)
parser.add_argument(
'--ethernet_device',
type=str,
Expand Down
28 changes: 28 additions & 0 deletions moneo_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"exporter_config":{
"gpu_sample_interval": "60",
"gpu_profiling": "false"
},
"prom_config":{
"IDENTITY_CLIENT_ID": "<identity client id>",
"INGESTION_ENDPOINT": "<metrics ingestion endpoint>"
},
"geneva_config":{
"AccountName": "<account name>",
"MDMEndPoint": "<endpoint>",
"UmiObjectId": "<object ID>"
},
"publisher_config":{
"common_config": {
"metrics_ports": "8000,8001,8002",
"metrics_namespace": "<metrics_namespace>",
"interval": "20"
},
"geneva_agent_config": {
"metrics_account": "<metrics_account>"
},
"azure_monitor_agent_config": {
"connection_string": "<connectionString>"
}
}
}
5 changes: 2 additions & 3 deletions src/worker/deploy_docker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,14 @@

IMAGE=azmoneo/moneo-exporter:nvidia
CONT_NAME=moneo-exporter-nvidia
PROFILING=$1

if [ -e "/dev/nvidiactl" ]; then
docker pull $IMAGE

docker rm --force $CONT_NAME && \
docker run --name=$CONT_NAME --net=host --restart=unless-stopped \
-e PROFILING=$PROFILING --rm --runtime=nvidia \
--cap-add SYS_ADMIN -v /sys:/hostsys/ -itd $IMAGE
--rm --runtime=nvidia \
--cap-add SYS_ADMIN -v /sys:/hostsys/ -v /tmp/moneo-worker/moneo_config.json:/tmp/moneo-worker/moneo_config.json -itd $IMAGE
else

echo 'No Nvidia devices found Docker deployment canceled'
Expand Down
Loading
Loading