diff --git a/README.md b/README.md index f6997be..a8aeb1e 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,26 @@ -Moneo -===== -Description ------ -Moneo is a distributed GPU system monitor for AI workflows. +# Moneo # + +## Description ## + +Moneo is a distributed GPU system monitor for AI workflows. It orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization. + +Moneo offers flexibility with 3 deployment methods: + +1. The preffered method using Azure Managed Prometheus/Grafana and Moneo linux services for collection (Headless deployment) +2. Using Azure Application Insights/Azure Monitor Workspace(AMW) (Headless deployment w/ App Insights). +3. Using Moneo CLI with a dedicate headnode to host local Prometheus/Grafana servers (Local Grafana Deployment) + +Moneo Headless Method: + +![image](./docs/assets/managedResourceDiagram.svg) -Moneo orchestrates metric collection (DCGMI + Prometheus DB) and visualization (Grafana) across multi-GPU/node systems. This provides useful insights into workflow and system level characterization.
Metrics There five categories of metrics that Moneo monitors: -1. GPU Counters + +1. GPU Counters + - Compute/Memory Utilization - SM and Memory Clock frequency - Temperature @@ -17,12 +28,12 @@ There five categories of metrics that Moneo monitors: - ECC Counts (Nvidia) - GPU Throttling (Nvidia) - XID code (Nvidia) -2. GPU Profiling Counters +2. GPU Profiling Counters - SM Activity - Memory Dram Activity - NVLink Activity - PCIE Rate -3. InfiniBand Network Counters +3. InfiniBand Network Counters - IB TX/RX rate - IB Port errors - IB Link FLap @@ -31,6 +42,7 @@ There five categories of metrics that Moneo monitors: - Clock frequency 5. Memory - Utilization +
@@ -59,125 +71,100 @@ There five categories of metrics that Moneo monitors:
-Minimum Requirements ------ +## Minimum Requirements ## + - python >=3.7 installed - OS Support: - - Ubuntu 18.04, 20.04, 22.04 - - AlmaLinux 8.6 -### Manager node requirements + - Ubuntu 18.04, 20.04, 22.04 + - AlmaLinux 8.6 + +### Manager Node Requirements ### + +Note: Not applicable if using Azure Managed Grafana/Prometheus + - docker 20.10.23 (May work with other versions but this has been tested.) - parallel-ssh 2.3.1-2 (May work with other versions but this has been tested.) +- Manager node must be able to ssh to itself + +### Worker node requirements ### -### Worker node requirements - Nvidia Architecture supported (only for Nvidai GPU monitoring): - - Volta - - Ampere - - Hopper - - docker 20.10.23 (Only if using geneva agent. May work with other versions but this has been tested.) - - Installed with install script at time of deployment (If not installed.): - - DCGM 3.1.6 - - pip3 - - prometheus_client - - psutil - - filelock - -Setup ------ - -Run following commands on dev box (could be one of the master/worker nodes or a local node): + - Volta + - Ampere + - Hopper +- Installed with install script at time of deployment (If not installed): + - DCGM 3.1.6 (For Nvidia deployments) + - Check install scripts for the various python packages installed. -```sh -# get the code -git clone https://github.com/Azure/Moneo.git -cd Moneo +## Usage ## -# install dependencies -sudo apt-get install pssh=2.3.1-2 -``` +### Deploying Moneo ### -Configuration -------------- +Get the code: -Prepare a hostfile that lists all worker node hostnames/ip +- Clone Moneo from Github. -```hostfile -192.168.0.100 -192.168.0.101 -192.168.0.110 -``` + ```sh + # get the code + git clone https://github.com/Azure/Moneo.git + cd Moneo + # install dependency + sudo apt-get install pssh + ``` + + Note: If you are using an [Azure Ubuntu HPC-AI](https://github.com/Azure/azhpc-images) VM image you can find the Moneo in this path: /opt/azurehpc/tools/Moneo + +### Preffered Moneo Deployment ### -If the remote worker machines use a different username use the Moneo cli "--user" flag to indicate username to use. +The preffered way to deploy Moneo is the headless method using Azure Managaed Grafana and Prometheus resources. -If the manager is not local host use the "--manager_host" flag to specify hostname/IP. +Complete the steps listed here: [Headless Deployment Guide](./docs/HeadlessDeployment.md) -i.e. ```python3 moneo.py -d manager -c hostfile --user --manager_host ``` +### Alternative deployment using Moneo CLI and head node ### -Usage ------ -### _Moneo CLI_ -To make deploying and shutting down easier we provide the Moneo CLI. +This method requires a deploying of a head node to host the local Prometheus database and Grafana server. -Which can be accessed as such: +- The headnode must have enough storage available to facilitate data collection +- Grafana and Prometheus is accessed via web browser. Ensure proper access from web browser to headnode IP. -* ```sh +Complete the steps listed here: [Local Grafana Deployment Guide](./docs/LocalGrafanDeployment.md) + +### Moneo CLI ### + +Moneo CLI provides an alternative way to deploy and update Moneo manager and worker nodes. Although linux services are preffered this offeres an alternative way to control Moneo. + +#### CLI Usage #### + +- ```python3 moneo.py [-d/--deploy] [-c hostfile] {manager,workers,full}``` +- ```python3 moneo.py [-s/--shutdown] [-c hostfile] {manager,workers,full}``` +- ```python3 moneo.py [-j JOB_ID ] [-c hostfile]``` +- i.e. ```python3 moneo.py -d -c ./hostfile full``` + +Note: For more options check the Moneo help menu + +```sh python3 moneo.py --help - ``` -#### CLI Usage -* ```python3 moneo.py [-d/--deploy] [-c hostfile] {manager,workers,full}``` -* ```python3 moneo.py [-s/--shutdown] [-c hostfile] {manager,workers,full}``` -* ```python3 moneo.py [-j JOB_ID ] [-c hostfile]``` -* i.e. ```python3 moneo.py -d -c ./hostfile full``` - - -| Flag | Options/arguments |Description| -|--------------------------------|--------------------------|--------| -|-d, --deploy | None |Deploy option selection. Requires config file to be specified (i.e. -c host.ini) or file to be in Moneo directory.| -|-s, --shutdown| None |Shutdown option selection. Requires config file to be specified (i.e. -c host.ini) or file to be in Moneo directory.| -| | {manager,workers,full} | Type of deployment/shutdown. Choices: {manager,workers,full}. Default: full. | -|-c, --host_ini | path + file name |Provide filepath and name of ansible config file. The default is host.ini in the Moneo directory.| -|-j , --job_id | Job ID |Job ID for filtering metrics by job group. Host.ini file required. Cannot be specified during deployment and shutdown.| -|-p, --profiler_metrics | None|Enable profile metrics (Tensor Core,FP16,FP32,FP64 activity). Addition of profile metrics encurs additional overhead on computer nodes.| -|-f, --fork_processes | number of processes | The number of processes used to deploy/shutdown/update Moneo. Increasing process count can reduce the latency when deploying to large number of nodes. Default is 16.| -|-r, --container | None|Deploy Moneo-worker inside the container. Supported Platform: {nvidia} | --w, --skip_install | None | Skip worker software install| --u, --user | Username for remote machine | Provide username to use on remote VMs if not the same as current machine. Default is none.| --m, --manager_host | Manager Hostname/IP | Manager hostname or IP. Default is localhost.| ---g , --launch_publisher | {geneva, azure_monitor} | This launches the publisher which will share exporter data with Azure.| --a PUBLISHER_AUTH | {umi, cert}| Required if launching publisher with geneva. Authentication method for geneva. See help menu for cert configuration.| -### _Access the Portal_ - -The Prometheus and Grafana services will be started on master nodes after deployment. -You can access the Grafana portal to visualize collected metrics. - -There are several cases based on the networking configuration: - -* If the master node has a public IP address or domain, you can access the portal through `http://master-ip-or-domain:3000` directly. - - For example, if you are deploying for Azure VM or VMSS, you can [associate a public IP address](https://docs.microsoft.com/en-us/azure/virtual-network/ip-services/associate-public-ip-address-vm) to the master node, then create a [fully qualified domain name (FQDN)](https://docs.microsoft.com/en-us/azure/virtual-machines/create-fqdn) for it. - -* If the master node does not have a public IP address to access, e.g., the VMSS is created behind a load balancer, you will need to create a proxy to access. - - For example, you can create a socks5 proxy at `socks5://localhost:1080` through `ssh -D 1080 -p PORT USER@IP`, then install [Proxy SwitchyOmega](https://chrome.google.com/webstore/detail/proxy-switchyomega/padekgcemlokbadohgkifijomclgjgif?hl=en) in Edge/Chrome browser and configure the proxy to protocol `socks5`, server `localhost`, port `1080` for all schemes, you will be able to navigate portal using master node's hostname at `http://master-hostname:3000`. - -* Default Grafana access: - * username: azure - * password: azure - - This can be changed in the "src/master/grafana/grafana.env" file. +``` + +### Access the Grafana Portal ### + +- For Azure Managed Grafana the dashboards can be accessed via the endpoint provided on the resource overview. +- For Moneo CLI deployment with a dedicated head node the Grafana portal can be reached via browser: http://master-ip-or-domain:3000 +- If Azure Monitor is used navigate to the Azure Monitor Workspace on The Azure portal. - ### _User Docs_ ### -- [Quick Start](./docs/QuickStartGuide.md) +## User Docs ## + +- [Headless Deployment Guide](./docs/HeadlessDeployment.md) +- [Local Grafana Deployment Guide](./docs/LocalGrafanDeployment.md) - To get started with job level filtering see: [Job Level Filtering](./docs/JobFiltering.md) - Slurm epilog/prolog integration: [Slurm example](./examples/slurm/README.md) - To deploy moneo-worker inside container: [Moneo-exporter](./docs/Moneo-exporter.md) -- To integrate Moneo with Azure Insights dashboard see: [Azure Monitor](./docs/AzureMonitorAgent.md) +- To integrate Moneo with Azure App Insights dashboard see: [Azure Monitor](./docs/AzureMonitorAgent.md) -Known Issues ------------- +## Known Issues ## -* NVIDIA exporter may conflict with DCGMI +- NVIDIA exporter may conflict with DCGMI There're [two modes for DCGM](https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/getting-started.html#content): embedded mode and standalone mode. @@ -187,24 +174,43 @@ Known Issues > Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users. -* Moneo will attempt to install a tested version of DCGM if it is not present on the worker nodes. However, this step is skipped if DCGM is already installed. In instances DCGM installed may be too old. +- Moneo will attempt to install a tested version of DCGM if it is not present on the worker nodes. However, this step is skipped if DCGM is already installed. In instances DCGM installed may be too old. This may cause the Nvidia exporter to fail. In this case it is recommended that DCGM be upgrade to atleast version 2.4.4. To view which exporters are running on a worker just run ```ps -eaf | grep python3``` -Troubleshooting ------------- -- Verifying Grafana and Prometheus containers are running: - - Check browser http://master-ip-or-domain:3000 (Grafana), http://master-ip-or-domain:9090 (Prometheus) - - On Manager node terminal run ```sudo docker container ls``` - ![image](https://user-images.githubusercontent.com/70273488/205715440-9f994c84-b115-4a98-9535-fdce8a4adf7d.png) -- Verifying exporters on worker node: - - ```ps -eaf | grep python3``` - - ![image](https://user-images.githubusercontent.com/70273488/205716391-d0144085-8948-4269-a25c-51bc68448e1e.png) +## Troubleshooting ## + +1. For Managed Grafana (headless) deployment + - Verify that the user managed identity is assigned to the VM resource. + - Verify the the prerequisite configure file (`Moneo/src/worker/publisher/config/managed_prom_config.json`) is configured correctly on each worker node. + - On the worker nodes verify functionality of prometheus agent remote write: + - Check prometheus docker with `sudo docker logs prometheus | grep 'Done replaying WAL'` + It will have the result like this: + + ```Bash + ts=2023-08-07T07:25:49.636Z caller=dedupe.go:112 component=remote level=info remote_name=6ac237 url="" msg="Done replaying WAL" duration=8.339998173s + ``` + + - Check Azure Grafana's is linked to Azure Prometheus workspace. + - This can be done by accessing settings in Grafana dashboard and ensuring the ingestion link for the Managed Prometheus is being used for the datasource url. + - You can also verify The Managed Prometheus resource in the portal is linked with the managed Grafana resource + ![image](./docs/assets/promAMWLinkGrafana.png) + +2. For deployments with a Headnode: + + - Verifying Grafana and Prometheus containers are running: + - Check browser http://master-ip-or-domain:3000 (Grafana), http://master-ip-or-domain:9090 (Prometheus) + - On Manager node terminal run ```sudo docker container ls``` + ![image](https://user-images.githubusercontent.com/70273488/205715440-9f994c84-b115-4a98-9535-fdce8a4adf7d.png) + +3. All deployments: + - Verifying exporters on worker node: + - ``` ps -eaf | grep python3 ``` + ![image](https://user-images.githubusercontent.com/70273488/205716391-d0144085-8948-4269-a25c-51bc68448e1e.png) -## Contributing +## Contributing ## This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us @@ -218,10 +224,10 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. -## Trademarks +## Trademarks ## -This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft -trademarks or logos is subject to and must follow +This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft +trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies. diff --git a/dashboard_templates/DeviceCountersTemplate.json b/dashboard_templates/DeviceCountersTemplate.json deleted file mode 100644 index 3c110a2..0000000 --- a/dashboard_templates/DeviceCountersTemplate.json +++ /dev/null @@ -1,1075 +0,0 @@ -{ - "properties": { - "lenses": { - "0": { - "order": 0, - "parts": { - "0": { - "position": { - "x": 0, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_gpu_utilization", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_gpu_utilization" - } - } - ], - "title": "Avg dcgm_gpu_utilization for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "1": { - "position": { - "x": 6, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_gpu_temp", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_gpu_temp" - } - } - ], - "title": "Avg dcgm_gpu_temp for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "2": { - "position": { - "x": 12, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_sm_clock", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_sm_clock" - } - } - ], - "title": "Avg dcgm_sm_clock for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "3": { - "position": { - "x": 0, - "y": 4, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_mem_copy_utilization", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_mem_copy_utilization" - } - } - ], - "title": "Avg dcgm_mem_copy_utilization for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "4": { - "position": { - "x": 6, - "y": 4, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_memory_temp", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_memory_temp" - } - } - ], - "title": "Avg dcgm_memory_temp for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "5": { - "position": { - "x": 12, - "y": 4, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_memory_clock", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_memory_clock" - } - } - ], - "title": "Avg dcgm_memory_clock for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "6": { - "position": { - "x": 0, - "y": 8, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_tensor_active", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_tensor_active" - } - } - ], - "title": "Avg dcgm_tensor_active for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "7": { - "position": { - "x": 0, - "y": 12, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_fp16_active", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_fp16_active" - } - } - ], - "title": "Avg dcgm_fp16_active for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "8": { - "position": { - "x": 6, - "y": 12, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_fp32_active", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_fp32_active" - } - } - ], - "title": "Avg dcgm_fp32_active for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "9": { - "position": { - "x": 12, - "y": 12, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_fp64_active", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_fp64_active" - } - } - ], - "title": "Avg dcgm_fp64_active for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "10": { - "position": { - "x": 0, - "y": 16, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_power_usage", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_power_usage" - } - } - ], - "title": "Avg dcgm_power_usage for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "11": { - "position": { - "x": 6, - "y": 16, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_total_energy_consumption", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_total_energy_consumption" - } - } - ], - "title": "Avg dcgm_total_energy_consumption for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "12": { - "position": { - "x": 0, - "y": 20, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_ecc_dbe_aggregate_total", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_ecc_dbe_aggregate_total" - } - } - ], - "title": "Avg dcgm_ecc_dbe_aggregate_total for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "13": { - "position": { - "x": 6, - "y": 20, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_ecc_dbe_volatile_total", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_ecc_dbe_volatile_total" - } - } - ], - "title": "Avg dcgm_ecc_dbe_volatile_total for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "14": { - "position": { - "x": 0, - "y": 24, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_ecc_sbe_aggregate_total", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_ecc_sbe_aggregate_total" - } - } - ], - "title": "Avg dcgm_ecc_sbe_aggregate_total for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "15": { - "position": { - "x": 6, - "y": 24, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_ecc_sbe_volatile_total", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_ecc_sbe_volatile_total" - } - } - ], - "title": "Avg dcgm_ecc_sbe_volatile_total for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - } - } - } - }, - "metadata": { - "model": { - "timeRange": { - "value": { - "relative": { - "duration": 24, - "timeUnit": 1 - } - }, - "type": "MsPortalFx.Composition.Configuration.ValueTypes.TimeRange" - }, - "filterLocale": { - "value": "en-us" - }, - "filters": { - "value": { - "MsPortalFx_TimeRange": { - "model": { - "format": "utc", - "granularity": "auto", - "relative": "24h" - }, - "displayCache": { - "name": "UTC Time", - "value": "Past 24 hours" - }, - "filteredPartIds": [ - "StartboardPart-MonitorChartPart-8cb7053b-eb23-4016-9758-2555fffc3ebd", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b82480d2", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248235", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248015", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b82480de", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248241", - "StartboardPart-MonitorChartPart-be91bd06-3865-429f-92ec-db3c644bc053", - "StartboardPart-MonitorChartPart-be91bd06-3865-429f-92ec-db3c644bc05f", - "StartboardPart-MonitorChartPart-be91bd06-3865-429f-92ec-db3c644bc241", - "StartboardPart-MonitorChartPart-be91bd06-3865-429f-92ec-db3c644bc235", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b824841f", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b824842b", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248437", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248634", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248713", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b824871f" - ] - } - } - } - } - } - }, - "name": "Device Counters", - "type": "Microsoft.Portal/dashboards", - "location": "INSERT LOCATION", - "tags": { - "hidden-title": "Device Counters" - }, - "apiVersion": "2015-08-01-preview" -} diff --git a/dashboard_templates/InfiniBandNetworkCountersTemplate.json b/dashboard_templates/InfiniBandNetworkCountersTemplate.json deleted file mode 100644 index 6f97822..0000000 --- a/dashboard_templates/InfiniBandNetworkCountersTemplate.json +++ /dev/null @@ -1,179 +0,0 @@ -{ - "properties": { - "lenses": { - "0": { - "order": 0, - "parts": { - "0": { - "position": { - "x": 0, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/ib_port_rcv_data", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "ib_port_rcv_data" - } - } - ], - "title": "Avg ib_port_rcv_data for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "1": { - "position": { - "x": 6, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/ib_port_xmit_data", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "ib_port_xmit_data" - } - } - ], - "title": "Avg ib_port_xmit_data for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - } - } - } - }, - "metadata": { - "model": { - "timeRange": { - "value": { - "relative": { - "duration": 24, - "timeUnit": 1 - } - }, - "type": "MsPortalFx.Composition.Configuration.ValueTypes.TimeRange" - }, - "filterLocale": { - "value": "en-us" - }, - "filters": { - "value": { - "MsPortalFx_TimeRange": { - "model": { - "format": "utc", - "granularity": "auto", - "relative": "24h" - }, - "displayCache": { - "name": "UTC Time", - "value": "Past 24 hours" - }, - "filteredPartIds": [ - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248ee5", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248ef7" - ] - } - } - } - } - } - }, - "name": "InfiniBand Network Counters", - "type": "Microsoft.Portal/dashboards", - "location": "INSERT LOCATION", - "tags": { - "hidden-title": "InfiniBand Network Counters" - }, - "apiVersion": "2015-08-01-preview" -} diff --git a/dashboard_templates/ProfilingCountersTemplate.json b/dashboard_templates/ProfilingCountersTemplate.json deleted file mode 100644 index d24c9a7..0000000 --- a/dashboard_templates/ProfilingCountersTemplate.json +++ /dev/null @@ -1,499 +0,0 @@ -{ - "properties": { - "lenses": { - "0": { - "order": 0, - "parts": { - "0": { - "position": { - "x": 0, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_dram_active", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_dram_active" - } - } - ], - "title": "Avg dcgm_dram_active for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "1": { - "position": { - "x": 6, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_sm_active", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_sm_active" - } - } - ], - "title": "Avg dcgm_sm_active for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "2": { - "position": { - "x": 12, - "y": 0, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_sm_occupancy", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_sm_occupancy" - } - } - ], - "title": "Avg dcgm_sm_occupancy for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "3": { - "position": { - "x": 0, - "y": 4, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_nvlink_rx_bytes", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_nvlink_rx_bytes" - } - } - ], - "title": "Avg dcgm_nvlink_rx_bytes for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "4": { - "position": { - "x": 6, - "y": 4, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_nvlink_tx_bytes", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_nvlink_tx_bytes" - } - } - ], - "title": "Avg dcgm_nvlink_tx_bytes for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "5": { - "position": { - "x": 0, - "y": 8, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_pcie_rx_bytes", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_pcie_rx_bytes" - } - } - ], - "title": "Avg dcgm_pcie_rx_bytes for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - }, - "6": { - "position": { - "x": 6, - "y": 8, - "colSpan": 6, - "rowSpan": 4 - }, - "metadata": { - "inputs": [ - { - "name": "options", - "isOptional": true - }, - { - "name": "sharedTimeRange", - "isOptional": true - } - ], - "type": "Extension/HubsExtension/PartType/MonitorChartPart", - "settings": { - "content": { - "options": { - "chart": { - "metrics": [ - { - "resourceMetadata": { - "id": "/subscriptions//resourceGroups//providers/microsoft.insights/components/" - }, - "name": "customMetrics/dcgm_pcie_tx_bytes", - "aggregationType": 4, - "namespace": "microsoft.insights/components/kusto", - "metricVisualization": { - "displayName": "dcgm_pcie_tx_bytes" - } - } - ], - "title": "Avg dcgm_pcie_tx_bytes for ", - "titleKind": 1, - "visualization": { - "chartType": 2, - "legendVisualization": { - "isVisible": true, - "position": 2, - "hideSubtitle": false - }, - "axisVisualization": { - "x": { - "isVisible": true, - "axisType": 2 - }, - "y": { - "isVisible": true, - "axisType": 1 - } - }, - "disablePinning": true - } - } - } - } - } - } - } - } - } - }, - "metadata": { - "model": { - "timeRange": { - "value": { - "relative": { - "duration": 24, - "timeUnit": 1 - } - }, - "type": "MsPortalFx.Composition.Configuration.ValueTypes.TimeRange" - }, - "filterLocale": { - "value": "en-us" - }, - "filters": { - "value": { - "MsPortalFx_TimeRange": { - "model": { - "format": "utc", - "granularity": "auto", - "relative": "24h" - }, - "displayCache": { - "name": "UTC Time", - "value": "Past 24 hours" - }, - "filteredPartIds": [ - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248ab5", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248ace", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248bf5", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248c01", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248d10", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248d1c", - "StartboardPart-MonitorChartPart-3a416a37-d251-44ab-a365-96d4b8248dcb" - ] - } - } - } - } - } - }, - "name": "Profiling Counters", - "type": "Microsoft.Portal/dashboards", - "location": "INSERT LOCATION", - "tags": { - "hidden-title": "Profiling Counters" - }, - "apiVersion": "2015-08-01-preview" -} diff --git a/deploy_managed_infra/README.md b/deploy_managed_infra/README.md index 20453a5..62bf05d 100644 --- a/deploy_managed_infra/README.md +++ b/deploy_managed_infra/README.md @@ -1,7 +1,5 @@ # Azure Managed Moneo Resources # -===== - ## Overview ## Usining Moneo with Azure Managed resources is the prefered Method of deployment. These instructions will set up Azure managed Grafana and Azure Managed Prometheus to ingest and visualize data from Moneo exporters. This deployment only needs to be run ones. @@ -35,3 +33,12 @@ Specifically: ![Alt text](image.png) 4. Verify/Add Grafana admin,viewer,and/or editor roles to your grafana resource. 5. The deployment is complete. You can now design the Grafana dashboards to your own specifications. Also see: [ManagedPrometheusAgent.md](../docs/ManagedPrometheusAgent.md) for details on how to launch Moneo on compute nodes and start ingesting data. + +## Dashboard templates ## + +You are free to design your own Grafana dashboards. We also provide dashboards in the grafana_dashboard_templates directory: + +- [Cluster View](./grafana_dashboard_templates/Cluster_View.json) +- [GPU View](./grafana_dashboard_templates/GPU_View.json) +- [Network View](./grafana_dashboard_templates/Network_View.json) +- [Node View](./grafana_dashboard_templates/Node_View.json) diff --git a/deploy_managed_infra/grafana_dashboard_templates/Cluster_View.json b/deploy_managed_infra/grafana_dashboard_templates/Cluster_View.json new file mode 100755 index 0000000..41bf7f8 --- /dev/null +++ b/deploy_managed_infra/grafana_dashboard_templates/Cluster_View.json @@ -0,0 +1,1617 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": 45, + "links": [ + { + "asDropdown": true, + "icon": "external link", + "includeVars": true, + "keepTime": true, + "tags": [ + "Moneo" + ], + "targetBlank": true, + "title": "Moneo", + "tooltip": "", + "type": "dashboards", + "url": "" + } + ], + "liveNow": false, + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 29, + "panels": [], + "title": "IB Rate", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 1 + }, + "id": 30, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "(\r\n average_ib_port_xmit_data{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_ib_port_xmit_data{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_ib_port_xmit_data{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "IB TX Rate", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 1 + }, + "id": 31, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "(\r\n average_ib_port_rcv_data{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_ib_port_rcv_data{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_ib_port_rcv_data{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "IB RX Rate", + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 9 + }, + "id": 11, + "panels": [], + "title": "GPU Utilization", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation on VM GPU device utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 10 + }, + "id": 2, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_gpu_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_gpu_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_gpu_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation on VM GPU device utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 10 + }, + "id": 24, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true, + "sortBy": "Max", + "sortDesc": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_mem_copy_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_mem_copy_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_mem_copy_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Memory Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 18 + }, + "id": 17, + "panels": [], + "title": "Clock", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation on VM GPU device SM Clock", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "MHz" + }, + "overrides": [] + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 19 + }, + "id": 7, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_sm_clock{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_sm_clock{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_sm_clock{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "SM Clock", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation of VM GPU device Memory Clock", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "MHz" + }, + "overrides": [] + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 19 + }, + "id": 9, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_memory_clock{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_memory_clock{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_memory_clock{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Memory Clock", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 29 + }, + "id": 13, + "panels": [], + "title": "Temperature", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation of VM GPU device Temperature", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "celsius" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 30 + }, + "id": 3, + "options": { + "legend": { + "calcs": [ + "min", + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_gpu_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_gpu_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_gpu_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Temperature", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Memory temperature (in C)", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "celsius" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 30 + }, + "id": 4, + "options": { + "legend": { + "calcs": [ + "min", + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_memory_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_memory_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_memory_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Mem Temperature", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 38 + }, + "id": 15, + "panels": [], + "title": "Power", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation of VM GPU Power", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "watt" + }, + "overrides": [] + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 39 + }, + "id": 6, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "list", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\r\n average_dcgm_power_usage{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\n min_dcgm_power_usage{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\n max_dcgm_power_usage{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n)", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "Power Usage", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total energy consumption since boot (in mJ)", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "joule" + }, + "overrides": [] + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 39 + }, + "id": 8, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "average_dcgm_total_energy_consumption{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 1) +\r\nmin_dcgm_total_energy_consumption{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 2) +\r\nmax_dcgm_total_energy_consumption{subscription=\"$Subscription\", cluster=\"$Cluster\",job_id=~\"$JobId\"} * ($Operation== bool 3)\r\n", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Total Energy Consumption", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 48 + }, + "id": 32, + "panels": [], + "title": "IB Port Down", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "inspect": false + }, + "mappings": [ + { + "options": { + "0": { + "index": 0, + "text": "Polling" + } + }, + "type": "value" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "ib_sys_guid" + }, + "properties": [ + { + "id": "custom.width", + "value": 208 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "ib_port" + }, + "properties": [ + { + "id": "custom.width", + "value": 142 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "instance" + }, + "properties": [ + { + "id": "custom.width", + "value": 196 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "physical_host" + }, + "properties": [ + { + "id": "custom.width", + "value": 201 + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 49 + }, + "id": 33, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true, + "sortBy": [] + }, + "pluginVersion": "9.5.6", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "ib_port_physical_state{subscription=\"$Subscription\", cluster=\"$Cluster\", job_id=~\"$JobId\"} == 0\r\n", + "format": "table", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "IB Port Down", + "transformations": [ + { + "id": "filterFieldsByName", + "options": { + "include": { + "names": [ + "ib_port", + "physical_host", + "Value", + "ib_sys_guid", + "instance", + "Time" + ] + } + } + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "indexByName": { + "Time": 0, + "Value": 5, + "ib_port": 2, + "ib_sys_guid": 3, + "instance": 1, + "physical_host": 4 + }, + "renameByName": {} + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "Time": { + "aggregations": [ + "last" + ], + "operation": "aggregate" + }, + "Value": { + "aggregations": [], + "operation": "groupby" + }, + "ib_port": { + "aggregations": [], + "operation": "groupby" + }, + "ib_sys_guid": { + "aggregations": [], + "operation": "groupby" + }, + "instance": { + "aggregations": [], + "operation": "groupby" + }, + "physical_host": { + "aggregations": [], + "operation": "groupby" + } + } + } + } + ], + "type": "table" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 57 + }, + "id": 28, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": false, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 31 + }, + "id": 26, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "enablePagination": true, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 6, + "showHeader": true, + "sortBy": [ + { + "desc": false, + "displayName": "instance" + } + ] + }, + "pluginVersion": "9.5.6", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_gpu_temp{subscription=\"$Subscription\", cluster=\"$Cluster\",gpu_id=\"0\"}", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "VM to Physical Host Map", + "transformations": [ + { + "id": "labelsToFields", + "options": { + "keepLabels": [ + "instance", + "physical_host" + ] + } + }, + { + "id": "filterFieldsByName", + "options": { + "include": { + "names": [ + "instance", + "physical_host" + ] + } + } + }, + { + "id": "merge", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "indexByName": {}, + "renameByName": { + "instance": "Instance", + "physical_host": "Physical Host Name " + } + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "Instance": { + "aggregations": [], + "operation": "groupby" + }, + "Physical Host Name ": { + "aggregations": [], + "operation": "groupby" + } + } + } + } + ], + "type": "table" + } + ], + "title": "VM Instance to Host Mapping", + "type": "row" + } + ], + "refresh": "1m", + "revision": 1, + "schemaVersion": 38, + "style": "dark", + "tags": [], + "templating": { + "list": [ + { + "current": { + "selected": false, + "text": "Average", + "value": "1" + }, + "hide": 0, + "includeAll": false, + "multi": false, + "name": "Operation", + "options": [ + { + "selected": true, + "text": "Average", + "value": "1" + }, + { + "selected": false, + "text": "Minimum", + "value": "2" + }, + { + "selected": false, + "text": "Maximum", + "value": "3" + } + ], + "query": "Average : 1, Minimum : 2, Maximum : 3", + "queryValue": "", + "skipUrlSync": false, + "type": "custom" + }, + { + "current": { + "selected": false, + "text": "d71c7216-6409-45f8-be15-35cf57b8527c", + "value": "d71c7216-6409-45f8-be15-35cf57b8527c" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization, subscription)", + "hide": 0, + "includeAll": false, + "label": "Subscription", + "multi": false, + "name": "Subscription", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization, subscription)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "ndv4-test-t", + "value": "ndv4-test-t" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization{subscription=\"$Subscription\"}, cluster)", + "hide": 0, + "includeAll": false, + "label": "Cluster", + "multi": false, + "name": "Cluster", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization{subscription=\"$Subscription\"}, cluster)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "none", + "value": "none" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization{cluster=~\"$Cluster\"}, job_id)", + "hide": 0, + "includeAll": false, + "label": "Job Id", + "multi": true, + "name": "JobId", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization{cluster=~\"$Cluster\"}, job_id)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + } + ] + }, + "time": { + "from": "now-30m", + "to": "now" + }, + "timepicker": { + "hidden": false, + "refresh_intervals": [ + "1m", + "5m", + "15m", + "30m", + "1h", + "2h", + "1d" + ] + }, + "timezone": "utc", + "title": "Cluster Unified View (Experimental)", + "uid": "e12394be-6c26-4c19-a089-f69930b17e7e", + "version": 62, + "weekStart": "" +} + diff --git a/deploy_managed_infra/grafana_dashboard_templates/GPU_View.json b/deploy_managed_infra/grafana_dashboard_templates/GPU_View.json new file mode 100755 index 0000000..b311f74 --- /dev/null +++ b/deploy_managed_infra/grafana_dashboard_templates/GPU_View.json @@ -0,0 +1,2152 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 39, + "links": [ + { + "asDropdown": true, + "icon": "external link", + "includeVars": true, + "keepTime": true, + "tags": [ + "Moneo" + ], + "targetBlank": true, + "title": "Moneo", + "tooltip": "", + "type": "dashboards", + "url": "" + } + ], + "liveNow": false, + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 11, + "panels": [], + "title": "Utilization", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation on VM GPU device utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 1 + }, + "id": 2, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "exemplar": false, + "expr": "dcgm_gpu_utilization{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation on VM GPU device utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 1 + }, + "id": 24, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "exemplar": false, + "expr": "dcgm_mem_copy_utilization{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Memory Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 9 + }, + "id": 17, + "panels": [], + "title": "Clock", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation on VM GPU device SM Clock", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "MHz" + }, + "overrides": [] + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 10 + }, + "id": 7, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_sm_clock{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "SM Clock", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation of VM GPU device Memory Clock", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "MHz" + }, + "overrides": [] + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 12, + "y": 10 + }, + "id": 9, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_memory_clock{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "Memory Clock", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 20 + }, + "id": 23, + "panels": [], + "title": "NV Link", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "The rate of data transmitted over NVLink.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 21 + }, + "id": 19, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_nvlink_tx_bytes{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "NVLink TX Rate", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "The rate of data received over NVLink.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 21 + }, + "id": 21, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_nvlink_rx_bytes{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "NVLink RX Rate", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 29 + }, + "id": 13, + "panels": [], + "title": "Temperature", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation of VM GPU device Temperature", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "celsius" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 30 + }, + "id": 3, + "options": { + "legend": { + "calcs": [ + "min", + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "exemplar": false, + "expr": "dcgm_gpu_temp{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Temperature", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Memory temperature (in C)", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "celsius" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 30 + }, + "id": 4, + "options": { + "legend": { + "calcs": [ + "min", + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_memory_temp{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "Mem Temperature", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 38 + }, + "id": 15, + "panels": [], + "title": "Power", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Selected Operation of VM GPU Power", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "watt" + }, + "overrides": [] + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 0, + "y": 39 + }, + "id": 6, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "list", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_power_usage{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "Power Usage", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total energy consumption since boot (in mJ)", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "joule" + }, + "overrides": [] + }, + "gridPos": { + "h": 9, + "w": 12, + "x": 12, + "y": 39 + }, + "id": 8, + "options": { + "legend": { + "calcs": [ + "min", + "max" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "dcgm_total_energy_consumption{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "Total Energy Consumption", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 48 + }, + "id": 39, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Current throttle code ", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 7 + }, + "id": 37, + "options": { + "legend": { + "calcs": [ + "lastNotNull" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_current_clock_throttle_reasons{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Throttle Code", + "type": "timeseries" + } + ], + "title": "GPU Throttling", + "type": "row" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 49 + }, + "id": 30, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of single-bit volatile ECC errors", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 6, + "x": 0, + "y": 17 + }, + "id": 32, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_ecc_sbe_volatile_total{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "SBE Volatile ECC Errors", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of double-bit volatile ECC errors", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 6, + "x": 6, + "y": 17 + }, + "id": 33, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_ecc_dbe_volatile_total{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "DBE Volatile ECC Errors", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of double-bit persistent ECC errors", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 6, + "x": 12, + "y": 17 + }, + "id": 35, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_ecc_dbe_aggregate_total{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "DBE Persistent ECC Errors", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of single-bit persistent ECC errors", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 6, + "x": 18, + "y": 17 + }, + "id": 34, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_ecc_sbe_aggregate_total{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "SBE Persistent ECC Errors", + "type": "timeseries" + } + ], + "title": "ECC Errors", + "type": "row" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 50 + }, + "id": 28, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": false, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 10, + "w": 12, + "x": 0, + "y": 121 + }, + "id": 26, + "options": { + "footer": { + "countRows": false, + "enablePagination": true, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "frameIndex": 6, + "showHeader": true, + "sortBy": [ + { + "desc": false, + "displayName": "instance" + } + ] + }, + "pluginVersion": "9.4.12", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "dcgm_gpu_temp{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\",gpu_id=\"0\"}", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "VM to Physical Host Map", + "transformations": [ + { + "id": "labelsToFields", + "options": { + "keepLabels": [ + "instance", + "physical_host" + ] + } + }, + { + "id": "filterFieldsByName", + "options": { + "include": { + "names": [ + "instance", + "physical_host" + ] + } + } + }, + { + "id": "merge", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "indexByName": {}, + "renameByName": { + "instance": "Instance", + "physical_host": "Physical Host Name " + } + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "Instance": { + "aggregations": [], + "operation": "groupby" + }, + "Physical Host Name ": { + "aggregations": [], + "operation": "groupby" + } + } + } + } + ], + "type": "table" + } + ], + "title": "VM Instance to Host Mapping", + "type": "row" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 51 + }, + "id": 43, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 52 + }, + "id": 44, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "expr": "node_gpu_burn_mon{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", gpu_id=~\"$GPU\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "GPU Burn Bandwidth", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 52 + }, + "id": 41, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "expr": "node_meta_seq_mon{subscription=~\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\"}", + "legendFormat": "{{instance}}:{{gpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "meta_seq wps per GPU", + "type": "timeseries" + } + ], + "title": "Custom Bandwidth", + "type": "row" + } + ], + "refresh": "1m", + "revision": 1, + "schemaVersion": 38, + "style": "dark", + "tags": [ + "Moneo" + ], + "templating": { + "list": [ + { + "current": { + "selected": false, + "text": "d71c7216-6409-45f8-be15-35cf57b8527c", + "value": "d71c7216-6409-45f8-be15-35cf57b8527c" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization, subscription)", + "hide": 0, + "includeAll": false, + "label": "Subscription", + "multi": false, + "name": "Subscription", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization, subscription)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "yangwang1-integration-vmss", + "value": "yangwang1-integration-vmss" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization{subscription=\"$Subscription\"}, cluster)", + "hide": 0, + "includeAll": false, + "label": "Cluster", + "multi": false, + "name": "Cluster", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization{subscription=\"$Subscription\"}, cluster)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "none" + ], + "value": [ + "none" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization{cluster=~\"$Cluster\"}, job_id)", + "hide": 0, + "includeAll": false, + "label": "Job Id", + "multi": true, + "name": "JobId", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization{cluster=~\"$Cluster\"}, job_id)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "yangwa0ae0000cn", + "value": "yangwa0ae0000cn" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization{cluster=\"$Cluster\", job_id=~\"$JobId\"}, instance)", + "hide": 0, + "includeAll": false, + "label": "Instance", + "multi": true, + "name": "Instance", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization{cluster=\"$Cluster\", job_id=~\"$JobId\"}, instance)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(dcgm_gpu_utilization{instance=~\"$Instance\"},gpu_id)", + "hide": 0, + "includeAll": true, + "label": "GPU", + "multi": true, + "name": "GPU", + "options": [], + "query": { + "query": "label_values(dcgm_gpu_utilization{instance=~\"$Instance\"},gpu_id)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + } + ] + }, + "time": { + "from": "now-30m", + "to": "now" + }, + "timepicker": { + "refresh_intervals": [ + "1m", + "5m", + "15m", + "30m", + "1h", + "2h", + "1d" + ] + }, + "timezone": "utc", + "title": "GPU View", + "uid": "dHpbWBP4z", + "version": 41, + "weekStart": "" +} + diff --git a/deploy_managed_infra/grafana_dashboard_templates/Network_View.json b/deploy_managed_infra/grafana_dashboard_templates/Network_View.json new file mode 100755 index 0000000..b52ebfb --- /dev/null +++ b/deploy_managed_infra/grafana_dashboard_templates/Network_View.json @@ -0,0 +1,1237 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 40, + "links": [ + { + "asDropdown": true, + "icon": "external link", + "includeVars": false, + "keepTime": true, + "tags": [ + "Moneo" + ], + "targetBlank": true, + "title": "Moneo", + "tooltip": "", + "type": "dashboards", + "url": "" + } + ], + "liveNow": false, + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 19, + "panels": [], + "title": "InifiniBand", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Indication of IB Link Flap", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "mappings": [ + { + "options": { + "0": { + "color": "dark-red", + "index": 1, + "text": "Polling" + }, + "1": { + "color": "dark-green", + "index": 0, + "text": "Link Up" + } + }, + "type": "value" + } + ], + "max": 1, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 12, + "w": 24, + "x": 0, + "y": 1 + }, + "id": 2, + "options": { + "legend": { + "calcs": [ + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "exemplar": false, + "expr": "ib_port_physical_state{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "IB Link Status", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "The rate of data transmitted over InfiniBand.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 13 + }, + "id": 7, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "ib_port_xmit_data{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "IB TX Rate", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "The rate of data transmitted over InfiniBand.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 13 + }, + "id": 11, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "ib_port_rcv_data{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "IB RX Rate", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 21 + }, + "id": 17, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of outbound packets discarded by the port because the port is down or congested.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 22 + }, + "id": 10, + "options": { + "legend": { + "calcs": [ + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "ib_port_xmit_discards{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "Port Xmit Discards", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of packets not transmitted from the switch physical port.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 22 + }, + "id": 12, + "options": { + "legend": { + "calcs": [ + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "ib_port_xmit_constraint_errors{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "Port Xmit Constraint Errors", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of packets containing an error that were received on the port.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 30 + }, + "id": 13, + "options": { + "legend": { + "calcs": [ + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "ib_port_rcv_errors{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "Port Rcv Errors", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Total number of packets received on the switch physical port that are discarded.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 30 + }, + "id": 14, + "options": { + "legend": { + "calcs": [ + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "ib_port_rcv_constraint_errors{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "Port Rcv Constraint Errors", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Indication of IB Link Flap", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "inspect": false + }, + "decimals": 1, + "mappings": [], + "max": 1, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 24, + "x": 0, + "y": 38 + }, + "id": 15, + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "enablePagination": true, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "Time Stamp" + } + ] + }, + "pluginVersion": "9.5.6", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_link_flap{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", ib_port=~\"$IBPort\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{ib_port}}", + "range": true, + "refId": "A" + } + ], + "title": "Link Flap", + "transformations": [ + { + "id": "labelsToFields", + "options": { + "keepLabels": [ + "ib_port", + "instance", + "time_stamp", + "cluster" + ] + } + }, + { + "id": "filterFieldsByName", + "options": { + "include": { + "names": [ + "cluster", + "ib_port", + "instance", + "time_stamp" + ] + } + } + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "indexByName": { + "cluster": 1, + "ib_port": 3, + "instance": 2, + "time_stamp": 0 + }, + "renameByName": { + "cluster": "Cluster", + "hbv2:hbv22ec6c000000:": "", + "ib_port": "IB Port", + "instance": "Instance", + "time_stamp": "Time Stamp" + } + } + } + ], + "transparent": true, + "type": "table" + } + ], + "title": "Port Errors", + "type": "row" + }, + { + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 22 + }, + "id": 21, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 0, + "y": 47 + }, + "id": 23, + "options": { + "footer": { + "countRows": false, + "enablePagination": true, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true + }, + "pluginVersion": "9.4.12", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "expr": "ib_port_rcv_errors{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=~\"$Instance\", ib_port=\"mlx5_ib0:1\"}", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "VM to Physical Host Map", + "transformations": [ + { + "id": "labelsToFields", + "options": { + "keepLabels": [ + "instance", + "physical_host" + ] + } + }, + { + "id": "filterFieldsByName", + "options": { + "include": { + "names": [ + "instance", + "physical_host" + ] + } + } + }, + { + "id": "merge", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "indexByName": {}, + "renameByName": { + "instance": "VM Instance", + "physical_host": "Physical Hostname" + } + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "Physical Hostname": { + "aggregations": [], + "operation": "groupby" + }, + "VM Instance": { + "aggregations": [], + "operation": "groupby" + } + } + } + } + ], + "type": "table" + } + ], + "title": "VM Instance to Host Mapping", + "type": "row" + } + ], + "refresh": "1m", + "revision": 1, + "schemaVersion": 38, + "style": "dark", + "tags": [ + "Moneo" + ], + "templating": { + "list": [ + { + "current": { + "selected": false, + "text": "d71c7216-6409-45f8-be15-35cf57b8527c", + "value": "d71c7216-6409-45f8-be15-35cf57b8527c" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(ib_port_physical_state, subscription)", + "hide": 0, + "includeAll": false, + "label": "Subscription", + "multi": false, + "name": "Subscription", + "options": [], + "query": { + "query": "label_values(ib_port_physical_state, subscription)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "yangwang1-integration-vmss" + ], + "value": [ + "yangwang1-integration-vmss" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(ib_port_physical_state{subscription=~\"$Subscription\"}, cluster)", + "hide": 0, + "includeAll": false, + "label": "Cluster", + "multi": false, + "name": "Cluster", + "options": [], + "query": { + "query": "label_values(ib_port_physical_state{subscription=~\"$Subscription\"}, cluster)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "none", + "value": "none" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(ib_port_physical_state{cluster=~\"$Cluster\"}, job_id)", + "hide": 0, + "includeAll": false, + "label": "Job Id", + "multi": false, + "name": "JobId", + "options": [], + "query": { + "query": "label_values(ib_port_physical_state{cluster=~\"$Cluster\"}, job_id)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "yangwa0ae0000cn", + "value": "yangwa0ae0000cn" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(ib_port_physical_state{cluster=~\"$Cluster\", job_id=~\"$JobId\"}, instance)", + "hide": 0, + "includeAll": false, + "label": "Instance", + "multi": true, + "name": "Instance", + "options": [], + "query": { + "query": "label_values(ib_port_physical_state{cluster=~\"$Cluster\", job_id=~\"$JobId\"}, instance)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(ib_port_physical_state{instance=~\"$Instance\"}, ib_port)", + "hide": 0, + "includeAll": true, + "label": "IBPort", + "multi": true, + "name": "IBPort", + "options": [], + "query": { + "query": "label_values(ib_port_physical_state{instance=~\"$Instance\"}, ib_port)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + } + ] + }, + "time": { + "from": "now-30m", + "to": "now" + }, + "timepicker": { + "refresh_intervals": [ + "1m", + "5m", + "15m", + "30m", + "1h", + "2h", + "1d" + ] + }, + "timezone": "utc", + "title": "Network View", + "uid": "IziFPI8Vk", + "version": 11, + "weekStart": "" +} + diff --git a/deploy_managed_infra/grafana_dashboard_templates/Node_View.json b/deploy_managed_infra/grafana_dashboard_templates/Node_View.json new file mode 100755 index 0000000..cf07077 --- /dev/null +++ b/deploy_managed_infra/grafana_dashboard_templates/Node_View.json @@ -0,0 +1,1039 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "description": "", + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 41, + "links": [ + { + "asDropdown": true, + "icon": "external link", + "includeVars": true, + "keepTime": true, + "tags": [ + "Moneo" + ], + "targetBlank": true, + "title": "Moneo", + "tooltip": "", + "type": "dashboards", + "url": "" + } + ], + "liveNow": false, + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 2, + "panels": [], + "title": "CPU Utilization", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "CPU Utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 2, + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 1 + }, + "id": 8, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "timezone": [ + "utc" + ], + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_cpu_util{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\",numa_domain=~\"$NUMA\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{cpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "CPU Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "CPU Utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 2, + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 1 + }, + "id": 10, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "timezone": [ + "utc" + ], + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_cpu_util{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\", numa_domain=~\"$NUMA\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}:{{cpu_id}}", + "range": true, + "refId": "A" + } + ], + "title": "CPU Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 9 + }, + "id": 4, + "panels": [], + "title": "Memory Counters", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Memory Utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 2, + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 10 + }, + "id": 9, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "timezone": [ + "utc" + ], + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_mem_util{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Memory Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "Memory Utilization", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 2, + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 10 + }, + "id": 12, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "timezone": [ + "utc" + ], + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_mem_util{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Memory Utilization", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 18 + }, + "id": 6, + "panels": [], + "title": "Network Counters", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "TX Rate of VM's Ethernet Interface", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 19 + }, + "id": 11, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "timezone": [ + "utc" + ], + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_net_tx{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Ethernet TX Rate", + "transparent": true, + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "description": "RX Rate of VM's Ethernet Interface", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 19 + }, + "id": 13, + "options": { + "legend": { + "calcs": [ + "max", + "last" + ], + "displayMode": "table", + "placement": "right", + "showLegend": true + }, + "timezone": [ + "utc" + ], + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "code", + "exemplar": false, + "expr": "node_net_rx{subscription=~\"$Subscription\", cluster=~\"$Cluster\", instance=~\"$Instance\"}", + "format": "time_series", + "instant": false, + "interval": "", + "legendFormat": "{{instance}}", + "range": true, + "refId": "A" + } + ], + "title": "Ethernet RX Rate", + "transparent": true, + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 27 + }, + "id": 15, + "panels": [], + "title": "VM Instance to Host Mapping", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green" + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 28 + }, + "id": 17, + "options": { + "footer": { + "countRows": false, + "fields": "", + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true + }, + "pluginVersion": "9.4.12", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "editorMode": "builder", + "expr": "node_mem_util{subscription=\"$Subscription\", cluster=\"$Cluster\", instance=\"$Instance\"}", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "VM to Physical Host Map", + "transformations": [ + { + "id": "labelsToFields", + "options": { + "keepLabels": [ + "instance", + "physical_host" + ] + } + }, + { + "id": "filterFieldsByName", + "options": { + "include": { + "names": [ + "instance", + "physical_host" + ] + } + } + }, + { + "id": "groupBy", + "options": { + "fields": { + "Physical Hostname": { + "aggregations": [ + "last" + ], + "operation": "aggregate" + }, + "VM Instance": { + "aggregations": [], + "operation": "groupby" + }, + "instance": { + "aggregations": [], + "operation": "groupby" + }, + "physical_host": { + "aggregations": [ + "last" + ], + "operation": "aggregate" + } + } + } + }, + { + "id": "organize", + "options": { + "excludeByName": {}, + "indexByName": {}, + "renameByName": { + "instance": "VM Instance", + "physical_host": "Physical Hostname", + "physical_host (last)": "Physical Hostname" + } + } + } + ], + "type": "table" + } + ], + "refresh": "1m", + "revision": 1, + "schemaVersion": 38, + "style": "dark", + "tags": [ + "Moneo" + ], + "templating": { + "list": [ + { + "current": { + "selected": false, + "text": "d71c7216-6409-45f8-be15-35cf57b8527c", + "value": "d71c7216-6409-45f8-be15-35cf57b8527c" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(node_mem_util, subscription)", + "hide": 0, + "includeAll": false, + "label": "Subscription", + "multi": false, + "name": "Subscription", + "options": [], + "query": { + "query": "label_values(node_mem_util, subscription)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "yangwang1-integration-vmss", + "value": "yangwang1-integration-vmss" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(node_mem_util{subscription=\"$Subscription\"}, cluster)", + "hide": 0, + "includeAll": false, + "label": "Cluster", + "multi": false, + "name": "Cluster", + "options": [], + "query": { + "query": "label_values(node_mem_util{subscription=\"$Subscription\"}, cluster)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "none", + "value": "none" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(node_mem_util{cluster=\"$Cluster\"}, job_id)", + "description": "", + "hide": 0, + "includeAll": false, + "label": "Job Id", + "multi": false, + "name": "JobId", + "options": [], + "query": { + "query": "label_values(node_mem_util{cluster=\"$Cluster\"}, job_id)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "yangwa0ae0000cn", + "value": "yangwa0ae0000cn" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(node_mem_util{cluster=\"$Cluster\", job_id=\"$JobId\"}, instance)", + "hide": 0, + "includeAll": false, + "label": "Instance", + "multi": false, + "name": "Instance", + "options": [], + "query": { + "query": "label_values(node_mem_util{cluster=\"$Cluster\", job_id=\"$JobId\"}, instance)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + }, + { + "current": { + "selected": false, + "text": "All", + "value": "$__all" + }, + "datasource": { + "type": "prometheus", + "uid": "moneo-amw" + }, + "definition": "label_values(node_cpu_util{instance=~\"$Instance\"},numa_domain)", + "hide": 0, + "includeAll": true, + "label": "NUMA", + "multi": true, + "name": "NUMA", + "options": [], + "query": { + "query": "label_values(node_cpu_util{instance=~\"$Instance\"},numa_domain)", + "refId": "StandardVariableQuery" + }, + "refresh": 1, + "regex": "", + "skipUrlSync": false, + "sort": 0, + "type": "query" + } + ] + }, + "time": { + "from": "now-30m", + "to": "now" + }, + "timepicker": { + "refresh_intervals": [ + "1m", + "5m", + "15m", + "30m", + "1h", + "2h", + "1d" + ] + }, + "timezone": "utc", + "title": "Node View", + "uid": "DBUc8IU4k", + "version": 16, + "weekStart": "" +} + diff --git a/docs/AzureMonitorAgent.md b/docs/AzureMonitorAgent.md index 81ccb6c..a84804a 100644 --- a/docs/AzureMonitorAgent.md +++ b/docs/AzureMonitorAgent.md @@ -4,8 +4,6 @@ Description ----- This guide will provide step-by-step instructions on how to share your exporter metrics with Azure by utilizing Azure Monitor Metrics. -If you are an internal Microsoft user, we recommend utilizing the Geneva agent instead. For detailed instructions, please refer to this document [Geneva Agent](GenevaAgent.MD). - Prequisites: 1. An Azure Monitor Metrics (Application Insights) resource, please enable alerting on custom metric dimensions by refering this [document](https://learn.microsoft.com/en-us/azure/azure-monitor/app/pre-aggregated-metrics-log-metrics#custom-metrics-dimensions-and-pre-aggregation) to restore the metrics dimentions.(Lead to a extra cost) 2. PSSH installed on manager nodes. diff --git a/docs/HeadlessDeployment.md b/docs/HeadlessDeployment.md new file mode 100644 index 0000000..43ff97a --- /dev/null +++ b/docs/HeadlessDeployment.md @@ -0,0 +1,53 @@ +# Managed Grafana Deployment # + +- The following steps assume the Moneo directory is located here: /opt/azurehpc/tools/Moneo +- The following steps deploy Moneo workers using Moneo linux services, however Az CLI can also be used to deploy the Moneo workers. See [managed prometheus guide](./ManagedPrometheusAgent.md) for details on how. +- Moneo CLI can be used in place of Moneo linux services to deploy Moneo workers + +## Deploy Infrastructure ## + +To use this method you will need to deploy the managed infrastructure and managed user identity. + +Follow steps outlined in [Infrastructure deployment](../deploy_managed_infra/README.md) to setup Azure Managed Grafana and Prometheus resources. + + Note: this only needs to be done once. + +## Deploy Moneo ## + +1. Modify the managed prometheus config file in `Moneo/src/worker/publisher/config/managed_prom_config.json`. + - Reference the user managed identity created during infrastructure deployment to get the "identity client id" + - Reference the Managed Prometheus resource created during infrastructure deployment to get the "metrics ingestion endpoint" + - The config file modifcations must be distributed to the Moneo directories on all workers. + ```json + { + "IDENTITY_CLIENT_ID": "", + "INGESTION_ENDPOINT": "" + } + ``` + +2. Assign the identity to your VMSS resource: + - This can either be done via the portal or AZ CLI (below) + - During VMSS creation: + + ```sh + az vmss create --resource-group --name --image --admin-username --admin-password --assign-identity --role --scope + ``` + + - Already existing VMSS: + + ```sh + az vmss identity assign -g -n --identities + ``` + +3. Start Services (Assumes Azure marketplace AI/HPC Image): ``` parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/start_moneo_services.sh true" ``` + - To stop services: ```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/stop_moneo_services.sh"``` + + Note: If not using Azure AI/HPC market place image reference the ["Deploying Linux services guide"](../linux_service/README.md) for full instructions. + +4. At this point data collection should be on going and metrics streaming to the Azure managed Grafana setup during infrastructure. + + Note: In the infrastructure deployment step you have the option to use provided template dashboards or create your own. + +5. Check with Azure Grafana Dashboards to verify that the metrics are being ingested. + + ![image](assets/azuregrafana-managed_prometheus.png) diff --git a/docs/LocalGrafanaDeployment.md b/docs/LocalGrafanaDeployment.md new file mode 100644 index 0000000..11db487 --- /dev/null +++ b/docs/LocalGrafanaDeployment.md @@ -0,0 +1,53 @@ + +# Deploy Moneo with Local Grafana and head node # + +1. Create a hostfile file. + + ```hostfile + 192.168.0.100 + 192.168.0.101 + 192.168.0.110 + ``` + + Note: The manager node can also be a worker node as well. The manager node will have the Grafana and Prometheus docker containers deployed to it. + + Note: You must have passwordless ssh enabled on your nodes. + + Note: The manager node must be able to ssh into itself. + +2. Now deploy Moneo + - using Moneo cli: + + ```sh + python3 moneo.py --deploy -c hostfile full + ``` + + - If using the Azure HPC/AI marketplace image or if installation has been performed on all worker nodes by a previous deployment we can skip the install step: + + ```sh + python3 moneo.py --deploy -c hostfile full -w + ``` + + Note: See usage section of the README doc for more advance details on Moneo CLI + + Note: By default Moneo deploys to the manager using localhost. This can be changed using the "manager_host" flag. + +3. Log into the portal by navigating to `http://manager-ip-or-domain:3000` and inputting your credentials + + ![image](https://user-images.githubusercontent.com/70273488/173685955-dc51f7fc-da55-450b-b214-20d875e7687f.png) + + Note: By default username/password are set to "azure". This can be changed here "src/master/grafana/grafana.env" + +4. Navigating Moneo Grafana Portal + - The current view is labeled in the top left corner: + + ![image](https://user-images.githubusercontent.com/70273488/173687229-d1d64693-58d6-4874-a61c-c32af67e3fea.png) + - VM instance and GPU can be selected from the drop down menus in the top left corner: + + ![image](https://user-images.githubusercontent.com/70273488/173687914-ee684e71-02a7-429e-abfa-046244e9eea0.png) + - Various actions such as dashboard selection or data source configuration can be achieved using the left screen menu: + + ![image](https://user-images.githubusercontent.com/70273488/173689054-661bb442-4883-4f99-9147-b8307821a6b2.png) + - Metric groups are collapsable: + + ![image](https://user-images.githubusercontent.com/70273488/173689514-e7532cfb-0b56-41ed-b9b9-1d71beaab123.png) diff --git a/docs/ManagedPrometheusAgent.md b/docs/ManagedPrometheusAgent.md index 2bdae71..521801d 100644 --- a/docs/ManagedPrometheusAgent.md +++ b/docs/ManagedPrometheusAgent.md @@ -1,10 +1,7 @@ -# Managed Prometheus Agent User Guide (Preview) # - -===== +# Managed Prometheus Agent User Guide # ## Description ## ------ This guide will provide step-by-step instructions on how to to publish your exporter metrics to Azure Managed Prometheus in a second-level granularity interval. ## Prequisites ## @@ -33,13 +30,11 @@ This guide will provide step-by-step instructions on how to to publish your exp ## Steps ## ------ - 1. Ensure that all prequisites are met. 2. deploy Moneo on worker nodes: - - Worker deployment + - Worker deployment using CLI ```bash python3 moneo.py -d -c hostlist workers -g managed_prometheus -a umi @@ -57,6 +52,6 @@ This guide will provide step-by-step instructions on how to to publish your exp Which means, prometheus agent's remote write is enabled. 4. At this point the remote write functionality shoud be working. -5. Check with Azure grafana (linked with AMW)dashboards to verify that the metrics are being ingested. +5. Check with Azure grafana (linked with AMW) dashboards to verify that the metrics are being ingested. ![image](assets/azuregrafana-managed_prometheus.png) -Note: You will have to design the dashboards (templated dashboards coming soon) +Note: You will have to design the dashboards or use the template dashboards in the "Moneo/deploy_managed_infra/grafana_dashboard_templates" folder. diff --git a/docs/QuickStartGuide.md b/docs/QuickStartGuide.md index c58af2f..7baa5c2 100644 --- a/docs/QuickStartGuide.md +++ b/docs/QuickStartGuide.md @@ -1,65 +1,61 @@ -Moneo Quick Start Guide -===== -Description ------ -This guide will walk you through the simple steps of setting up Moneo. -This guide assume that all dependencies and requirements have been meant. - -Steps ------ -1. Clone Moneo from Github and install ansible. - ```sh - # get the code - git clone https://github.com/Azure/Moneo.git - cd Moneo +# Moneo Quick Start Guide # + +1. Clone Moneo from Github. - # install dependencies - sudo apt-get install pssh + ```sh + # get the code + git clone https://github.com/Azure/Moneo.git + cd Moneo + # install dependencies + sudo apt-get install pssh ``` + Note: If you are using an [Azure Ubuntu HPC-AI](https://github.com/Azure/azhpc-images) VM image you can find the Moneo in this path: /opt/azurehpc/tools/Moneo -2. Next create a hostfile file. - ```hostfile - 192.168.0.100 - 192.168.0.101 - 192.168.0.110 - ``` - Note: The manager node can also be a work node as well. The manager node will have the Grafana and Prometheus docker containers deployed to it. - - Note: You must have passwordless ssh enabled on your nodes - - Note: The manager node must be able to ssh into itself - -3. Now deploy Moneo - * using Moneo cli: - ```sh - python3 moneo.py --deploy -c hostfile full - ``` - * If using the Azure HPC/AI marketplace image or if installation has been performed on all worker nodes by a previous deployment we can skip the install step: - ```sh - python3 moneo.py --deploy -c hostfile full -w - ``` - Note: See usage section of the README doc for more advance details on Moneo CLI +## Preffered Moneo Deployment ## + +The preffered way to deploy Moneo is the headless method using Azure Managaed Grafana and Prometheus resources. + +Complete the steps listed here: [Headless Deployment Guide](./HeadlessDeployment.md) + +## Alternative deployment using Moneo CLI and head node ## + +This method requires a deploying of a head node to host the local Prometheus database and Grafana server. + +- The headnode must have enough storage available to facilitate data collection +- Grafana and Prometheus is accessed via web browser. Ensure proper access from web browser to headnode IP. + +Complete the steps listed here: [Local Grafana Deployment Guide](./HeadlessDeployment.md) + +## Known Issues ## + +- NVIDIA exporter may conflict with DCGMI + + There're [two modes for DCGM](https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/getting-started.html#content): embedded mode and standalone mode. + + If DCGM is started as embedded mode (e.g., `nv-hostengine -n`, using no daemon option `-n`), the exporter will use the DCGM agent while DCGMI may return error. + + It's recommended to start DCGM in standalone mode in a daemon, so that multiple clients like exporter and DCGMI can interact with DCGM at the same time, according to [NVIDIA](https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/getting-started.html#standalone-mode). + + > Generally, NVIDIA prefers this mode of operation, as it provides the most flexibility and lowest maintenance cost to users. - Note: By default Moneo deploys to the manager using localhost. This can be changed using the "manager_host" flag. +- Moneo will attempt to install a tested version of DCGM if it is not present on the worker nodes. However, this step is skipped if DCGM is already installed. In instances DCGM installed may be too old. -4. Log into the portal by navigating to `http://manager-ip-or-domain:3000` and inputting your credentials + This may cause the Nvidia exporter to fail. In this case it is recommended that DCGM be upgrade to atleast version 2.4.4. + To view which exporters are running on a worker just run ```ps -eaf | grep python3``` - ![image](https://user-images.githubusercontent.com/70273488/173685955-dc51f7fc-da55-450b-b214-20d875e7687f.png) - - Note: By default username/password are set to "azure". This can be changed here "src/master/grafana/grafana.env" - -5. Navigating Moneo Grafana Portal - - The current view is labeled in the top left corner: - - ![image](https://user-images.githubusercontent.com/70273488/173687229-d1d64693-58d6-4874-a61c-c32af67e3fea.png) - - VM instance and GPU can be selected from the drop down menus in the top left corner: +## Troubleshooting ## - ![image](https://user-images.githubusercontent.com/70273488/173687914-ee684e71-02a7-429e-abfa-046244e9eea0.png) - - Various actions such as dashboard selection or data source configuration can be achieved using the left screen menu: +1. +2. For deployments with a Headnode: - ![image](https://user-images.githubusercontent.com/70273488/173689054-661bb442-4883-4f99-9147-b8307821a6b2.png) - - Metric groups are collapsable: + - Verifying Grafana and Prometheus containers are running: + - Check browser http://master-ip-or-domain:3000 (Grafana), http://master-ip-or-domain:9090 (Prometheus) + - On Manager node terminal run ```sudo docker container ls``` + ![image](https://user-images.githubusercontent.com/70273488/205715440-9f994c84-b115-4a98-9535-fdce8a4adf7d.png) - ![image](https://user-images.githubusercontent.com/70273488/173689514-e7532cfb-0b56-41ed-b9b9-1d71beaab123.png) +3. All deployments: + - Verifying exporters on worker node: + - ``` ps -eaf | grep python3 ``` + ![image](https://user-images.githubusercontent.com/70273488/205716391-d0144085-8948-4269-a25c-51bc68448e1e.png) diff --git a/docs/assets/managedResourceDiagram.svg b/docs/assets/managedResourceDiagram.svg new file mode 100644 index 0000000..85ca5d6 --- /dev/null +++ b/docs/assets/managedResourceDiagram.svg @@ -0,0 +1 @@ +NetworkExporterPrometheus AgentCompute nodeCompute nodeCompute nodeVMSS ClusterVMSS ClusterVMSS ClusterAzure ManagedPrometheusAzure ManagedGrafanaGPU ViewCluster NameJob IDInstanceGPUNetworkViewCluster NameJob IDInstanceIB PortNode ViewCluster NameJob IDInstanceNUMAClusterViewCluster NameJob IDOperation: Min/Ave/MaxQuery Raw Data usingPromQL via REST APIRemote WriteAuth via Managed IdentityMetricsVisualizeMetricsexportDCGMIGPUExporterNodeExporter \ No newline at end of file diff --git a/docs/assets/promAMWLinkGrafana.png b/docs/assets/promAMWLinkGrafana.png new file mode 100644 index 0000000..5810374 Binary files /dev/null and b/docs/assets/promAMWLinkGrafana.png differ diff --git a/linux_service/README.md b/linux_service/README.md index 18dd896..ab666db 100644 --- a/linux_service/README.md +++ b/linux_service/README.md @@ -7,16 +7,16 @@ Setting up Moneo exporters as Linux service will allow for easy management and d Three launch methods provided: -1. The basic launch method launches the exporters on the compute node. It is up to the user to either: - - Use Moneo CLI to launch the manager Grafana and Prometheus containers on a head node. - - Or use you own method to scrape from the exporter ports ("nvidia_exporter": 8000 "net_exporter": 8001 "node_exporter": 8002). -2. Launch exporters and an [Azure Monitor](../docs/AzureMonitorAgent.md) publisher. - - Before launch you must modify the "azure_monitor_agent_config" section of [publisher_config](../src/worker/publisher/config/publisher_config.json) file with the Azure Monitor workspace connection string. -3. Azure Managed Grafana/Prometheus. +1. Azure Managed Grafana/Prometheus. - This will require you to set up Managed Prometheus and Managed Grafana - See prereqs for [Managed Prometheus](../docs/ManagedPrometheusAgent.md) - Once Managed Prometheus is set up you can link it to a Grafana Dashboard. - See [Azure Managed Grafana overview](https://learn.microsoft.com/en-us/azure/managed-grafana/overview) for info on setting up Grafana. +2. The basic launch method launches the exporters on the compute node. It is up to the user to either: + - Use Moneo CLI to launch the manager Grafana and Prometheus containers on a head node. + - Or use you own method to scrape from the exporter ports ("nvidia_exporter": 8000 "net_exporter": 8001 "node_exporter": 8002). +3. Launch exporters and an [Azure Monitor](../docs/AzureMonitorAgent.md) publisher. + - Before launch you must modify the "azure_monitor_agent_config" section of [publisher_config](../src/worker/publisher/config/publisher_config.json) file with the Azure Monitor workspace connection string. This guide will walk you through how to set up Linux services for Moneo exporters. @@ -44,6 +44,7 @@ Configuration/Installation is only required once. Afte that is complete the Linu 1. Configuration and installation of the Linux service is done with the following command: ```parallel-ssh -i -t 0 -h hostfile "sudo /opt/azurehpc/tools/Moneo/linux_service/configure_service.sh"``` - If You will only be launching the exporters without AZ monitor or Managed Prometheus Continue to the Launch Services section else continue. + 2. For Azure Monitor or Managed Prometheus methods if you have not yet modified the configuration files reference the following: - For Azure Managed Prometheus: - modify [managed_prom_config.json](../src/worker/publisher/config) and copy the file to the compute nodes.