Skip to content

Latest commit

 

History

History
111 lines (79 loc) · 2.92 KB

README.md

File metadata and controls

111 lines (79 loc) · 2.92 KB

GPU Cluster Status Monitor

Introduction

This project aims to provide a Web application to display GPU usages of multiple nodes.

The framework is shown in Figure 1. One each node, the process node_info.py deploys a REST API for the access of node information, such as GPU numbers and usages. The status informatio of all nodes are merged by the process api.py running on one main node. This process also deploys the API to provide the combined information and a web application.


Figure 1. Framework

Quick Start

Install packages (tested with python=3.9).

pip install -r requirments.txt

Assume we have a cluster with two nodes (node_1: 192.168.0.1; node_2: 192.168.0.2) and node_1 is the main node to display cluster information.

First, run the script node_info.py on each node to start a Flask process. Then node status can be obtained through the api on port 7080 (default):

# on node_1, node_2
python node_info.py

Before start the cluster monitor, save all node IP addresses in a txt file:

> hosts.txt # clear hosts.txt
echo 192.168.0.1 >> hosts.txt
echo 192.168.0.2 >> hosts.txt

Second, run the interface API on node_1.

python api.py -c hosts.txt --port 7070

Then visit http://192.168.0.1:7070 in Chrome.

Customize the port

# on each node
python node_info.py --port <node_api_port> --disable_log

# on main node
python api.py -c hosts.txt --port <main_api_port> --node_port <node_api_port>

# visit http://<main_node_ip>:<main_api_port>

Specifications

Password

A password is required to get node status, defaulting to '8888'. To change the password, modify the global variable PASSWD in node_info.py and api.py.

Get node status (json)

Given node_info.py running on 192.168.0.3:7080, node status data can be acquired by python:

import requests

res = requests.post(f'http://192.168.0.3:7080/get-status', json = {'passwd': '8888'})

print(json.dumps(res.json(), indent = 4))

The structure of node status:

{
    "hostname": (`str`)
    "last_update": (`str`) isoformat, e.g., "2023-04-29T21:17:41.419592"
    "ips": List[Tuple[interface, ip]], e.g., [["eno1", "192.168.0.3"]]
    "gpus": [
        {
            "index": (`int`)
            "name": (`str`) gpu brand
            "use_mem": (`int`) used memory in MiB
            "tot_mem": (`int`) total memory in MiB
            "utilize": (`int`) utilization percent
            "temp": (`int`) temperature
            "index": (`int`)
            "users": [{"pid": 123, "username": xx, "mem(MiB)": 1024, "command": xx}, ...]
        },
        ...
    ]
    ""
}

Or print node status data in command line by running

python node_info.py --debug