Skip to content

HTML interface to display GPU statuses of multiple nodes.

Notifications You must be signed in to change notification settings

srhthu/gpu-cluster-monitor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Cluster Status Monitor

Introduction

This project aims to provide a Web application to display GPU usages of multiple nodes.

The framework is shown in Figure 1. One each node, the process node_info.py deploys a REST API for the access of node information, such as GPU numbers and usages. The status informatio of all nodes are merged by the process api.py running on one main node. This process also deploys the API to provide the combined information and a web application.


Figure 1. Framework

Quick Start

Install packages (tested with python=3.9).

pip install -r requirments.txt

Assume we have a cluster with two nodes (node_1: 192.168.0.1; node_2: 192.168.0.2) and node_1 is the main node to display cluster information.

First, run the script node_info.py on each node to start a Flask process. Then node status can be obtained through the api on port 7080 (default):

# on node_1, node_2
python node_info.py

Before start the cluster monitor, save all node IP addresses in a txt file:

> hosts.txt # clear hosts.txt
echo 192.168.0.1 >> hosts.txt
echo 192.168.0.2 >> hosts.txt

Second, run the interface API on node_1.

python api.py -c hosts.txt --port 7070

Then visit http://192.168.0.1:7070 in Chrome.

Customize the port

# on each node
python node_info.py --port <node_api_port> --disable_log

# on main node
python api.py -c hosts.txt --port <main_api_port> --node_port <node_api_port>

# visit http://<main_node_ip>:<main_api_port>

Specifications

Password

A password is required to get node status, defaulting to '8888'. To change the password, modify the global variable PASSWD in node_info.py and api.py.

Get node status (json)

Given node_info.py running on 192.168.0.3:7080, node status data can be acquired by python:

import requests

res = requests.post(f'http://192.168.0.3:7080/get-status', json = {'passwd': '8888'})

print(json.dumps(res.json(), indent = 4))

The structure of node status:

{
    "hostname": (`str`)
    "last_update": (`str`) isoformat, e.g., "2023-04-29T21:17:41.419592"
    "ips": List[Tuple[interface, ip]], e.g., [["eno1", "192.168.0.3"]]
    "gpus": [
        {
            "index": (`int`)
            "name": (`str`) gpu brand
            "use_mem": (`int`) used memory in MiB
            "tot_mem": (`int`) total memory in MiB
            "utilize": (`int`) utilization percent
            "temp": (`int`) temperature
            "index": (`int`)
            "users": [{"pid": 123, "username": xx, "mem(MiB)": 1024, "command": xx}, ...]
        },
        ...
    ]
    ""
}

Or print node status data in command line by running

python node_info.py --debug

About

HTML interface to display GPU statuses of multiple nodes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published