Skip to content

hellowaywewe/ms_observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MindSpore Observability

Experimental notice: This project is still experimental, only shows how to run eBPF code in container to trace kernel, and provides some simple examples to teach you how to use BCC to develop eBPF tools and use ebpf_exporter to visualize the system tracing metrics. Right now here's an simple attempt to combine MindSpore with eBPF, real practical examples are expected in the near future.

Introduction of ms_observability

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. MindSpore is designed to provide development experience with friendly design and efficient execution for the data scientists and algorithmic engineers, native support for Ascend AI processor, and software hardware co-optimization.

Currently, the problem with all deep learning job is that the AI training process is invisible. While running a AI job by using the MindSpore, we don't know how it is layered, don’t know which CPU core it runs on , even don’t know what kernel functions it calls and how to jump. Once the task has bottlenecks, developers tend to choose to use some common monitoring tools to analyze, but these usually have blind spots and they are inflexible, such as: they can get long-lived processes information, but for some short-lived processes, often can't capture which leads to loss of information, a lot of these processes are actually on the consumption of resources.

To solve the gap, the project ms_observability combines the MindSpore with the new technology eBPF to improve the observability of the AI kernel throughout the training and reasoning process. eBPF can make the kernel fully programmable and dynamically run a mini programs on a wide variety of kernel events, which can empower non-kernel developers to customize their own tracing codes to solve real problems they met, which means that it can keep watch over the whole kernel states of the AI job to provide more detailed context to further analyze your system and application.

Getting Started

Prerequisites

Run eBPF code in container to probe kernel metrics

Download ms_observability code

cd $HOME
git clone https://github.com/hellowaywewe/ms_observability.git

Build and run ebpf_bcc_exporter container

cd $HOME/ms_observability/docker
docker build -f Dockerfile -t ebpf_bcc_exporter:latest .
cd $HOME/ms_observability
DOCKER_NAME=ebpf_bcc_exporter TAG=latest ./run_docker.sh   // mount the host kernel to the container

Show the kernel queue IO latency metrics (simple example, showing how to use bcc to develop eBPF code and probe kernel)

docker exec -it ebpf_bcc_exporter /bin/bash     // Container interactive operation
cd /mnt/ms_observability/ebpf_example && ./io-latency.py 1 2

Visualize kernel metrics in the unified format of Prometheus

Show the kernel queue IO latency metrics (simple example, show how to use ebpf_exporter to configure and visualize metrics)

~/go/bin/ebpf_exporter --config.file=/mnt/ms_observability/exporter_example/io-latency.yaml

Use the curl command to verify that the visual metrics are properly captured

docker inspect ebpf_bcc_exporter | grep IPAddress  // Query the IP of the container
curl http://<yourContainerIP>:9435/metrics

A simple attempt to combine MindSpore with eBPF

When executing MindSpore LENET job in the host, if the kernel function “blk_account_io_done” is called, the words “Hello World” will be printed, if not, print nothing.

Run the lenet-io.py code in container

docker exec -it ebpf_bcc_exporter /bin/bash
cd /mnt/ms_observability/ebpf_example
./lenet-io.py

Run the MindSpore lenet training job in the host (Required MindSpore v0.2.0-alpha Env)

cd $HOME && git clone https://github.com/mindspore-ai/docs.git
conda activate mindspore && cd $HOME/docs/tutorials/tutorial_code/
python lenet.py --device_target="CPU"

Future Work

Currently the ms_observability is in the early stages of experiment, in the future, most importantly, we should analyze what to do in AI scenarios and which can be used and traced from the thousands of available kernel events. And then collaborate with other open source communities:

  1. Work with the iovisor/bcc project to develop AI observability tools based on eBPF.
  2. Enable MindSpore to support eBPF AI observability tools.
  3. Work with the Prometheus and ebpf_exporter project to visualize the AI kernel metrics.

About

MindSpore Observability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published