Experimental notice: This project is still experimental, only shows how to run eBPF code in container to trace kernel, and provides some simple examples to teach you how to use BCC to develop eBPF tools and use ebpf_exporter to visualize the system tracing metrics. Right now here's an simple attempt to combine MindSpore with eBPF, real practical examples are expected in the near future.
MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. MindSpore is designed to provide development experience with friendly design and efficient execution for the data scientists and algorithmic engineers, native support for Ascend AI processor, and software hardware co-optimization.
Currently, the problem with all deep learning job is that the AI training process is invisible. While running a AI job by using the MindSpore, we don't know how it is layered, don’t know which CPU core it runs on , even don’t know what kernel functions it calls and how to jump. Once the task has bottlenecks, developers tend to choose to use some common monitoring tools to analyze, but these usually have blind spots and they are inflexible, such as: they can get long-lived processes information, but for some short-lived processes, often can't capture which leads to loss of information, a lot of these processes are actually on the consumption of resources.
To solve the gap, the project ms_observability combines the MindSpore with the new technology eBPF to improve the observability of the AI kernel throughout the training and reasoning process. eBPF can make the kernel fully programmable and dynamically run a mini programs on a wide variety of kernel events, which can empower non-kernel developers to customize their own tracing codes to solve real problems they met, which means that it can keep watch over the whole kernel states of the AI job to provide more detailed context to further analyze your system and application.
cd $HOME
git clone https://github.com/hellowaywewe/ms_observability.git
cd $HOME/ms_observability/docker
docker build -f Dockerfile -t ebpf_bcc_exporter:latest .
cd $HOME/ms_observability
DOCKER_NAME=ebpf_bcc_exporter TAG=latest ./run_docker.sh // mount the host kernel to the container
Show the kernel queue IO latency metrics (simple example, showing how to use bcc to develop eBPF code and probe kernel)
docker exec -it ebpf_bcc_exporter /bin/bash // Container interactive operation
cd /mnt/ms_observability/ebpf_example && ./io-latency.py 1 2
Show the kernel queue IO latency metrics (simple example, show how to use ebpf_exporter to configure and visualize metrics)
~/go/bin/ebpf_exporter --config.file=/mnt/ms_observability/exporter_example/io-latency.yaml
docker inspect ebpf_bcc_exporter | grep IPAddress // Query the IP of the container
curl http://<yourContainerIP>:9435/metrics
When executing MindSpore LENET job in the host, if the kernel function “blk_account_io_done” is called, the words “Hello World” will be printed, if not, print nothing.
docker exec -it ebpf_bcc_exporter /bin/bash
cd /mnt/ms_observability/ebpf_example
./lenet-io.py
cd $HOME && git clone https://github.com/mindspore-ai/docs.git
conda activate mindspore && cd $HOME/docs/tutorials/tutorial_code/
python lenet.py --device_target="CPU"
Currently the ms_observability is in the early stages of experiment, in the future, most importantly, we should analyze what to do in AI scenarios and which can be used and traced from the thousands of available kernel events. And then collaborate with other open source communities:
- Work with the iovisor/bcc project to develop AI observability tools based on eBPF.
- Enable MindSpore to support eBPF AI observability tools.
- Work with the Prometheus and ebpf_exporter project to visualize the AI kernel metrics.