This repository includes the source code for the measurements and the analysis scripts used for "Network Traffic Characteristics of Machine Learning Frameworks Under the Microscope" by Johannes Zerwas, Kaan Aykurt, Stefan Schmid and Andreas Blenk (2021).
The collected traces are available here.
Folders:
analysis/
: parsing, aggregation and evaluation scriptscustombox
: Vagrant VM filesframeworks/
: Python automation of experiments. Scripts to run the DML trainings are in the sub-folders corresponding to each framework.
The remaining scripts and modules are shared for all the frameworks.
- Install necessary python packages on the orchestrator/controller machine:
pip install paramiko pandas numpy seaborn sklearn matplotlib statsmodels
- Update
config.py
- Orchestrator must be able to SSH as root into worker nodes
- Worker nodes' root user must have a keypair for SSH
- Prepare the folder
/root/dependencies
with downloaded files under the following folder structre:- cuda-files (can be downloaded from NVIDIA Developer Program)
- libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
- libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb
- libcudnn7-doc_7.6.5.32-1+cuda10.1_amd64.deb
- golang (can be downladed from https://golang.org/)
- go1.16.3.linux-amd64.tar.gz
- cuda-files (can be downloaded from NVIDIA Developer Program)
- Update
custombox/custombox_ssh_install.sh
with the SSH pub-key of the orchestrator/controller node
run_experiment.py
script controls the whole VM creation and experiment running process. The script is run with following
parameters:
framework
: name of the frameworkbackend
: name of the communication backend (only for naming convention, desired backend should be specified manually in custom box creation scripts)models
: models to be run, separated with commasbatchsizes
: batch sizes of interest, separated with commastopologies
: topologies of interest, separated with commas (defaults to ring)losses
: losses of interest, separated with commas (defaults to 0)usebox
: flag to set intermediary custom box usage (currently only TensorFlow supports this, defaults to False)
Example:
python3 run_experiment.py --framework kungfu --backend kungfu --models mobilenetv2,densenet201 --batchsizes 64
--topologies BINARY_TREE_STAR,TREE,CLIQUE --losses 0,0.05,0.1
Notes:
- Cuda and Golang dependencies should be in a file which should be passed through to the VM.
- All VMs should be mounted a folder which the results are written into.
- In case the nodes have different internet connection speeds, a dummy experiment has to be run so that each worker has a copy of the training dataset and no hanging occurs in the beginning.
Framework | Backend | Models | Batch Sizes | Topologies | Losses |
---|---|---|---|---|---|
TensorFlow | grpc, grpc_nccl | mobilenetv2, densenet201, resnet50, resnet101 | 64, 128, 512 | ring | 0, 0.05, 0.1, 0.2, 0.5, 1, 2 |
Horovod | mpi, mpi_nccl, gloo, gloo_nccl | mobilenetv2, densenet201, resnet50, resnet101 | 64, 128, 512 | ring | 0 |
KungFu | kungfu | mobilenetv2, densenet201, resnet50, resnet101 | 64 | BINARY_TREE_STAR, CLIQUE, STAR, TREE | 0 |
Raw traces are first parsed via parsing_script.py
and then aggregated via aggregate.py
under analysis/
folder.
A Jupyter Notebook (analysis.ipynb
) is used for the evaluation and visualization of the results.
Folder names designate the experiment configuration in format framework-model-optimizer-batchsize-backend-delay-(topology)
.
- Run parsing:
python parsing_script.py --path /path/to/datafoler
- Run aggregation:
python aggregate.py --path /path/to/datafoler