English | 简体中文
KaiTian is a Pytorch backend extension that enables distributed data parallel for heterogeneous devices.
The name comes from the theme song of the 2018 LPL Spring Finals, which also conveys the meaning of "creating the world". See https://www.youtube.com/watch?v=mGAjqYyVvzc 。
Currently, only single-node multi-GPU DDP is supported. Multi-node multi-GPU DDP and model parallelism have not yet been implemented.
-
Python >= 3.8
-
Docker Engine(Recommended >= 26.0.2)
-
NVIDIA CUDA (optional)
-
NVIDIA Driver(Recommended >= 520.61.05)
-
NVIDIA Container Toolkit(Recommended >= 1.15.0)
-
-
Cambricon MLU (optional)
- Cambricon MLU Driver(Recommended >= 5.10.22)
git clone --recurse-submodules https://github.com/jklincn/kaitian.git
cd kaitian
python -m pip install -r requirements.txt
python -m pip install .
Taking NVIDIA CUDA for distributed training as an example.
-
Import kaitian
import torch_kaitian from torch_kaitian.distributed import DistributedSampler, optimize_batch_size
-
Modify world_size
# Before world_size = torch.cuda.device_count() # After world_size = torch_kaitian.local_device_count()
-
Modify distributed setup
# Before dist.init_process_group("nccl", rank=rank, world_size=world_size) torch.cuda.set_device(rank) # After dist.init_process_group("kaitian", rank=rank, world_size=world_size) torch_kaitian.set_device(rank)
-
Modify device
# Before device = "cuda" # After device = torch_kaitian.device()
-
Modify DistributedSampler of train_set
# Before train_sampler = DistributedSampler(train_set) # After train_sampler = DistributedSampler(train_set, batch_size)
-
Modify DataLoader of train_set
# Before train_loader = DataLoader(train_set, batch_size=batch_size, sampler=train_sampler, num_workers=2) # After train_loader = DataLoader(train_set, batch_size=optimize_batch_size(batch_size), sampler=train_sampler, num_workers=2)
-
Modify random seed settings
# Before torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.backends.cudnn.deterministic = True # After torch_kaitian.manual_seed(seed)
Specific adaptation examples can be found in example/cuda.py (original code) and example/kaitian.py (modified code)
Initialize the KaiTian environment, which includes creating a configuration file, registering available devices, and pulling images.
kaitian init
When there are changes in the machine environment (such as changes in the number of accelerators installed), reinitialization is necessary.
kaitian init -r
kaitian run -f your_code.py
By default, all available devices on the host will be used. You can specify which accelerators to use with USE_XXX
. Currently supported options include USE_CUDA
and USE_MLU
.
Usage example:
USE_CUDA=0
:Use GPU0 to accelerateUSE_CUDA=0,1
:Use GPU0 and GPU1 to accelerateUSE_CUDA=0 USE_MLU=1
:Use GPU0 and MLU1 to accelerateUSE_CUDA=-1
:Don't use GPU to accelerate
0.0.0-cuda
- Python 3.10 + PyTorch 1.13.1 + CUDA 11.6 + cuDNN 8
- FROM
pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
0.0.0-mlu
- Python 3.10 + Pytorch 1.13.1 + Cambricon 1.17.0
- FROM
yellow.hub.cambricon.com/pytorch/pytorch:v1.17.0-torch1.13.1-ubuntu20.04-py310