V0.10.0 Test Plan #585

yukirora · 2023-12-06T02:51:52Z

Test Cases

single-node test

Machine Type	#Node * #GPU * GPU Type	Accelerated Computing Toolkit	Status
NDv5 SXM	1* 8 * H100	CUDA12.2	done
AMD MI200	1 * 16 * AMD MI200	ROCM 5.7	done
AMD MI300x	1 * 8 * AMD MI300x	ROCM 6.0	done

A100 and H100 related

microbenchmark

Bug fix for GPU Burn test (Bug Fix - remove cp ptx file command in gpu burn test #567)

Support INT8 in cublaslt function (Benchmarks: micro benchmarks - add int8 support for cublaslt function #574)

Support cpu-gpu and gpu-cpu in ib-validation (Benchmarks: micro benchmark - Support cpu-gpu and gpu-cpu in ib-validation #581)

Support graph mode in NCCL/RCCL benchmarks for latency metrics (Benchmarks: Micro benchmark - Add graph mode in NCCL/RCCL benchmarks for latency metrics #583)

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance #588)

dist-inference cpp (Benchmarks: Microbenchmark - Add distributed inference benchmark cpp implementation #586)

add msccl support (Benchmarks: Add MSCCL Support for Nvidia GPU #584)

Support in-place for NCCL/RCCL benchmark (Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591)

Model Benchmark Improvement

Change torch.distributed.launch to torchrun (Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556)

Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark #582)

Superbench improvement

Update Docker image for H100 support (Dockerfile - Upgrade Docker image to CUDA 12.2 #577)

MI200 and MI300x

microbenchmark improvement

Add HPL random generator to gemm-flops with ROCm (Benchmarks: Micro benchmark - add initialization options for rocm gemm flops #578)

Update MLC version into 3.10 for CUDA/ROCm dockerfile (Dockerfile - update mlc version into 3.10 for cuda and rocm dockerfiles #562)

Add hipBLASLt function benchmark (Benchmarks: Micro benchmark - Add hipBLASLt function benchmark #576)

Support cpu-gpu and gpu-cpu in ib-validation (Benchmarks: micro benchmark - Support cpu-gpu and gpu-cpu in ib-validation #581)

Support graph mode in NCCL/RCCL benchmarks for latency metrics (Benchmarks: Micro benchmark - Add graph mode in NCCL/RCCL benchmarks for latency metrics #583)

Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance (Benchmarks: Micro benchmark - Add one-to-all, all-to-one, all-to-all support to gpu_copy_bw_performance #588)

dist-inference cpp (Benchmarks: Microbenchmark - Add distributed inference benchmark cpp implementation #586)

Support in-place for NCCL/RCCL benchmark (Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591)

Model Benchmark Improvement

Change torch.distributed.launch to torchrun (Benchmarks: model benchmarks - change torch.distributed.launch to torchrun #556)

Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark (Benchmarks: Add benchmark: Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark #582)

Superbench improvement

Support Monitoring for AMD GPUs (Monitor - Add support for AMD GPU. #580)

Result analysis

Support baseline generation from multiple nodes (Analyzer - Generate baseline given results from multiple nodes. #575)

cp5555 added the test Unit tests, lint or format check related label Dec 6, 2023

yukirora closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.10.0 Test Plan #585

V0.10.0 Test Plan #585

yukirora commented Dec 6, 2023 •

edited

Loading

V0.10.0 Test Plan #585

V0.10.0 Test Plan #585

Comments

yukirora commented Dec 6, 2023 • edited Loading

Test Cases

single-node test

A100 and H100 related

MI200 and MI300x

Result analysis

yukirora commented Dec 6, 2023 •

edited

Loading