Improve latency measurement #50

xuzhao9 · 2024-11-14T15:36:58Z

We want to remove the both CPU and GPU launch latency from GPU kernel runtime. Right now do_bench uses CUDA event, which means it will include the GPU launch latency.

It would be more accurate to measure GPU kernel latency with Kineto/CUPTI, or nsys.

We also want to have latency (+/- variance%) available ootb and supports both tabulate and csv format.

The text was updated successfully, but these errors were encountered:

FindHao · 2024-11-19T19:41:19Z

I'm going to add a nsys report analyzer this week, but may only focus on kernel execution time summary and number of kernels.

xuzhao9 · 2024-11-21T00:44:28Z

We can have different ways to measure the kernel latency, by default we use triton do_bench/do_bench_cudagraph, but user can choose to use nsys to get more accurate kernel run time at the cost of slower benchmarking.

Summary: This PR add a nsys report analyzer providing metrics ```python nsys_metrics_to_reports = { # the sum of kernel execution time "nsys_gpu_kernel_sum": ["cuda_gpu_kern_sum", "nvtx_sum"], # the overhead of kernel launch "nsys_launch_overhead": ["cuda_gpu_kern_sum", "nvtx_sum"], # the names of kernels "nsys_kernel_names": ["cuda_gpu_kern_sum"], # the durations of kernels "nsys_kernel_durations": ["cuda_gpu_kern_sum"], # the duration of nvtx range "nsys_nvtx_range_duration": ["nvtx_sum"], # the number of kernels "nsys_num_of_kernels": ["cuda_gpu_kern_sum"], } ``` `nsys_gpu_kernel_sum` is the sum of total GPU kernel execution time on GPUs, the `nsys_nvtx_range_duration ` is the total execution time of the operator, and the `nsys_launch_overhead` is their difference which indicates the launch overhead. This is one way to measure execution time mentioned in #50 Fix #67 Pull Request resolved: #65 Test Plan: ``` % python run.py --op rope --num-inputs 1 --metrics nsys_gpu_kernel_sum,nsys_launch_overhead,nsys_kernel_names,nsys_kernel_durations,nsys_nvtx_range_duration,nsys_num_of_kernels --csv --dump-csv 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 Capture range started in the application. Capture range ended in the application. Generating '/tmp/nsys-report-531e.qdstrm' [1/1] [0% ] nsys_output.nsys-repProcessing events... [1/1] [========================100%] nsys_output.nsys-rep Generated: /tmp/tritonbench/rope/nsys_traces/apply_rotary_pos_emb_0/nsys_output.nsys-rep 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 Capture range started in the application. Capture range ended in the application. Generating '/tmp/nsys-report-39ea.qdstrm' [1/1] [0% ] nsys_output.nsys-repProcessing events... [1/1] [========================100%] nsys_output.nsys-rep Generated: /tmp/tritonbench/rope/nsys_traces/liger_rotary_pos_emb_0/nsys_output.nsys-rep 0%| | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46 Capture range started in the application. Capture range ended in the application. Generating '/tmp/nsys-report-e8bf.qdstrm' [1/1] [0% ] nsys_output.nsys-repProcessing events... [1/1] [========================100%] nsys_output.nsys-rep Generated: /tmp/tritonbench/rope/nsys_traces/inductor_rotary_pos_emb_full_op_0/nsys_output.nsys-rep 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.40s/it] (H, T);apply_rotary_pos_emb-nsys_kernel_names;apply_rotary_pos_emb-nsys_kernel_durations;apply_rotary_pos_emb-nsys_gpu_kernel_sum;apply_rotary_pos_emb-nsys_num_of_kernels;apply_rotary_pos_emb-nsys_launch_overhead;apply_rotary_pos_emb-nsys_nvtx_range_duration;liger_rotary_pos_emb-nsys_kernel_names;liger_rotary_pos_emb-nsys_kernel_durations;liger_rotary_pos_emb-nsys_gpu_kernel_sum;liger_rotary_pos_emb-nsys_num_of_kernels;liger_rotary_pos_emb-nsys_launch_overhead;liger_rotary_pos_emb-nsys_nvtx_range_duration;inductor_rotary_pos_emb_full_op-nsys_kernel_names;inductor_rotary_pos_emb_full_op-nsys_kernel_durations;inductor_rotary_pos_emb_full_op-nsys_gpu_kernel_sum;inductor_rotary_pos_emb_full_op-nsys_num_of_kernels;inductor_rotary_pos_emb_full_op-nsys_launch_overhead;inductor_rotary_pos_emb_full_op-nsys_nvtx_range_duration (8192, 1024);['void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::MulFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>, unsigned int, (int)4, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::CUDAFunctor_add<float>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::neg_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)'];0.090065;0.351364;4;0.4534;0.804764;['_triton_rope'];0.049281;0.049281;1;0.176437;0.225718;['triton_poi_fused_add_cat_mul_0', 'triton_poi_fused_add_cat_mul_1'];0.0266885;0.053377;2;0.444969;0.498346 [TritonBench] Dumped csv to /tmp/tritonbench/op_rope__z_yqmrz.csv ``` Reviewed By: xuzhao9 Differential Revision: D66311127 Pulled By: FindHao fbshipit-source-id: 085454e34a3e9aadb360309cc69885684a8a1758

FindHao · 2024-12-07T00:15:57Z

Let's take kl_div as an example.

python run.py --op kl_div --mode fwd --precision fp32 --num-inputs 1 --input-id 5 --metrics kineto_trace,latency,ncu_rep,nsys_rep,nsys_kernel_names,nsys_kernel_durations, --csv

For inductor implementation, the final kernel triton_red_fused_mul_sub_sum_xlogy_0's statistics are the following.

profiler	measured latency
kineto	1.916171ms
ncu	1.837184ms
nsys	1.903886ms

FindHao mentioned this issue Nov 21, 2024

Add nsys report analyzer #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve latency measurement #50

Improve latency measurement #50

xuzhao9 commented Nov 14, 2024 •

edited

Loading

FindHao commented Nov 19, 2024

xuzhao9 commented Nov 21, 2024

FindHao commented Dec 7, 2024

Improve latency measurement #50

Improve latency measurement #50

Comments

xuzhao9 commented Nov 14, 2024 • edited Loading

FindHao commented Nov 19, 2024

xuzhao9 commented Nov 21, 2024

FindHao commented Dec 7, 2024

xuzhao9 commented Nov 14, 2024 •

edited

Loading