Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve latency measurement #50

Open
xuzhao9 opened this issue Nov 14, 2024 · 3 comments
Open

Improve latency measurement #50

xuzhao9 opened this issue Nov 14, 2024 · 3 comments

Comments

@xuzhao9
Copy link
Contributor

xuzhao9 commented Nov 14, 2024

We want to remove the both CPU and GPU launch latency from GPU kernel runtime. Right now do_bench uses CUDA event, which means it will include the GPU launch latency.

It would be more accurate to measure GPU kernel latency with Kineto/CUPTI, or nsys.

We also want to have latency (+/- variance%) available ootb and supports both tabulate and csv format.

@FindHao
Copy link
Member

FindHao commented Nov 19, 2024

I'm going to add a nsys report analyzer this week, but may only focus on kernel execution time summary and number of kernels.

@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Nov 21, 2024

We can have different ways to measure the kernel latency, by default we use triton do_bench/do_bench_cudagraph, but user can choose to use nsys to get more accurate kernel run time at the cost of slower benchmarking.

facebook-github-bot pushed a commit that referenced this issue Nov 27, 2024
Summary:
This PR add a nsys report analyzer providing metrics
```python
nsys_metrics_to_reports = {
    # the sum of kernel execution time
    "nsys_gpu_kernel_sum": ["cuda_gpu_kern_sum", "nvtx_sum"],
    # the overhead of kernel launch
    "nsys_launch_overhead": ["cuda_gpu_kern_sum", "nvtx_sum"],
    # the names of kernels
    "nsys_kernel_names": ["cuda_gpu_kern_sum"],
    # the durations of kernels
    "nsys_kernel_durations": ["cuda_gpu_kern_sum"],
    # the duration of nvtx range
    "nsys_nvtx_range_duration": ["nvtx_sum"],
    # the number of kernels
    "nsys_num_of_kernels": ["cuda_gpu_kern_sum"],
}
```
`nsys_gpu_kernel_sum` is the sum of total GPU kernel execution time on GPUs, the `nsys_nvtx_range_duration ` is the total execution time of the operator, and the `nsys_launch_overhead` is their difference which indicates the launch overhead. This is one way to measure execution time mentioned in #50

Fix #67

Pull Request resolved: #65

Test Plan:
```
% python run.py --op rope  --num-inputs 1  --metrics nsys_gpu_kernel_sum,nsys_launch_overhead,nsys_kernel_names,nsys_kernel_durations,nsys_nvtx_range_duration,nsys_num_of_kernels --csv --dump-csv
  0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-531e.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/apply_rotary_pos_emb_0/nsys_output.nsys-rep
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-39ea.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/liger_rotary_pos_emb_0/nsys_output.nsys-rep
  0%|          | 0/1 [00:00<?, ?it/s]`LlamaRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-e8bf.qdstrm'
[1/1] [0%                          ] nsys_output.nsys-repProcessing events...
[1/1] [========================100%] nsys_output.nsys-rep
Generated:
    /tmp/tritonbench/rope/nsys_traces/inductor_rotary_pos_emb_full_op_0/nsys_output.nsys-rep
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:52<00:00, 52.40s/it]
(H, T);apply_rotary_pos_emb-nsys_kernel_names;apply_rotary_pos_emb-nsys_kernel_durations;apply_rotary_pos_emb-nsys_gpu_kernel_sum;apply_rotary_pos_emb-nsys_num_of_kernels;apply_rotary_pos_emb-nsys_launch_overhead;apply_rotary_pos_emb-nsys_nvtx_range_duration;liger_rotary_pos_emb-nsys_kernel_names;liger_rotary_pos_emb-nsys_kernel_durations;liger_rotary_pos_emb-nsys_gpu_kernel_sum;liger_rotary_pos_emb-nsys_num_of_kernels;liger_rotary_pos_emb-nsys_launch_overhead;liger_rotary_pos_emb-nsys_nvtx_range_duration;inductor_rotary_pos_emb_full_op-nsys_kernel_names;inductor_rotary_pos_emb_full_op-nsys_kernel_durations;inductor_rotary_pos_emb_full_op-nsys_gpu_kernel_sum;inductor_rotary_pos_emb_full_op-nsys_num_of_kernels;inductor_rotary_pos_emb_full_op-nsys_launch_overhead;inductor_rotary_pos_emb_full_op-nsys_nvtx_range_duration
(8192, 1024);['void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::BinaryFunctor<float, float, float, at::native::binary_internal::MulFunctor<float>>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>, unsigned int, (int)4, (int)64, (int)64>(T1 *, at::native::<unnamed>::CatArrInputTensorMetadata<T1, T2, T4, T5>, at::native::<unnamed>::TensorSizeStride<T2, (unsigned int)4>, int, T2)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::CUDAFunctor_add<float>>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)', 'void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::native::neg_kernel_cuda(at::TensorIteratorBase &)::[lambda() (instance 2)]::operator ()() const::[lambda() (instance 7)]::operator ()() const::[lambda(float) (instance 1)]>(at::TensorIteratorBase &, const T1 &)::[lambda(int) (instance 1)]>(int, T3)'];0.090065;0.351364;4;0.4534;0.804764;['_triton_rope'];0.049281;0.049281;1;0.176437;0.225718;['triton_poi_fused_add_cat_mul_0', 'triton_poi_fused_add_cat_mul_1'];0.0266885;0.053377;2;0.444969;0.498346
[TritonBench] Dumped csv to /tmp/tritonbench/op_rope__z_yqmrz.csv
```

Reviewed By: xuzhao9

Differential Revision: D66311127

Pulled By: FindHao

fbshipit-source-id: 085454e34a3e9aadb360309cc69885684a8a1758
@FindHao
Copy link
Member

FindHao commented Dec 7, 2024

Let's take kl_div as an example.

python run.py --op kl_div --mode fwd --precision fp32 --num-inputs 1 --input-id 5 --metrics kineto_trace,latency,ncu_rep,nsys_rep,nsys_kernel_names,nsys_kernel_durations, --csv

For inductor implementation, the final kernel triton_red_fused_mul_sub_sum_xlogy_0's statistics are the following.

profiler measured latency
kineto 1.916171ms
ncu 1.837184ms
nsys 1.903886ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants