NCU trace doesn't take mode arguments #87

FindHao · 2024-12-02T19:13:22Z

ncu_rep only profile forward for fwd_bwd mode and it fails for bwd tests.

To reproduce:

python run.py --op rms_norm  --mode bwd  --precision fp32 --metrics ncu_rep,kineto_trace --cudagraph


  0%|                                                                                                                                                                                                                                                              | 0/6 [00:00<?, ?it/s]I1202 15:21:27.471187 1905825 DynoCmdLine.cpp:1393] Target Host: localhost (port 1777)
Failed to configure DCGM profiling, it may be DCGM is not supported or failed==PROF== Connected to process 1906325 (/home/yhao/.conda/envs/ptd/bin/python3.11)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
      (M, H)    llama_rms-_ncu_trace_in_task
------------  ------------------------------
(2048, 1024)                         success
==PROF== Disconnected from process 1906325
==WARNING== No kernels were profiled.
==WARNING== Note that specified NVTX include expressions match only push/pop ranges.
==WARNING== Refer https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvtx-filtering for NVTX Filtering usage.
I1202 15:21:34.731528 1906690 DynoCmdLine.cpp:1393] Target Host: localhost (port 1777)
Failed to configure DCGM profiling, it may be DCGM is not supported or failed==PROF== Connected to process 1906796 (/home/yhao/.conda/envs/ptd/bin/python3.11)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.08s/it]
      (M, H)    liger_rms-_ncu_trace_in_task
------------  ------------------------------
(2048, 1024)                         success
==PROF== Disconnected from process 1906796
==WARNING== No kernels were profiled.
==WARNING== Note that specified NVTX include expressions match only push/pop ranges.
==WARNING== Refer https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvtx-filtering for NVTX Filtering usage.
I1202 15:21:42.512831 1907289 DynoCmdLine.cpp:1393] Target Host: localhost (port 1777)

The text was updated successfully, but these errors were encountered:

xuzhao9 · 2024-12-09T18:53:31Z

I can reproduce:

ncu --nvtx --nvtx-include tritonbench_range/ --target-processes all --import-source yes --set full -f -o /tmp/tritonbench/rms_norm/ncu_traces/inductor_rms_0/ncu_output.ncu-rep /home/xzhao9/.conda/envs/py312/bin/python run.py --op rms_norm --mode bwd --precision fp32 --only inductor_rms --num-inputs 1 --input-id 0 --metrics _ncu_trace_in_task

There is no kernel profiled when running backward. Fwd does not have this problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCU trace doesn't take mode arguments #87

NCU trace doesn't take mode arguments #87

FindHao commented Dec 2, 2024 •

edited

Loading

xuzhao9 commented Dec 9, 2024

NCU trace doesn't take mode arguments #87

NCU trace doesn't take mode arguments #87

Comments

FindHao commented Dec 2, 2024 • edited Loading

xuzhao9 commented Dec 9, 2024

FindHao commented Dec 2, 2024 •

edited

Loading