Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCU trace doesn't take mode arguments #87

Open
FindHao opened this issue Dec 2, 2024 · 1 comment
Open

NCU trace doesn't take mode arguments #87

FindHao opened this issue Dec 2, 2024 · 1 comment

Comments

@FindHao
Copy link
Member

FindHao commented Dec 2, 2024

ncu_rep only profile forward for fwd_bwd mode and it fails for bwd tests.

To reproduce:

python run.py --op rms_norm  --mode bwd  --precision fp32 --metrics ncu_rep,kineto_trace --cudagraph


  0%|                                                                                                                                                                                                                                                              | 0/6 [00:00<?, ?it/s]I1202 15:21:27.471187 1905825 DynoCmdLine.cpp:1393] Target Host: localhost (port 1777)
Failed to configure DCGM profiling, it may be DCGM is not supported or failed==PROF== Connected to process 1906325 (/home/yhao/.conda/envs/ptd/bin/python3.11)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.40it/s]
      (M, H)    llama_rms-_ncu_trace_in_task
------------  ------------------------------
(2048, 1024)                         success
==PROF== Disconnected from process 1906325
==WARNING== No kernels were profiled.
==WARNING== Note that specified NVTX include expressions match only push/pop ranges.
==WARNING== Refer https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvtx-filtering for NVTX Filtering usage.
I1202 15:21:34.731528 1906690 DynoCmdLine.cpp:1393] Target Host: localhost (port 1777)
Failed to configure DCGM profiling, it may be DCGM is not supported or failed==PROF== Connected to process 1906796 (/home/yhao/.conda/envs/ptd/bin/python3.11)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.08s/it]
      (M, H)    liger_rms-_ncu_trace_in_task
------------  ------------------------------
(2048, 1024)                         success
==PROF== Disconnected from process 1906796
==WARNING== No kernels were profiled.
==WARNING== Note that specified NVTX include expressions match only push/pop ranges.
==WARNING== Refer https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvtx-filtering for NVTX Filtering usage.
I1202 15:21:42.512831 1907289 DynoCmdLine.cpp:1393] Target Host: localhost (port 1777)
@xuzhao9
Copy link
Contributor

xuzhao9 commented Dec 9, 2024

I can reproduce:

ncu --nvtx --nvtx-include tritonbench_range/ --target-processes all --import-source yes --set full -f -o /tmp/tritonbench/rms_norm/ncu_traces/inductor_rms_0/ncu_output.ncu-rep /home/xzhao9/.conda/envs/py312/bin/python run.py --op rms_norm --mode bwd --precision fp32 --only inductor_rms --num-inputs 1 --input-id 0 --metrics _ncu_trace_in_task

There is no kernel profiled when running backward. Fwd does not have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants