-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resurrect Graph & Op Profiler #9647
Conversation
I cannot review this right at the moment, but part of the reason it was removed is because we already have a way to measure the performance of individual operators with Additionally, we should avoid adding data specific to the CPU backend to the common ggml structures. In the long term, the goal should be to remove all the coupling between the core ggml code and the CPU backend. |
Thanks for the feedback. I'm familiar with Re: CPU specific. I was thinking we can update the API and backends to use this infra. |
I agree that it would be useful to have tools to obtain a full view of where the time is being spent. I am inclined to think that this data could be obtained in a less intrusive way and without requiring changes to the backends, simply by running the graph node by node and measuring the time it takes to run each one. There would be some disadvantages to this approach since it would include the graph launch overhead, but this should be a small difference that should not prevent obtaining the overall picture. Once that is found, Measuring the barrier/sync time could also be useful in the context of the CPU backend, but it would not be relevant to other backends. I think there is also a place for more detailed profiling specific to each backend, but in that case the implementation would need to be strictly confined to the backend itself. Alternatively, it could possibly be done in a generic way by creating an interface to generate "performance counters" that would associated to an individual ops of a graph, and would be defined by the backends. So "barrier time" could be one counter for the CPU backend, but it could be used to measure the different phases of each op, for example in matrix multiplication the CPU backend could also have a counter for the "time to quantize src1", and the BLAS or CUDA (with cuBLAS) backends could have a counter for "time to dequantize src0". Note that to implement this effectively, the timing would likely need to be done by the backend itself, otherwise, this would not be useful for GPU backends that typically work with large command buffers that are submitted at once, and synchronizing with the CPU has a significant overhead. However, they may still provide other means to measure individual commands (eg. CUDA has events that can measure the time between two operations), so the interface should be able to work with that. |
Sounds good. |
Moved into #9659 |
I needed to generate some per-Op profiling data (same thing that LLAMA_PERF used to generate) and realized that that feature is gone. My guess is probably due to completely different parallel graph processing (and probably the backend interface introduction).
Here is a first attempt at reintroducing that feature (I'm calling it
GGML_GRAPH_PROFILER
) with some additional features.Definitely not ready for the merge but I wanted to get some input before I invest more time :-)
Features:
This might be handy in other places.
GGML_GRAPH_PROFILER
is not defined there is zero overhead (well, besides the extra pointer in the graph).Known issues:
ggml_init_param.graph_profile
filename:format
or something like that)ggml_new_graph_custom
i.e provide the size of the data we need for profiling stuff and use the ctx buffer
ggml_profile_timing
data (they'd have to collect it on the accelerator and then export into this common format.Let me know how this sounds.
Example of the terminal output
Same example in rendered MarkDown