Resurrect Graph & Op Profiler #9647

max-krasnyansky · 2024-09-26T01:42:38Z

I needed to generate some per-Op profiling data (same thing that LLAMA_PERF used to generate) and realized that that feature is gone. My guess is probably due to completely different parallel graph processing (and probably the backend interface introduction).

Here is a first attempt at reintroducing that feature (I'm calling it GGML_GRAPH_PROFILER) with some additional features.
Definitely not ready for the merge but I wanted to get some input before I invest more time :-)

Features:

Profiling data is collected per op / per thread and aggregated during the report
We collect three per-op events: START, SYNC, END
- This allows for measuring sync overhead per op
- All timing data is in nanoseconds
The output is the markdown table
- See example below (nice and readable output both in the terminal and rendedered)
The output includes a compact dump of all tensor dims and data-types of the Op (see Flash Attention in the sample below)
This might be handy in other places.
Changes to the graph_compute are minimal / clean
If GGML_GRAPH_PROFILER is not defined there is zero overhead (well, besides the extra pointer in the graph).
- If profiler is compiled but not enabled the overhead is relatively small (3 extra if checks per graph node)
- We can probably optimize that out further (ie like a separate compute loop without profiling calls)
Tested on a couple platforms (M2 Max and Snapdragon Gen 3) with and without TSAN

Known issues:

Using env var is just a hack for now
- I'm thinking of adding ggml_init_param.graph_profile
- If set it'll enable profiling for all the graphs created from that context
- The parms will probably be a string (filename:format or something like that)
The profile_data allocation is leaky
- If the overall approach looks / sounds good I'll change it to do the same thing ggml_new_graph_custom
  i.e provide the size of the data we need for profiling stuff and use the ctx buffer
Currently it only works with the CPU backend
- If there is interest it should be easy to extend to other backends where they could update per-node/per-thread ggml_profile_timing data (they'd have to collect it on the accelerator and then export into this common format.

Let me know how this sounds.

Example of the terminal output

| node idx | op name | proc (nsec) | sync (nsec) | total (nsec) | op dims | op types | tensor name |
| -------: | :------ | ----------: | ----------: | -----------: | ------: | -------: | ----------: |
| 0000 |   GET_ROWS |      22863 |   47566042 |   47588905 |      4096:128256:1:1 x 2:1:1:1 ->   4096:2:1:1 |      q4_0 x i32 -> f32 |             inp_embd |
| 0001 |   RMS_NORM |      96926 |   47859323 |   47956249 |                     4096:2:1:1 ->   4096:2:1:1 |             f32 -> f32 |               norm-0 |
| 0002 |        MUL |      11875 |   36215730 |   36227605 |        4096:2:1:1 x 4096:1:1:1 ->   4096:2:1:1 |       f32 x f32 -> f32 |          attn_norm-0 |
| 0003 |    MUL_MAT |   27633698 |   21271875 |   48905573 |     4096:4096:1:1 x 4096:2:1:1 ->   4096:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Qcur-0 |
| 0004 |    RESHAPE |       9167 |   22963435 |   22972602 |                     4096:2:1:1 ->   128:32:1:1 |             f32 -> f32 |    Qcur-0 (reshaped) |
| 0005 |       ROPE |    1787239 |   22215678 |   24002917 |           128:32:1:1 x 2:1:1:1 ->   128:32:1:1 |       f32 x i32 -> f32 |               Qcur-0 |
| 0006 |    MUL_MAT |   20612237 |     698388 |   21310625 |     4096:1024:1:1 x 4096:2:1:1 ->   1024:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Kcur-0 |
| 0007 |    RESHAPE |      10989 |       7033 |      18022 |                     1024:2:1:1 ->    128:8:1:1 |             f32 -> f32 |    Kcur-0 (reshaped) |
| 0008 |       ROPE |    2566925 |     332969 |    2899894 |            128:8:1:1 x 2:1:1:1 ->    128:8:1:1 |       f32 x i32 -> f32 |               Kcur-0 |
| 0009 |    MUL_MAT |     489845 |      98229 |     588074 |     4096:1024:1:1 x 4096:2:1:1 ->   1024:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Vcur-0 |
| 0010 |       VIEW |       7396 |       3074 |      10470 |                  4194304:1:1:1 ->   2048:1:1:1 |           q4_0 -> q4_0 |       k_cache_view-0 |
| 0011 |        CPY |      15782 |       7186 |      22968 |         128:8:1:1 x 2048:1:1:1 ->   2048:1:1:1 |     f32 x q4_0 -> q4_0 | k_cache_view-0 (copy of Kcur-0) |
| 0012 |       VIEW |       5836 |       2444 |       8280 |                  4194304:1:1:1 ->   2048:1:1:1 |           q4_0 -> q4_0 |       v_cache_view-0 |
| 0013 |        CPY |      14324 |      12865 |      27189 |        1024:2:1:1 x 2048:1:1:1 ->   2048:1:1:1 |     f32 x q4_0 -> q4_0 | v_cache_view-0 (copy of Vcur-0) |
| 0014 |    PERMUTE |       5572 |       2240 |       7812 |                     128:32:1:1 ->    128:2:1:1 |             f32 -> f32 |                  q-0 |
| 0015 |       VIEW |      10469 |       3543 |      14012 |                  4194304:1:1:1 ->  128:256:1:1 |           q4_0 -> q4_0 |                  k-0 |
| 0016 |       VIEW |       9375 |       3071 |      12446 |                  4194304:1:1:1 ->  128:256:1:1 |           q4_0 -> q4_0 |                  v-0 |
| 0017 |        CPY |      10053 |       4635 |      14688 |        256:32:1:1 x 256:32:1:1 ->   256:32:1:1 |       f32 x f16 -> f16 |       KQ_mask (copy) |
| 0018 | FLASH_ATTN_EXT |     113753 |      17809 |     131562 | 128:2:1:1 x 128:256:1:1 x 128:256:1:1 x 256:32:1:1 ->   128:32:1:1 | f32 x q4_0 x q4_0 x f16 -> f32 |              node_18 |

Same example in rendered MarkDown

node idx	op name	proc (nsec)	sync (nsec)	total (nsec)	op dims	op types	tensor name
0000	GET_ROWS	22863	47566042	47588905	4096:128256:1:1 x 2:1:1:1 -> 4096:2:1:1	q4_0 x i32 -> f32	inp_embd
0001	RMS_NORM	96926	47859323	47956249	4096:2:1:1 -> 4096:2:1:1	f32 -> f32	norm-0
0002	MUL	11875	36215730	36227605	4096:2:1:1 x 4096:1:1:1 -> 4096:2:1:1	f32 x f32 -> f32	attn_norm-0
0003	MUL_MAT	27633698	21271875	48905573	4096:4096:1:1 x 4096:2:1:1 -> 4096:2:1:1	q4_0_4x8 x f32 -> f32	Qcur-0
0004	RESHAPE	9167	22963435	22972602	4096:2:1:1 -> 128:32:1:1	f32 -> f32	Qcur-0 (reshaped)
0005	ROPE	1787239	22215678	24002917	128:32:1:1 x 2:1:1:1 -> 128:32:1:1	f32 x i32 -> f32	Qcur-0
0006	MUL_MAT	20612237	698388	21310625	4096:1024:1:1 x 4096:2:1:1 -> 1024:2:1:1	q4_0_4x8 x f32 -> f32	Kcur-0
0007	RESHAPE	10989	7033	18022	1024:2:1:1 -> 128:8:1:1	f32 -> f32	Kcur-0 (reshaped)
0008	ROPE	`2566925`	332969	2899894	128:8:1:1 x 2:1:1:1 -> 128:8:1:1	f32 x i32 -> f32	Kcur-0
0009	MUL_MAT	489845	98229	588074	4096:1024:1:1 x 4096:2:1:1 -> 1024:2:1:1	q4_0_4x8 x f32 -> f32	Vcur-0
0010	VIEW	7396	3074	10470	4194304:1:1:1 -> 2048:1:1:1	q4_0 -> q4_0	k_cache_view-0
0011	CPY	15782	7186	22968	128:8:1:1 x 2048:1:1:1 -> 2048:1:1:1	f32 x q4_0 -> q4_0	k_cache_view-0 (copy of Kcur-0)
0012	VIEW	5836	2444	8280	4194304:1:1:1 -> 2048:1:1:1	q4_0 -> q4_0	v_cache_view-0
0013	CPY	14324	12865	27189	1024:2:1:1 x 2048:1:1:1 -> 2048:1:1:1	f32 x q4_0 -> q4_0	v_cache_view-0 (copy of Vcur-0)
0014	PERMUTE	5572	2240	7812	128:32:1:1 -> 128:2:1:1	f32 -> f32	q-0
0015	VIEW	10469	3543	14012	4194304:1:1:1 -> 128:256:1:1	q4_0 -> q4_0	k-0
0016	VIEW	9375	3071	12446	4194304:1:1:1 -> 128:256:1:1	q4_0 -> q4_0	v-0
0017	CPY	10053	4635	14688	256:32:1:1 x 256:32:1:1 -> 256:32:1:1	f32 x f16 -> f16	KQ_mask (copy)
0018	FLASH_ATTN_EXT	113753	17809	131562	128:2:1:1 x 128:256:1:1 x 128:256:1:1 x 256:32:1:1 -> 128:32:1:1	f32 x q4_0 x q4_0 x f16 -> f32	node_18

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-09-26T01:53:43Z

I cannot review this right at the moment, but part of the reason it was removed is because we already have a way to measure the performance of individual operators with test-backend-ops, in a way that works with every backend. I suggest taking a look at that and see if you could obtain the data that you need using test-backend-ops instead.

Additionally, we should avoid adding data specific to the CPU backend to the common ggml structures. In the long term, the goal should be to remove all the coupling between the core ggml code and the CPU backend.

max-krasnyansky · 2024-09-26T04:33:43Z

I cannot review this right at the moment, but part of the reason it was removed is because we already have a way to measure the performance of individual operators with test-backend-ops, in a way that works with every backend. I suggest taking a look at that and see if you could obtain the data that you need using test-backend-ops instead.

Additionally, we should avoid adding data specific to the CPU backend to the common ggml structures. In the long term, the goal should be to remove all the coupling between the core ggml code and the CPU backend.

Thanks for the feedback. I'm familiar with test-backend-ops (been using it to test an exeprimental backend I've been working on). It's a different use-case. This graph-profiler I'm proposing is a whole-model/graph coverage. So that we can see full picture of a specific model / context / features (kv-quantization, flash attention, etc), how Ops slowdown with context growth, etc. Also I as mentioned I've added profiling of the sync (ie barrier) overhead. We can add more events later (maybe quantization and other prep).
It's a good point though that we don't want multiple Op profilers. Perhaps, we could update test-backend-ops to use the generic graph-profiler. I'll think some more about it.

Re: CPU specific. I was thinking we can update the API and backends to use this infra.
For example, we have GGML_VULKAN_PERF which is basically doing the same thing (with much simple output) and is Vulkan backend specific. I bet other backends would benefit from exposing profile data like that.
If not I can just move everything in the CPU backend I guess but yeah it seems like the current version can be fairly generic.

slaren · 2024-09-26T14:12:13Z

I agree that it would be useful to have tools to obtain a full view of where the time is being spent. test-backend-ops is useful for optimizing individual ops, but it does not tell you what ops need to be optimized in the first place.

I am inclined to think that this data could be obtained in a less intrusive way and without requiring changes to the backends, simply by running the graph node by node and measuring the time it takes to run each one. There would be some disadvantages to this approach since it would include the graph launch overhead, but this should be a small difference that should not prevent obtaining the overall picture. Once that is found, test-backend-ops or similar tools can be used to for more accurate measurements of individual ops.

Measuring the barrier/sync time could also be useful in the context of the CPU backend, but it would not be relevant to other backends. I think there is also a place for more detailed profiling specific to each backend, but in that case the implementation would need to be strictly confined to the backend itself. Alternatively, it could possibly be done in a generic way by creating an interface to generate "performance counters" that would associated to an individual ops of a graph, and would be defined by the backends. So "barrier time" could be one counter for the CPU backend, but it could be used to measure the different phases of each op, for example in matrix multiplication the CPU backend could also have a counter for the "time to quantize src1", and the BLAS or CUDA (with cuBLAS) backends could have a counter for "time to dequantize src0".

Note that to implement this effectively, the timing would likely need to be done by the backend itself, otherwise, this would not be useful for GPU backends that typically work with large command buffers that are submitted at once, and synchronizing with the CPU has a significant overhead. However, they may still provide other means to measure individual commands (eg. CUDA has events that can measure the time between two operations), so the interface should be able to work with that.

max-krasnyansky · 2024-09-26T16:37:31Z

Sounds good.
I pushed what I have so far into graph-profiler branch and will start a new PR from there so that you guys can play/push into it as well (if there is time/interest).
At the very least it could just evolve into the CPU-backend specific profiler for now or maybe we can figure out that perf counter approach you suggested.

max-krasnyansky · 2024-09-26T16:46:10Z

Moved into #9659

max-krasnyansky added 3 commits September 25, 2024 14:25

profiler: initial support for profiling graph ops

b7ae2d1

profiler: make profiler optional with GGML_GRAPH_PROFILER

4578c37

profiler: add support for different outputs

8d76b0c

max-krasnyansky requested review from ggerganov and slaren September 26, 2024 01:42

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 26, 2024

max-krasnyansky mentioned this pull request Sep 26, 2024

Introduce Graph Profiler #9659

Open

max-krasnyansky closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resurrect Graph & Op Profiler #9647

Resurrect Graph & Op Profiler #9647

max-krasnyansky commented Sep 26, 2024

slaren commented Sep 26, 2024

max-krasnyansky commented Sep 26, 2024

slaren commented Sep 26, 2024

max-krasnyansky commented Sep 26, 2024

max-krasnyansky commented Sep 26, 2024

Resurrect Graph & Op Profiler #9647

Resurrect Graph & Op Profiler #9647

Conversation

max-krasnyansky commented Sep 26, 2024

slaren commented Sep 26, 2024

max-krasnyansky commented Sep 26, 2024

slaren commented Sep 26, 2024

max-krasnyansky commented Sep 26, 2024

max-krasnyansky commented Sep 26, 2024