Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resurrect Graph & Op Profiler #9647

Closed
wants to merge 3 commits into from

Conversation

max-krasnyansky
Copy link
Collaborator

I needed to generate some per-Op profiling data (same thing that LLAMA_PERF used to generate) and realized that that feature is gone. My guess is probably due to completely different parallel graph processing (and probably the backend interface introduction).

Here is a first attempt at reintroducing that feature (I'm calling it GGML_GRAPH_PROFILER) with some additional features.
Definitely not ready for the merge but I wanted to get some input before I invest more time :-)

Features:

  • Profiling data is collected per op / per thread and aggregated during the report
  • We collect three per-op events: START, SYNC, END
    • This allows for measuring sync overhead per op
    • All timing data is in nanoseconds
  • The output is the markdown table
    • See example below (nice and readable output both in the terminal and rendedered)
  • The output includes a compact dump of all tensor dims and data-types of the Op (see Flash Attention in the sample below)
    This might be handy in other places.
  • Changes to the graph_compute are minimal / clean
  • If GGML_GRAPH_PROFILER is not defined there is zero overhead (well, besides the extra pointer in the graph).
    • If profiler is compiled but not enabled the overhead is relatively small (3 extra if checks per graph node)
    • We can probably optimize that out further (ie like a separate compute loop without profiling calls)
  • Tested on a couple platforms (M2 Max and Snapdragon Gen 3) with and without TSAN

Known issues:

  • Using env var is just a hack for now
    • I'm thinking of adding ggml_init_param.graph_profile
    • If set it'll enable profiling for all the graphs created from that context
    • The parms will probably be a string (filename:format or something like that)
  • The profile_data allocation is leaky
    • If the overall approach looks / sounds good I'll change it to do the same thing ggml_new_graph_custom
      i.e provide the size of the data we need for profiling stuff and use the ctx buffer
  • Currently it only works with the CPU backend
    • If there is interest it should be easy to extend to other backends where they could update per-node/per-thread ggml_profile_timing data (they'd have to collect it on the accelerator and then export into this common format.

Let me know how this sounds.

Example of the terminal output

| node idx | op name | proc (nsec) | sync (nsec) | total (nsec) | op dims | op types | tensor name |
| -------: | :------ | ----------: | ----------: | -----------: | ------: | -------: | ----------: |
| 0000 |   GET_ROWS |      22863 |   47566042 |   47588905 |      4096:128256:1:1 x 2:1:1:1 ->   4096:2:1:1 |      q4_0 x i32 -> f32 |             inp_embd |
| 0001 |   RMS_NORM |      96926 |   47859323 |   47956249 |                     4096:2:1:1 ->   4096:2:1:1 |             f32 -> f32 |               norm-0 |
| 0002 |        MUL |      11875 |   36215730 |   36227605 |        4096:2:1:1 x 4096:1:1:1 ->   4096:2:1:1 |       f32 x f32 -> f32 |          attn_norm-0 |
| 0003 |    MUL_MAT |   27633698 |   21271875 |   48905573 |     4096:4096:1:1 x 4096:2:1:1 ->   4096:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Qcur-0 |
| 0004 |    RESHAPE |       9167 |   22963435 |   22972602 |                     4096:2:1:1 ->   128:32:1:1 |             f32 -> f32 |    Qcur-0 (reshaped) |
| 0005 |       ROPE |    1787239 |   22215678 |   24002917 |           128:32:1:1 x 2:1:1:1 ->   128:32:1:1 |       f32 x i32 -> f32 |               Qcur-0 |
| 0006 |    MUL_MAT |   20612237 |     698388 |   21310625 |     4096:1024:1:1 x 4096:2:1:1 ->   1024:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Kcur-0 |
| 0007 |    RESHAPE |      10989 |       7033 |      18022 |                     1024:2:1:1 ->    128:8:1:1 |             f32 -> f32 |    Kcur-0 (reshaped) |
| 0008 |       ROPE |    2566925 |     332969 |    2899894 |            128:8:1:1 x 2:1:1:1 ->    128:8:1:1 |       f32 x i32 -> f32 |               Kcur-0 |
| 0009 |    MUL_MAT |     489845 |      98229 |     588074 |     4096:1024:1:1 x 4096:2:1:1 ->   1024:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Vcur-0 |
| 0010 |       VIEW |       7396 |       3074 |      10470 |                  4194304:1:1:1 ->   2048:1:1:1 |           q4_0 -> q4_0 |       k_cache_view-0 |
| 0011 |        CPY |      15782 |       7186 |      22968 |         128:8:1:1 x 2048:1:1:1 ->   2048:1:1:1 |     f32 x q4_0 -> q4_0 | k_cache_view-0 (copy of Kcur-0) |
| 0012 |       VIEW |       5836 |       2444 |       8280 |                  4194304:1:1:1 ->   2048:1:1:1 |           q4_0 -> q4_0 |       v_cache_view-0 |
| 0013 |        CPY |      14324 |      12865 |      27189 |        1024:2:1:1 x 2048:1:1:1 ->   2048:1:1:1 |     f32 x q4_0 -> q4_0 | v_cache_view-0 (copy of Vcur-0) |
| 0014 |    PERMUTE |       5572 |       2240 |       7812 |                     128:32:1:1 ->    128:2:1:1 |             f32 -> f32 |                  q-0 |
| 0015 |       VIEW |      10469 |       3543 |      14012 |                  4194304:1:1:1 ->  128:256:1:1 |           q4_0 -> q4_0 |                  k-0 |
| 0016 |       VIEW |       9375 |       3071 |      12446 |                  4194304:1:1:1 ->  128:256:1:1 |           q4_0 -> q4_0 |                  v-0 |
| 0017 |        CPY |      10053 |       4635 |      14688 |        256:32:1:1 x 256:32:1:1 ->   256:32:1:1 |       f32 x f16 -> f16 |       KQ_mask (copy) |
| 0018 | FLASH_ATTN_EXT |     113753 |      17809 |     131562 | 128:2:1:1 x 128:256:1:1 x 128:256:1:1 x 256:32:1:1 ->   128:32:1:1 | f32 x q4_0 x q4_0 x f16 -> f32 |              node_18 |

Same example in rendered MarkDown

node idx op name proc (nsec) sync (nsec) total (nsec) op dims op types tensor name
0000 GET_ROWS 22863 47566042 47588905 4096:128256:1:1 x 2:1:1:1 -> 4096:2:1:1 q4_0 x i32 -> f32 inp_embd
0001 RMS_NORM 96926 47859323 47956249 4096:2:1:1 -> 4096:2:1:1 f32 -> f32 norm-0
0002 MUL 11875 36215730 36227605 4096:2:1:1 x 4096:1:1:1 -> 4096:2:1:1 f32 x f32 -> f32 attn_norm-0
0003 MUL_MAT 27633698 21271875 48905573 4096:4096:1:1 x 4096:2:1:1 -> 4096:2:1:1 q4_0_4x8 x f32 -> f32 Qcur-0
0004 RESHAPE 9167 22963435 22972602 4096:2:1:1 -> 128:32:1:1 f32 -> f32 Qcur-0 (reshaped)
0005 ROPE 1787239 22215678 24002917 128:32:1:1 x 2:1:1:1 -> 128:32:1:1 f32 x i32 -> f32 Qcur-0
0006 MUL_MAT 20612237 698388 21310625 4096:1024:1:1 x 4096:2:1:1 -> 1024:2:1:1 q4_0_4x8 x f32 -> f32 Kcur-0
0007 RESHAPE 10989 7033 18022 1024:2:1:1 -> 128:8:1:1 f32 -> f32 Kcur-0 (reshaped)
0008 ROPE 2566925 332969 2899894 128:8:1:1 x 2:1:1:1 -> 128:8:1:1 f32 x i32 -> f32 Kcur-0
0009 MUL_MAT 489845 98229 588074 4096:1024:1:1 x 4096:2:1:1 -> 1024:2:1:1 q4_0_4x8 x f32 -> f32 Vcur-0
0010 VIEW 7396 3074 10470 4194304:1:1:1 -> 2048:1:1:1 q4_0 -> q4_0 k_cache_view-0
0011 CPY 15782 7186 22968 128:8:1:1 x 2048:1:1:1 -> 2048:1:1:1 f32 x q4_0 -> q4_0 k_cache_view-0 (copy of Kcur-0)
0012 VIEW 5836 2444 8280 4194304:1:1:1 -> 2048:1:1:1 q4_0 -> q4_0 v_cache_view-0
0013 CPY 14324 12865 27189 1024:2:1:1 x 2048:1:1:1 -> 2048:1:1:1 f32 x q4_0 -> q4_0 v_cache_view-0 (copy of Vcur-0)
0014 PERMUTE 5572 2240 7812 128:32:1:1 -> 128:2:1:1 f32 -> f32 q-0
0015 VIEW 10469 3543 14012 4194304:1:1:1 -> 128:256:1:1 q4_0 -> q4_0 k-0
0016 VIEW 9375 3071 12446 4194304:1:1:1 -> 128:256:1:1 q4_0 -> q4_0 v-0
0017 CPY 10053 4635 14688 256:32:1:1 x 256:32:1:1 -> 256:32:1:1 f32 x f16 -> f16 KQ_mask (copy)
0018 FLASH_ATTN_EXT 113753 17809 131562 128:2:1:1 x 128:256:1:1 x 128:256:1:1 x 256:32:1:1 -> 128:32:1:1 f32 x q4_0 x q4_0 x f16 -> f32 node_18

@slaren
Copy link
Collaborator

slaren commented Sep 26, 2024

I cannot review this right at the moment, but part of the reason it was removed is because we already have a way to measure the performance of individual operators with test-backend-ops, in a way that works with every backend. I suggest taking a look at that and see if you could obtain the data that you need using test-backend-ops instead.

Additionally, we should avoid adding data specific to the CPU backend to the common ggml structures. In the long term, the goal should be to remove all the coupling between the core ggml code and the CPU backend.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 26, 2024
@max-krasnyansky
Copy link
Collaborator Author

I cannot review this right at the moment, but part of the reason it was removed is because we already have a way to measure the performance of individual operators with test-backend-ops, in a way that works with every backend. I suggest taking a look at that and see if you could obtain the data that you need using test-backend-ops instead.

Additionally, we should avoid adding data specific to the CPU backend to the common ggml structures. In the long term, the goal should be to remove all the coupling between the core ggml code and the CPU backend.

Thanks for the feedback. I'm familiar with test-backend-ops (been using it to test an exeprimental backend I've been working on). It's a different use-case. This graph-profiler I'm proposing is a whole-model/graph coverage. So that we can see full picture of a specific model / context / features (kv-quantization, flash attention, etc), how Ops slowdown with context growth, etc. Also I as mentioned I've added profiling of the sync (ie barrier) overhead. We can add more events later (maybe quantization and other prep).
It's a good point though that we don't want multiple Op profilers. Perhaps, we could update test-backend-ops to use the generic graph-profiler. I'll think some more about it.

Re: CPU specific. I was thinking we can update the API and backends to use this infra.
For example, we have GGML_VULKAN_PERF which is basically doing the same thing (with much simple output) and is Vulkan backend specific. I bet other backends would benefit from exposing profile data like that.
If not I can just move everything in the CPU backend I guess but yeah it seems like the current version can be fairly generic.

@slaren
Copy link
Collaborator

slaren commented Sep 26, 2024

I agree that it would be useful to have tools to obtain a full view of where the time is being spent. test-backend-ops is useful for optimizing individual ops, but it does not tell you what ops need to be optimized in the first place.

I am inclined to think that this data could be obtained in a less intrusive way and without requiring changes to the backends, simply by running the graph node by node and measuring the time it takes to run each one. There would be some disadvantages to this approach since it would include the graph launch overhead, but this should be a small difference that should not prevent obtaining the overall picture. Once that is found, test-backend-ops or similar tools can be used to for more accurate measurements of individual ops.

Measuring the barrier/sync time could also be useful in the context of the CPU backend, but it would not be relevant to other backends. I think there is also a place for more detailed profiling specific to each backend, but in that case the implementation would need to be strictly confined to the backend itself. Alternatively, it could possibly be done in a generic way by creating an interface to generate "performance counters" that would associated to an individual ops of a graph, and would be defined by the backends. So "barrier time" could be one counter for the CPU backend, but it could be used to measure the different phases of each op, for example in matrix multiplication the CPU backend could also have a counter for the "time to quantize src1", and the BLAS or CUDA (with cuBLAS) backends could have a counter for "time to dequantize src0".

Note that to implement this effectively, the timing would likely need to be done by the backend itself, otherwise, this would not be useful for GPU backends that typically work with large command buffers that are submitted at once, and synchronizing with the CPU has a significant overhead. However, they may still provide other means to measure individual commands (eg. CUDA has events that can measure the time between two operations), so the interface should be able to work with that.

@max-krasnyansky
Copy link
Collaborator Author

Sounds good.
I pushed what I have so far into graph-profiler branch and will start a new PR from there so that you guys can play/push into it as well (if there is time/interest).
At the very least it could just evolve into the CPU-backend specific profiler for now or maybe we can figure out that perf counter approach you suggested.

@max-krasnyansky
Copy link
Collaborator Author

Moved into #9659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants