Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Add cpu alloc/free callback to support customlize memory alloctor APIs. #1898

Open
xuhancn opened this issue May 7, 2024 · 4 comments
Assignees
Labels
enhancement A feature or an optimization request

Comments

@xuhancn
Copy link

xuhancn commented May 7, 2024

Summary

During our pytorch development, we found Windows system memory alloctor is worse performance, and slow down the whole pytorch performance. After add third party memory alloctor, pytorch improved its tensor alloction performance. Detailed please take reference: pytorch/pytorch#102534

As pytorch submodule, I found oneDNN still using system memory alloctor to malloc some buffer for reorder/resharp options.
Related code as here:

oneDNN/src/common/utils.cpp

Lines 146 to 170 in 11f5558

void *malloc(size_t size, int alignment) {
void *ptr;
if (memory_debug::is_mem_debug())
return memory_debug::malloc(size, alignment);
#ifdef _WIN32
ptr = _aligned_malloc(size, alignment);
int rc = ptr ? 0 : -1;
#else
int rc = ::posix_memalign(&ptr, alignment, size);
#endif
return (rc == 0) ? ptr : nullptr;
}
void free(void *p) {
if (memory_debug::is_mem_debug()) return memory_debug::free(p);
#ifdef _WIN32
_aligned_free(p);
#else
::free(p);
#endif
}

I add some debug log to confirmed also.

(build_pytorch) D:\xuhan\build_pytorch\pytorch\third_party\ideep\mkl-dnn>git diff
diff --git a/src/common/utils.cpp b/src/common/utils.cpp
index 37659a5d3e..1d1db40337 100644
--- a/src/common/utils.cpp
+++ b/src/common/utils.cpp
@@ -46,6 +46,38 @@
 #include "cpu/platform.hpp"
 #endif

+#ifdef _WIN32
+#include <debugapi.h>
+#define MAX_MESSAGE_SIZE 4096
+void D4D(LPCSTR szFormat, ...)
+{
+       const CHAR * p_ModuleName = "[pytorch] ";
+       char szMsg[MAX_MESSAGE_SIZE] = { 0 };
+       LPSTR lpsz = szMsg;
+       size_t nLen = 0;
+       int nReturnValue = 0;
+       va_list va;
+       va_start(va, szFormat);
+
+       lstrcatA(lpsz, p_ModuleName);
+
+       nLen = lstrlenA(szMsg);
+       lpsz = szMsg;
+       lpsz += nLen;
+
+       nReturnValue = _vsnprintf_s(lpsz, MAX_MESSAGE_SIZE - nLen, MAX_MESSAGE_SIZE, szFormat, va);
+
+       lstrcatA(szMsg, "\n");
+
+       OutputDebugStringA(szMsg);
+}
+#else
+void D4D(LPCSTR szFormat, ...)
+{
+
+}
+#endif
+
 namespace dnnl {
 namespace impl {

@@ -151,6 +183,7 @@ void *malloc(size_t size, int alignment) {
 #ifdef _WIN32
     ptr = _aligned_malloc(size, alignment);
     int rc = ptr ? 0 : -1;
+    D4D("dnnl malloc: %p - %x", ptr, size);
 #else
     int rc = ::posix_memalign(&ptr, alignment, size);
 #endif
@@ -164,6 +197,7 @@ void free(void *p) {

 #ifdef _WIN32
     _aligned_free(p);
+    D4D("dnnl free: %p", p);
 #else
     ::free(p);
 #endif

(build_pytorch) D:\xuhan\build_pytorch\pytorch\third_party\ideep\mkl-dnn>

On Windows, I tested resnet18 it has more than 360k times malloc/free via system malloc/free.
Shows as below:
image

Problem statement

For slow memory alloction on Windows OS, I also write a malloc benchmark: https://github.com/xuhancn/bench_malloc
The other third party memory malloc libraries can improve the performance.
It is also works well on pytorch: pytorch/pytorch#102534 (comment)

So, we need an idea to let oneDNN use some third party memory alloctor for performance improvement.

Option 1: Add some memory alloction library as a submodule.

Acturally, It is not a good option:

  1. Additional library will bring in more lisence, security issues.
  2. It is hard to selected a memory alloction library for all usage cases.

Option 2: Add cpu alloc/free callback to support customlize memory alloctor APIs.

It is a light method to change the memory alloction implemention.

  1. Add a optional cpu alloc/free callback registeration API.
  2. If we registered callback functions, It will use the customlize memory alloctor.
  3. If we not registered callback functions, oneDNN will use the default system memory alloctor.

Preferred solution

For above option 2:
First, we can define the callback funtions:

// void* alloc_cpu(size_t size, int alignment);
typedef void* (*t_dnnl_cpu_aligned_malloc)(size_t, int);

// void free_cpu(void* data);
typedef void (*t_dnnl_cpu_free)(void*);

The registeration API as below:

static t_dnnl_cpu_aligned_malloc                   g_dnnl_cpu_malloc;
static t_dnnl_cpu_free                             g_dnnl_cpu_free;

bool register_dnnl_cpu_memory_alloction_apis(t_dnnl_cpu_aligned_malloc p_malloc, t_dnnl_cpu_free p_free)
{
    if(!p_malloc || !p_free)
        return false;

    g_dnnl_cpu_malloc = p_malloc;
    g_dnnl_cpu_free = p_free;

    return true;
}

Reference implemention:

void *malloc(size_t size, int alignment) {
    void *ptr;
    if (memory_debug::is_mem_debug())
        return memory_debug::malloc(size, alignment);

    // malloc callback
    if(g_dnnl_cpu_malloc)
        return g_dnnl_cpu_malloc(size, alignment);

#ifdef _WIN32
    ptr = _aligned_malloc(size, alignment);
    int rc = ptr ? 0 : -1;
#else
    int rc = ::posix_memalign(&ptr, alignment, size);
#endif

    return (rc == 0) ? ptr : nullptr;
}

void free(void *p) {
    if (memory_debug::is_mem_debug()) return memory_debug::free(p);

    // free callback
    if(g_dnnl_cpu_free)
        return g_dnnl_cpu_free(p);

#ifdef _WIN32
    _aligned_free(p);
#else
    ::free(p);
#endif
}

Additional question:
oneDNN has two piece of malloc/free implemention:

  1. Common:

    oneDNN/src/common/utils.cpp

    Lines 146 to 170 in 11f5558

    void *malloc(size_t size, int alignment) {
    void *ptr;
    if (memory_debug::is_mem_debug())
    return memory_debug::malloc(size, alignment);
    #ifdef _WIN32
    ptr = _aligned_malloc(size, alignment);
    int rc = ptr ? 0 : -1;
    #else
    int rc = ::posix_memalign(&ptr, alignment, size);
    #endif
    return (rc == 0) ? ptr : nullptr;
    }
    void free(void *p) {
    if (memory_debug::is_mem_debug()) return memory_debug::free(p);
    #ifdef _WIN32
    _aligned_free(p);
    #else
    ::free(p);
    #endif
    }
  2. Graph:
    void *cpu_allocator_t::malloc(size_t size, size_t alignment) {
    void *ptr = nullptr;
    const size_t align = alignment == 0 ? DEFAULT_ALIGNMENT : alignment;
    #ifdef _WIN32
    ptr = _aligned_malloc(size, align);
    int rc = ((ptr) ? 0 : errno);
    #else
    int rc = ::posix_memalign(&ptr, align, size);
    #endif /* _WIN32 */
    return (rc == 0) ? ptr : nullptr;
    }
    void cpu_allocator_t::free(void *p) {
    #ifdef _WIN32
    _aligned_free((void *)p);
    #else
    ::free((void *)p);
    #endif /* _WIN32 */
    }

    Whether we need to add callback for both them?

CC: @jgong5, @chunyuan-w, @Guobing-Chen

@xuhancn xuhancn added the enhancement A feature or an optimization request label May 7, 2024
@jgong5
Copy link

jgong5 commented May 7, 2024

cc @mgouicem

@mgouicem
Copy link
Contributor

mgouicem commented May 7, 2024

Hi @xuhancn and thanks for the proposal. Some time ago, we decided to rely on pointers pre-allocated by users instead of malloc/free callbacks. There was 2 main reasons for this:

  • to not introduce global state to the library
  • to simplify usage (scratchpad is just another memory object, it saves users from writing wrapper function to their allocators to pass to oneDNN).

In general, the memory allocation in oneDNN happens in four places:

  • for memory object allocation. Here users can typically provide their own handle (see this constructor).
  • for scratchpad memory allocation. oneDNN already provides means for user to pass their own memory handle as well (see scratchpad_mode primitive attributes in this documentation).
  • for small temporary buffers not covered by scratchpad. This mostly affects gemm functionality as it is not a primitive. We encourage user to rely on matmul primtiive as it has more features. In particular it is compatible with user managed scratchpad.
  • for jitted executable allocation. We typically don't expose this to the user since it involves non-conventional allocations and setting page properties (it does not use the malloc function you highlighted).

Could you clarify if you are already using the mechanisms above and still see allocation overheads?

@xuhancn
Copy link
Author

xuhancn commented May 22, 2024

Hi @mgouicem
Thanks for your comment, and sorry to reply delay.
Acturally I took some time wrote a POC and collected some performance data.

My proposal indeed to optimize your mentioned item: for small temporary buffers not covered by scratchpad. This mostly affects gemm functionality as it is not a primitive. We encourage user to rely on matmul primtiive as it has more features. In particular it is compatible with user managed scratchpad.

The POC PR is here: pytorch/pytorch#126049 which contains:

  1. Add register malloc/free function to dnnl: xuhancn/oneDNN@f5ff0a6...c4d40c6#diff-f41d3e0deddfc14df260aa568dda8ffe7a22b8f6d5db94711fa2b7a64cc0855b
  2. pytorch will register its mimalloc to dnnl for temprary buffer allocation.

The performance comparsion as following:
image

After mimalloc registered, the mkldnn_convolution performance improved about 0.3s. Could you please help on designed a memory allocation callback mechanism? It will help on pytorch Windows get better performance, much appreciated.
CC: @jgong5

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Oct 24, 2024
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.

Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.

For `oneDNN`, it is still tracking in this proposal: oneapi-src/oneDNN#1898

Pull Request resolved: #138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
@xuhancn
Copy link
Author

xuhancn commented Oct 31, 2024

Hi @mgouicem and @vpirogov
I'm continued working on optimize PyTorch Windows in past two years and good progress. Here is blog from PyTorch official: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the conclusion, still some module have performance gaps. Based on analysis, it should caused by MKL and oneDNN's intermediate memory allocation use Windows system allocator.

Acturally, I submit add cpu alloc/free callback proposal to MKL and oneDNN at the same time. MKL team aware this issue and provide the API, I have optimized memory allocator for MKL. PR pytorch/pytorch#138419 is merged.

From oneDNN, I also provided POC code: #1898 (comment) , but I didn't get official update. Please provide a memory allocator register API like MKL. It should be the last piece of the puzzle for my optimization work.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature or an optimization request
Projects
None yet
Development

No branches or pull requests

3 participants