New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[CPU] Avoid storing extra copies of constant inputs #26009

Open

EgorDuplensky wants to merge 5 commits into openvinotoolkit:master from EgorDuplensky:avoid_extra_copies_of_constants

+895 −511

Contributor

EgorDuplensky commented Aug 9, 2024 •

edited

Loading

Details:

The main idea of the change is to postpone any manipulations with original constant input data till more information / context is available. This allows to avoid unnecessary copies of the original constant blobs.
In the following situations it is possible to avoid the copy:

The node is going to perform some kind of preprocessing (i.e. weights repacking) anyway.
Multisocket scenario. Numa local copy is not required, if the tensor will be process / copied anyway.

TODO:

Cover with tests

Tickets:

158433

EgorDuplensky requested review from a team as code owners

August 9, 2024 17:09

github-actions bot added category: Core category: CPU category: CPP API labels

EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch 5 times, most recently from be32902 to da692b2 Compare

August 12, 2024 16:00

maxnick self-assigned this

EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from da692b2 to ad121dd Compare

September 5, 2024 17:15

github-actions bot removed category: Core category: CPP API labels

maxnick reviewed

View reviewed changes

src/plugins/intel_cpu/src/graph.h Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/node.h Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/common/has_subnormals.h Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/common/has_subnormals.cpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/common/subnormals_to_zero.h Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/executors/dnnl/dnnl_utils.cpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/nodes/deconv.h Show resolved Hide resolved

src/plugins/intel_cpu/src/utils/clone_original_blob.h Outdated Show resolved Hide resolved

src/plugins/intel_cpu/tests/unit/graph/dummy_node.hpp Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/graph.cpp Show resolved Hide resolved

EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from ad121dd to f6b383f Compare

September 6, 2024 13:16

maxnick added this to the 2024.5 milestone

Contributor

github-actions bot commented Oct 2, 2024

This PR will be closed in a week because of 2 weeks of no activity.

github-actions bot added the Stale label

Contributor

github-actions bot commented Oct 10, 2024

This PR was closed because it has been stalled for 2 week with no activity.

github-actions bot closed this

EgorDuplensky reopened this

github-actions bot removed the Stale label

EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from f6b383f to 740869b Compare

November 29, 2024 13:07

Contributor Author

EgorDuplensky commented Nov 29, 2024

@maxnick Thank you for review, the comments have been applied.
Could you please take another look?

maxnick requested changes

View reviewed changes

src/plugins/intel_cpu/src/nodes/common/has_subnormals.h Outdated

+              namespace ov {
+              namespace intel_cpu {
+              struct jit_has_subnormals_base;

Contributor

maxnick Dec 3, 2024

Do we need this forward declaration here?

Contributor Author

EgorDuplensky Dec 3, 2024

Removed

src/plugins/intel_cpu/src/graph.cpp Outdated

+                              const auto edge = node->getParentEdgeAt(i);
+                              const auto parent = node->getParentEdgeAt(0)->getParent();
+                              // keep track of inplace up by inplace output ports
+                              inPlaceOutPort = inPlaceOutPort == parent->inPlaceOutPort(i) ? edge->parent_port : -1;

Contributor

maxnick Dec 3, 2024

The number of the parent node output ports is not equal to the number of input port (i.e. input edges) of the current node in general case. Please double check the algorithm.

Contributor Author

EgorDuplensky Dec 3, 2024 •

edited

Loading

Right, this logic looks just wrong.
Will be corrected.
The tests will be added

src/plugins/intel_cpu/src/utils/clone_original_blob.cpp Outdated

+              }
+              InputPrepType requiresPreProcessing(const IMemory& blob, GraphContext::CPtr context, const dnnl::engine& engine) {
+                  const auto shape = blob.getShape();

Contributor

maxnick Dec 3, 2024

It seems that this variable isn't used.

Contributor Author

EgorDuplensky Dec 3, 2024

Removed

src/plugins/intel_cpu/src/utils/clone_original_blob.cpp Outdated

+                      return InputPrepType::SimpleClone;
+                  }
+                  const bool mustFlushDenormalsToZero = needFlushDenormalsToZero && std::make_shared<HasSubnormals>()->execute(blob);

Contributor

maxnick Dec 3, 2024

Since HasSubnormals size is zero (it doesn't sore any values) and it has only default constructor, suggest creating an HasSubnormal object on stack.

   if (needFlushDenormalsToZero) {
      if (HasSubnormals{}.execute(blob)) {
         DEBUG_LOG("Clone is necessary for Constant containing subnormals");
         return InputPrepType::FTZ;
      }
    }

Contributor Author

EgorDuplensky Dec 3, 2024

Removed

src/plugins/intel_cpu/src/utils/clone_original_blob.cpp Outdated

+                  // DAZ has been set, processor automatically converts all denormal source operands
+                  // to a zero with the sign of the original operand before performing any
+                  // computations on them, thus no need to flush them to zero manually
+                  bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true;

Contributor

maxnick Dec 3, 2024

Suggested change

      
                bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true;
          
                const bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true;

Also can be put closer to its usage

Contributor Author

EgorDuplensky Dec 3, 2024

Updated

src/plugins/intel_cpu/src/utils/clone_original_blob.cpp

Comment on lines +107 to +104

+                  if (context->getWeightsCache() &&
+                      context->getNumNumaNodes() > 1 &&
+                      context->getCPUStreamExecutor()->get_streams_num() > 1) {
+                      DEBUG_LOG("Clone is necessary for multistream multisocket configuration");
+                      return InputPrepType::PutToNumaLocalCache;
+                  }

Contributor

maxnick Dec 3, 2024

Are we sure that this check is more heavy than the HasSubnormals check, given the fact that needFlushDenormalsToZero is the default config?

Contributor Author

EgorDuplensky Dec 3, 2024

Maybe I misunderstood the question, but the idea is a bit different.
If we need to flush subnormals to zero based on provided configuration, that means we have to always check whether subnormals really exist in a Tensor. We must do it unconditionally. So, there is no need to perform the check below, if we are going to clone the tensor because of subnormals anyway.

src/plugins/intel_cpu/src/utils/clone_original_blob.cpp

Comment on lines +102 to +97

+                  if (!isBlobAligned()) {
+                      DEBUG_LOG("Clone is necessary for not aligned blobs");
+                      return InputPrepType::SimpleClone;
+                  }

Contributor

maxnick Dec 3, 2024

Is this check is slower than the HasSubnormals, given the fact that needFlushDenormalsToZero == true is the default config?

Contributor Author

EgorDuplensky Dec 3, 2024

The same answer here

src/plugins/intel_cpu/src/utils/clone_original_blob.cpp Outdated

+              MemoryPtr cloneBlob(const IMemory& blob, const dnnl::engine& engine, bool needFlushDenormalsToZero) {
+                  const auto& memDesc = blob.getDesc();
+                  const auto prec = blob.getPrecision();
+                  const size_t size = blob.getShape().getElementsCount();

Contributor

maxnick Dec 3, 2024

Shouldn't we call it only in the scope where it's used?

Contributor Author

EgorDuplensky Dec 3, 2024

Updated

src/plugins/intel_cpu/src/graph.cpp

+                          if (!parent->isConstant())
+                              continue;
+                          bool oneShotCopyPossible = node->canPrepInput(i);

Contributor

maxnick Dec 3, 2024

Better to rename oneShotCopyPossible -> nodeLevelCopyPossible. oneShot is not that descriptive, especially for people outside the context of the feature. Here and in the other places.

Contributor Author

EgorDuplensky Dec 3, 2024 •

edited

Loading

nodeLevelCopyPossible sounds even less descriptive to me.
The idea is to specify, that a single copy processing is possible

src/plugins/intel_cpu/src/nodes/fullyconnected.h Outdated

Comment on lines 75 to 81

+                  bool canPrepInput(size_t idx) const override {
+                      return idx == 1;
+                  }
+                  void prepInput(size_t idx, InputPrepType type) override {
+                      OPENVINO_ASSERT(idx == 1, "Only weights input (1) can be preprocessed");
+                      attrs.weightsPrepType = type;
+                  }

Contributor

maxnick Dec 3, 2024

My main concern here is that your proposed design doesn't force the node implementation developer set such properties explicitly. The implementation doesn't have to clearly define which inputs are preprocessed by the node itself, and the developer must somehow know about this feature to optimize constants processing. Such a design flaw isn't harmful in itself, as leaving a default behavior doesn't break program correctness. But, it looks like a parameter of the operation semantics, which is a property, and may be request to be set explicitly. That would enforce developers apply such optimization for future node implementations.

EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from 740869b to 4096459 Compare

December 3, 2024 18:05

Contributor Author

EgorDuplensky commented Dec 3, 2024

@maxnick Could you please take another look?
The tests will be added in scope of the next commit (same PR)

EgorDuplensky added the pr: needs tests label

praasz reviewed

View reviewed changes

src/plugins/intel_cpu/src/node.h Outdated Show resolved Hide resolved

src/plugins/intel_cpu/src/node.h Outdated Show resolved Hide resolved

EgorDuplensky added 5 commits

December 5, 2024 15:11


          [CPU] Avoid storing extra copies of constant inputs

660f90e


          Apply review comments

3da2094


          Remove debug code

5797f8a


          Apply review comments 2

93c007f


          Apply review comments 3

8d14149

EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from 4096459 to 8d14149 Compare

December 5, 2024 14:11

Contributor

github-actions bot commented Dec 20, 2024

This PR will be closed in a week because of 2 weeks of no activity.

github-actions bot added the Stale label

mg-intel removed the Stale label

mg-intel modified the milestones: 2024.5, 2025.0

Contributor

github-actions bot commented Jan 7, 2025

This PR will be closed in a week because of 2 weeks of no activity.

github-actions bot added the Stale label

mg-intel removed the Stale label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU pr: needs tests