Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Avoid storing extra copies of constant inputs #26009

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

EgorDuplensky
Copy link
Contributor

@EgorDuplensky EgorDuplensky commented Aug 9, 2024

Details:

The main idea of the change is to postpone any manipulations with original constant input data till more information / context is available. This allows to avoid unnecessary copies of the original constant blobs.
In the following situations it is possible to avoid the copy:

  1. The node is going to perform some kind of preprocessing (i.e. weights repacking) anyway.
  2. Multisocket scenario. Numa local copy is not required, if the tensor will be process / copied anyway.

TODO:

  • Cover with tests

Tickets:

  • 158433

@EgorDuplensky EgorDuplensky requested review from a team as code owners August 9, 2024 17:09
@github-actions github-actions bot added category: Core OpenVINO Core (aka ngraph) category: CPU OpenVINO CPU plugin category: CPP API OpenVINO CPP API bindings labels Aug 9, 2024
@EgorDuplensky EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch 5 times, most recently from be32902 to da692b2 Compare August 12, 2024 16:00
@maxnick maxnick self-assigned this Sep 5, 2024
@EgorDuplensky EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from da692b2 to ad121dd Compare September 5, 2024 17:15
@github-actions github-actions bot removed category: Core OpenVINO Core (aka ngraph) category: CPP API OpenVINO CPP API bindings labels Sep 5, 2024
src/plugins/intel_cpu/src/graph.h Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/node.h Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/common/has_subnormals.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/deconv.h Show resolved Hide resolved
src/plugins/intel_cpu/src/utils/clone_original_blob.h Outdated Show resolved Hide resolved
src/plugins/intel_cpu/tests/unit/graph/dummy_node.hpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/graph.cpp Show resolved Hide resolved
@EgorDuplensky EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from ad121dd to f6b383f Compare September 6, 2024 13:16
@maxnick maxnick added this to the 2024.5 milestone Sep 6, 2024
Copy link
Contributor

github-actions bot commented Oct 2, 2024

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Oct 2, 2024
Copy link
Contributor

This PR was closed because it has been stalled for 2 week with no activity.

@github-actions github-actions bot closed this Oct 10, 2024
@EgorDuplensky EgorDuplensky reopened this Nov 25, 2024
@github-actions github-actions bot removed the Stale label Nov 26, 2024
@EgorDuplensky EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from f6b383f to 740869b Compare November 29, 2024 13:07
@EgorDuplensky
Copy link
Contributor Author

@maxnick Thank you for review, the comments have been applied.
Could you please take another look?

namespace ov {
namespace intel_cpu {

struct jit_has_subnormals_base;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this forward declaration here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

const auto edge = node->getParentEdgeAt(i);
const auto parent = node->getParentEdgeAt(0)->getParent();
// keep track of inplace up by inplace output ports
inPlaceOutPort = inPlaceOutPort == parent->inPlaceOutPort(i) ? edge->parent_port : -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of the parent node output ports is not equal to the number of input port (i.e. input edges) of the current node in general case. Please double check the algorithm.

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this logic looks just wrong.
Will be corrected.
The tests will be added

}

InputPrepType requiresPreProcessing(const IMemory& blob, GraphContext::CPtr context, const dnnl::engine& engine) {
const auto shape = blob.getShape();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this variable isn't used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

return InputPrepType::SimpleClone;
}

const bool mustFlushDenormalsToZero = needFlushDenormalsToZero && std::make_shared<HasSubnormals>()->execute(blob);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since HasSubnormals size is zero (it doesn't sore any values) and it has only default constructor, suggest creating an HasSubnormal object on stack.

   if (needFlushDenormalsToZero) {
      if (HasSubnormals{}.execute(blob)) {
         DEBUG_LOG("Clone is necessary for Constant containing subnormals");
         return InputPrepType::FTZ;
      }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

// DAZ has been set, processor automatically converts all denormal source operands
// to a zero with the sign of the original operand before performing any
// computations on them, thus no need to flush them to zero manually
bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true;
const bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true;

Also can be put closer to its usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines +107 to +104
if (context->getWeightsCache() &&
context->getNumNumaNodes() > 1 &&
context->getCPUStreamExecutor()->get_streams_num() > 1) {
DEBUG_LOG("Clone is necessary for multistream multisocket configuration");
return InputPrepType::PutToNumaLocalCache;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this check is more heavy than the HasSubnormals check, given the fact that needFlushDenormalsToZero is the default config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood the question, but the idea is a bit different.
If we need to flush subnormals to zero based on provided configuration, that means we have to always check whether subnormals really exist in a Tensor. We must do it unconditionally. So, there is no need to perform the check below, if we are going to clone the tensor because of subnormals anyway.

Comment on lines +102 to +97
if (!isBlobAligned()) {
DEBUG_LOG("Clone is necessary for not aligned blobs");
return InputPrepType::SimpleClone;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check is slower than the HasSubnormals, given the fact that needFlushDenormalsToZero == true is the default config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same answer here

MemoryPtr cloneBlob(const IMemory& blob, const dnnl::engine& engine, bool needFlushDenormalsToZero) {
const auto& memDesc = blob.getDesc();
const auto prec = blob.getPrecision();
const size_t size = blob.getShape().getElementsCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we call it only in the scope where it's used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

if (!parent->isConstant())
continue;

bool oneShotCopyPossible = node->canPrepInput(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to rename oneShotCopyPossible -> nodeLevelCopyPossible. oneShot is not that descriptive, especially for people outside the context of the feature. Here and in the other places.

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeLevelCopyPossible sounds even less descriptive to me.
The idea is to specify, that a single copy processing is possible

Comment on lines 75 to 81
bool canPrepInput(size_t idx) const override {
return idx == 1;
}

void prepInput(size_t idx, InputPrepType type) override {
OPENVINO_ASSERT(idx == 1, "Only weights input (1) can be preprocessed");
attrs.weightsPrepType = type;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that your proposed design doesn't force the node implementation developer set such properties explicitly. The implementation doesn't have to clearly define which inputs are preprocessed by the node itself, and the developer must somehow know about this feature to optimize constants processing. Such a design flaw isn't harmful in itself, as leaving a default behavior doesn't break program correctness. But, it looks like a parameter of the operation semantics, which is a property, and may be request to be set explicitly. That would enforce developers apply such optimization for future node implementations.

@EgorDuplensky EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from 740869b to 4096459 Compare December 3, 2024 18:05
@EgorDuplensky
Copy link
Contributor Author

@maxnick Could you please take another look?
The tests will be added in scope of the next commit (same PR)

@EgorDuplensky EgorDuplensky added the pr: needs tests PR needs tests updating label Dec 3, 2024
src/plugins/intel_cpu/src/node.h Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/node.h Outdated Show resolved Hide resolved
@EgorDuplensky EgorDuplensky force-pushed the avoid_extra_copies_of_constants branch from 4096459 to 8d14149 Compare December 5, 2024 14:11
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Dec 20, 2024
@mg-intel mg-intel removed the Stale label Dec 23, 2024
@mg-intel mg-intel modified the milestones: 2024.5, 2025.0 Dec 23, 2024
Copy link
Contributor

github-actions bot commented Jan 7, 2025

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Jan 7, 2025
@mg-intel mg-intel removed the Stale label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin pr: needs tests PR needs tests updating
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants