-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Avoid storing extra copies of constant inputs #26009
base: master
Are you sure you want to change the base?
[CPU] Avoid storing extra copies of constant inputs #26009
Conversation
be32902
to
da692b2
Compare
da692b2
to
ad121dd
Compare
ad121dd
to
f6b383f
Compare
This PR will be closed in a week because of 2 weeks of no activity. |
This PR was closed because it has been stalled for 2 week with no activity. |
f6b383f
to
740869b
Compare
@maxnick Thank you for review, the comments have been applied. |
namespace ov { | ||
namespace intel_cpu { | ||
|
||
struct jit_has_subnormals_base; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this forward declaration here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
src/plugins/intel_cpu/src/graph.cpp
Outdated
const auto edge = node->getParentEdgeAt(i); | ||
const auto parent = node->getParentEdgeAt(0)->getParent(); | ||
// keep track of inplace up by inplace output ports | ||
inPlaceOutPort = inPlaceOutPort == parent->inPlaceOutPort(i) ? edge->parent_port : -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of the parent
node output ports is not equal to the number of input port (i.e. input edges) of the current node
in general case. Please double check the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this logic looks just wrong.
Will be corrected.
The tests will be added
} | ||
|
||
InputPrepType requiresPreProcessing(const IMemory& blob, GraphContext::CPtr context, const dnnl::engine& engine) { | ||
const auto shape = blob.getShape(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this variable isn't used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
return InputPrepType::SimpleClone; | ||
} | ||
|
||
const bool mustFlushDenormalsToZero = needFlushDenormalsToZero && std::make_shared<HasSubnormals>()->execute(blob); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since HasSubnormals
size is zero (it doesn't sore any values) and it has only default constructor, suggest creating an HasSubnormal
object on stack.
if (needFlushDenormalsToZero) {
if (HasSubnormals{}.execute(blob)) {
DEBUG_LOG("Clone is necessary for Constant containing subnormals");
return InputPrepType::FTZ;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
// DAZ has been set, processor automatically converts all denormal source operands | ||
// to a zero with the sign of the original operand before performing any | ||
// computations on them, thus no need to flush them to zero manually | ||
bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true; | |
const bool needFlushDenormalsToZero = context->getConfig().DAZOn ? false : true; |
Also can be put closer to its usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
if (context->getWeightsCache() && | ||
context->getNumNumaNodes() > 1 && | ||
context->getCPUStreamExecutor()->get_streams_num() > 1) { | ||
DEBUG_LOG("Clone is necessary for multistream multisocket configuration"); | ||
return InputPrepType::PutToNumaLocalCache; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that this check is more heavy than the HasSubnormals
check, given the fact that needFlushDenormalsToZero
is the default config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstood the question, but the idea is a bit different.
If we need to flush subnormals to zero based on provided configuration, that means we have to always check whether subnormals really exist in a Tensor. We must do it unconditionally. So, there is no need to perform the check below, if we are going to clone the tensor because of subnormals anyway.
if (!isBlobAligned()) { | ||
DEBUG_LOG("Clone is necessary for not aligned blobs"); | ||
return InputPrepType::SimpleClone; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check is slower than the HasSubnormals
, given the fact that needFlushDenormalsToZero == true
is the default config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same answer here
MemoryPtr cloneBlob(const IMemory& blob, const dnnl::engine& engine, bool needFlushDenormalsToZero) { | ||
const auto& memDesc = blob.getDesc(); | ||
const auto prec = blob.getPrecision(); | ||
const size_t size = blob.getShape().getElementsCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we call it only in the scope where it's used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
if (!parent->isConstant()) | ||
continue; | ||
|
||
bool oneShotCopyPossible = node->canPrepInput(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to rename oneShotCopyPossible
-> nodeLevelCopyPossible
. oneShot
is not that descriptive, especially for people outside the context of the feature. Here and in the other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeLevelCopyPossible sounds even less descriptive to me.
The idea is to specify, that a single copy processing is possible
bool canPrepInput(size_t idx) const override { | ||
return idx == 1; | ||
} | ||
|
||
void prepInput(size_t idx, InputPrepType type) override { | ||
OPENVINO_ASSERT(idx == 1, "Only weights input (1) can be preprocessed"); | ||
attrs.weightsPrepType = type; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here is that your proposed design doesn't force the node implementation developer set such properties explicitly. The implementation doesn't have to clearly define which inputs are preprocessed by the node itself, and the developer must somehow know about this feature to optimize constants processing. Such a design flaw isn't harmful in itself, as leaving a default behavior doesn't break program correctness. But, it looks like a parameter of the operation semantics, which is a property, and may be request to be set explicitly. That would enforce developers apply such optimization for future node implementations.
740869b
to
4096459
Compare
@maxnick Could you please take another look? |
4096459
to
8d14149
Compare
This PR will be closed in a week because of 2 weeks of no activity. |
This PR will be closed in a week because of 2 weeks of no activity. |
Details:
The main idea of the change is to postpone any manipulations with original constant input data till more information / context is available. This allows to avoid unnecessary copies of the original constant blobs.
In the following situations it is possible to avoid the copy:
TODO:
Tickets: