-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio: DRC: Some HiFi4 and HiFi3 optimizations #9646
Audio: DRC: Some HiFi4 and HiFi3 optimizations #9646
Conversation
src/audio/drc/drc_math_hifi3.c
Outdated
* p1p0.l holds p0(x) | ||
* in3p1.h holds p1(x) * x^3 | ||
*/ | ||
acc = AE_MOVAD32_L(AE_ADD32_HL_LH(in3p1, in3p1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st working version, this should be possible optimize further. If it works, it should apply to other used polynomial evaluation based functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great - lets takes this as step 1. The future optimizations can merge as other PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's now a better version, but I'm out of ideas how to improve this further. Also, the polynomial is too low order for four parallel multipliers version of Horner, there would be setup overhead and redundant zero coefficients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/audio/drc/drc_math_hifi3.c
Outdated
* p1p0.l holds p0(x) | ||
* in3p1.h holds p1(x) * x^3 | ||
*/ | ||
acc = AE_MOVAD32_L(AE_ADD32_HL_LH(in3p1, in3p1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great - lets takes this as step 1. The future optimizations can merge as other PRs.
8ccefd0
to
9a1a160
Compare
src/audio/drc/drc_math_hifi3.c
Outdated
int32_t precision_inv; | ||
int32_t sqrt2_extracted = 0; | ||
ae_f32 acc; | ||
ae_f32 coef[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__aligned(8) ae_f32 coef[2];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu does this need aligned i.e. will it be used for SIMD ? I would assume that the SIMD intrinsics would force alignment in their definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need to add that since I declare it as array of 32-bit while I use it in hot code with 64 bit load. The compiler may figure that out but I think there is no guarantee. I didn't get align faults when I ran this but better to be safe.
I added unit test patch for the changed function separately in #9649 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one open.
src/audio/drc/drc_math_hifi3.c
Outdated
int32_t precision_inv; | ||
int32_t sqrt2_extracted = 0; | ||
ae_f32 acc; | ||
ae_f32 coef[2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@singalsu does this need aligned i.e. will it be used for SIMD ? I would assume that the SIMD intrinsics would force alignment in their definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very helpful comments. A few minor things, but no showstoppers. I'd add the alignment attribute.
Use 64 bit SIMD for load/store and maximum absolute values search. This saves about 0.1 MCPS in MTL simulation in sof-testbench4. Signed-off-by: Seppo Ingalsuo <[email protected]>
9a1a160
to
64a8433
Compare
This change improves calculation speed. The implementation is bit exact with previous but implemented with 2-way SIMD multiply. The Horner polynomial evaluation is changed to parallel Horner version for two multipliers. The input and output shifting code is also simplified. The code changes save 1.0 MCPS in sof-testbench4 simulated MTL platform. Signed-off-by: Seppo Ingalsuo <[email protected]>
64a8433
to
c46e4e3
Compare
No description provided.