enhance: BF functions support real fp16/bf16 calculate #980

cqy123456 · 2024-12-10T09:47:30Z

issue: #909
This purpose of this pr is to remove memory copy of input data and prepare for NM index search.

simple test with cohere-768d-cosine:
8cpu, avx512
fp16/bf16: rowcount = 174762, total size = 128MB
QPS of float16: 170.49 --> before: 75.5
QPS of bfloat16: 158.87 --> before: 73.95

fp32: rowcount = 174762, total size = 256MB
QPS of float32: 132.029

sre-ci-robot · 2024-12-10T09:47:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cqy123456

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cqy123456]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mergify · 2024-12-10T09:48:10Z

@cqy123456 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

If you're fixing a bug, label it as kind/bug.
For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

codecov · 2024-12-10T11:12:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.87%. Comparing base (3c46f4c) to head (e089ab8).
Report is 268 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff            @@
##           main     #980       +/-   ##
=========================================
+ Coverage      0   73.87%   +73.87%     
=========================================
  Files         0       82       +82     
  Lines         0     6916     +6916     
=========================================
+ Hits          0     5109     +5109     
- Misses        0     1807     +1807

see 82 files with indirect coverage changes

alexanderguzhva · 2024-12-11T00:56:14Z

src/simd/distances_avx512.cc

@@ -159,19 +159,17 @@ FAISS_PRAGMA_IMPRECISE_FUNCTION_END
 float
 fp16_vec_inner_product_avx512(const knowhere::fp16* x, const knowhere::fp16* y, size_t d) {
    __m512 m512_res = _mm512_setzero_ps();
-    __m512 m512_res_0 = _mm512_setzero_ps();


res_0 is used for increasing the instruction level parallelism in the following loop.
Please confirm with godbolt.org, or let me know if I need to check this.

There is a slight difference between fp16_vec_inner_product_avx512 and fp16_vec_inner_product_avx512_batch_4:
fp16_vec_inner_product_avx512 : sum of (round(ab + m512_res)) + sum of (round(ab + m512_res_0))
fp16_vec_inner_product_avx512_batch_4: sum of (round(a*b + m512_res))

presicion loss may caused by fmadd: round(ab + c)
mul and add: round(ab) +c

I see. It sounds reasonable, but please add comments in the corresponding distance_XYZ.cc files after headers but before namespace faiss { row the explanation of why it is done. Otherwise, someone may wish to 'optimize' the code back.
Alternatively, it is possible (and it is a faster solution for the single-op version) to extend batch-4 version to match a single-op version instead. So, let batch-4 loop perform 8 FMA operations instead of 4 and let a single-op version perform 2 FMA operations, as it is now in the baseline.

I leave it up to you to decide whether you'd like to change it, bcz the hot spot is batch-4 version anyways.

I tested batch-4 loop perform 8 FMA operations, a little performance degradation existed in BF float16 search.
It is possible that the number of registers in avx512 is not enough for parallelism.

alexanderguzhva · 2024-12-11T00:57:27Z

src/simd/distances_neon.cc

-        res.val[1] = vmlaq_f32(res.val[1], a.val[1], a.val[1]);
-        res.val[2] = vmlaq_f32(res.val[2], a.val[2], a.val[2]);
-        res.val[3] = vmlaq_f32(res.val[3], a.val[3], a.val[3]);
+        res.val[0] = vaddq_f32(vmulq_f32(a.val[0], a.val[0]), res.val[0]);


why replacing FMA with ADD+MUL ?

alexanderguzhva · 2024-12-11T23:36:09Z