From e642cbf2f07eb8c04cf118686dd81fb4b35502b7 Mon Sep 17 00:00:00 2001 From: seetimee <50852027+seetimee@users.noreply.github.com> Date: Thu, 15 Aug 2024 14:31:16 +0800 Subject: [PATCH 1/2] Update performance_faq.md suggestion If the data volume of the entities is within the millions, you might consider using brute-force search. In other words, set nprobe to nlist. --- site/en/faq/performance_faq.md | 1 + 1 file changed, 1 insertion(+) diff --git a/site/en/faq/performance_faq.md b/site/en/faq/performance_faq.md index 0b7ec66a9..3d8384eae 100644 --- a/site/en/faq/performance_faq.md +++ b/site/en/faq/performance_faq.md @@ -18,6 +18,7 @@ Setting `nlist` is scenario-specific. As a rule of thumb, the recommended value The size of each segment is determined by the `datacoord.segment.maxSize` parameter, which is set to 512 MB by default. The total number of entities in a segment n can be estimated by dividing `datacoord.segment.maxSize` by the size of each entity. Setting `nprobe` is specific to the dataset and scenario, and involves a trade-off between accuracy and query performance. We recommend finding the ideal value through repeated experimentation. +If the data volume of the entities is within the millions, you might consider using brute-force search. In other words, set `nprobe` to `nlist`. The following charts are results from a test running on the sift50m dataset and IVF_SQ8 index, which compares recall and query performance of different `nlist`/`nprobe` pairs. From dfd0a9b1005200b96a816a63f8de2a6008322ca4 Mon Sep 17 00:00:00 2001 From: seetimee <50852027+seetimee@users.noreply.github.com> Date: Thu, 15 Aug 2024 15:21:14 +0800 Subject: [PATCH 2/2] Update performance_faq.md one more nlist example --- site/en/faq/performance_faq.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/site/en/faq/performance_faq.md b/site/en/faq/performance_faq.md index 3d8384eae..4e92947c5 100644 --- a/site/en/faq/performance_faq.md +++ b/site/en/faq/performance_faq.md @@ -17,6 +17,9 @@ Setting `nlist` is scenario-specific. As a rule of thumb, the recommended value The size of each segment is determined by the `datacoord.segment.maxSize` parameter, which is set to 512 MB by default. The total number of entities in a segment n can be estimated by dividing `datacoord.segment.maxSize` by the size of each entity. +**Example**: If each vector is 50 KB, then $n = \frac{512\, \text{MB} \times 1024\, \text{KB/MB}}{50\, \text{KB per entity}} = 10,485 \text{ entities}$ +For the number of clusters, `nlist` $= 4 \times \sqrt{n} = 410$. + Setting `nprobe` is specific to the dataset and scenario, and involves a trade-off between accuracy and query performance. We recommend finding the ideal value through repeated experimentation. If the data volume of the entities is within the millions, you might consider using brute-force search. In other words, set `nprobe` to `nlist`.