Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nlist computer example and nprobe suggestion #2771

Open
wants to merge 3 commits into
base: v2.4.x
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions site/en/faq/performance_faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,11 @@ Setting `nlist` is scenario-specific. As a rule of thumb, the recommended value

The size of each segment is determined by the `datacoord.segment.maxSize` parameter, which is set to 512 MB by default. The total number of entities in a segment n can be estimated by dividing `datacoord.segment.maxSize` by the size of each entity.

**Example**: If each vector is 50 KB, then $n = \frac{512\, \text{MB} \times 1024\, \text{KB/MB}}{50\, \text{KB per entity}} = 10,485 \text{ entities}$
For the number of clusters, `nlist` $= 4 \times \sqrt{n} = 410$.

Setting `nprobe` is specific to the dataset and scenario, and involves a trade-off between accuracy and query performance. We recommend finding the ideal value through repeated experimentation.
If the data volume of the entities is within the millions, you might consider using brute-force search. In other words, set `nprobe` to `nlist`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, BF search shows better performance only in thousands level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the millions level dosen't effect performance Significantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some misunderstanding.
He meant it doesn't hurt that much for small cases.
Actually for 1M dataset, the performance gap between w/ and w/o index can be 10~100x. Only when the row number smaller than thousands FLAT can outperform. But Milvus will handle that for you


The following charts are results from a test running on the sift50m dataset and IVF_SQ8 index, which compares recall and query performance of different `nlist`/`nprobe` pairs.

Expand Down