Skip to content

Searching with ANNOY index

jingkl edited this page Oct 18, 2021 · 5 revisions

ANNOY

ANNOY (Approximate Nearest Neighbors Oh Yeah) is an index that uses a hyperplane to divide a high-dimensional space into multiple subspaces, and then stores them in a tree structure.

When searching for vectors, ANNOY follows the tree structure to find subspaces closer to the target vector, and then compares all the vectors in these subspaces (The number of vectors being compared should not be less than search_k) to obtain the final result. Obviously, when the target vector is close to the edge of a certain subspace, sometimes it is necessary to greatly increase the number of searched subspaces to obtain a high recall rate. Therefore, ANNOY uses n_trees different methods to divide the whole space, and searches all the dividing methods simultaneously to reduce the probability that the target vector is always at the edge of the subspace.

  • Index building parameters

    Parameter Description Range
    n_trees The number of methods of space division. [1, 1024]
  • Search parameters

    Parameter Description Range
    search_k The number of nodes to search. -1 means 5% of the whole data. {-1} ∪ [top_k, n × n_trees]

For example:

In python

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
# create a collection
collection_name = "milvus_ANNOY"
default_fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=d)
]default_schema = CollectionSchema(fields=default_fields, description="test collection")print(f"\nCreate collection...")collection = Collection(name= collection_name, schema=default_schema)

# insert data
mr = collection.insert([xb])print(collection.num_entities)

# create index
collection.create_index(field_name=field_name,
                        index_params={'index_type': 'ANNOY',
                                      'metric_type': 'L2',
                                      'params': {
                                        "n_trees": 8      # int. 1~1024
                                      }})
collection.load()
                                     
# search 
top_k = 10
results = collection.search(data=xq, anns_field="vector",params={
                "search_k": -1    # int. {-1} U [top_k, n*n_trees], n represents vectors count.
              }, limit=top_k) # show results
              
for result in results:
  print(result.ids)
  print(result.distance)