Skip to content

Latest commit

 

History

History
26 lines (18 loc) · 1.18 KB

faiss-flat.msmarco-v2.1-doc.arctic-embed-l.20240824.README.md

File metadata and controls

26 lines (18 loc) · 1.18 KB

msmarco-v2.1-arctic-embed-l

Faiss FlatIP indexes of msmarco v2.1 encoded by Snowflake embed-l. These indexes were generated on 2024/08/26 on orca.

The indexes were generated from indexing embeddings available on Huggingface.

Preparation

Due to msmarco v2.1 dataset's large size, indexes needed to be divided in two parts.

python scripts/arctic/convert_embeddings.py --embeddings_folder /store/scratch/sjupadhy/msmarco-v2.1-snowflake-arctic-embed-l/corpus \
--output /store/scratch/sjupadhy/indexes/msmarco-v2.1-dev-snowflake-arctic-embed-l-1 \
--indices 0_30

python scripts/arctic/convert_embeddings.py --embeddings_folder /store/scratch/sjupadhy/msmarco-v2.1-snowflake-arctic-embed-l/corpus \
--output /store/scratch/sjupadhy/indexes/msmarco-v2.1-dev-snowflake-arctic-embed-l-2 \
--indices 30_59

Topic embeddings

python scripts/arctic/convert_queries.py --embedding_path /store/scratch/sjupadhy/msmarco-v2.1-snowflake-arctic-embed-l/topics/snowflake-arctic-embed-l-topics.msmarco-v2-doc.dev.parquet \
--output /store/scratch/sjupadhy/queries/msmarco-v2.1-dev-snowflake-arctic-embed-l