Official repository for the paper Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models.
This repository contains the code and resources for Promptriever, which demonstrates that retrieval models can be controlled with prompts on a per-instance basis, similar to language models.
Binary | Description |
---|---|
samaya-ai/promptriever-llama2-7b-v1 | A Promptriever bi-encoder model based on LLaMA 2 (7B parameters). |
samaya-ai/promptriever-llama3.1-8b-instruct-v1 | A Promptriever bi-encoder model based on LLaMA 3.1 Instruct (8B parameters). |
samaya-ai/promptriever-llama3.1-8b-v1 | A Promptriever bi-encoder model based on LLaMA 3.1 (8B parameters). |
samaya-ai/promptriever-mistral-v0.1-7b-v1 | A Promptriever bi-encoder model based on Mistral v0.1 (7B parameters). |
samaya-ai/RepLLaMA-reproduced | A reproduction of the RepLLaMA model (no instructions). A bi-encoder based on LLaMA 2, trained on the tevatron/msmarco-passage-aug dataset. |
samaya-ai/msmarco-w-instructions | A dataset of MS MARCO with added instructions and instruction-negatives, used for training the above models. |
To initialize your research environment:
bash setup/install_conda.sh # if you don't have conda already
bash setup/install_req.sh
pip install git+https://github.com/orionw/tevatron
Run a MSMARCO experiment (DL19, DL20, Dev) with:
bash msmarco/encode_corpus.sh <output_path> <model_name>
bash msmarco/encode_queries.sh <output_path> <model_name>
bash msmarco/search.sh <output_path>
To reproduce the BEIR experiments you can either use the batch method (running all models):
bash scripts/beir/matrix_of_corpus.sh
bash scripts/beir/matrix_of_prompts.sh
bash scripts/beir/search_all_prompts.sh <output_path>
Or can also run just one model with:
bash beir/run_all.sh <model_name> <output_nickname>
bash beir/run_all_prompts.sh <model_name> <output_nickname>
bash beir/search_all_prompts.sh <output_path>
The beir/bm25
subfolder contains scripts for BM25 baseline experiments, using BM25S.
To train a Promptriever model, you can use the scripts in scripts/training/*
:
bash scripts/training/train.sh <output_name> <dataset_name> <gpu_ids> <port>
Available training scripts:
train_instruct.sh
(Llama 2)train_instruct_llama3_instruct.sh
train_instruct_llama3.sh
train_instruct_mistral_v1.sh
train_instruct_mistral.sh
(v0.3)
There are a variety of utilities to symlink corpus files (to avoid double storage when doing the dev set optimization), to upload models to Huggingface, and to filter out bad instruction-negatives.
utils/symlink_dev.sh
andutils/symlink_msmarco.sh
: Optimize storage usageutils/upload_to_hf_all.py
andutils/upload_to_hf.py
: Upload models to Hugging Face Hubutils/validate_all_present.py
: Validate dataset completenessfiltering/filter_query_doc_pairs_from_batch_gpt.py
: Implement advanced filtering using GPT model outputs
If you found the code, data or model useful, free to cite:
@article{weller2024promptriever,
title={Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models},
author={Orion Weller and Benjamin Van Durme and Dawn Lawrie and Ashwin Paranjape and Yuhao Zhang and Jack Hessel},
year={2024},
eprint={2409.11136},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2409.11136},
}