This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections. Details about our model can be found in the following paper:
Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.
For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.
This document also describes hybrid combinations with our TCT-ColBERTv2 dense retrieval mode.
At present, these indexes are referenced as absolute paths on our Waterloo machine orca
, so these results are not broadly reproducible.
We are working on figuring out ways to distribute the indexes.
For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described here.
Specifically, we applied inference over the MS MARCO V2 passage corpus and segmented document corpus to obtain the term weights.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
We start from the corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
Download the sparse representation of the corpus generated by uniCOIL:
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar
tar -xvf collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -C collections/
To confirm, msmarco-passage-v2-unicoil-noexp-0shot-b8.tar
is 24 GB and has an MD5 checksum of fcf21991103197a7e8823b0e2045aca1
.
Index the sparse vectors:
python -m pyserini.index --collection JsonVectorCollection \
--input collections/msmarco-passage-v2-unicoil-noexp-0shot-b8 \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-passage-unicoil-noexp-0shot
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search --topics msmarco-v2-passage-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
--output runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt \
--impact \
--hits 1000 \
--batch 144 \
--threads 36 \
--min-idf 1
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt
Results:
map all 0.1306
recip_rank all 0.1314
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt
Results:
recall_100 all 0.4964
recall_1000 all 0.7013
Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
We start from the corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
Download the sparse representation of the corpus generated by uniCOIL:
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar
tar -xvf collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -C collections/
To confirm, msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar
is 54 GB and has an MD5 checksum of af54061ab5c2ce6cf90a1e60fd92924c
.
Index the sparse vectors:
python -m pyserini.index --collection JsonVectorCollection \
--input collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact \
--pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-doc-per-passage-unicoil-noexp-0shot
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search --topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
--output runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt \
--impact \
--hits 10000 \
--batch 144 \
--threads 36 \
--max-passage-hits 1000 \
--max-passage \
--min-idf 1
For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
Results:
map all 0.2012
recip_rank all 0.2032
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
Results:
recall_100 all 0.7190
recall_1000 all 0.8813
We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects.
For example, if we output in MS MARCO format --output-format msmarco
and then convert to TREC format with pyserini.eval.convert_msmarco_run_to_trec_run
, the scores will be different.
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
python -m pyserini.hsearch dense --index /store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-v2-passage-augmented \
--encoder castorini/tct_colbert-v2-hnp-msmarco \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.46 --normalization \
run --topics collections/passv2_dev_queries.tsv \
--output runs/run.msmarco-v2-passage.tct_v2+unicoil-noexp.0shot.top1k.dev1.trec \
--batch-size 72 --threads 72 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-v2-passage.tct_v2+unicoil-noexp.0shot.top1k.dev1.trec
Results:
map all 0.1823
recip_rank all 0.1835
recall_10 all 0.3373
recall_100 all 0.6375
recall_1000 all 0.8620
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 trained):
python -m pyserini.hsearch dense --index /store/scratch/j587yang/project/trec_2021/indexes/dl2021/passage/title_headings_body/tct_colbert-v2-hnp-msmarco-hn-msmarcov2-full \
--encoder /store/scratch/j587yang/project/trec_2021/checkpoints/torch_ckpt/tct_colbert-v2-hnp-msmarco-hn-msmarcov2 \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.29 --normalization \
run --topics collections/passv2_dev_queries.tsv \
--output runs/run.msmarco-v2-passage.tct_v2-trained+unicoil-noexp-0shot.top1k.dev1.trec \
--batch-size 72 --threads 72 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-v2-passage.tct_v2-trained+unicoil-noexp-0shot.top1k.dev1.trec
Results:
map all 0.2265
recip_rank all 0.2283
recall_10 all 0.3964
recall_100 all 0.6701
recall_1000 all 0.8748
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):
python -m pyserini.hsearch dense --index /store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-v2-doc-segmented \
--encoder castorini/tct_colbert-v2-hnp-msmarco \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.56 --normalization \
run --topics collections/docv2_dev_queries.tsv \
--output runs/run.msmarco-document-v2-segmented.tct_v2+unicoil_noexp.0shot.maxp.top100.dev1.trec \
--batch-size 72 --threads 72 \
--max-passage \
--max-passage-hits 100 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_v2+unicoil_noexp.0shot.maxp.top100.dev1.trec
Results:
map all 0.2550
recip_rank all 0.2575
recall_10 all 0.5051
recall_100 all 0.8082
Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 trained):
python -m pyserini.hsearch dense --index /store/scratch/j587yang/project/trec_2021/indexes/dl2021/document/title_headings_body/tct_colbert-v2-hnp-msmarco-hn-msmarcov2-full-maxp \
--encoder /store/scratch/j587yang/project/trec_2021/checkpoints/torch_ckpt/tct_colbert-v2-hnp-msmarco-hn-msmarcov2 \
sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
--encoder castorini/unicoil-noexp-msmarco-passage \
--impact \
--min-idf 1 \
fusion --alpha 0.54 --normalization \
run --topics collections/docv2_dev_queries.tsv \
--output runs/run.msmarco-document-v2-segmented.tct_v2-trained+unicoil-noexp-0shot.maxp.top100.dev1.trec \
--batch-size 72 --threads 72 \
--max-passage \
--max-passage-hits 100 \
--output-format trec
Evaluation:
$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_v2-trained+unicoil-noexp-0shot.maxp.top100.dev1.trec
Results:
map all 0.2945
recip_rank all 0.2970
recall_10 all 0.5389
recall_100 all 0.8128