Skip to content

Latest commit

 

History

History
278 lines (216 loc) · 14 KB

experiments-msmarco-v2-unicoil.md

File metadata and controls

278 lines (216 loc) · 14 KB

Pyserini: uniCOIL for the MS MARCO V2 Collections

This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections. Details about our model can be found in the following paper:

Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.

For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.

This document also describes hybrid combinations with our TCT-ColBERTv2 dense retrieval mode. At present, these indexes are referenced as absolute paths on our Waterloo machine orca, so these results are not broadly reproducible. We are working on figuring out ways to distribute the indexes.

Zero-Shot uniCOIL

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using the model trained on the MS MARCO (V1) passage corpus, described here.

Specifically, we applied inference over the MS MARCO V2 passage corpus and segmented document corpus to obtain the term weights.

Passage V2

You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.

We start from the corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).

Download the sparse representation of the corpus generated by uniCOIL:

# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/a29gEzyXrK5NG4o/download -O collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar

tar -xvf collections/msmarco-passage-v2-unicoil-noexp-0shot-b8.tar -C collections/

To confirm, msmarco-passage-v2-unicoil-noexp-0shot-b8.tar is 24 GB and has an MD5 checksum of fcf21991103197a7e8823b0e2045aca1.

Index the sparse vectors:

python -m pyserini.index --collection JsonVectorCollection \
                         --input collections/msmarco-passage-v2-unicoil-noexp-0shot-b8 \
                         --index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
                         --generator DefaultLuceneDocumentGenerator \
                         --threads 32 \
                         --impact \
                         --pretokenized

If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use --index msmarco-v2-passage-unicoil-noexp-0shot in the command below.

Sparse retrieval with uniCOIL:

python -m pyserini.search --topics msmarco-v2-passage-dev \
                          --encoder castorini/unicoil-noexp-msmarco-passage \
                          --index indexes/lucene-index.msmarco-v2-passage.unicoil-noexp-0shot \
                          --output runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt \
                          --impact \
                          --hits 1000 \
                          --batch 144 \
                          --threads 36 \
                          --min-idf 1

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt
Results:
map                   	all	0.1306
recip_rank            	all	0.1314

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev runs/run.msmarco-v2-passage.unicoil-noexp.0shot.txt
Results:
recall_100            	all	0.4964
recall_1000           	all	0.7013

Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

Document V2

You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.

We start from the corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).

Download the sparse representation of the corpus generated by uniCOIL:

# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/x5cEaM3rXnTaE7j/download -O collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar

tar -xvf collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar -C collections/

To confirm, msmarco-doc-v2-seg-unicoil-noexp-0shot-b8.tar is 54 GB and has an MD5 checksum of af54061ab5c2ce6cf90a1e60fd92924c.

Index the sparse vectors:

python -m pyserini.index --collection JsonVectorCollection \
                         --input collections/msmarco-doc-v2-seg-unicoil-noexp-0shot-b8 \
                         --index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
                         --generator DefaultLuceneDocumentGenerator \
                         --threads 32 \
                         --impact \
                         --pretokenized

If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use --index msmarco-v2-doc-per-passage-unicoil-noexp-0shot in the command below.

Sparse retrieval with uniCOIL:

python -m pyserini.search --topics msmarco-v2-doc-dev \
                          --encoder castorini/unicoil-noexp-msmarco-passage \
                          --index indexes/lucene-index.msmarco-doc-v2-segmented.unicoil-noexp.0shot \
                          --output runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt \
                          --impact \
                          --hits 10000 \
                          --batch 144 \
                          --threads 36 \
                          --max-passage-hits 1000 \
                          --max-passage \
                          --min-idf 1

For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
Results:
map                   	all	0.2012
recip_rank            	all	0.2032

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev runs/run.msmarco-doc-v2-segmented.unicoil-noexp.0shot.txt
Results:
recall_100            	all	0.7190
recall_1000           	all	0.8813

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

Zero-Shot uniCOIL + Dense Retrieval Hybrid

Because there are duplicate passages in MS MARCO V2 collections, score differences might be observed due to tie-breaking effects. For example, if we output in MS MARCO format --output-format msmarco and then convert to TREC format with pyserini.eval.convert_msmarco_run_to_trec_run, the scores will be different.

Passage V2

Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):

python -m pyserini.hsearch   dense  --index /store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-v2-passage-augmented \
                                    --encoder castorini/tct_colbert-v2-hnp-msmarco \
                             sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
                                    --encoder castorini/unicoil-noexp-msmarco-passage \
                                    --impact \
                                    --min-idf 1 \
                             fusion --alpha 0.46 --normalization \
                             run    --topics collections/passv2_dev_queries.tsv \
                                    --output runs/run.msmarco-v2-passage.tct_v2+unicoil-noexp.0shot.top1k.dev1.trec \
                                    --batch-size 72 --threads 72 \
                                    --output-format trec

Evaluation:

$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-v2-passage.tct_v2+unicoil-noexp.0shot.top1k.dev1.trec
Results:
map                   	all	0.1823
recip_rank            	all	0.1835
recall_10             	all	0.3373
recall_100            	all	0.6375
recall_1000           	all	0.8620

Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 trained):

python -m pyserini.hsearch   dense  --index /store/scratch/j587yang/project/trec_2021/indexes/dl2021/passage/title_headings_body/tct_colbert-v2-hnp-msmarco-hn-msmarcov2-full \
                                    --encoder /store/scratch/j587yang/project/trec_2021/checkpoints/torch_ckpt/tct_colbert-v2-hnp-msmarco-hn-msmarcov2 \
                             sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-passage \
                                    --encoder castorini/unicoil-noexp-msmarco-passage \
                                    --impact \
                                    --min-idf 1 \
                             fusion --alpha 0.29 --normalization \
                             run    --topics collections/passv2_dev_queries.tsv \
                                    --output runs/run.msmarco-v2-passage.tct_v2-trained+unicoil-noexp-0shot.top1k.dev1.trec \
                                    --batch-size 72 --threads 72 \
                                    --output-format trec

Evaluation:

$ python -m pyserini.eval.trec_eval -c -m recall.10,100,1000 -mmap -m recip_rank collections/passv2_dev_qrels.tsv runs/run.msmarco-v2-passage.tct_v2-trained+unicoil-noexp-0shot.top1k.dev1.trec
Results:
map                   	all	0.2265
recip_rank            	all	0.2283
recall_10             	all	0.3964
recall_100            	all	0.6701
recall_1000           	all	0.8748

Document V2

Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 zero-shot):

python -m pyserini.hsearch   dense  --index /store/scratch/indexes/trec2021/faiss-flat.tct_colbert-v2-hnp.0shot.msmarco-v2-doc-segmented \
                                    --encoder castorini/tct_colbert-v2-hnp-msmarco \
                             sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
                                    --encoder castorini/unicoil-noexp-msmarco-passage \
                                    --impact \
                                    --min-idf 1 \
                             fusion --alpha 0.56 --normalization \
                             run    --topics collections/docv2_dev_queries.tsv \
                                    --output runs/run.msmarco-document-v2-segmented.tct_v2+unicoil_noexp.0shot.maxp.top100.dev1.trec \
                                    --batch-size 72 --threads 72 \
                                    --max-passage \
                                    --max-passage-hits 100 \
                                    --output-format trec

Evaluation:

$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_v2+unicoil_noexp.0shot.maxp.top100.dev1.trec
Results:
map                   	all	0.2550
recip_rank            	all	0.2575
recall_10             	all	0.5051
recall_100            	all	0.8082

Dense-sparse hybrid retrieval (uniCOIL zero-shot + TCT_ColBERT_v2 trained):

python -m pyserini.hsearch   dense  --index /store/scratch/j587yang/project/trec_2021/indexes/dl2021/document/title_headings_body/tct_colbert-v2-hnp-msmarco-hn-msmarcov2-full-maxp \
                                    --encoder /store/scratch/j587yang/project/trec_2021/checkpoints/torch_ckpt/tct_colbert-v2-hnp-msmarco-hn-msmarcov2 \
                             sparse --index /store/scratch/indexes/trec2021/lucene.unicoil-noexp.0shot.msmarco-v2-doc-segmented \
                                    --encoder castorini/unicoil-noexp-msmarco-passage \
                                    --impact \
                                    --min-idf 1 \
                             fusion --alpha 0.54 --normalization \
                             run    --topics collections/docv2_dev_queries.tsv \
                                    --output runs/run.msmarco-document-v2-segmented.tct_v2-trained+unicoil-noexp-0shot.maxp.top100.dev1.trec \
                                    --batch-size 72 --threads 72 \
                                    --max-passage \
                                    --max-passage-hits 100 \
                                    --output-format trec

Evaluation:

$ python -m pyserini.eval.trec_eval -c -m recall.10,100 -mmap -m recip_rank collections/docv2_dev_qrels.tsv runs/run.msmarco-document-v2-segmented.tct_v2-trained+unicoil-noexp-0shot.maxp.top100.dev1.trec
Results:
map                   	all	0.2945
recip_rank            	all	0.2970
recall_10             	all	0.5389
recall_100            	all	0.8128

Reproduction Log*