Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on some configs for reproducing #5

Open
cadurosar opened this issue Apr 1, 2023 · 9 comments
Open

Question on some configs for reproducing #5

cadurosar opened this issue Apr 1, 2023 · 9 comments
Labels
question Further information is requested

Comments

@cadurosar
Copy link
Contributor

Hi,

amazing work! I was trying to replicate the 4a and 4b experiments, but it seems that they are duplicated, could you help me with this?

Query expansion | 4a | Before: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml
Query expansion | 4b | Before: lsr/configs/experiment/unicoil_tilde_msmarco_distil.yaml After: lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml

Thanks!

@thongnt99
Copy link
Owner

thongnt99 commented Apr 3, 2023

Hi @cadurosar I updated the configurations. Thanks for pointing out the duplication.

I also want to note that we performed an additional step of length matching with the full Splade model (splade_msmarco_distil_flops_0.1_0.08.yaml) to remove the discrepancy due to length difference.

Let assume that this full Splade model generates NQ(q) query terms for a query q and ND(d) terms for a document d.

After training 4a (Before - splade_asm_qmlp_msmarco_distil_flops_0.0_0.08.yaml), we pruned the documents to match (or no longer than) the above ND(d).

Similarly, after training 4b (After - lsr/configs/experiment/splade_asm_dmlp_msmarco_distil.yaml), we pruned the output queries to match the above NQ(q).

The length difference is due to the fact that dropping FLOPs regularizer on one side (e.g., query) doesn't result in the same sparsity on the other side (e.g., document). For example, the documents produced by [Splade, FLOPs(0.1, 0.08)] is unlikely to be equally sparse as documents produced by [Splade, FLOPs(0.0, 0.08)]. Same thing may happen with queries in [Splade, FLOPs(0.1, 0.08)] and [Splade, FLOPs(0.1, 0.0)].

I hope this helps and happy to have further discussion.

@cadurosar
Copy link
Contributor Author

Thanks a lot for fixing the configs, I will take a look into this as soon as possible but I believe that now should be sufficient to reproduce :)

And yes, the problem of the sparsity depending not only on the FLOPS you set for the given modality (query/doc), but on the relations between them is something we saw as well. However, I am not sure to have understood exactly what you did do you do it case by case (i.e. each individual document may not be larger than it was on the previous method), or via the average (i.e. each individual document may not be larger than the previous average)? I am more of a believer in the first, but I wanted to be sure. Also, would you have code for reproducing that part as well?

@thongnt99
Copy link
Owner

hi @cadurosar,

Do you have any further update on the reproduction?

@thongnt99 thongnt99 added the question Further information is requested label Apr 24, 2023
@cadurosar
Copy link
Contributor Author

cadurosar commented Apr 26, 2023

Sorry @thongnt99 the post ECIR has been crazier than I expected. I have been able to reproduce the results and I'm quite surprised by the results I got on BEIR as they differ vastly from what we had when removing query expansion. For us removing query expansion on BEIR reduced the results, but it does not seem to be the case when using the MLP strategy.

Results are slightly worse compared to SPLADE++ but is expected due to the differences with MLM (distilbert vs cocondenser), still need to test them on the same field, but compared to COIL CR for example the results are good.

Retriever Type: PP CoilCR 4a Before 4a After
Average "13" 50.7 47.3 48.5 48.7
arguana 51.8 34.2 49.8 53.3
climate-fever 23.7 18.6 18.9 17.2
dbpedia-entity 43.6 37.8 43.3 43.3
fever 79.6 78.2 75.1 75.6
fiqa 34.9 31.0 33.2 32.7
hotpotqa 69.3 68.3 68.6 68.5
nfcorpus 34.5 33.8 33.7 33.8
nq 53.3 48.3 52.4 52.1
quora 84.9 77.3 76.4 77.3
scidocs 16.1 15.3 15.1 15.3
scifact 71.0 69.8 66.2 67.7
trec-covid 72.5 73.5 68.4 69.8
webis-touche2020 24.2 28.7 29.3 26.1

@cadurosar
Copy link
Contributor Author

If you want I can send a PR with code for running BEIR from ir_datasets

@thongnt99
Copy link
Owner

Great. Thanks a lot for the update.
That would be awesome if you can send the pull request for evaluation on BEIR.
I can run evaluation on other checkpoints when I have free machines.

@seanmacavaney
Copy link
Contributor

Hey @cadurosar -- I'm curious about the "Reranker Type" label in the table above. Are you using these all as re-rankers? If so, is that due to pooling bias or something else?

Thanks!

@cadurosar
Copy link
Contributor Author

No its just first stage retrieval, this was because I used the same excel I was using for reranking but forgot to change that part... Sorry for that

@seanmacavaney
Copy link
Contributor

No worries, thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants