Impact Analysis of Hard Negative Sample Rankings in ANCE, 646 Project, Fall 2024

Hojae Son*, Deepesh Suranjandass*

This repo is inspired from

ANCE codebase [https://github.com/microsoft/ANCE/]
ANCE paper Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

We investigate the impact of hard negative sampling strategies in the ANCE [11] (Approximate Nearest Neighbor Negative Contrastive Learning) framework by analyzing how the ranking positions of negative samples affect model convergence and generalization. Our hypothesis suggests that the degree of negative sample hardness significantly influences training dynamics and model performance. Using a subset of MS MARCO, we conduct experiments comparing different ranking segments for negative sampling to understand their impact on training efficiency and model effectiveness.

Code Modifications

Our primary code changes are in drivers/run_ann_data_gen.py.
Below is a summary of the key logic modifications:

Select Top K Samples

The changes focus on how positive and negative samples are selected during data generation
The following code block is a part of our modifications:

if SelectTopK:
    selected_ann_idx = list(top_ann_pid[:args.negative_sample + 1])
    count_negative_sample = args.negative_sample
    
    # Add bottom negative samples if enabled
    if args.bottom_neg:                             
        selected_ann_idx.extend(top_ann_pid[-args.negative_sample:])
        count_negative_sample = 2 * args.negative_sample
        # print('count_negative_sample = 2 * args.negative_sample')
    
    # Only use bottom negative samples if both flags are set
    if args.bottom_neg and args.bottom_only:
        selected_ann_idx = list(top_ann_pid[-args.negative_sample:])
        count_negative_sample = 2 * args.negative_sample
        # print('count_negative_sample = args.negative_sample')

Reproducing Results with 10% subset data

To download raw dataset, please refer commands/data_download.sh script

To create subset, please refer data/create_subset.py

The logs are located in results directory with the following files

warmup-26334398.out
subset_10_train-a40-26398175-ann_data_4.out
subset_10_train-a40-26398212-ann_data_4_bottom_neg.out
subset_10_train-a40-26449628-ann_data_4_random.out
subset_10_train-a40-26456768-ann_data_4_bottom_neg_only.out

Commands

Run the following scripts for the experiments:

commands/run_train_warmup.sh
commands/subset_10_train-a40.sh
commands/subset_10_train-a40_bottom_neg.sh
commands/subset_10_train-a40_bottom_neg_only.sh
commands/subset_10_train-a40_random.sh

Warm-Up MRR

NDCG Comparison

SLURM Configurations

To reproduce the results in the results directory, execute the provided commands. The experiments were conducted using a SLURM cluster, so please refer the hardware settings and configurations

drivers/10/sbatch_train_index-a40_subset_10.sh
drivers/10/sbatch_train_index-a40_subset_10_bottom_neg.sh
drivers/10/sbatch_train_index-a40_subset_10_bottom_neg_only.sh
drivers/10/sbatch_train_index-a40_subset_10_random.sh
drivers/sbatch_warmup.sh

Report

Please refer the details, CS646_Final_Project.pdf

Contact

For any questions or further information, please contact:

Hojae Son: [email protected]
Deepesh Suranjandass: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
commands		commands
data		data
drivers		drivers
evaluation		evaluation
model		model
results		results
utils		utils
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CS646_Final_Project.pdf		CS646_Final_Project.pdf
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Impact Analysis of Hard Negative Sample Rankings in ANCE, 646 Project, Fall 2024

Code Modifications

Select Top K Samples

Reproducing Results with 10% subset data

To download raw dataset, please refer commands/data_download.sh script

To create subset, please refer data/create_subset.py

Commands

Warm-Up MRR

NDCG Comparison

SLURM Configurations

Report

Contact

About

Releases

Packages

Contributors 8

Languages

License

goodluck-hojae/Analysis-for-Negative_Samples-on-ANCE

Folders and files

Latest commit

History

Repository files navigation

Impact Analysis of Hard Negative Sample Rankings in ANCE, 646 Project, Fall 2024

Code Modifications

Select Top K Samples

Reproducing Results with 10% subset data

To download raw dataset, please refer commands/data_download.sh script

To create subset, please refer data/create_subset.py

Commands

Warm-Up MRR

NDCG Comparison

SLURM Configurations

Report

Contact

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages