Hojae Son*, Deepesh Suranjandass*
This repo is inspired from
- ANCE codebase [https://github.com/microsoft/ANCE/]
- ANCE paper Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
We investigate the impact of hard negative sampling strategies in the ANCE [11] (Approximate Nearest Neighbor Negative Contrastive Learning) framework by analyzing how the ranking positions of negative samples affect model convergence and generalization. Our hypothesis suggests that the degree of negative sample hardness significantly influences training dynamics and model performance. Using a subset of MS MARCO, we conduct experiments comparing different ranking segments for negative sampling to understand their impact on training efficiency and model effectiveness.
Our primary code changes are in drivers/run_ann_data_gen.py
.
Below is a summary of the key logic modifications:
The changes focus on how positive and negative samples are selected during data generation
The following code block is a part of our modifications:
if SelectTopK:
selected_ann_idx = list(top_ann_pid[:args.negative_sample + 1])
count_negative_sample = args.negative_sample
# Add bottom negative samples if enabled
if args.bottom_neg:
selected_ann_idx.extend(top_ann_pid[-args.negative_sample:])
count_negative_sample = 2 * args.negative_sample
# print('count_negative_sample = 2 * args.negative_sample')
# Only use bottom negative samples if both flags are set
if args.bottom_neg and args.bottom_only:
selected_ann_idx = list(top_ann_pid[-args.negative_sample:])
count_negative_sample = 2 * args.negative_sample
# print('count_negative_sample = args.negative_sample')
The logs are located in results
directory with the following files
warmup-26334398.out
subset_10_train-a40-26398175-ann_data_4.out
subset_10_train-a40-26398212-ann_data_4_bottom_neg.out
subset_10_train-a40-26449628-ann_data_4_random.out
subset_10_train-a40-26456768-ann_data_4_bottom_neg_only.out
Run the following scripts for the experiments:
commands/run_train_warmup.sh
commands/subset_10_train-a40.sh
commands/subset_10_train-a40_bottom_neg.sh
commands/subset_10_train-a40_bottom_neg_only.sh
commands/subset_10_train-a40_random.sh
To reproduce the results in the results
directory, execute the provided commands.
The experiments were conducted using a SLURM cluster, so please refer the hardware settings and configurations
drivers/10/sbatch_train_index-a40_subset_10.sh
drivers/10/sbatch_train_index-a40_subset_10_bottom_neg.sh
drivers/10/sbatch_train_index-a40_subset_10_bottom_neg_only.sh
drivers/10/sbatch_train_index-a40_subset_10_random.sh
drivers/sbatch_warmup.sh
Please refer the details, CS646_Final_Project.pdf
For any questions or further information, please contact:
- Hojae Son: [email protected]
- Deepesh Suranjandass: [email protected]