Skip to content

Latest commit

 

History

History
58 lines (40 loc) · 2.68 KB

README.md

File metadata and controls

58 lines (40 loc) · 2.68 KB

Language Models as Semantic Indexers

This repository contains the source code and datasets for Language Models as Semantic Indexers, ICML 2024.

Links

Requirements

The code is written in Python 3.8. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

LMIndexer is a self-supervised framework learned to tokenize documents into semantic IDs.

LMIndexer can be applied to various downstream tasks, including recommendation and retrieval.

Data Preparation

Download processed data. To reproduce the results in our paper, you need to first download the processed datasets. Then put the dataset folders under data/rec-data/{data_name} (data_name=Beauty, Sports, Toys) and data/retrieval-data/{data_name} (data_name=NQ_aug, macro) respectively.

Raw data & data processing. Raw data can be downloaded from Amazon-Recommendation, Amazon-Retrieval, NQ and MS-MACRO directly. More details about the data processing for recommendation, product retrieval and document retrieval can be found here.

Learn Semantic IDs

Codes are in SemanticID/. Please refer to the README.md here.

Downstream Tasks

Codes are in downstream/. Please refer to the README.md here.

Citations

Please cite the following paper if you find the code helpful for your research.

@article{jin2023language,
  title={Language Models As Semantic Indexers},
  author={Jin, Bowen and Zeng, Hansi and Wang, Guoyin and Chen, Xiusi and Wei, Tianxin and Li, Ruirui and Wang, Zhengyang and Li, Zheng and Li, Yang and Lu, Hanqing and others},
  journal={arXiv preprint arXiv:2310.07815},
  year={2023}
}