Language Models as Semantic Indexers

This repository contains the source code and datasets for Language Models as Semantic Indexers, ICML 2024.

Links

Requirements
Overview
Data Preparation
Learn Semantic IDs
Downstream Tasks
Citations

Requirements

The code is written in Python 3.8. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

LMIndexer is a self-supervised framework learned to tokenize documents into semantic IDs.

LMIndexer can be applied to various downstream tasks, including recommendation and retrieval.

Data Preparation

Download processed data. To reproduce the results in our paper, you need to first download the processed datasets. Then put the dataset folders under data/rec-data/{data_name} (data_name=Beauty, Sports, Toys) and data/retrieval-data/{data_name} (data_name=NQ_aug, macro) respectively.

Raw data & data processing. Raw data can be downloaded from Amazon-Recommendation, Amazon-Retrieval, NQ and MS-MACRO directly. More details about the data processing for recommendation, product retrieval and document retrieval can be found here.

Learn Semantic IDs

Codes are in SemanticID/. Please refer to the README.md here.

Downstream Tasks

Codes are in downstream/. Please refer to the README.md here.

Citations

Please cite the following paper if you find the code helpful for your research.

@article{jin2023language,
  title={Language Models As Semantic Indexers},
  author={Jin, Bowen and Zeng, Hansi and Wang, Guoyin and Chen, Xiusi and Wei, Tianxin and Li, Ruirui and Wang, Zhengyang and Li, Zheng and Li, Yang and Lu, Hanqing and others},
  journal={arXiv preprint arXiv:2310.07815},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Language Models as Semantic Indexers

Links

Requirements

Overview

Data Preparation

Learn Semantic IDs

Downstream Tasks

Citations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Language Models as Semantic Indexers

Links

Requirements

Overview

Data Preparation

Learn Semantic IDs

Downstream Tasks

Citations