Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Ang Li, Dong Yu
The open-source Mixture of Depths code and the official implementation of the paper "Router-Tuning: A Simple and Effective Approach for Enabling Dynamic Depth in Transformers."
Traditional transformer models allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this inefficiency, Mixture of Depths (MoD) was introduced, dynamically adjusting computational depth by skipping less important layers. While promising, current MoD approaches face two significant challenges:
- High Training Costs: Existing methods require training the entire model alongside routers, which determine which layers to skip, resulting in substantial computational overhead.
- Risk of Performance Degradation: Bypassing important layers can lead to a drop in model performance.
To overcome these challenges, we introduce Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the training costs. Additionally, we propose Mindskip (Attention with Dynamic Depths), which preserves model performance while significantly enhancing computational and memory efficiency.
Our approach delivers competitive results, achieving up to 21% speedup with only a 0.2% performance drop, demonstrating its effectiveness in balancing efficiency and performance.
- Oct 2024: Published preprint on arXiv along with the related codebase.
conda create -n router-tuning python=3.10
conda activate router-tuning
git clone https://github.com/CASE-Lab-UMD/Router-Tuning
cd ./Router-Tuning
pip install -r requirements.txt
sh /scripts/finetune_mindskip.sh
The evaluation code is based on EleutherAI/lm-evaluation-harness. To fully reproduce our results, please use this version. It samples few-shot based on the index of the samples, avoiding the issue of result variation with the number of processes during data parallel inference.
@misc{he2024routertuningsimpleeffectiveapproach,
title={Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers},
author={Shwai He and Tao Ge and Guoheng Sun and Bowei Tian and Xiaoyang Wang and Ang Li and Dong Yu},
year={2024},
eprint={2410.13184},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.13184},
}
If you have any questions, please contact:
- Shwai He: [email protected]