Title | Key Words |
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [ICLR'17] | RNN-based 137B model |
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [ICLR'21] | First Transformer-MoE model |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts | |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity[JMLR'22] | Top-1 gating |
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale[ICML'22] | |
Fastmoe: A fast mixture-of-expert training system | |
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models [PPoPP'22] | |
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization [ATC'23] | |
Accelerating Distributed MoE Training and Inference with Lina [ATC'23] | |
Optimizing Dynamic Neural Networks with Brainstorm [OSDI'23] |