Title	Key Words
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [ICLR'17]	RNN-based 137B model
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [ICLR'21]	First Transformer-MoE model
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity[JMLR'22]	Top-1 gating
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale[ICML'22]
Fastmoe: A fast mixture-of-expert training system
FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models [PPoPP'22]
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization [ATC'23]
Accelerating Distributed MoE Training and Inference with Lina [ATC'23]
Optimizing Dynamic Neural Networks with Brainstorm [OSDI'23]

Provide feedback

Saved searches