Related Works

If you are interested in listing your papers here, please post an issue on FastMoE’s GitHub Repo

SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization ATC’23
- Smart adaptive hybrid parallelism to further boost distributed MoE model training.

FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models PPoPP’22
- Boost the performance of FastMoE using multiple parallel techniques.
BaGuaLu: targeting brain scale pretrained models with over 37 million cores PPoPP’22
- Training a 174-trillion-parameter MoE model based on FastMoE.

FastMoE: A Fast Mixture-of-Expert Training System arXiv preprint
- Introduction to the core FastMoE system.

Other Systems #

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training ICS’23
Accelerating Distributed MoE Training and Inference with Lina ATC’23
Lita: Accelerating Distributed Training of Sparsely Activated Models arxiv
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System arxiv
Optimizing Mixture of Experts using Dynamic Recompilations arxiv
HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System arxiv
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arxiv
Tutel: An efficient mixture-of-experts implementation for large DNN model training github blog arxiv
BASE Layers: Simplifying Training of Large, Sparse Models ICML’21
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding ICLR’21