Related Works

Table of Contents

If you are interested in listing your papers here, please post an issue on FastMoE’s GitHub Repo

2023 #

  • SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization ATC’23
    • Smart adaptive hybrid parallelism to further boost distributed MoE model training.

2022 #

  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models PPoPP’22
    • Boost the performance of FastMoE using multiple parallel techniques.
  • BaGuaLu: targeting brain scale pretrained models with over 37 million cores PPoPP’22
    • Training a 174-trillion-parameter MoE model based on FastMoE.

2021 #

  • FastMoE: A Fast Mixture-of-Expert Training System arXiv preprint
    • Introduction to the core FastMoE system.

Talks about Fast(er)MoE #

Other Systems #

  • A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training ICS’23
  • Accelerating Distributed MoE Training and Inference with Lina ATC’23
  • Lita: Accelerating Distributed Training of Sparsely Activated Models arxiv
  • SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System arxiv
  • Optimizing Mixture of Experts using Dynamic Recompilations arxiv
  • HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System arxiv
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arxiv
  • Tutel: An efficient mixture-of-experts implementation for large DNN model training github blog arxiv
  • BASE Layers: Simplifying Training of Large, Sparse Models ICML’21
  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding ICLR’21

MoE Paper Collections #