Logo image
BalanceMoE: An Efficient Dynamic Load Balance Framework to Accelerate Mixture-of-Expert Training
Journal article   Peer reviewed

BalanceMoE: An Efficient Dynamic Load Balance Framework to Accelerate Mixture-of-Expert Training

Yunqi Gao, Bing Hu, Mahdi Boloursaz Mashhadi, Wei Wang, Pei Xiao, Rahim Tafazolli and Merouane Debbah
IEEE Transactions on Cloud Computing, Vol.Early Access(Early Access)
30/04/2026

Abstract

Distributed Deep Learning Mixture-of-Expert Expert Parallelism Load Balance MoE-GPT

The sparsely-activated Mixture-of-Expert (MoE) techniques support scaling the parameter sizes of pre-trained models to the trillion-level without increasing the computational costs. However, in large-scale cloud computing environments, dynamic load imbalance caused by the random expert selection of samples draws a huge challenge for distributed training efficiency. To address these challenges, we propose a lighter-weight and lower communication overhead dynamic load balancing framework, called BalanceMoE, to accelerate MoE model training. BalanceMoE is based on two key novel ideas. Firstly, we model a worker-pair-based expert transfer mechanism that considers the tradeoff between the expert parameter communication and the time reduction obtained. We perform a theoretical analysis and design a highly lightweight algorithm to obtain a near-optimal load balancing solution for per-iteration time reduction. Then, we present our proposed scheme for parallelization of expert computing and transfer, which overlaps the parameter communication of transferred experts and the computing of non-transferred experts to reduce per-iteration training time.

We implement BalanceMoE architecture on the PyTorch framework. Extensive experiments on two clusters demonstrate that at training speed, BalanceMoE can achieve up to 1.26x, 1.79x and 2.62x speedup compared to the state-of-the-art SmartMoE, FasterMoE and FastMoE, respectively. At memory usage, BalanceMoE saves up to 71% and 36% of memory compared to FasterMoE and FastMoE, respectively. At energy consumption, BalanceMoE saves up to 13% of the energy consumed within each training iteration compared to SmartMoE. BalanceMoE’s code is available at https://github.com/ZJU-CNLAB/BalanceMoE.

pdf
The final manuscript9.83 MB
Author's Accepted Manuscript Restricted. Access maybe granted on request., This file will be open access upon publication. CC BY V4.0

Metrics

1 Record Views

Details

Logo image

Usage Policy