Abstract
The sparsely-activated Mixture-of-Expert (MoE) techniques support scaling the parameter sizes of pre-trained models to the trillion-level without increasing the computational costs. However, in large-scale cloud computing environments, dynamic load imbalance caused by the random expert selection of samples draws a huge challenge for distributed training efficiency. To address these challenges, we propose a lighter-weight and lower communication overhead dynamic load balancing framework, called BalanceMoE, to accelerate MoE model training. BalanceMoE is based on two key novel ideas. Firstly, we model a worker-pair-based expert transfer mechanism that considers the tradeoff between the expert parameter communication and the time reduction obtained. We perform a theoretical analysis and design a highly lightweight algorithm to obtain a near-optimal load balancing solution for per-iteration time reduction. Then, we present our proposed scheme for parallelization of expert computing and transfer, which overlaps the parameter communication of transferred experts and the computing of non-transferred experts to reduce per-iteration training time.
We implement BalanceMoE architecture on the PyTorch framework. Extensive experiments on two clusters demonstrate that at training speed, BalanceMoE can achieve up to 1.26x, 1.79x and 2.62x speedup compared to the state-of-the-art SmartMoE, FasterMoE and FastMoE, respectively. At memory usage, BalanceMoE saves up to 71% and 36% of memory compared to FasterMoE and FastMoE, respectively. At energy consumption, BalanceMoE saves up to 13% of the energy consumed within each training iteration compared to SmartMoE. BalanceMoE’s code is available at https://github.com/ZJU-CNLAB/BalanceMoE.