Sigma-MoE-Tiny:
Towards Super-Sparse MoE Models

Microsoft Research

Sigma-MoE-Tiny has 20B total parameters, but only 0.5B are activated !

Figure 1: With a total-to-activated ratio of 40:1, Sigma-MoE-Tiny achieves the highest sparsity among open-source MoE models. Leveraging this super-high sparsity, Sigma-MoE-Tiny delivers top-tier performance at significantly lower cost.

Abstract

Sigma-MoE-Tiny is an ultra-sparse Mixture-of-Experts (MoE) language model designed for efficient scalability. It achieves the highest sparsity among open-source MoE models, featuring 20B total parameters with only 0.5B activated.

To enable such extreme sparsity, Sigma-MoE-Tiny has up to 96 experts per layer, activating just one expert per token. This design introduces severe load-balancing challenges, where the standard load balancing loss become ineffective in the lower layers. We address this with a progressive sparsification schedule that stabilizes training while maintaining balanced expert utilization.

Sigma-MoE-Tiny is pre-trained on a high-quality corpus and further post-trained to enhance instruction-following and reasoning capabilities. Throughout the entire pipeline, training remains highly stable, with no irrecoverable loss spikes.

Despite activating only 0.5B parameters, Sigma-MoE-Tiny delivers top-tier performance compared to models of similar or even substantially larger scale. We also provide a deep dive into load balancing under extreme sparsity, offering insights for future sparse-MoE model design.

Challenges in Load Balancing

We observe that simply applying the standard load balancing loss (LBL) to an extremely sparse MoE architecture reveals a significant drawback: routing collapse in the lower layers. In Layer 0, the min-loaded expert gets almost no tokens, while the max-loaded expert receives nearly 3\(\times\) the tokens expected under uniform allocation (Figure 2a).

Looking more closely at how LBL behaves, we find that the token allocation fractions \(f\) and gating probabilities \(p\) exhibit different patterns in lower versus higher layers. In higher layers (Layer 52), \(f\) becomes roughly uniform while \(p\) remains non-uniform (Figure 2c). In lower layers (Layer 0), the opposite happens (Figure 2b): \(p\) moves toward uniformity but \(f\) stays uneven, which breaks the intended load balance. Despite this, the LBL still reaches its minimum.

Figure 2: (a) Relative deviation from uniform token allocation in Layer 0. (b) and (c) show the distribution of token allocation fraction \(f\) and gating probability \(p\) across all experts in Layer 0 and Layer 52, respectively.

We attribute this deficiency to inherent characteristics of the LBL itself. The desired goal of this loss is to push the token allocation fractions \(f\) toward a uniform distribution. However, LBL can in fact reach its minimum by making either \(f\) or \(p\) uniform. Under high sparsity, routing tokens in the lower layers becomes more difficult. As a result, the LBL optimization takes a shortcut by driving \(p\) toward uniformity, resulting in an unintended minimum that fails to achieve true load balance.

To fix this, we apply a progressive sparsification schedule for Sigma-MoE-Tiny. The core idea is to start with a modest sparsity in lower layers when training from scratch and then transition to our proposed high sparsity later in the training process. This simple schedule significantly improves load balance in the lower layers, as shown in Figure 2a.

Model Performance

Pre-training Evaluation: As shown in Table 1, despite having only 0.5B active parameters, Sigma-MoE-Tiny-Base achieves strong performance across benchmarks compared to other counterparts with comparable or larger model scales.

Table 1: Performance comparison among Sigma-MoE-Tiny-Base and other base models.

Post-training Evaluation: Table 2 shows that post-trained Sigma-MoE-Tiny with only 0.5B activated parameters can match or surpass much larger dense and MoE models, highlighting the effectiveness of super-high MoE sparsity in enhancing generalization and reasoning efficiency.

Table 2: Performance Comparison among Sigma-MoE-Tiny and baseline models.

Analysis

Effect of Progressive Sparsification Scheduling

The proposed progressive sparsification schedule not only improves expert load balance during training but also preserves nearly the same performance. As shown in Table 3, we compare two settings starting from the same intermediate checkpoint: one continuing with the initial sparsity and the other switched to the target sparsity. Notably, although converting to the target sparsity loses 0.15B activated parameters (approximately 25%), the resulting performance is largely preserved.

Table 3: Effect of progressive sparsification scheduling on model performance.

Comparison of Different Load Balancing Strategies

We also experimented with an auxiliary-loss-free approach to maintain load balancing, which is adopted in DeepSeek-V3. However, we observe that this loss-free approach can cause significant load imbalance in the lower layers under high MoE sparsity. As shown in Figure 3, introducing this loss-free balancing strategy leads to the min-loaded expert consistently receiving zero tokens after 2K training steps, while the max-loaded expert is allocated nearly 40\(\times\) the tokens expected under uniform allocation. Further analysis reveals that the bias terms introduced by this strategy in the lower layers will continually increase as training progresses, eventually dominating the gating scores. As a result, the expert with the highest bias receives the overwhelming majority of tokens.

Figure 3: Comparison of different load balancing strategies.

Exploring Native Load Balancing under High Sparsity

To achieve native load balancing under high sparsity without modifying the model architecture, we introduce a new LBL variant, termed Top-1 LBL. The core idea of this variant is to directly optimize the L2 norm of the token allocation fraction across all experts, theoretically avoiding the optimization bias present in conventional LBL. However, since the token allocation fraction is non-differentiable, we use the differentiable gating probabilities as an effective approximation, obtained by applying a temperature-scaled softmax to the routing logits. Formally, the Top-1 LBL is defined as:

$$\mathrm{LBL}_{\mathrm{top\text{-}1}} \;=\; \frac{N_E \sum_{i=1}^{N_E} \hat{f}_i^2}{\bar{p}_{\mathrm{top\text{-}1}}},$$

where the token allocation fraction \( \hat{f}_i \) for expert \( i \) is computed as

$$\hat{f}_i \;=\; \frac{1}{N_B}\sum_{j=1}^{N_B} p_{i,j}, \qquad p_{i,j} \;=\; \frac{\exp(\mathrm{logits}_{i,j}/\tau)}{\sum_{k=1}^{N_E}\exp(\mathrm{logits}_{k,j}/\tau)}.$$

The average top-1 probability \( \bar{p}_{\mathrm{top\text{-}1}} \) in the denominator is defined as

$$\bar{p}_{\mathrm{top\text{-}1}} \;=\; \frac{1}{N_B}\sum_{j=1}^{N_B} \operatorname{Top\text{-}1}\big(p_{i,j}\big).$$

Here, \(N_B\) is the number of tokens in a batch, \(\mathrm{logits}_{i,j}\) is the routing logit of expert \(i\) for token \(j\), and \(\tau\) is the softmax temperature.

As shown in Figure 4, applying Top-1 LBL significantly improves load balancing under super-high sparsity, steadily approaching a uniform token allocation. However, we also find that overly balanced expert utilization may sacrifice model performance. We attribute this issue to the inherent trade-off between load balance and performance, and leave improving this trade-off as an important direction for future work.

Figure 4: Comparison between Top-1 LBL and conventional LBL.

Cite Us

@misc{hu2025sigmamoetinytechnicalreport,
      title={Sigma-Moe-Tiny Technical Report}, 
      author={Qingguo Hu and Zhenghao Lin and Ziyue Yang and Yucheng Ding and Xiao Liu and Yuting Jiang and Ruizhe Wang and Tianyu Chen and Zhongxin Guo and Yifan Xiong and Rui Gao and Lei Qu and Jinsong Su and Peng Cheng and Yeyun Gong},
      year={2025},
      eprint={2512.16248},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.16248}, 
}