Improving MoE Post-Training with Precise Router Gradients

Abstract

We introduce DenseMixer — a novel and effective MoE post-training technique that makes MoE easier to train and better performing. By trading one extra forward pass on inactive experts for precise router gradient, DenseMixer consistently outperforms the conventional method — across different MoE scales (7B, 14B, 30B), architectures (with/without shared experts), pre-trained methods (from scratch/up-cycling), and post-training data types (instruction/long CoT data). We provide a plug-and-play implementation for DenseMixer, empowering MoE post-training simply by pip install densemixer. It is fully compatible with existing libraries (e.g., transformers, llama-factory, open-instruct, verl) and can be applied with parameter-efficient methods (e.g., LoRA), introducing no changes to inference.

Avatar
Liyuan Liu
Principal Researcher @ MSR

Understand the underlying mechanism of pretraining heuristics.