FlashRL: 8Bit Rollouts, Full Power RL

Abstract

Rollout generation is a primary bottleneck in RL training, taking up ~70% of total training time in DAPO-32B. FlashRL provides the first open-sourced & working RL recipe that applies quantized rollout generation while preserving downstream performance via the TIS technique. It can be easily used via pip install flash-llm-rl and supports both INT8 and FP8 quantization for both the latest GPUs (H100) and older ones (A100).

Avatar
Liyuan Liu
Principal Researcher @ MSR

Understand the underlying mechanism of pretraining heuristics.