Your Efficient RL Framework Secretly Brings You Off-Policy RL Training

Abstract

In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and discuss a simple yet effective importance sampling technique for handling such discrepancy.

Avatar
Liyuan Liu
Principal Researcher @ MSR

Understand the underlying mechanism of pretraining heuristics.