Your Efficient RL Framework Secretly Brings You Off-Policy RL Training

Feng Yao*, Liyuan Liu* , Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao

August 2025

PDF Code

Abstract

In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and discuss a simple yet effective importance sampling technique for handling such discrepancy.

Type

Preprint

Reinforcement Learning

Liyuan Liu

Principal Researcher @ MSR

Understand the underlying mechanism of pretraining heuristics.