Reinforcement Learning

FlashRL: 8Bit Rollouts, Full Power RL

Rollout generation is a primary bottleneck in RL training, taking up ~70% of total training time in DAPO-32B. FlashRL provides the first open-sourced & working RL recipe that applies quantized rollout generation while preserving downstream …

Improving MoE Post-Training with Precise Router Gradients

We introduce **DenseMixer —** a novel and effective MoE **post-training** technique that makes MoE easier to train and better performing. By trading one **extra forward pass** on inactive experts for **precise router gradient**, DenseMixer …

Your Efficient RL Framework Secretly Brings You Off-Policy RL Training

In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and …

Training Language Models to Generate Quality Code with Program Analysis Feedback

Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing …

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs). Applying RLVR to the base model …