Rollout generation is a primary bottleneck in RL training, taking up ~70% of total training time in DAPO-32B. FlashRL provides the first open-sourced & working RL recipe that applies quantized rollout generation while preserving downstream …
We introduce **DenseMixer —** a novel and effective MoE **post-training** technique that makes MoE easier to train and better performing. By trading one **extra forward pass** on inactive experts for **precise router gradient**, DenseMixer …
In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and …
Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing …
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs). Applying RLVR to the base model …