Trainer

The Trainer closes the loop. It samples the task batch, consumes finished rollouts from the Rollouter's data buffer, computes advantages, updates the actor (and critic, for PPO), and pushes new weights back via weight sync.

Algorithms

Selected with adv_estimator / policy_loss_mode in the launch script:

PPO — actor + critic, clipped policy-gradient objective.
GRPO — group-relative advantages, no critic (adv_estimator=grpo).
GSPO — sequence-level policy optimization (policy_loss_mode=gspo).

The default profile runs GRPO advantages with a GSPO loss. All three optimize over full multi-turn rollouts, not single completions. See Algorithms.

Backends

The actor can run on three backends:

Backend	Best for
FSDP	dense models; the validated default and fully-async path
VeOmni	large MoE policies (e.g. Qwen3-30B-A3B); enable with `USE_NEW_VERL=1`
Megatron	the largest models, via tensor/pipeline parallelism

Training modes & features

Sync — generate → execute → reward → update in lockstep on shared GPUs; the minimal-cost single-node default.
Async — a decoupled trainer pool consumes a streaming buffer from a separate rollout pool (see Rollouter and Scaling Up).
Partial Rollout — a rollout interrupted mid-generation by a weight sync is resumed rather than thrown away, recovering otherwise-wasted compute.
R3 (router replay) — for MoE policies, replays the rollout's expert-routing decisions during the training forward pass so the update matches what was actually generated; required to keep MoE rollout/training logprobs aligned.

Outputs

verl writes the actor as FSDP shards every save_freq steps to checkpoints/<project>/<exp>/global_step_N/, and logs per-step metrics to logs/<exp>.log and wandb. See Metrics and Results & Artifacts.

Trainer

Algorithms

Backends

Training modes & features

Outputs

On this page