Rollouter
The vLLM inference-server pool, the rollout data buffer, and weight sync
The Rollouter is the serving-and-data half of the loop: it runs the vLLM inference servers that produce rollouts, buffers the finished rollouts for the trainer, and pulls fresh weights from the trainer between updates.
Inference servers
The policy is served as a pool of vLLM replicas (verl-managed, each
gen_tp-way tensor-parallel). They register as Ray named actors
(vllm_server_*) and receive requests through the
Global Load Balancer. On a 30B MoE
(TP=4), CUDA-graph capture takes ~10–20 min before the first replica registers.
Data Buffer
Completed rollouts stream into a Data Buffer that the Trainer consumes:
- Synchronous (
scripts/sync_1node_cc.sh): the buffer is one step's batch — generate the whole batch, then update, in lockstep, on shared GPUs. - Fully-async (
scripts/fully_async_*.sh): the buffer decouples the two pools. Rollouts (dominated by sandbox execution, not GPU compute) stream into the queue while the trainer consumes them, so neither side blocks the other.staleness_thresholdbounds how many weight-versions stale a buffered rollout may be before the trainer waits;partial_rolloutresumes a generation that a weight sync interrupted instead of discarding it.
Weight Sync
After each update the trainer pushes new actor weights into the inference
replicas so the next rollouts are on-policy. In the synchronous loop this happens
every step; in the async path it is governed by trigger_parameter_sync_step
(and bounded by staleness_threshold). rollout_corr/kl is the health metric for
sync fidelity — rollout-vs-training logprobs should track closely (≈3e-4); a
spike means scaffold/routing corruption.
Scaling the rollouter into a dedicated pool, plus the async knobs, is covered in Scaling Up.