Global Load Balancer

The Global Load Balancer sits between the In-Process Proxy and the Rollouter's pool of inference servers. Its job is to keep every replica evenly fed while keeping each conversation pinned to one replica.

Load balance

A run serves the policy as several vLLM replicas (each gen_tp-way tensor-parallel; total replicas = rollout GPUs ÷ gen_tp). Replicas register as Ray named actors (vllm_server_*), and verl's server manager distributes incoming requests across them so no replica is idle while another queues. With many AgentLoopWorkers issuing calls concurrently, balanced distribution is what turns added rollout GPUs into added throughput.

Sticky routing

Routing is sticky per session: all calls from one trial go to the same replica for the life of the conversation. This matters because a coding-agent rollout is long and highly repetitive — each turn re-sends the growing history. Pinning the session lets the replica reuse its prefix cache across turns instead of re-prefilling the whole context every call, which is a large throughput win on multi-turn agent rollouts.

Why both at once

Load balancing optimizes across sessions (fill every replica); sticky routing optimizes within a session (reuse the prefix cache). Together they keep the rollout GPUs busy without throwing away cached context.

Global Load Balancer

Load balance

Sticky routing

On this page