Global Load Balancer
Spreading requests across inference replicas with sticky routing
The Global Load Balancer sits between the In-Process Proxy and the Rollouter's pool of inference servers. Its job is to keep every replica evenly fed while keeping each conversation pinned to one replica.
Load balance
A run serves the policy as several vLLM replicas (each gen_tp-way
tensor-parallel; total replicas = rollout GPUs ÷ gen_tp). Replicas register as
Ray named actors (vllm_server_*), and verl's server manager distributes
incoming requests across them so no replica is idle while another queues. With
many AgentLoopWorkers issuing calls
concurrently, balanced distribution is what turns added rollout GPUs into added
throughput.
Sticky routing
Routing is sticky per session: all calls from one trial go to the same replica for the life of the conversation. This matters because a coding-agent rollout is long and highly repetitive — each turn re-sends the growing history. Pinning the session lets the replica reuse its prefix cache across turns instead of re-prefilling the whole context every call, which is a large throughput win on multi-turn agent rollouts.
Why both at once
Load balancing optimizes across sessions (fill every replica); sticky routing optimizes within a session (reuse the prefix cache). Together they keep the rollout GPUs busy without throwing away cached context.