In-Process Proxy
One unified API in front of the policy that captures every trajectory
The In-Process Proxy is the single seam between every scaffold and the self-served policy. It presents one unified API surface, captures the exact token-level trajectory used for training, and forwards each call onward — with no standalone LiteLLM service and no fixed port.
Unified API
The proxy is started by the verl agent loop on the training host, binds the host's primary IPv4 on an ephemeral port, and hands each trial a unique per-session URL. It serves two protocol families from the same process:
in-process proxy (per-session URL, e.g. http://<host-ip>:<port>/sess/<id>/v1)
• OpenAI /v1/chat/completions (OpenHands / OpenCode / Terminus)
• Anthropic /v1/messages (Claude Code)
└─▶ server_manager.generate(...) ─▶ vLLM replicasSo any scaffold works unchanged: Claude Code's ANTHROPIC_BASE_URL and the OpenAI
scaffolds' base URL are simply overridden per trial with the session URL.
Trajectory capture (token-in / token-out)
This is why the proxy exists. It tokenizes messages through the same
apply_chat_template path as verl, forwards to server_manager.generate(...),
and writes a per-trial proxy_trajectory.json carrying the token ids, masks, and
logprobs. Training therefore consumes the exact tokens the policy produced —
there is no re-tokenization gap between rollout and update.
When a scaffold mutates earlier history mid-conversation (e.g. Claude Code
injecting a dynamic <system-reminder>) and the token sequence diverges, the
proxy recomputes the rebuilt prefix's logprobs via vLLM prompt_logprobs instead
of training on zeros (kill-switch HARBOR_RECOMPUTE_MISMATCH_LOGPROBS=0).
Partial-rollout aware
Because the proxy forwards to verl's server_manager rather than to vLLM
directly, vLLM aborts/retries — and the fully-async partial rollout path
(resuming a generation interrupted by a weight sync) — are handled transparently
on the way through.
For readiness checks and failure symptoms, see Inference Stack.