LiveRL

LiveRL

Architecture

In-Process Proxy

One unified API in front of the policy that captures every trajectory

The In-Process Proxy is the single seam between every scaffold and the self-served policy. It presents one unified API surface, captures the exact token-level trajectory used for training, and forwards each call onward — with no standalone LiteLLM service and no fixed port.

Unified API

The proxy is started by the verl agent loop on the training host, binds the host's primary IPv4 on an ephemeral port, and hands each trial a unique per-session URL. It serves two protocol families from the same process:

in-process proxy   (per-session URL, e.g. http://<host-ip>:<port>/sess/<id>/v1)
  • OpenAI    /v1/chat/completions   (OpenHands / OpenCode / Terminus)
  • Anthropic /v1/messages           (Claude Code)
  └─▶ server_manager.generate(...) ─▶ vLLM replicas

So any scaffold works unchanged: Claude Code's ANTHROPIC_BASE_URL and the OpenAI scaffolds' base URL are simply overridden per trial with the session URL.

Trajectory capture (token-in / token-out)

This is why the proxy exists. It tokenizes messages through the same apply_chat_template path as verl, forwards to server_manager.generate(...), and writes a per-trial proxy_trajectory.json carrying the token ids, masks, and logprobs. Training therefore consumes the exact tokens the policy produced — there is no re-tokenization gap between rollout and update.

When a scaffold mutates earlier history mid-conversation (e.g. Claude Code injecting a dynamic <system-reminder>) and the token sequence diverges, the proxy recomputes the rebuilt prefix's logprobs via vLLM prompt_logprobs instead of training on zeros (kill-switch HARBOR_RECOMPUTE_MISMATCH_LOGPROBS=0).

Partial-rollout aware

Because the proxy forwards to verl's server_manager rather than to vLLM directly, vLLM aborts/retries — and the fully-async partial rollout path (resuming a generation interrupted by a weight sync) — are handled transparently on the way through.

For readiness checks and failure symptoms, see Inference Stack.

On this page