LiveRL

LiveRL

Core Concepts

The framework layers and terminology in LiveRL

LiveRL is organized as the layers in the Architecture section: an Environment of sandboxed scaffolds, AgentLoopWorkers that run them, an In-Process Proxy + Global Load Balancer that front the policy, a Rollouter that serves it and buffers rollouts, and a Trainer that learns from them. The terms below name the pieces you will see in configs, logs, and the dashboard.

Policy, rollout, and trial

The policy is the model being trained. A rollout (Harbor calls it a trial) is one full attempt by the agent to solve a task: the agent reasons, edits files, and runs commands inside a sandbox until it submits or hits the turn/timeout limit. One training step samples a batch of rollouts across tasks (train_prompt_bsz × n_resp_per_prompt, e.g. 64 × 8 = 512 trials/step).

Reward

Each task ships its own verifier (a test script). When a rollout finishes, the verifier runs inside the sandbox and produces a reward — typically 1.0 if the tests pass and 0.0 otherwise. This grounds the optimization in real task success rather than a learned reward model. On the dashboard this surfaces as critic/score/mean.

Agent scaffold

The scaffold is the agent harness that drives the model through a task — claude-code by default. It determines the rollout's shape and which runtime image Harbor launches, and is selected per run via HARBOR_AGENT_IMPORT_PATH (e.g. harbor_patch.agents.image_mounted_claude_code:ClaudeCode). Other scaffolds ship under src/harbor_patch/agents/ — OpenHands (ai / sdk), OpenCode, Terminus 2.

The inference chain (in-process proxy)

During training the policy serves itself: verl manages the vLLM replicas, and the agent inside each sandbox reaches them through an in-process proxy — there is no standalone LiteLLM service. The proxy is started by the verl agent loop, binds the training host's primary IPv4 on an ephemeral port, and hands each trial a unique per-session URL.

coding agent (in a K8s pod / Docker container)
  └─▶ in-process proxy   (per-session URL on the training host)
        • OpenAI    /sess/<id>/v1/chat/completions   (OpenHands / OpenCode / Terminus)
        • Anthropic /sess/<id>/v1/messages           (Claude Code)
        • captures the per-token trajectory → proxy_trajectory.json
        └─▶ server_manager.generate(...) ─▶ vLLM replicas (verl-managed, DP × TP)

What the proxy does for every scaffold:

  • Unified API — presents an ordinary OpenAI and Anthropic surface, so any standard scaffold works unchanged. Claude Code's ANTHROPIC_BASE_URL and the OpenAI scaffolds' base URL are overridden per trial with the proxy's session URL.
  • Trajectory capture — tokenizes messages through the same apply_chat_template path as verl, forwards to server_manager.generate(...), and dumps a per-trial proxy_trajectory.json (token ids / masks / logprobs) for the training update and for inspection.
  • Partial-rollout aware — because it forwards to verl's server manager rather than vLLM directly, vLLM aborts/retries (and the fully-async partial-rollout path) are handled transparently.

The verl training loop

verl drives the RL loop: generate rollouts with vLLM → execute each trial in a Harbor sandbox → collect rewards → compute advantages → update the actor (and critic, for PPO) → sync weights.

In the synchronous single-node setup (scripts/sync_1node_cc.sh) the trainer and the rollout engine share the same GPUs on one node — the recommended, minimal-cost default. The fully-async path splits the cluster into a rollout pool and a trainer pool that run concurrently; see Scaling Up.

Algorithms

Selected via the adv_estimator / policy_loss_mode variables in the launch script:

  • PPO — actor + critic, clipped policy-gradient objective.
  • GRPO — group-relative advantages, no critic (adv_estimator=grpo).
  • GSPO — sequence-level policy optimization (policy_loss_mode=gspo).

The default profile runs GRPO advantages with a GSPO loss.

Trainer backends

verl can drive the actor on different backends, set per launch script:

  • FSDP — the default for dense models and the validated fully-async path.
  • VeOmni — for large MoE policies (e.g. Qwen3-30B-A3B); selected with USE_NEW_VERL=1 so import verl resolves to the verl-swe_agent_opd_dev checkout that carries the router-replay (R3) wiring.
  • Megatron — tensor/pipeline-parallel backend for the largest models.

Sandbox backend

Every trial runs in a fresh sandbox, selected via HARBOR_ENVIRONMENT_IMPORT_PATH:

  • Kubernetes — each trial is a pod; production default (harbor_patch.environments.kubernetes.kubernetes:KubernetesEnvironment, configured with K8S_KUBECONFIG / K8S_NAMESPACE).
  • Docker — each trial is a local or remote container; the minimal setup, no cluster required (harbor_patch.environments.remote_docker:RemoteDockerEnvironment).

See Backends for the trade-offs and config.

Checkpoints

verl writes the actor as FSDP shards every save_freq steps to checkpoints/<project>/<exp>/global_step_N/. These are the primary training output and survive cleanup runs.

The training venv

All training code runs from a single venv (default .venv, built by scripts/setup_env.sh) with editable installs of harbor, verl, and harbor-verl-train. The editable paths must point at the trees you intend to run — otherwise the process silently executes different code. Verify with:

.venv/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__)"

Metrics

verl logs one step:N - key:value - ... line per step (to the launch log and wandb). The ones to watch:

MetricMeaning
critic/score/meanMean reward (task success signal)
actor/entropyPolicy entropy — exploration / collapse
actor/ppo_klKL between old and new policy
actor/pg_clipfracFraction of clipped policy-gradient updates
perf/mfu/actorModel FLOPs utilization (throughput)
response_length/meanMean rollout length in tokens
rollout_corr/klRollout-vs-training distribution shift

The dashboard renders all of these as live charts.

On this page