Core Concepts
The framework layers and terminology in LiveRL
LiveRL is organized as the layers in the Architecture section: an Environment of sandboxed scaffolds, AgentLoopWorkers that run them, an In-Process Proxy + Global Load Balancer that front the policy, a Rollouter that serves it and buffers rollouts, and a Trainer that learns from them. The terms below name the pieces you will see in configs, logs, and the dashboard.
Policy, rollout, and trial
The policy is the model being trained. A rollout (Harbor calls it a
trial) is one full attempt by the agent to solve a task: the agent reasons,
edits files, and runs commands inside a sandbox until it submits or hits the
turn/timeout limit. One training step samples a batch of rollouts across tasks
(train_prompt_bsz × n_resp_per_prompt, e.g. 64 × 8 = 512 trials/step).
Reward
Each task ships its own verifier (a test script). When a rollout finishes,
the verifier runs inside the sandbox and produces a reward — typically 1.0
if the tests pass and 0.0 otherwise. This grounds the optimization in real
task success rather than a learned reward model. On the dashboard this surfaces
as critic/score/mean.
Agent scaffold
The scaffold is the agent harness that drives the model through a task —
claude-code by default. It determines the rollout's shape and which runtime
image Harbor launches, and is selected per run via HARBOR_AGENT_IMPORT_PATH
(e.g. harbor_patch.agents.image_mounted_claude_code:ClaudeCode). Other
scaffolds ship under src/harbor_patch/agents/ — OpenHands (ai / sdk), OpenCode,
Terminus 2.
The inference chain (in-process proxy)
During training the policy serves itself: verl manages the vLLM replicas, and the agent inside each sandbox reaches them through an in-process proxy — there is no standalone LiteLLM service. The proxy is started by the verl agent loop, binds the training host's primary IPv4 on an ephemeral port, and hands each trial a unique per-session URL.
coding agent (in a K8s pod / Docker container)
└─▶ in-process proxy (per-session URL on the training host)
• OpenAI /sess/<id>/v1/chat/completions (OpenHands / OpenCode / Terminus)
• Anthropic /sess/<id>/v1/messages (Claude Code)
• captures the per-token trajectory → proxy_trajectory.json
└─▶ server_manager.generate(...) ─▶ vLLM replicas (verl-managed, DP × TP)What the proxy does for every scaffold:
- Unified API — presents an ordinary OpenAI and Anthropic surface, so any
standard scaffold works unchanged. Claude Code's
ANTHROPIC_BASE_URLand the OpenAI scaffolds' base URL are overridden per trial with the proxy's session URL. - Trajectory capture — tokenizes messages through the same
apply_chat_templatepath as verl, forwards toserver_manager.generate(...), and dumps a per-trialproxy_trajectory.json(token ids / masks / logprobs) for the training update and for inspection. - Partial-rollout aware — because it forwards to verl's server manager rather than vLLM directly, vLLM aborts/retries (and the fully-async partial-rollout path) are handled transparently.
The verl training loop
verl drives the RL loop: generate rollouts with vLLM → execute each trial in a Harbor sandbox → collect rewards → compute advantages → update the actor (and critic, for PPO) → sync weights.
In the synchronous single-node setup (scripts/sync_1node_cc.sh) the trainer
and the rollout engine share the same GPUs on one node — the recommended,
minimal-cost default. The fully-async path splits the cluster into a rollout
pool and a trainer pool that run concurrently; see
Scaling Up.
Algorithms
Selected via the adv_estimator / policy_loss_mode variables in the launch
script:
- PPO — actor + critic, clipped policy-gradient objective.
- GRPO — group-relative advantages, no critic (
adv_estimator=grpo). - GSPO — sequence-level policy optimization (
policy_loss_mode=gspo).
The default profile runs GRPO advantages with a GSPO loss.
Trainer backends
verl can drive the actor on different backends, set per launch script:
- FSDP — the default for dense models and the validated fully-async path.
- VeOmni — for large MoE policies (e.g. Qwen3-30B-A3B); selected with
USE_NEW_VERL=1soimport verlresolves to theverl-swe_agent_opd_devcheckout that carries the router-replay (R3) wiring. - Megatron — tensor/pipeline-parallel backend for the largest models.
Sandbox backend
Every trial runs in a fresh sandbox, selected via HARBOR_ENVIRONMENT_IMPORT_PATH:
- Kubernetes — each trial is a pod; production default
(
harbor_patch.environments.kubernetes.kubernetes:KubernetesEnvironment, configured withK8S_KUBECONFIG/K8S_NAMESPACE). - Docker — each trial is a local or remote container; the minimal setup, no
cluster required (
harbor_patch.environments.remote_docker:RemoteDockerEnvironment).
See Backends for the trade-offs and config.
Checkpoints
verl writes the actor as FSDP shards every save_freq steps to
checkpoints/<project>/<exp>/global_step_N/. These are the primary training
output and survive cleanup runs.
The training venv
All training code runs from a single venv (default .venv, built by
scripts/setup_env.sh) with editable installs of harbor, verl, and
harbor-verl-train. The editable paths must point at the trees you intend to run
— otherwise the process silently executes different code. Verify with:
.venv/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__)"Metrics
verl logs one step:N - key:value - ... line per step (to the launch log and
wandb). The ones to watch:
| Metric | Meaning |
|---|---|
critic/score/mean | Mean reward (task success signal) |
actor/entropy | Policy entropy — exploration / collapse |
actor/ppo_kl | KL between old and new policy |
actor/pg_clipfrac | Fraction of clipped policy-gradient updates |
perf/mfu/actor | Model FLOPs utilization (throughput) |
response_length/mean | Mean rollout length in tokens |
rollout_corr/kl | Rollout-vs-training distribution shift |
The dashboard renders all of these as live charts.