Run Training

A LiveRL run boots vLLM to serve the policy, starts the in-process proxy in front of it, and lets verl drive the RL loop while a coding agent rolls out across tasks inside Harbor sandboxes. Everything is driven from a single launch script — no separate config file.

# single-node, synchronous (the minimal-cost default)
bash scripts/cleanup_before_run.sh   # optional: reset stale Ray/vLLM state
bash scripts/sync_1node_cc.sh        # boot vLLM + proxy, then run the PPO/GRPO/GSPO loop

You configure a run by editing the variables at the top of the script (or exporting them as env vars) — see Inputs & Configuration.

Launch in the background

Training runs for hours. Wrap it in tmux or nohup setsid so it survives shell disconnects, and tail logs/<exp>.log.

The boot sequence

venv — built once by scripts/setup_env.sh (or reused via VENV_PATH).
vLLM — verl launches DP × TP replicas; CUDA-graph capture on a 30B MoE (TP=4) is ~10–20 min before the first replica registers.
In-process proxy — started by the verl agent loop; it advertises a per-session URL per trial (no standalone LiteLLM, no fixed port).
verl loop — generate rollouts → run trials in Harbor → reward → update → checkpoint every save_freq steps.

In this section

Preflight — what to validate before a multi-hour run, including KV-head divisibility
Inference Stack — the agent → in-process proxy → vLLM chain and its health checks
Backends — Kubernetes vs Docker sandboxes
Results & Artifacts — logs, checkpoints, trajectories
Scaling Up — advanced: multi-node, fully-async, VeOmni/MoE, partial rollout, R3

Start single-node

The single-node path above is the minimal-cost way to run LiveRL — one 8×GPU host, one launch script. Only move to multiple machines when you outgrow it; see Scaling Up.

Stop cleanly between runs

Between runs, bash scripts/cleanup_before_run.sh stops Ray, frees ports, and reaps stale vLLM/verl processes. Checkpoints and wandb runs are preserved; set KEEP_TRIALS=0 to also clear the trials dir. It never kills running GPU tasks.

Run Training

The boot sequence

In this section

On this page