Getting Started
Set up LiveRL and launch your first training run
This page walks through preparing LiveRL and launching a training run end to end.
LiveRL is the harbor-verl-train repo; it bundles a patched verl
(trainer + rollout) and Harbor (sandboxed
task execution + verifier). A run is driven directly by a launch script — there is
no separate config file; you set values at the top of the script or via
environment variables.
Prerequisites
- 8× GPU (A100/H100-class) on the training node — vLLM serves the policy and verl runs the trainer on the same host for the single-node path.
- A sandbox backend — either a reachable Kubernetes cluster (production) or a Docker daemon (local or remote) for Harbor to run each trial in.
- uv — used by
scripts/setup_env.shto build the training venv (vllm, flash_attn, verl, harbor, harbor-verl-train, editable). - A policy checkpoint — e.g. an SFT-trained Qwen3 model on local disk.
- Train/val task indexes — Harbor task parquet files (
*.parquet) the trainer samples rollouts from.
Where to run
Training is hours-long — launch it inside a tmux session (or with
nohup setsid) so it survives shell disconnects.
1. Build the environment
scripts/setup_env.sh builds the venv at .venv via uv and clones + editably
installs harbor and verl (plus pinned vllm / flash_attn / transformers /
cupy). It is idempotent:
bash scripts/setup_env.shKey overrides (env vars): VENV_PATH, HARBOR_DIR / VERL_DIR (clone
locations), HARBOR_REF / VERL_REF (branches), PYTHON_VERSION (default
3.12.3), and the SKIP_CLONE / SKIP_FLASH_ATTN / SKIP_CUPY skip flags.
Confirm the editable installs resolve to the trees you expect:
.venv/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__)"2. Point at your model, data, and backend
For the single-node path, edit the variables near the top of
scripts/sync_1node_cc.sh (or export them before launching):
| Variable | What it sets |
|---|---|
MODEL_PATH | policy checkpoint to train |
SERVED_MODEL_NAME | vLLM model name (default vllm_model) |
TRAIN_INDEX / VAL_INDEX | Harbor task parquet indexes |
PROJECT_NAME / EXP_NAME | wandb project + experiment name |
NGPUS_PER_NODE / gen_tp | GPU count + vLLM tensor-parallel degree |
K8S_KUBECONFIG / K8S_NAMESPACE | Kubernetes backend (default) |
HARBOR_AGENT_IMPORT_PATH | agent scaffold (default Claude Code) |
HARBOR_ENVIRONMENT_IMPORT_PATH | sandbox backend (K8s / remote Docker) |
Set credentials in your shell — never hardcode them in the script:
export WANDB_API_KEY=... # or set WANDB_MODE=disabled to skip wandbSee Inputs & Configuration for the full surface and Backends to switch to Docker.
KV-head divisibility (the common crash)
gen_tp must divide the model's num_key_value_heads, or training crashes with
a CUDA illegal memory access at the first forward pass. Pick a gen_tp that
divides the model's KV-head count. See Preflight.
3. Clean up stale state (optional)
If a previous run left Ray/vLLM processes or stale ports behind, reset first:
bash scripts/cleanup_before_run.shIt stops Ray, frees the Ray ports, reaps stale vLLM/verl processes, and (with
KEEP_TRIALS=0) clears the trials dir. It leaves running GPU tasks alone; use
DRY_RUN=1 to preview.
4. Launch a run
bash scripts/sync_1node_cc.shThis boots vLLM (verl-managed), starts the in-process proxy, and drives the synchronous PPO/GRPO/GSPO loop. vLLM CUDA-graph capture on a 30B MoE (TP=4) takes ~10–20 min before the first replica registers — this is expected. See Run Training.
5. Watch it train
Tail the launch log and bring up the dashboard:
tail -F logs/<exp>.log
bash webui/start_dashboard.sh # http://<host>:8090 (+ public tunnel)Checkpoints land under checkpoints/<project>/<exp>/global_step_N/. See
Results & Artifacts.
Scaling beyond one node
When a single host is no longer enough — large MoE policies, ≈131k contexts, or
rollout-bound throughput — move to the multi-node, fully-async scripts
(scripts/fully_async_*.sh). See Scaling Up.