LiveRL

LiveRL

Getting Started

Set up LiveRL and launch your first training run

This page walks through preparing LiveRL and launching a training run end to end. LiveRL is the harbor-verl-train repo; it bundles a patched verl (trainer + rollout) and Harbor (sandboxed task execution + verifier). A run is driven directly by a launch script — there is no separate config file; you set values at the top of the script or via environment variables.

Prerequisites

  • 8× GPU (A100/H100-class) on the training node — vLLM serves the policy and verl runs the trainer on the same host for the single-node path.
  • A sandbox backend — either a reachable Kubernetes cluster (production) or a Docker daemon (local or remote) for Harbor to run each trial in.
  • uv — used by scripts/setup_env.sh to build the training venv (vllm, flash_attn, verl, harbor, harbor-verl-train, editable).
  • A policy checkpoint — e.g. an SFT-trained Qwen3 model on local disk.
  • Train/val task indexes — Harbor task parquet files (*.parquet) the trainer samples rollouts from.

Where to run

Training is hours-long — launch it inside a tmux session (or with nohup setsid) so it survives shell disconnects.

1. Build the environment

scripts/setup_env.sh builds the venv at .venv via uv and clones + editably installs harbor and verl (plus pinned vllm / flash_attn / transformers / cupy). It is idempotent:

bash scripts/setup_env.sh

Key overrides (env vars): VENV_PATH, HARBOR_DIR / VERL_DIR (clone locations), HARBOR_REF / VERL_REF (branches), PYTHON_VERSION (default 3.12.3), and the SKIP_CLONE / SKIP_FLASH_ATTN / SKIP_CUPY skip flags. Confirm the editable installs resolve to the trees you expect:

.venv/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__)"

2. Point at your model, data, and backend

For the single-node path, edit the variables near the top of scripts/sync_1node_cc.sh (or export them before launching):

VariableWhat it sets
MODEL_PATHpolicy checkpoint to train
SERVED_MODEL_NAMEvLLM model name (default vllm_model)
TRAIN_INDEX / VAL_INDEXHarbor task parquet indexes
PROJECT_NAME / EXP_NAMEwandb project + experiment name
NGPUS_PER_NODE / gen_tpGPU count + vLLM tensor-parallel degree
K8S_KUBECONFIG / K8S_NAMESPACEKubernetes backend (default)
HARBOR_AGENT_IMPORT_PATHagent scaffold (default Claude Code)
HARBOR_ENVIRONMENT_IMPORT_PATHsandbox backend (K8s / remote Docker)

Set credentials in your shell — never hardcode them in the script:

export WANDB_API_KEY=...     # or set WANDB_MODE=disabled to skip wandb

See Inputs & Configuration for the full surface and Backends to switch to Docker.

KV-head divisibility (the common crash)

gen_tp must divide the model's num_key_value_heads, or training crashes with a CUDA illegal memory access at the first forward pass. Pick a gen_tp that divides the model's KV-head count. See Preflight.

3. Clean up stale state (optional)

If a previous run left Ray/vLLM processes or stale ports behind, reset first:

bash scripts/cleanup_before_run.sh

It stops Ray, frees the Ray ports, reaps stale vLLM/verl processes, and (with KEEP_TRIALS=0) clears the trials dir. It leaves running GPU tasks alone; use DRY_RUN=1 to preview.

4. Launch a run

bash scripts/sync_1node_cc.sh

This boots vLLM (verl-managed), starts the in-process proxy, and drives the synchronous PPO/GRPO/GSPO loop. vLLM CUDA-graph capture on a 30B MoE (TP=4) takes ~10–20 min before the first replica registers — this is expected. See Run Training.

5. Watch it train

Tail the launch log and bring up the dashboard:

tail -F logs/<exp>.log
bash webui/start_dashboard.sh            # http://<host>:8090 (+ public tunnel)

Checkpoints land under checkpoints/<project>/<exp>/global_step_N/. See Results & Artifacts.

Scaling beyond one node

When a single host is no longer enough — large MoE policies, ≈131k contexts, or rollout-bound throughput — move to the multi-node, fully-async scripts (scripts/fully_async_*.sh). See Scaling Up.

On this page