LiveRL

LiveRL

Run Training

Preflight

What to check before a multi-hour run

A training run is a multi-hour, multi-GPU commitment, so confirm the run profile before launching. LiveRL has no separate dry-run command — walk this checklist against the variables at the top of your launch script.

Pre-launch checklist

  • venv — editable installs of harbor / verl / harbor-verl-train resolve to the trees you intend to run:
    .venv/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__, verl.__file__)"
  • Model & dataMODEL_PATH, TRAIN_INDEX, VAL_INDEX exist and are readable.
  • GPUsNGPUS_PER_NODE matches what nvidia-smi shows.
  • Backend — for K8s, kubectl --kubeconfig $K8S_KUBECONFIG get nodes works; for Docker, the daemon is reachable (see Backends).
  • Ports — the Ray ports (:6379, :8265) are free, or run scripts/cleanup_before_run.sh first.
  • Disk — the checkpoints/ volume has room for FSDP shards.
  • wandbWANDB_API_KEY is exported (or WANDB_MODE=disabled).

KV-head divisibility (the common crash)

The check that most often saves a run: gen_tp must divide the model's num_key_value_heads. If it doesn't, training crashes with a CUDA illegal memory access at the first forward pass. Verify against the model's config.json:

.venv/bin/python - <<'PY'
import json, os
cfg = json.load(open(os.path.join(os.environ["MODEL_PATH"], "config.json")))
kv = cfg.get("num_key_value_heads") or cfg.get("num_attention_heads")
gen_tp = int(os.environ.get("gen_tp", 4))
print(f"num_key_value_heads={kv}  gen_tp={gen_tp}  ->", "OK" if kv % gen_tp == 0 else "WILL CRASH")
PY

Fix by choosing a gen_tp that divides the model's KV-head count (or a model whose KV heads are divisible by your tensor-parallel degree).

Confirm, then run

Heavy operations (multi-hour GPU training, multi-container rollouts) are expensive and hard to reverse. Resolve every item above before launching, and prefer a short smoke run (small batch) on a new model/cluster before the full profile.

On this page