Preflight
What to check before a multi-hour run
A training run is a multi-hour, multi-GPU commitment, so confirm the run profile before launching. LiveRL has no separate dry-run command — walk this checklist against the variables at the top of your launch script.
Pre-launch checklist
- venv — editable installs of harbor / verl / harbor-verl-train resolve to
the trees you intend to run:
.venv/bin/python -c "import harbor, verl, verl_patch; print(harbor.__file__, verl.__file__)" - Model & data —
MODEL_PATH,TRAIN_INDEX,VAL_INDEXexist and are readable. - GPUs —
NGPUS_PER_NODEmatches whatnvidia-smishows. - Backend — for K8s,
kubectl --kubeconfig $K8S_KUBECONFIG get nodesworks; for Docker, the daemon is reachable (see Backends). - Ports — the Ray ports (
:6379,:8265) are free, or runscripts/cleanup_before_run.shfirst. - Disk — the
checkpoints/volume has room for FSDP shards. - wandb —
WANDB_API_KEYis exported (orWANDB_MODE=disabled).
KV-head divisibility (the common crash)
The check that most often saves a run: gen_tp must divide the model's
num_key_value_heads. If it doesn't, training crashes with a CUDA illegal memory
access at the first forward pass. Verify against the model's config.json:
.venv/bin/python - <<'PY'
import json, os
cfg = json.load(open(os.path.join(os.environ["MODEL_PATH"], "config.json")))
kv = cfg.get("num_key_value_heads") or cfg.get("num_attention_heads")
gen_tp = int(os.environ.get("gen_tp", 4))
print(f"num_key_value_heads={kv} gen_tp={gen_tp} ->", "OK" if kv % gen_tp == 0 else "WILL CRASH")
PYFix by choosing a gen_tp that divides the model's KV-head count (or a model
whose KV heads are divisible by your tensor-parallel degree).
Confirm, then run
Heavy operations (multi-hour GPU training, multi-container rollouts) are expensive and hard to reverse. Resolve every item above before launching, and prefer a short smoke run (small batch) on a new model/cluster before the full profile.