Reference
Inputs & Configuration
The launch-script variables that define a run
LiveRL has no config.yaml. A run is configured by the variables at the top
of its launch script (e.g. scripts/sync_1node_cc.sh), which can also be
overridden as environment variables. The script exports them, then execs
verl.trainer.main_ppo with Hydra overrides built from those values plus the
YAML defaults under src/verl_patch/config/.
Configuration layers
| Layer | What it is | Edit it? |
|---|---|---|
| Launch-script variables | model, data, topology, batch, algorithm, backend | Yes — this is the user surface |
| Environment variables | same names, override the script defaults | Yes — for one-off runs |
YAML defaults (src/verl_patch/config/*.yaml) | agent-loop + trainer defaults consumed by Hydra | Rarely |
Inputs
The values you normally set in scripts/sync_1node_cc.sh:
| Variable | Description |
|---|---|
MODEL_PATH | Policy checkpoint to train |
SERVED_MODEL_NAME | vLLM model name advertised (default vllm_model) |
TRAIN_INDEX / VAL_INDEX | Train / val Harbor task parquet indexes |
PROJECT_NAME / EXP_NAME | wandb project + experiment name |
NNODES / NGPUS_PER_NODE | Cluster topology |
gen_tp | vLLM tensor-parallel degree (must divide num_key_value_heads) |
K8S_KUBECONFIG / K8S_NAMESPACE | Kubernetes backend |
HARBOR_AGENT_IMPORT_PATH | Agent scaffold |
HARBOR_ENVIRONMENT_IMPORT_PATH | Sandbox backend (K8s / remote Docker) |
NUM_WORKERS | Parallel trials per step (16 cold-start; raise for steady state) |
WANDB_API_KEY / WANDB_MODE | wandb credentials + mode (set in your shell) |
Hyperparameters
Also set in the launch script:
adv_estimator=grpo # ppo | grpo
policy_loss_mode=gspo # gspo | ...
learning_rate=1.0e-06
train_prompt_bsz=64
n_resp_per_prompt=8 # 64 × 8 = 512 trials/step
max_prompt_length=40000
max_response_length=68000 # window = prompt + response
save_freq=5
test_freq=5
temperature=1.0KV-head divisibility
gen_tp must divide the model's num_key_value_heads, or training crashes at the
first forward pass. See Preflight.
Outputs
A run writes, relative to the repo root:
| Output | Path | Notes |
|---|---|---|
| Checkpoints | checkpoints/<project>/<exp>/global_step_N/ | actor FSDP shards (per save_freq) |
| Training log | logs/<exp>.log | per-run training log |
| Throughput log | logs/<exp>_vllm.log | throughput-only log |
| Trajectories | harbor_trials/<project>/<exp>/ | per-trial proxy_trajectory.json |
| Curves / val resolve rate | wandb + dashboard | validation resolve rate and metric curves |
See Results & Artifacts for the full layout, and Config Variants for common edits.