LiveRL

LiveRL

Run Training

Scaling Up

When and how to move from the single-node default to a multi-node, fully-async run

The recommended way to run LiveRL is the single-node path documented throughout this section: one 8×GPU host where verl runs the trainer and vLLM serves the policy on the same GPUs, driven by scripts/sync_1node_cc.sh with the claude-code scaffold. It is the minimal-cost way to get a real RL run going — one machine, one launch script, no cross-node plumbing.

This page is the advanced path: what changes when one host is no longer enough. You almost certainly do not need it to start.

This costs more — reach for it only when you must

The multi-node / fully-async stack needs multiple machines, more setup, and more operational care (cross-node networking, a separate rollout pool, staleness tuning). Stay on the single-node path until you actually hit its ceiling: a large MoE policy that won't fit one host's training + serving budget, very long contexts (≈131k), or rollout throughput that bottlenecks a synchronous loop.

When to scale

Signal on the single-node runScaling lever
Policy is a large MoE (e.g. 30B-A3B) and trainer + vLLM won't co-fitveomni engine + a dedicated rollout node
Trainer GPUs idle most of the step waiting on rollouts (idle_ratio high)Fully-async trainer/rollout decoupling
Context window must reach ≈131k (40k prompt + 91k response)Multi-node FSDP (more trainer GPUs + CPU offload)
You need a different agent harness than claude-codeA different scaffold via HARBOR_AGENT_IMPORT_PATH

Fully-async, multi-node

The synchronous loop runs generate → execute → reward → update in lockstep on shared GPUs. The fully-async path instead splits the cluster into a rollout pool and a trainer pool that run concurrently: rollouts (which are dominated by sandbox/env execution, not GPU compute) stream into a queue while the trainer consumes them, so neither side blocks the other.

The reference launch scripts are scripts/fully_async_2nodes_qwen35_cc_clean.sh (2 nodes) and scripts/fully_async_3nodes_qwen35_cc_clean.sh (3 nodes):

bash scripts/fully_async_3nodes_qwen35_cc_clean.sh

The 3-node layout is 2 trainer nodes (16-GPU FSDP, fsdp_size=16) + 1 rollout node (8-GPU vLLM), fitting the ≈131k window (40k prompt + 91k response) via more trainer GPUs plus CPU optimizer offload. A few knobs govern the async behaviour:

KnobRole
staleness_thresholdHow many param-versions stale a rollout may be before the trainer waits (0 = strict on-policy; 1 tolerates a throughput dip)
partial_rolloutResume a rollout aborted mid-generation on a weight sync, instead of discarding it (True by default)
train_bsz × n_resp_per_promptEffective batch = prompts × rollouts/prompt. Official 64×8; smoke 8×4

sp_size must be 1 for qwen35

For the Qwen3.5 policy, sp_size=1 is fixed — Ulysses sequence parallel corrupts the GatedDeltaNet layers. The ≈131k window is fit with fsdp_size=16 + CPU optimizer offload, not SP.

Smoke before the official run

Always validate a new scaffold/cluster with the smoke batch (TRAIN_BSZ=8 N_RESP=4) before the official 64×8. The small batch is ~8× faster per step but gives noisy gradients and degenerate all-same-reward groups — it is for plumbing validation, not for reading learning quality.

VeOmni backend (MoE)

For MoE policies, set USE_NEW_VERL=1 so import verl resolves to the verl-swe_agent_opd_dev checkout (prepended to PYTHONPATH; override the path with NEW_VERL_DIR). Only that tree has the VeOmni engine_workers router-replay (R3) wiring (actor.veomni.router_replay.mode) and the async-rollouter routed_experts concat fixes; the old installed verl's VeOmni engine is unvalidated for this path. Dense models can stay on FSDP.

It also enables the trajectory_filter config (TRAJ_FILTER_ENABLE=True, TRAJ_FILTER_DROP_REASONS=timeout,env_setup_failed), which the old verl lacks.

Setup write-up: troubleshooting/training_env/veomni-engine-setup-and-run-20260610.md · routing coverage fix: troubleshooting/training_env/r3-routing-coverage-rootcause-fix-20260610.md

OH-SDK scaffold

claude-code is the single-node default. For the multi-node path the validated alternative is the OpenHands-SDK scaffold, which uses an image-mounted runtime (the SDK pre-baked into the agent image) instead of an in-pod venv install — the latter fails on no-egress task pods. Select it with HARBOR_AGENT_NAME=null so harbor loads the mounted-runtime-aware import_path class rather than the registry default.

Scaffold pitfall (name vs import_path): troubleshooting/training_env/ohsdk-agent-name-vs-import-path-20260613.md

What to watch on a fully-async run

Beyond the single-node metrics, the async path adds a few signals worth a glance each step:

MetricRead
fully_async/trainer/idle_ratioFraction of the step the trainer waits on rollouts — high = rollout-bound (raise concurrency or rollout capacity)
rollout_corr/klRollout-vs-training logprob fidelity — should be ≈3e-4; a spike means scaffold/routing corruption
fully_async/partial/partial_ratioShare of partial (staleness-bounded) rollouts
trajectory_filter/invalid_ratioShare of trajectories dropped before the update

Advanced-path failure modes

The async/MoE path has its own silent killers — a reward drop traced to trajectory filtering / sequence-distribution drift, routing-coverage loss under multi-turn replay, and importance-sampling ESS collapse on very long responses (set rollout_is=null and use token-level IS, not sequence-level). Start from the v6 analysis: troubleshooting/training_env/reward_drop_analysis_v6_20260612.md.

On this page