LiveRL

LiveRL

Troubleshooting

Training Startup

Model-format and serving issues that surface the moment a run starts

These issues all bite right at launch — a crash on step 0, or every launch paying a slow-import tax.

Tool-call parser mismatch

Symptom. Training crashes on the very first step with:

response_mask must contain at least one valid token (1)

vLLM is healthy and the model clearly generates text, but the entire batch has zero tool calls (e.g. 0/1024 trials) and the trajectories are single-turn.

Root cause. The vLLM rollout is configured with the wrong tool-call parser for the model's output format. A Qwen3-Coder model emits XML tool calls:

<function=EnterWorktree>...</function>

but the rollout config sets tool-call-parser: hermes, which expects JSON. vLLM's hermes parser runs json.loads() on the XML, throws JSONDecodeError on every turn, recognizes zero tool calls, so the agent never takes an action — no assistant tokens are produced and response_mask is all zeros.

Fix. Match the parser to the model. The runtime exposes one knob, HARBOR_TOOL_PARSER (default qwen3_coder), which drives every config point (harbor_verl_*.yamltool-call-parser, agent_loop_config_cc.yamltool_parser):

Model output formatHARBOR_TOOL_PARSER
Qwen3-Coder XML (<function=...>)qwen3_coder
Hermes/JSON tool callshermes

Verify. The vLLM logs stop printing JSONDecodeError, and the first one or two steps' proxy_trajectory.json show non-zero tool_calls with finish_reason: tool_calls.

Full write-up: troubleshooting/training_env/tool-call-parser-mismatch.md

Slow venv imports on a network filesystem

Symptom. Every launch pays a cold-import tax — import torch takes ~8.7s (vs ~1.8s), because the venv lives on a network filesystem (CPFS / fuse.aliyun-alinas-efc) and small-file metadata I/O goes over the network.

Fix. Copy the venv onto local NVMe once and point VENV_PATH at it:

cp -a .venv /root/venvs/harbor-verl-train-venv
VENV_PATH=/root/venvs/harbor-verl-train-venv bash scripts/sync_1node_cc.sh

/root is per-node local storage, so each node needs its own copy. Re-sync when pyproject.toml changes or packages are added: rsync -a --delete <src>/.venv/ /root/venvs/harbor-verl-train-venv/.

Editable installs still apply

A copied venv keeps its editable installs pointing at the original repos/ trees. Verify with python -c "import harbor, verl, verl_patch; print(harbor.__file__)" — see Core Concepts.

Full write-up: troubleshooting/training_env/local-venv-cache.md

On this page