Troubleshooting llama.cpp Tool Calls
This page covers llama.cpp (llama-server) specifically. If you’re using Ollama or vLLM, these flags don’t apply — those servers handle chat templates automatically.
Self-hosted netclaw on llama.cpp producing garbage? Raw <tool_call> XML in chat messages, </think> tags in replies, empty tool arguments, or responses that trail off into repetition loops — these all trace back to how the inference server parses the model’s chat template.
Netclaw just assembles whatever streaming deltas it receives. If the server emits <tool_call> as literal text instead of structured tool-call deltas, that’s what shows up in chat.
Quick checklist
Section titled “Quick checklist”Try these first before diving into root causes:
- Add
--jinjato your llama-server launch arguments. If absent and you’re on Qwen3 / DeepSeek-R1, that’s almost certainly the problem. - Check the model card on Hugging Face for recommended flags. If it lists
--jinjaor a custom chat template file, use it. - Check your quantization. Switch to Q5_K_XL or Q6_K_XL if you’re on Q4 or lower.
- Disable speculative decoding if you’re seeing truncation or repetition loops. Remove
--spec-default. - Check the llama.cpp build for known regressions. Tool-call parser fixes land frequently.
- Verify the GGUF’s embedded template isn’t broken. Community-corrected templates exist for several Qwen3 quants.
If that didn’t fix it, read on for root cause details.
Symptoms
Section titled “Symptoms”| What you’re seeing | Likely cause |
|---|---|
<tool_call>, <function=…>, <parameter=…> tags as plain text | Chat template mismatch |
Stray </think> or <think> in replies | Chat template mismatch |
| Tool-call arguments containing another tool call concatenated onto the end | Chat template mismatch |
Empty or partial tool arguments (args={}, {"Path": ""}) | Chat template mismatch or quantization too aggressive |
Lists truncate mid-item with finishReason=stop | Speculative decoding |
Token repetition loops (they'd'd'd'd...) | Speculative decoding |
| Long SSE silence, then watchdog kills the session | Speculative decoding |
| Works in one session, breaks in another with longer history | Chat template mismatch — longer context makes misfires more likely |
Chat template mismatch
Section titled “Chat template mismatch”Different model families use different delimiters for tool calls and thinking blocks:
| Model family | Tool-call format | Reasoning format |
|---|---|---|
| Qwen3 / Qwen3.5 / Qwen3-Coder | <tool_call><function=…><parameter=…>…</parameter></tool_call> | <think>…</think> |
| DeepSeek-R1 | JSON-shaped | <think>…</think> |
| Hermes / Mistral | JSON-shaped | varies |
llama-server only parses these correctly with --jinja. That flag tells it to use the chat template embedded in the GGUF. Without it, the server falls back to a heuristic parser that doesn’t recognize the delimiters and passes them through as plain text.
Fix: add --jinja and --reasoning-format
Section titled “Fix: add --jinja and --reasoning-format”For Qwen3 (the most common netclaw deployment):
llama-server \ --model <path/to/qwen3-gguf> \ --jinja \ --reasoning-format deepseek \ --flash-attn on \ --ctx-size <N> \ --parallel <K> \ --port 8080--jinja— uses the GGUF’s embedded chat template. Knows the model’s tool-call delimiters.--reasoning-format deepseek— correct for Qwen3, which uses the same<think>/</think>delimiters as DeepSeek.
Restart llama-server with these flags, then send a message that triggers a tool call. You should see structured JSON tool calls in netclaw’s session log — not <tool_call> XML leaking into chat.
Rule of thumb
Section titled “Rule of thumb”Models with XML-style tool-call markup or a bundled chat_template.jinja need --jinja.
Models known to require --jinja
Section titled “Models known to require --jinja”- Qwen3 / Qwen3.5 / Qwen3.6 / Qwen3-Coder — confirmed.
<tool_call>XML and</think>leak without it. - Qwen2.5-Instruct (with tool calling) — covered explicitly by llama.cpp’s function-calling docs.
- DeepSeek-R1 distills — reasoning leakage if
--reasoning-formatis wrong.
Buggy GGUF templates
Section titled “Buggy GGUF templates”Some GGUF files (quantized model packages) ship with broken embedded templates. Check the model’s Hugging Face discussions for community-corrected templates — serve them with --chat-template-file <path> instead of relying on the embedded one.
The Unsloth Qwen3-Coder GGUF discussions document the exact failure modes and link to corrected templates.
Quantization
Section titled “Quantization”Tool-call structure is quantization-sensitive. Sub-4-bit quants (Q3, Q2, IQ) produce malformed tool calls even with the correct template.
If you’re on Q4 and seeing empty or garbled arguments, switch to Q5_K_XL or Q6_K_XL. You’ll burn a bit more VRAM, but structured output accuracy jumps significantly.
Speculative decoding
Section titled “Speculative decoding”Speculative decoding (specdec) with --spec-default (or a separate draft model) causes three distinct failures on reasoning-format models like Qwen3:
- Mid-list truncation. Model writes
1. <item>, starts2., then emits an end-of-sequence (EOS) token. Output is plain text, just cut short.finishReason=stop(notlength). - Token repetition loops. Model enters a state producing hundreds of repetitions of a 1-2 character suffix. The inference server eventually returns HTTP 500.
- Long SSE silence. Streaming starts normally, then deltas slow to a trickle and stop. Netclaw’s streaming watchdog kills the session on inactivity timeout.
All three reproduce only with speculative decoding enabled.
Why it happens
Section titled “Why it happens”Specdec accepts tokens that match between draft and main, rejects on first divergence. At list markers, sentence boundaries, and </think>, rejections spike the EOS probability and flip the next token into stop emission (truncation) or a single-token loop with no recovery. Template mismatch between draft and main models concentrates rejections at these positions. Q4 quantization makes it worse — the compressed logit distribution means a single high-EOS position cascades more easily than on Q5/Q6.
Fix: disable specdec
Section titled “Fix: disable specdec”Remove --spec-default from your llama-server launch flags. If symptoms disappear, specdec rejection at template boundaries is confirmed.
Long-term, Multi-Token Prediction (MTP) uses the model’s native multi-token heads instead of a draft model and avoids these rejection artifacts. For Qwen3.6, MTP support is tracked in ggml-org/llama.cpp#22673.
Diagnosing with netclaw logs
Section titled “Diagnosing with netclaw logs”Netclaw logs diagnostic counters at three layers for every LLM streaming call:
| Layer | What it reports |
|---|---|
| SSE | Raw deltas off the wire — delta counts, suppressed deltas, finish reason |
| Middleware | What the chat-client decorator saw before the actor consumed it |
| Actor | Assembled ChatResponse — text chars, thinking chars, tool calls, finish reason |
These show up in the per-session log at ~/.netclaw/logs/sessions/<channel>_<thread>/session.log. Enable debug logging to see them — see Debug Logging for the config.
If counts match across all three layers but a tool call’s arguments field is corrupted, the corruption originates upstream of netclaw — the inference server’s chat template.
Reading the counters
Section titled “Reading the counters”| Pattern | Diagnosis |
|---|---|
textChars=173 finishReason=stop on a truncated list | Specdec early termination |
output=2 finishReason=stop after long silence | Specdec stall — streaming watchdog timeout |
High thinkingChars with finishReason=length | Repetition loop hit max-token cap |
textChars inflated, then HTTP 500 from upstream | Repetition loop crashed the server’s parser |
See also
Section titled “See also”- Self-Hosted Providers — llama.cpp and Ollama setup
- Channel Troubleshooting — Slack and Discord connectivity issues
netclaw doctor— health check diagnosticsnetclaw status— live connector health
Resources
Section titled “Resources”- llama.cpp function-calling docs — definitive reference for
--jinjaand tool-call parsing - Qwen llama.cpp guide — official Qwen deployment guide
- Unsloth Qwen3-Coder template fixes — community-corrected GGUF templates
- llama.cpp server docs — full flag reference