LogoAIAny
Icon for item

Fixed Chat Templates for Qwen 3.5 & 3.6

Drop-in Jinja chat templates for Qwen 3.5/3.6 that fix rendering errors, token waste, and tool-calling failures across runtimes (LM Studio, llama.cpp, vLLM, MLX). Adds a think-on/think-off toggle, auto-closes broken thinking tags, robust tool-argument handling, and a graceful fallback for missing user queries.

Introduction

Most official Qwen chat templates assume Python/Jinja runtime behaviors that real-world LLM runtimes do not always provide. That mismatch causes template-render errors on C++ engines, wasted context from empty reasoning blocks, brittle tool calls, and occasional hallucinated or unclosed thinking tags — all of which break tool-calling agents or force brittle runtime workarounds. These templates are a small, pragmatic compatibility layer: they preserve the original chat structure while making the templates portable and resilient across common engines.

What Sets It Apart
  • Portable Jinja patterns: replaces Python-only constructs (e.g., |items, |safe) with patterns supported by LM Studio, llama.cpp, MLX, oMLX and vLLM — so templates render instead of erroring, allowing tool calls to proceed. (So what: you avoid runtime template crashes and can run Qwen-based agents on C++ stacks.)
  • Robust tool-call handling: serializes arguments safely, iterates arguments without Python-only filters, and auto-closes any open thinking blocks before a tool boundary. (So what: tool-calling chains and agentic loops stop failing mid-run.)
  • Thought-management controls: adds a hidden <|think_on|> / <|think_off|> toggle and omits empty reasoning blocks by default. (So what: reduces token waste and preserves prefix caching and consistent history visibility.)
  • Resilience for Qwen 3.6 quirks: detects </thinking> hallucinations, rescues interrupted thought streams, and falls back gracefully when no user query exists. (So what: fewer parse errors and more stable agent restarts.)
Who It's For

Great fit if you run Qwen 3.5/3.6 locally or in lightweight runtimes (LM Studio, llama.cpp, MLX, vLLM, oMLX) and need stable tool-calling or want to reduce template-related token overhead. It is ideal for developers embedding Qwen models in agent frameworks, custom tool chains, or for those who maintain templates across heterogeneous runtimes.

Look elsewhere if you rely exclusively on an official cloud-hosted Qwen API that already provides updated, runtime-specific templates; this repo focuses on cross-runtime portability and fixes rather than added model capabilities.

Where It Fits

Drop the 3.5 file for Qwen 3.5 variants and the 3.6 file for 3.6 variants (3.6 template is a superset). Use in LM Studio by replacing the Prompt Template, in llama.cpp/koboldcpp via --jinja --chat-template-file, in vLLM/TextGen by embedding into tokenizer_config, or in MLX/oMLX by overwriting the local chat_template.jinja and removing template kwargs that conflict.

Practical trade-offs and notes

The templates intentionally avoid Python-only filters and |safe wrappers to maximize cross-engine compatibility; that means some Python-specific conveniences are removed and string handling is more conservative. The goal is predictable runtime behavior rather than preserving every original templating shortcut. Apache-2.0 license is declared (inherited from Qwen), so integration into downstream tooling is straightforward if you follow license terms.

Information

Categories