Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413
Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413devnen wants to merge 2 commits intotheroyallab:mainfrom
Conversation
528325c to
a2c7d81
Compare
|
the biggest drawback of Tabby is the the problems with tool calling |
|
I can't speak to the code quality, but I'm testing this PR now and it does seem to work quite well. |
|
@dinerburger what models are you running? I'm very curious about which PR is best as it looks like there are two of them that implement robust tool-calling... |
|
Qwen 3.5 series, 27B specifically. @vlawhern |
mratsim
left a comment
There was a problem hiding this comment.
Given that this PR introduce a new jinja template. I think we should discuss (separate RFC issue?) what we should do with the templates/ infrastructure in tabbyAPI. I don't think anyone actually uses it given that the one in this repo only contains alpaca and chatml, and the other repo at https://github.com/theroyallab/llm-prompt-templates as not been updated for 8 months and so does not support any of the best agentic model compatible with EXL3 (GLM-4.x, MiniMax-M2.x, Qwen3.5, ...)
| {%- set tool_call_format = "xml" -%} | ||
| {%- set tool_start = "<tool_call>" -%} | ||
| {%- set tool_end = "</tool_call>" -%} | ||
| {%- set stop_strings = ["<|im_start|>", "<|im_end|>"] -%} |
There was a problem hiding this comment.
This probably needs a RFC but I don't think Tabby can keep up with all the jinja templates that AI labs are creating. It would be best if the raw template from HF could be parsed and loaded, tool calls included.
No one is maintaining this https://github.com/theroyallab/llm-prompt-templates and it would duplicate the original chat-template except for an extra 2~4 lines for tool calls.
What This PR Does
TabbyAPI's tool-calling system worked for simple cases but had a collection of bugs that surfaced the moment you pushed it harder — a different model family, a stricter client, or just hitting Stop at the wrong time. This PR fixes all of them in one pass, tested end-to-end against Kilo Code, Roo Code, and OpenCode with Qwen3-Coder-Next on a dual-GPU setup.
The changes fall into four areas:
tool_choiceis now respected, and a Jinja filter fix makes HuggingFace-native chat templates work out of the box.Built against commit
41511f5with ExLlamaV3 v0.0.22.Changes
1. XML Tool-Call Parsing
Problem: Qwen3-Coder models are trained to emit tool calls in XML format (
<function=name><parameter=key>value</parameter></function>). TabbyAPI's tool-calling system only supported JSON via constrained generation, causing XML tool calls to be dumped as plain text with notool_callsarray in the response.Solution:
from_xml()) alongside the existing JSON path, with anfrom_auto()dispatcher that tries JSON → JSON-in-wrapper → XML in sequence.tool_call_formatmetadata field to the Jinja template system, allowing templates to declare whether they expectjson,xml, orautoformat tool calls.contentfield for bare<function=patterns when the two-pass system doesn't trigger.json.loads()on string arguments) to prevent crashes in multi-turn tool conversations where clients sendargumentsas a JSON string but the Jinja template expects a dict.templates/tool_calls/qwen3_coder.jinja) based on the official Qwen3-Coder-Nextchat_template.jinjawith TabbyAPI metadata.Design notes:
<function=name>with=in the tag name, which is invalid XML. This matches the approach taken by vLLM, llama.cpp, and the official Qwen parser.json.loads()with string fallback, explicitly avoidingeval()(ref: CVE-2025-9141 in vLLM's parser).2. OpenAI Streaming Protocol Compliance
Problem: After adding XML parsing, non-streaming responses worked correctly, but strict clients (OpenCode / Vercel AI SDK) rejected streaming responses with
AI_TypeValidationError. The SSE chunks were missing the requiredindexfield on tool-call deltas, emittingrole: "user"instead of"assistant", merging tool-call data with the finish signal, and leaking null fields.Solution:
_build_tool_call_chunks()implementing a two-chunk emission pattern: one chunk with complete tool-call data (role: "assistant",tool_callsarray withindexvalues,finish_reason: null), followed by a separate finish chunk (delta: {},finish_reason: "tool_calls")._serialize_stream_chunk()for consistent serialization across all chunk types, usingexclude_none=Truewhile restoring semantically meaningfulfinish_reason: nullon intermediate chunks.stream_generate_chat_completion()to intercept tool-call generation results, parse them, and emit spec-compliant chunks before the normal chunk-building path._create_stream_chunk()since it is now handled upstream.3. Pydantic v2 Union Coercion Fix
Problem: The
indexfield added for streaming compliance was silently dropped during chunk construction. Pydantic v2's smart Union coercion converts dicts passed throughUnion[ChatCompletionMessage, dict]fields intoChatCompletionMessageinstances, which in turn coerces nested tool-call dicts intoToolCallinstances. Withextra='ignore'(the default), any keys not declared on the model are silently discarded.Solution:
index: Optional[int] = Nonedirectly to theToolCallmodel intypes/tools.py. This ensures the field survives Pydantic's coercion rather than being treated as an extra field.call_prefix convention (call_<24-char hex>), matching the format expected by some strict clients.4. Inference Abort Fixes
Problem: When a user presses "Stop" during streaming inference, TabbyAPI did not reliably stop generating tokens. The model continued running on the GPU after the client disconnected.
Three bugs were identified and fixed:
Bug 1 —
gen_queue.get()blocks disconnect detection:The consumer loop's
await gen_queue.get()blocks indefinitely when the queue is empty (during prefill, between tokens). While blocked, thedisconnect_task.done()check never re-executes.Fix: Replaced the blocking
get()withasyncio.wait()that races a queue get task against the disconnect task. Applied identically in bothstream_generate_completionandstream_generate_chat_completion.Bug 2 — ExLlamaV3 job registered too late in
active_job_ids:The
AsyncJobwas only assigned toself.active_job_ids[request_id]after the generation loop finished. During generation, the entry heldNone, whichwait_for_jobs()skips.Fix: Moved the assignment to immediately after
AsyncJob()construction, before the generation loop.Bug 3 —
GeneratorExitbypasses abort event:When
sse_starlettecrashes on a dropped TCP connection, it injectsGeneratorExitinto the async generator. The existingexcept CancelledErrorandexcept Exceptionhandlers don't catchGeneratorExit(aBaseException), soabort_event.set()is never called and inference continues.Fix: Moved
abort_event.set()anddisconnect_task.cancel()into afinallyblock, which executes on all exit paths includingGeneratorExit.Files Changed
backends/exllamav3/model.pyactive_job_idsassignment before generation loopcommon/templating.pytool_call_formatfield toTemplateMetadata, validationendpoints/OAI/types/tools.pyindex: Optional[int] = NonetoToolCall,call_prefix ID formatendpoints/OAI/utils/chat_completion.pyasyncio.wait-based disconnect detection,finallyblock for abortendpoints/OAI/utils/completion.pyasyncio.wait-based disconnect detection,finallyblock for abortendpoints/OAI/utils/tools.pyfrom_xml,from_auto,parse,extract_content_and_tools), think-block stripping, type coerciontemplates/tool_calls/qwen3_coder.jinjaTesting
Validated against:
Test scenarios: single and multiple tool calls, multi-turn tool conversations, mixed text + tool calls, streaming and non-streaming modes, client disconnect during inference.
Environment
Edit: Additional improvements after initial submission:
5. Broader Model Compatibility
tojsonJinja filter override so the model's built-in HuggingFace template works out of the box (the default sandboxed filter crashes ontojson(ensure_ascii=False)which Qwen3-Coder's template uses).{"name": ..., "arguments": ...}dicts without thefunctionwrapper, single objects instead of arrays, markdown-fenced JSON, and string-typed arguments. Previously only perfectly-formed OAI-shaped arrays were accepted.token_idsas a plain Python list, with robust handling for tensors and tuples from different ExLlamaV3 kernels.6.
tool_choiceSupportAdded support for the OpenAI
tool_choiceparameter:"none"skips tool generation entirely,"required"forces a tool-call pass even when the model doesn't emit the stop string, and named function choice ({"type": "function", "function": {"name": "..."}}) filters results to the specified function.7. Bug Fixes and Cleanup
generate_tool_calls()where the sharedpromptvariable was modified in-place during the loop. Withn > 1, each iteration would stack previous generations' text onto the prompt. Now uses a per-iteration local variable.tool_endextraction toTemplateMetadataso template-provided metadata isn't silently discarded.Additional files changed:
endpoints/OAI/types/chat_completion.pytool_choice,parallel_tool_callsfieldsendpoints/OAI/types/tools.pyNamedToolChoice,NamedToolFunctionmodels