Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance by devnen · Pull Request #413 · theroyallab/tabbyAPI

devnen · 2026-02-14T14:21:41Z

What This PR Does

TabbyAPI's tool-calling system worked for simple cases but had a collection of bugs that surfaced the moment you pushed it harder — a different model family, a stricter client, or just hitting Stop at the wrong time. This PR fixes all of them in one pass, tested end-to-end against Kilo Code, Roo Code, and OpenCode with Qwen3-Coder-Next on a dual-GPU setup.

The changes fall into four areas:

Qwen3-Coder support — Qwen3-Coder emits tool calls in XML, not JSON. TabbyAPI had no XML parser, so those tool calls were silently discarded. This PR adds full XML parsing and a new official Jinja template for the model family.
Streaming compliance — Strict clients (OpenCode / Vercel AI SDK) were rejecting every streaming tool-call response due to missing fields and wrong chunk structure. The streaming path is now fully spec-compliant.
Stop actually stops — Three independent bugs meant the model kept running on the GPU after a client disconnect. All three are fixed; GPU inference now aborts reliably on Stop.
Broader compatibility — JSON parsing is hardened to handle the many ways real models deviate from the ideal output format, tool_choice is now respected, and a Jinja filter fix makes HuggingFace-native chat templates work out of the box.

Tested with: ExLlamaV3 v0.0.22 · Qwen3-Coder-Next-exl3-4.0bpw · Kilo Code v5.6.0 · Roo Code · OpenCode (Vercel AI SDK). Windows 10 OS

Built against commit 41511f5 with ExLlamaV3 v0.0.22.

Changes

1. XML Tool-Call Parsing

Problem: Qwen3-Coder models are trained to emit tool calls in XML format (<function=name><parameter=key>value</parameter></function>). TabbyAPI's tool-calling system only supported JSON via constrained generation, causing XML tool calls to be dumped as plain text with no tool_calls array in the response.

Solution:

Added a regex-based XML parser (from_xml()) alongside the existing JSON path, with an from_auto() dispatcher that tries JSON → JSON-in-wrapper → XML in sequence.
Added a tool_call_format metadata field to the Jinja template system, allowing templates to declare whether they expect json, xml, or auto format tool calls.
Modified the two-pass generation flow: when XML mode is active, the second pass skips the JSON schema constraint and lets the model generate its natural XML output unconstrained.
Added a fallback path that scans the content field for bare <function= patterns when the two-pass system doesn't trigger.
Added pre-template argument conversion (json.loads() on string arguments) to prevent crashes in multi-turn tool conversations where clients send arguments as a JSON string but the Jinja template expects a dict.
Added a new Jinja template (templates/tool_calls/qwen3_coder.jinja) based on the official Qwen3-Coder-Next chat_template.jinja with TabbyAPI metadata.

Design notes:

Regex parsing (not XML parsing) is used because the format uses <function=name> with = in the tag name, which is invalid XML. This matches the approach taken by vLLM, llama.cpp, and the official Qwen parser.
Type coercion uses json.loads() with string fallback, explicitly avoiding eval() (ref: CVE-2025-9141 in vLLM's parser).

2. OpenAI Streaming Protocol Compliance

Problem: After adding XML parsing, non-streaming responses worked correctly, but strict clients (OpenCode / Vercel AI SDK) rejected streaming responses with AI_TypeValidationError. The SSE chunks were missing the required index field on tool-call deltas, emitting role: "user" instead of "assistant", merging tool-call data with the finish signal, and leaking null fields.

Solution:

Added _build_tool_call_chunks() implementing a two-chunk emission pattern: one chunk with complete tool-call data (role: "assistant", tool_calls array with index values, finish_reason: null), followed by a separate finish chunk (delta: {}, finish_reason: "tool_calls").
Added _serialize_stream_chunk() for consistent serialization across all chunk types, using exclude_none=True while restoring semantically meaningful finish_reason: null on intermediate chunks.
Restructured the streaming loop in stream_generate_chat_completion() to intercept tool-call generation results, parse them, and emit spec-compliant chunks before the normal chunk-building path.
Removed tool-call handling from _create_stream_chunk() since it is now handled upstream.

3. Pydantic v2 Union Coercion Fix

Problem: The index field added for streaming compliance was silently dropped during chunk construction. Pydantic v2's smart Union coercion converts dicts passed through Union[ChatCompletionMessage, dict] fields into ChatCompletionMessage instances, which in turn coerces nested tool-call dicts into ToolCall instances. With extra='ignore' (the default), any keys not declared on the model are silently discarded.

Solution:

Added index: Optional[int] = None directly to the ToolCall model in types/tools.py. This ensures the field survives Pydantic's coercion rather than being treated as an extra field.
Updated the tool-call ID factory to use the call_ prefix convention (call_<24-char hex>), matching the format expected by some strict clients.

4. Inference Abort Fixes

Problem: When a user presses "Stop" during streaming inference, TabbyAPI did not reliably stop generating tokens. The model continued running on the GPU after the client disconnected.

Three bugs were identified and fixed:

Bug 1 — gen_queue.get() blocks disconnect detection:
The consumer loop's await gen_queue.get() blocks indefinitely when the queue is empty (during prefill, between tokens). While blocked, the disconnect_task.done() check never re-executes.

Fix: Replaced the blocking get() with asyncio.wait() that races a queue get task against the disconnect task. Applied identically in both stream_generate_completion and stream_generate_chat_completion.

Bug 2 — ExLlamaV3 job registered too late in active_job_ids:
The AsyncJob was only assigned to self.active_job_ids[request_id] after the generation loop finished. During generation, the entry held None, which wait_for_jobs() skips.

Fix: Moved the assignment to immediately after AsyncJob() construction, before the generation loop.

Bug 3 — GeneratorExit bypasses abort event:
When sse_starlette crashes on a dropped TCP connection, it injects GeneratorExit into the async generator. The existing except CancelledError and except Exception handlers don't catch GeneratorExit (a BaseException), so abort_event.set() is never called and inference continues.

Fix: Moved abort_event.set() and disconnect_task.cancel() into a finally block, which executes on all exit paths including GeneratorExit.

Files Changed

File	Summary
`backends/exllamav3/model.py`	Moved `active_job_ids` assignment before generation loop
`common/templating.py`	Added `tool_call_format` field to `TemplateMetadata`, validation
`endpoints/OAI/types/tools.py`	Added `index: Optional[int] = None` to `ToolCall`, `call_` prefix ID format
`endpoints/OAI/utils/chat_completion.py`	XML generation flow, streaming protocol compliance, `asyncio.wait`-based disconnect detection, `finally` block for abort
`endpoints/OAI/utils/completion.py`	`asyncio.wait`-based disconnect detection, `finally` block for abort
`endpoints/OAI/utils/tools.py`	XML parser (`from_xml`, `from_auto`, `parse`, `extract_content_and_tools`), think-block stripping, type coercion
`templates/tool_calls/qwen3_coder.jinja`	New file — official Qwen3-Coder-Next template with TabbyAPI metadata

Testing

Validated against:

OpenCode (Vercel AI SDK) — strict Zod-based schema validation, previously rejected all streaming tool-call responses
Kilo Code v5.6.0 — OpenAI-compatible VSCode extension
Roo Code — OpenAI-compatible VSCode extension

Test scenarios: single and multiple tool calls, multi-turn tool conversations, mixed text + tool calls, streaming and non-streaming modes, client disconnect during inference.

Environment

ExLlamaV3 v0.0.22
Qwen3-Coder-Next-exl3-4.0bpw (80B params, 3B activated, MoE)
Dual GPU, 80K context window

Edit: Additional improvements after initial submission:

5. Broader Model Compatibility

Added a tojson Jinja filter override so the model's built-in HuggingFace template works out of the box (the default sandboxed filter crashes on tojson(ensure_ascii=False) which Qwen3-Coder's template uses).
Hardened JSON tool-call parsing to handle common model output variations: flat {"name": ..., "arguments": ...} dicts without the function wrapper, single objects instead of arrays, markdown-fenced JSON, and string-typed arguments. Previously only perfectly-formed OAI-shaped arrays were accepted.
Generation chunks now include token_ids as a plain Python list, with robust handling for tensors and tuples from different ExLlamaV3 kernels.

6. tool_choice Support

Added support for the OpenAI tool_choice parameter: "none" skips tool generation entirely, "required" forces a tool-call pass even when the model doesn't emit the stop string, and named function choice ({"type": "function", "function": {"name": "..."}}) filters results to the specified function.

7. Bug Fixes and Cleanup

Fixed a prompt mutation bug in generate_tool_calls() where the shared prompt variable was modified in-place during the loop. With n > 1, each iteration would stack previous generations' text onto the prompt. Now uses a per-iteration local variable.
Added tool_end extraction to TemplateMetadata so template-provided metadata isn't silently discarded.
Reduced debug logging verbosity in the XML parser — removed per-parameter and raw-text-dump log lines, kept summary-level logs and warnings.

Additional files changed:

File	Summary
`endpoints/OAI/types/chat_completion.py`	Added `tool_choice`, `parallel_tool_calls` fields
`endpoints/OAI/types/tools.py`	Added `NamedToolChoice`, `NamedToolFunction` models

…c fix, inference abort fix

arbi-dev · 2026-03-02T21:57:32Z

the biggest drawback of Tabby is the the problems with tool calling

dinerburger · 2026-03-05T20:04:03Z

I can't speak to the code quality, but I'm testing this PR now and it does seem to work quite well.

vlawhern · 2026-03-10T12:45:28Z

@dinerburger what models are you running? I'm very curious about which PR is best as it looks like there are two of them that implement robust tool-calling...

dinerburger · 2026-03-11T12:41:43Z

Qwen 3.5 series, 27B specifically. @vlawhern

mratsim

Given that this PR introduce a new jinja template. I think we should discuss (separate RFC issue?) what we should do with the templates/ infrastructure in tabbyAPI. I don't think anyone actually uses it given that the one in this repo only contains alpaca and chatml, and the other repo at https://github.com/theroyallab/llm-prompt-templates as not been updated for 8 months and so does not support any of the best agentic model compatible with EXL3 (GLM-4.x, MiniMax-M2.x, Qwen3.5, ...)

mratsim · 2026-03-12T16:22:17Z

templates/tool_calls/qwen3_coder.jinja

+{%- set tool_call_format = "xml" -%}
+{%- set tool_start = "<tool_call>" -%}
+{%- set tool_end = "</tool_call>" -%}
+{%- set stop_strings = ["<|im_start|>", "<|im_end|>"] -%}


This probably needs a RFC but I don't think Tabby can keep up with all the jinja templates that AI labs are creating. It would be best if the raw template from HF could be parsed and loaded, tool calls included.

No one is maintaining this https://github.com/theroyallab/llm-prompt-templates and it would duplicate the original chat-template except for an extra 2~4 lines for tool calls.

devnen added 2 commits February 14, 2026 14:26

Full tool-calling support: XML parsing, streaming compliance, Pydanti…

87bbe0f

…c fix, inference abort fix

Broader model compatibility, tool_choice support, bug fixes and cleanup

a2c7d81

devnen force-pushed the full-tool-calling-support branch from 528325c to a2c7d81 Compare February 14, 2026 15:20

lesj0610 mentioned this pull request Feb 24, 2026

feat: add vLLM 0.16 mistral option parity for exllamav3 #415

Open

mratsim reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413

Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413
devnen wants to merge 2 commits intotheroyallab:mainfrom
devnen:full-tool-calling-support

devnen commented Feb 14, 2026 •

edited

Loading

Uh oh!

arbi-dev commented Mar 2, 2026

Uh oh!

dinerburger commented Mar 5, 2026

Uh oh!

vlawhern commented Mar 10, 2026

Uh oh!

dinerburger commented Mar 11, 2026

Uh oh!

mratsim left a comment

Uh oh!

mratsim Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

devnen commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What This PR Does

Changes

1. XML Tool-Call Parsing

2. OpenAI Streaming Protocol Compliance

3. Pydantic v2 Union Coercion Fix

4. Inference Abort Fixes

Files Changed

Testing

Environment

Uh oh!

arbi-dev commented Mar 2, 2026

Uh oh!

dinerburger commented Mar 5, 2026

Uh oh!

vlawhern commented Mar 10, 2026

Uh oh!

dinerburger commented Mar 11, 2026

Uh oh!

mratsim left a comment

Choose a reason for hiding this comment

Uh oh!

mratsim Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

devnen commented Feb 14, 2026 •

edited

Loading