fix: retry transient LLM timeouts in A2A and surface error type (fixes #143)#150
Merged
yaojin3616 merged 1 commit intodataelement:mainfrom Mar 20, 2026
Conversation
…element#143) Two independent fixes for `_send_message_to_agent`: 1. **Retry on ReadTimeout** — wrap the `llm_client.complete()` call in a retry loop (up to 3 attempts, exponential back-off: 1 s then 2 s). A single transient `httpx.ReadTimeout` no longer aborts a long-running A2A task mid-execution; it is retried silently instead. Only after all 3 attempts fail does the exception propagate to the outer handler. 2. **Meaningful error message** — `httpx.ReadTimeout.__str__()` returns an empty string, so the UI previously showed only `"❌ Message send error:"` with no cause. The outer `except` block now falls back to `type(e).__name__` when `str(e)` is empty, producing e.g. `"❌ Message send error: ReadTimeout"`. Fixes dataelement#143. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Contributor
|
Thanks for picking this up. I took a look and it seems this PR overlaps quite a bit with #146, since both are addressing the same issue in #143 around I already opened #146 earlier for the same underlying problem, and it currently covers a slightly broader set of retry/error-handling cases as well. To avoid duplicate review effort, it may make sense to consolidate the discussion into one PR. If maintainers prefer the smaller scoped approach here, I’m happy to align and adjust #146 accordingly. |
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
_send_message_to_agentwas fragile to transient LLM timeouts during long A2A tasks:httpx.ReadTimeoutanywhere in a multi-step flow (web search + tool calls) aborted the entire task immediately, discarding all progress.httpx.ReadTimeout.__str__()returns"", so the UI displayed only"❌ Message send error:"— no cause, no actionable information.Reported in #143 with full logs showing the exact failure path.
Fixes
1. Retry on
ReadTimeout(exponential back-off)Wrap
llm_client.complete()in a retry loop — up to 3 attempts with 1 s → 2 s back-off before giving up:A single transient timeout now retries silently. Only after all 3 attempts fail does the exception propagate to the outer handler.
2. Meaningful error message
ReadTimeout(and any other exception with an emptystr()) now surfaces its type name instead of a blank string.Result
"Message send error:""Message send error: ReadTimeout"type.__name__as fallbackFiles changed
backend/app/services/agent_tools.py—_send_message_to_agentonly🤖 Generated with Claude Code