Skip to content

fix: retry transient LLM timeouts in A2A and surface error type (fixes #143)#150

Merged
yaojin3616 merged 1 commit intodataelement:mainfrom
Longado:fix/a2a-timeout-retry-and-error-message
Mar 20, 2026
Merged

fix: retry transient LLM timeouts in A2A and surface error type (fixes #143)#150
yaojin3616 merged 1 commit intodataelement:mainfrom
Longado:fix/a2a-timeout-retry-and-error-message

Conversation

@Longado
Copy link
Contributor

@Longado Longado commented Mar 20, 2026

Problem

_send_message_to_agent was fragile to transient LLM timeouts during long A2A tasks:

  1. A single httpx.ReadTimeout anywhere in a multi-step flow (web search + tool calls) aborted the entire task immediately, discarding all progress.
  2. httpx.ReadTimeout.__str__() returns "", so the UI displayed only "❌ Message send error:" — no cause, no actionable information.

Reported in #143 with full logs showing the exact failure path.

Fixes

1. Retry on ReadTimeout (exponential back-off)

Wrap llm_client.complete() in a retry loop — up to 3 attempts with 1 s → 2 s back-off before giving up:

for _attempt in range(3):
    try:
        response = await llm_client.complete(...)
        break
    except httpx.ReadTimeout:
        if _attempt == 2:
            raise
        await asyncio.sleep(2 ** _attempt)  # 1 s, then 2 s

A single transient timeout now retries silently. Only after all 3 attempts fail does the exception propagate to the outer handler.

2. Meaningful error message

# Before
return f"❌ Message send error: {str(e)[:200]}"

# After
err_detail = str(e) or type(e).__name__
return f"❌ Message send error: {err_detail[:200]}"

ReadTimeout (and any other exception with an empty str()) now surfaces its type name instead of a blank string.

Result

Scenario Before After
Single transient timeout Task aborted Silent retry, task continues
3 consecutive timeouts "Message send error:" "Message send error: ReadTimeout"
Other exceptions May show blank cause Always shows type.__name__ as fallback

Files changed

  • backend/app/services/agent_tools.py_send_message_to_agent only

🤖 Generated with Claude Code

…element#143)

Two independent fixes for `_send_message_to_agent`:

1. **Retry on ReadTimeout** — wrap the `llm_client.complete()` call in a
   retry loop (up to 3 attempts, exponential back-off: 1 s then 2 s).
   A single transient `httpx.ReadTimeout` no longer aborts a long-running
   A2A task mid-execution; it is retried silently instead.  Only after
   all 3 attempts fail does the exception propagate to the outer handler.

2. **Meaningful error message** — `httpx.ReadTimeout.__str__()` returns an
   empty string, so the UI previously showed only
   `"❌ Message send error:"` with no cause.  The outer `except` block now
   falls back to `type(e).__name__` when `str(e)` is empty, producing e.g.
   `"❌ Message send error: ReadTimeout"`.

Fixes dataelement#143.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@Congregalis
Copy link
Contributor

Thanks for picking this up. I took a look and it seems this PR overlaps quite a bit with #146, since both are addressing the same issue in #143 around send_message_to_agent timeout retries and error reporting.

I already opened #146 earlier for the same underlying problem, and it currently covers a slightly broader set of retry/error-handling cases as well. To avoid duplicate review effort, it may make sense to consolidate the discussion into one PR.

If maintainers prefer the smaller scoped approach here, I’m happy to align and adjust #146 accordingly.

@yaojin3616 yaojin3616 merged commit 9898651 into dataelement:main Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants