Files
lobe-chat/apps
Arvin Xu fdb529d598 🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode (#15620)
* 🐛 fix(agent): deliver sub-agent resume bridge via QStash webhook in queue mode

The callSubAgent completion bridge was a handler-only hook, which lives in
process memory: in queue mode (AGENT_RUNTIME_MODE=queue) HookDispatcher only
delivers webhook-configured hooks, so the bridge never fired — the parent op
stayed parked in waiting_for_async_tool forever after all sub-agents finished.

- Give the bridge hook a webhook config (delivery: qstash) targeting the new
  /api/agent/webhooks/subagent-callback endpoint; local mode keeps the
  in-process handler. Both paths converge on
  AgentRuntimeService.completeSubAgentBridge (backfill + barrier/CAS resume).
- Park-time self-check: after the parked state and operation row are
  persisted, re-run the resume barrier once to recover children that
  completed before the parent finished parking.
- One-shot verify watchdog: when a completion finds the parent not yet
  resumable, schedule a delayed verifyAsyncToolBarrier re-check (no step
  lock, CAS-idempotent, never re-arms).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* 📝 docs(agent): correct verify-watchdog rationale comment

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* 📝 docs(agent): clarify eventFields trimming rationale

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* ♻️ refactor(agent): align subagent-callback with workspace-scoped step worker

Post-rebase adaptation to canary's runtime restructure (#15609):

- Route the webhook bridge through AiAgentService (like the /run step
  worker) so the runtime's models stay workspace-scoped — a bare
  AgentRuntimeService would be personal-scoped and the tool-message
  backfill / resume barrier could miss workspace-scoped rows.
- Extract SubAgentBridgeParams into agentRuntime/types and add the
  completeSubAgentBridge passthrough next to executeStep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* 🐛 fix(agent): fail sub-agent callback loudly on backfill or delivery failure

Address two review findings on the resume bridge:

- completeSubAgentBridge now checks updateToolMessage's { success } result
  (it swallows transaction errors instead of throwing) and propagates all
  infrastructure failures. The webhook endpoint then returns non-2xx so
  QStash redelivers the whole bridge — previously a failed backfill was
  acked with 200 and the parent stayed parked forever, since the verify
  recheck only re-reads the barrier and cannot retry the backfill.
- New AgentHookWebhook.fallback: 'none' opts a qstash-delivered hook out of
  the unsigned plain-fetch fallback, which can never authenticate against a
  QStash-signed endpoint and only masked publish failures as silently
  dropped 401s. The bridge hook uses it; dispatch escalates such delivery
  failures to console.error instead of the debug namespace.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 16:00:17 +08:00
..