Files
Arvin Xu dbf743cc12 feat(verify): Agent Run delivery checker system (#15489)
* 🗃️ feat(database): add verify system tables for agent run delivery checker

Implement the database layer for the Agent Run delivery checker (Verify System).

Reuse / definition layer:
- verify_criteria: a single reusable pass/fail standard (atomic unit), carrying
  its verifier config + onFail default and bound to a document for judging
  guidance (iteration history reuses document_history; no version columns)
- verify_rubrics: a named group that aggregates criteria — the reusable unit
- verify_rubric_criteria: junction, which criteria a rubric aggregates
  (criteria are reusable across rubrics)

Mounted onto an agent via the existing agency config jsonb:
- agencyConfig.verifyRubricId: a reusable rubric (criteria template)
- agencyConfig.verifyCriteriaIds: ad-hoc one-off criteria
A run's plan instantiates the union of both. No dedicated bindings table.

Snapshot + result layer:
- agent_operations.verify_plan (jsonb) + verify_plan_confirmed_at: the per-run
  immutable check-item snapshot lives ON the operation (1:1 — auto-repair spawns
  a new operation), instead of a separate plans table
- agent_operations.verify_status: denormalized rollup for list-page badges
- verify_check_results: per-criterion result with the Toulmin model
  (verdict/confidence as columns, narrative in a typed toulmin jsonb), N:1
  verifier_tracing_id for batch judging, FP/FN flags for the data flywheel;
  relates to the plan via operation_id + stable check_item_id

Ref: LOBE-10019

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

*  feat(verify): add Agent Run delivery checker backend + frontend module

Implements the verify system on top of the schema (PR #15480):
- models: verifyCriterion / verifyRubric (+junction) / verifyCheckResult;
  agentOperation verify plan/status methods
- services/verify: AI plan generation (auto-create criteria), executor with
  LLM Toulmin judge (per-criterion + batch), program placeholder, agent &
  auto-repair spawner seams, rollup chokepoint, feedback fp/fn, completion
  lifecycle bridge
- lambda verify router (criteria/rubric CRUD, plan, results, feedback)
- frontend feature module: service, SWR hooks, CheckerDock state machine,
  RunArtifact, verify i18n namespace
- tracing scenarios: VerifyPlanGen / VerifyJudge

Live UI mount (dock/artifact into chat) pending server operationId source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* 🐛 fix(verify): persist delivery-checker verdicts via async tracing backfill

The LLM judge produced valid verdicts but they were never persisted, leaving
every run stuck at `verifying`. Two root causes:

1. FK ordering: `writeVerdict` stamped `verifier_tracing_id` synchronously, but
   the `llm_generation_tracing` row is written asynchronously (best-effort,
   after the response) — so the hard FK was violated every time and the verdict
   write was rolled back. Now the verdict is written with a null link, and the
   tracing id is backfilled by an `onPersisted` callback that fires only after
   the tracing row commits (still non-blocking). If tracing is disabled the link
   simply stays null.

2. Verdict parse: the judge JSON schema is non-strict, so the provider returns
   optional Toulmin fields as explicit `null`. The Zod validator used
   `.optional()` (accepts undefined, not null), so any null failed the whole
   `safeParse` and discarded the batch. Switched to `.nullish()`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(cli): add `verify` command for the delivery checker

Adds `lh verify` covering the full delivery-checker chain — criteria & rubric
CRUD, per-run plan (generate/state/confirm/skip), execute (LLM judge), results,
and feedback — calling the `verify` lambda router. Enables end-to-end backend
testing of the verify system.

Also adds the missing `tool-runtime` / `prompts` / `const` workspace entries to
the CLI's `pnpm-workspace.yaml` so the standalone package installs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 💄 feat(verify): add verify message role + delivery-checker card UI

Make the delivery-checker renderable in chat:

- Fix the `features/Verify` components so they compile: flatten the `verify`
  locale to the repo's flat-dotted-key convention (keySeparator: false), import
  `Flexbox`/`TextArea` from `@lobehub/ui` (react-layout-kit is no longer a dep),
  and the token cast.
- Add a `verify` UI message role + a `VerifyMessage` card that renders the
  Run Artifact + checker dock from `metadata.verifyOperationId`, wired into the
  message renderer switch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): add lobe-agent `generateVerifyPlan` tool (server runtime)

Lets an agent set up the delivery checker for its run: the agent calls
`generateVerifyPlan` early (per the new `<delivery_checker>` system-role
guidance), which instantiates the rubric / ad-hoc criteria into a frozen plan on
the current `agent_operations` row. Executed server-side only — the executor is
dispatched via `runtime[apiName]` with `operationId` threaded through the tool
execution context; the client `BaseExecutor` gracefully no-ops it.

Also registers the metadata fields (`verifyOperationId`/`verifyRound`) on the
message metadata zod schema so the role='verify' card can carry its operation id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): surface role=verify card on run completion (LOBE-10051)

Connect the delivery checker to the conversation: when an Agent Run with a
verify plan completes, `CompletionLifecycle` inserts a persisted `role='verify'`
message (parented to the assistant, carrying `metadata.verifyOperationId`) that
renders the checker card. Self-guarded — no plan → no card, failures never
affect the run.

`role='verify'` behaves like a `user` leaf message everywhere it flows
(persistence + conversation-flow pass it through unchanged); only the
context-engine treats it specially: a new `VerifyMessageProcessor` drops it from
the model context (UI-only card, not a valid model role). Adds `verify` to
`CreateMessageRoleType`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 💄 feat(verify): merge run-artifact + checker into one card

The role=verify message rendered two stacked cards (Run Artifact summary +
Delivery Checker) that duplicated the check-item list. Merge into a single card:
the `Run Artifact · Round N` header, then the checker results + actions, then the
snapshot note. RunArtifact/CheckerDock gain an `embedded` prop (header-only /
body-only, no card chrome) and VerifyMessage composes them under one border.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): derive generateVerifyPlan rubric from agencyConfig

A real agent calls `generateVerifyPlan` with just a `goal` and doesn't know
rubric ids. When `rubricId`/`criteriaIds` params are absent, derive the mounted
rubric + ad-hoc criteria from the executing agent's
`agencyConfig.verifyRubricId / verifyCriteriaIds`. Params still win when given.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 🐛 fix(cli): surface agent gateway WebSocket close code + reason

The `onclose` handler logged `String(event)` → the useless "[object
CloseEvent]". Surface `event.code` (+ `event.reason` when present) so a gateway
disconnect before completion is actually diagnosable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 💄 fix(verify): rename "Run Artifact" → "Verification", drop failed red border

- The kicker said "Run Artifact" — it's automated verification, not an artifact.
  Renamed to "Verification · Round N".
- Removed the red error border on a failed check — a normal card reads better.
- Fixes a render crash (`useVerifyState is not defined`): the border removal left
  a dangling reference after the import was dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(cli): poll run status when the agent stream drops

When the live stream (gateway WebSocket / SSE) closes before the run finishes,
the run is still executing server-side — so instead of hard-exiting, fall back to
polling `aiAgent.getOperationStatus` every 10s until the run reaches a terminal
state (or is no longer tracked). Pairs with surfacing the WS close code/reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 💄 feat(verify): add Render for generateVerifyPlan tool call

The generateVerifyPlan tool call rendered as the default param/result dump. Add a
Render that lists the generated delivery checks (title + gate/auto-fill tag), and
surface the items on the tool state so the Render can read them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): auto-confirm generated plan so checks run on completion

The agent generated a plan but it stayed `planned`/unconfirmed, so the completion
hook (which gates on a confirmed plan) never ran the checks — the card was stuck
at "awaiting confirmation" with no pass/fail. In the headless agent flow there's
no one to click Confirm, so `generateVerifyPlan` now auto-confirms the plan it
generates; the checks then run automatically on completion. (An interactive
"review before run" gate is a future enhancement.)

Also: the verify card header disappeared in the draft/planned phase
(`phaseToArtifact.draft` was null). Give it a header so the card always shows its
"Verification · Round N" heading.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 🐛 fix(agent-tracing): only count opaque/presentational attrs as structural noise

The first structuralNoiseRatio charged ALL markup (every <...> tag) as noise,
which over-penalized legitimately structured results 3x. Grounding against real
web-search output (`<item title="…" url="…">snippet</item>`) showed the tags and
the title=/url= attributes ARE the signal the model reads.

Now only opaque/presentational attribute names (id, class, style, data-*, aria-*,
role, on*) count as noise; semantic element tags and content-bearing attributes
(title, url, href, name…) are kept. On a 57-op user-interrupted sample this drops
web-search noise 42%→0% and overall estimated waste 16%→5%, leaving large-payload
(readDocument) and high error-rate tools as the real signal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): model-authored criteria with name/description/instruction-in-document + agent verifier

Restructure the generateVerifyPlan tool to a createDocument-style full-create flow
and wire up the agent verifier path:

- criteria now = title + description (required one-liner) + instruction (required
  detailed rubric); instruction lives in a linked document (verify_criteria.documentId),
  description is a new verify_criteria column (migration 0111). verifierConfig no
  longer holds description/instruction.
- generateVerifyPlan creates verify_criteria + a rubric, snapshots the plan onto
  the operation and confirms it; judge resolves the instruction from the document.
- agent-type checks run as verifier sub-agents (execAgent + isolated thread) whose
  onComplete hook parses a VERDICT and writes it back to verify_check_results
  (renamed AgentVerifierSpawner → VerifierAgentRunner).
- UI: custom Inspector for the tool header; check list shows per-verifier-type icons
  (llm/agent/program) + description + required/optional tag; i18n en/zh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ️ perf(verify): run program/llm/agent checks concurrently on completion

The three verifier kinds are independent; previously the agent spawn waited for
the batched LLM judge to finish. Run them via Promise.all so agent sub-agents
start immediately alongside the LLM batch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): dedicated builtin verify-agent + writeback tool, role=verify message, portal check editor

- Add `@lobechat/builtin-tool-verify` (submitVerifyResult) + builtin `verify-agent`;
  agent-type checks now run as the dedicated verify agent (not the user's agent),
  which investigates and writes its verdict back via the tool during its run.
- Verifier inherits the parent run's model/provider (builtin default may be
  unconfigured locally).
- role=verify completion message no longer requires an assistantMessageId, so the
  delivery-checker card always surfaces when a plan exists.
- Portal editor for verify checks (title/description/instruction/verifier/onFail).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 🐛 fix(verify): restrict verify-agent to its writeback tool; fix running loader icon

Root cause of stuck `running` agent checks: the verify-agent ran in agent mode and
inherited all default tools (web-browsing, cloud-sandbox, skills, activator), so it
went off web-searching/crawling to "investigate" and never called submitVerifyResult.

- Run the verify-agent in chat mode (enableAgentMode: false, searchMode: off) — the
  strict whitelist — and whitelist `lobe-verify` for chat mode so the verifier gets
  ONLY its writeback tool.
- Sharpen the verify systemRole: judge from the provided deliverable/instruction
  (no external tools), always reach a verdict, and always call submitVerifyResult.
- CheckerDock: running check now uses the standard RingLoadingIcon (warning ring),
  matching the app's loader instead of a blue spinner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): auto-repair loop — re-run the agent with failure feedback on failed checks

When required checks fail with onFail=auto_repair, automatically run a second
iteration instead of ending at `failed`:

- createRepairRunner: re-runs the SAME agent in the same topic with the failure
  feedback as the prompt, re-snapshots the plan onto the repair operation and
  confirms it so it re-verifies on completion (the next round). Capped at
  MAX_REPAIR_ROUNDS via parent-chain depth to prevent runaway loops.
- maybeAutoRepair: fires only once every required check has a terminal result, so
  it works for inline LLM checks (triggered from lifecycle) and async agent checks
  (triggered from the verify tool's writeback path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): open check result detail in portal & rename artifact→result

- add a VerifyResult portal view: clicking any check row opens that result's
  detail (verdict, confidence, Toulmin sections, suggestion) on the right; agent
  checks expose their execution trace from inside the panel
- CheckerDock rows are all clickable now (chevron affordance), status shown by
  icon only; verify card uses colorBgElevated
- rename the run-result surface from "artifact" to "result" everywhere: RunArtifact
  → RunResult, phaseToArtifact → phaseToResult, and all `artifact.*` i18n keys →
  `result.*`
- ship verify namespace zh-CN / en-US locales

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): enrich check result portal — criterion stepper, richer detail view

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): rubric run-policy config + repair feedback on the verify card

Auto-repair feedback now lives on the failed round's role=verify message
(content), and the VerifyMessageProcessor surfaces it into the repair run's
context as a tagged user turn — so the repair op runs off history via a new
execAgent `suppressUserMessage` path instead of injecting a synthetic user
message. createVerifyMessage is awaited before verification to avoid a race.

maxRepairRounds becomes a rubric-level config: new `verify_rubrics.config`
jsonb column, read live at repair time via the plan's sourceRubricId. Adds a
RubricConfig portal panel (reachable from the plan card's settings affordance)
to view/edit it, wired through the verify store + TRPC.

Verify domain types/vocab/config are extracted from the DB schema into
@lobechat/types as the single source of truth; schema and consumers import
from there.

Tests: VerifyMessageProcessor dual behavior; VerifyRubricModel config
round-trip; MessageModel.findVerifyMessageByOperationId.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* 🗃️ refactor(verify): squash the 3 verify migrations into one

Collapse 0110 (tables) + 0111 (criteria.description) + 0112 (rubrics.config)
into a single regenerated 0110_add_verify_tables so the PR ships one clean,
idempotent migration. No schema change vs the three combined.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(cli): verify rubric run-policy config commands + shrink judging-rule editor font

CLI: `verify rubric create --max-repair-rounds`, `verify rubric view`, and
`verify rubric update` exercise the rubric config endpoints end-to-end; adds a
mocked command test. UI: judging-rule editor font 16px → 14px.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(verify): editable rubric name in the config panel + default 3 repair rounds

Add a name (title) field to the RubricConfig portal, persisted via a new
updateRubricTitle store action + service (optimistic + debounced, alongside
the config write-back). Bump DEFAULT_MAX_REPAIR_ROUNDS 2 → 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* ♻️ refactor(verify): extract generateVerifyPlan into installable lobe-delivery-checker tool

Move the delivery-checker plan-creation flow out of the always-on lobe-agent
tool into a new standalone, installable builtin tool `lobe-delivery-checker`
(Skill Store, opt-in per agent — not loaded by default). lobe-agent no longer
ships generateVerifyPlan.

- new packages/builtin-tool-lobe-delivery-checker (manifest/types/systemRole +
  client Render/Inspector/Portal moved wholesale from lobe-agent)
- new serverRuntimes/lobeDeliveryChecker.ts (generateVerifyPlan moved out of
  lobeAgent.ts), registered alongside verifyResult
- registered installable in builtin-tools (no hidden/discoverable:false, not in
  defaultToolIds/alwaysOnToolIds/runtimeManagedToolIds); renders/inspectors/
  portals/identifiers wired; lobe-agent portal entries removed
- i18n keys moved builtins.lobe-agent.verifyPlan.* → builtins.lobe-delivery-checker.*

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

*  feat(agent): add `custom` tool mode; verify agent uses it instead of chat-mode

Chat mode's contract is to strip ALL user/agent plugins (strict KB/memory/web
allow-list) — so the verify sub-agent couldn't get its writeback tool without a
leaky blanket rule. Introduce a third tool mode `custom` where the toolset is
EXACTLY the agent's declared plugins (no always-on, no defaults, no activator),
for focused builtin sub-agents.

- chatConfig.toolMode: 'agent' | 'chat' | 'custom' (overrides enableAgentMode)
- AgentToolsEngine: custom branch (defaultToolIds = plugins, rules = plugins-on,
  allowExplicitActivation only in agent mode); chatModeRules restored to strict
- verify agent → toolMode: 'custom'; lobe-verify dropped from chatModeAllowedToolIds
- test: custom mode enables exactly the declared plugin, no always-on / defaults

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 09:16:35 +08:00

61 lines
3.0 KiB
JSON
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"badge.failed": "Check failed",
"badge.passed": "Check passed",
"badge.pending": "Awaiting check",
"badge.repairing": "Repair triggered",
"behavior.auto_improve": "Auto-fill",
"behavior.auto_improveDesc": "Filled in automatically; does not block delivery",
"behavior.gate": "Delivery gate",
"behavior.gateDesc": "Blocks delivery on failure and triggers a repair round",
"detail.checkedAt": "Checked at",
"detail.confidence": "Confidence",
"detail.counterEvidence": "Counter-evidence",
"detail.duration": "Duration",
"detail.evidence": "Evidence",
"detail.instruction": "Judging rule",
"detail.limitation": "Limitation",
"detail.method": "Method",
"detail.methodAgent": "Agent",
"detail.methodLlm": "LLM",
"detail.methodProgram": "Program",
"detail.model": "Model",
"detail.openTrace": "View agent trace",
"detail.pending": "This check has not run yet.",
"detail.reasoning": "Reasoning",
"detail.suggestion": "Suggested fix",
"detail.summary": "Summary",
"detail.tokens": "Tokens",
"dock.confirm": "Confirm & run",
"dock.edit": "Adjust checks",
"dock.forceDeliver": "Ignore & deliver",
"dock.repairHint": "The next round is fixing the failed checks. A new result is produced and the checker re-runs when it finishes.",
"dock.saveAndRepair": "Save input & repair now",
"dock.skip": "Skip checks",
"dock.title": "Delivery Checker",
"editor.add": "+ Add check",
"editor.cancel": "Cancel",
"editor.placeholder": "Check title",
"editor.save": "Save",
"input.hint": "This goes to the next repair round as checker input — it will not appear as a chat message.",
"input.label": "Extra input for the next repair round",
"input.placeholder": "e.g. run type-check first; if it still fails, just add a risk note.",
"result.failed.sub": "This result is held back. The delivery checker found verification insufficient and triggered a repair.",
"result.failed.title": "Draft result",
"result.foot": "A snapshot of this runs result — not an assistant or user message.",
"result.kicker": "Verification · Round {{round}}",
"result.passed.sub": "The delivery checker passed {{passed}}/{{total}}. This result is ready to deliver.",
"result.passed.title": "Result",
"result.pending.sub": "The result is generated but not yet delivered — waiting for the delivery checker.",
"result.pending.title": "Draft result",
"result.repairing.sub": "Checks did not pass. A repair round has started.",
"result.repairing.title": "Draft result",
"result.title": "Verification #{{round}}",
"status.checking": "Delivery Checker: checking {{passed}}/{{total}}",
"status.draft": "Delivery Checker: awaiting confirmation · {{total}} checks",
"status.failed": "Delivery Checker: failed · repair triggered",
"status.idle": "Delivery Checker: not generated",
"status.passed": "Delivery Checker: passed {{passed}}/{{total}}",
"status.repairing": "Delivery Checker: repairing",
"status.verifying": "Delivery Checker: waiting for run to finish"
}