🐛 fix(chat): compact operation metrics on narrow inputs (#15735)

* 🐛 fix: compact operation metrics on narrow inputs * 📝 docs: improve agent testing report template
2026-06-14 03:30:19 +00:00 · 2026-06-13 02:28:38 +08:00
parent 5362be4078
commit ab958a0b98
7 changed files with 255 additions and 96 deletions
@@ -38,9 +38,9 @@ lists `packages/**`, `e2e`, `apps/server`, and only `apps/desktop/src/main` —
 refresh them, so install in every app the test will touch:

 ```bash
-pnpm install                      # root workspace
-cd apps/desktop && pnpm install   # Electron surface
-cd apps/cli && pnpm install       # CLI surface
+pnpm install                    # root workspace
+cd apps/desktop && pnpm install # Electron surface
+cd apps/cli && pnpm install     # CLI surface
 ```

 Symptom of a stale standalone install: the build/launch fails to resolve a
@@ -63,12 +63,12 @@ long-running scripts.
 ./.agents/skills/agent-testing/scripts/setup-auth.sh status
 ```

-| Surface  | Mechanism                                         | One-key path                   | Standard check                 |
-| -------- | ------------------------------------------------- | ------------------------------ | ------------------------------ |
-| CLI      | OIDC Device Code Flow (`apps/cli/.lobehub-dev`)   | `setup-auth.sh cli`            | `setup-auth.sh status`         |
-| Web      | better-auth cookie injection into `agent-browser` | `pbpaste \| setup-auth.sh web` | `setup-auth.sh web-verify`     |
-| Electron | App's own persistent login state                  | Log in once in the app         | `app-probe.sh auth`            |
-| Bot      | Native apps already logged in                     | —                              | per-platform screenshot        |
+| Surface  | Mechanism                                         | One-key path                   | Standard check             |
+| -------- | ------------------------------------------------- | ------------------------------ | -------------------------- |
+| CLI      | OIDC Device Code Flow (`apps/cli/.lobehub-dev`)   | `setup-auth.sh cli`            | `setup-auth.sh status`     |
+| Web      | better-auth cookie injection into `agent-browser` | `pbpaste \| setup-auth.sh web` | `setup-auth.sh web-verify` |
+| Electron | App's own persistent login state                  | Log in once in the app         | `app-probe.sh auth`        |
+| Bot      | Native apps already logged in                     | —                              | per-platform screenshot    |

 Login-state checks are standardized — do NOT hand-roll `window.__LOBE_STORES`
 eval snippets; use `scripts/app-probe.sh auth` (returns `{ isSignedIn, userId }`,
@@ -148,17 +148,17 @@ Surface guides above carry the detailed workflows. Shared infrastructure:

 All under `.agents/skills/agent-testing/scripts/`:

-| Script                    | Usage                                                                          |
-| ------------------------- | ------------------------------------------------------------------------------ |
-| `setup-auth.sh`           | One-stop auth setup & status check (`status` / `cli` / `web`)                  |
-| `app-probe.sh`            | LobeHub app probes: `auth` / `route` / `ops` / `goto <path>` / `errors`        |
-| `record-gif.sh`           | Frame-sequence → GIF for time-based behavior (streaming, timers, animations)   |
-| `report-init.sh`          | Scaffold a structured test report (Step 3)                                     |
-| `electron-dev.sh`         | Manage Electron dev env (start/stop/status/restart, CDP 9222)                  |
-| `capture-app-window.sh`   | Screenshot a specific app window (general; used by bot tests)                  |
-| `record-app-screen.sh`    | Record app screen (video + periodic screenshots)                               |
-| `record-electron-demo.sh` | Record Electron app demo with ffmpeg                                           |
-| `agent-gateway/`          | Gateway probe / dump / analyze tools                                           |
+| Script                    | Usage                                                                        |
+| ------------------------- | ---------------------------------------------------------------------------- |
+| `setup-auth.sh`           | One-stop auth setup & status check (`status` / `cli` / `web`)                |
+| `app-probe.sh`            | LobeHub app probes: `auth` / `route` / `ops` / `goto <path>` / `errors`      |
+| `record-gif.sh`           | Frame-sequence → GIF for time-based behavior (streaming, timers, animations) |
+| `report-init.sh`          | Scaffold a structured test report (Step 3)                                   |
+| `electron-dev.sh`         | Manage Electron dev env (start/stop/status/restart, CDP 9222)                |
+| `capture-app-window.sh`   | Screenshot a specific app window (general; used by bot tests)                |
+| `record-app-screen.sh`    | Record app screen (video + periodic screenshots)                             |
+| `record-electron-demo.sh` | Record Electron app demo with ffmpeg                                         |
+| `agent-gateway/`          | Gateway probe / dump / analyze tools                                         |

 `app-probe.sh` is the LobeHub-specific fast path into app state — auth check,
 current route, running operations, and `goto <path>` quick navigation
@@ -174,12 +174,13 @@ not a chat-only summary. Scaffold it up front and fill it as you test:
 ```bash
 DIR=$(./.agents/skills/agent-testing/scripts/report-init.sh my-feature "Verify my feature")
 # ... test, saving screenshots / CLI transcripts into $DIR/assets/ ...
-# fill $DIR/report.md (case table, embedded evidence, verdict) and $DIR/result.json
+# fill $DIR/report.md (scope, case table with inline evidence, verdict, score) and $DIR/result.json
 ```

 Reports live in `.records/reports/<timestamp>-<slug>/` (gitignored): `report.md`
-(human-readable, with embedded screenshots), `result.json` (machine-readable
-pass/fail + score), `assets/` (evidence). Format spec and evidence rules:
+(human-readable, with screenshots/GIFs embedded directly in the case table),
+`result.json` (machine-readable pass/fail + score), `assets/` (evidence).
+Format spec and evidence rules:
 [references/report.md](./references/report.md).

 Two hard rules worth front-loading:
@@ -187,6 +188,10 @@ Two hard rules worth front-loading:
 - **Report language = the user's conversation language.** Write the ENTIRE
  `report.md` (headings included) in the language the user is conversing in —
  no mixed English. `result.json` keys/status values stay English.
+- **The case table is the main reading surface.** Prefer the compact
+  `# | case | result | key observation | evidence` shape and embed the
+  screenshot/GIF in the evidence cell. Use separate evidence sections only for
+  long CLI transcripts, HAR summaries, or supplemental detail.
 - **Time-based behavior needs a GIF, not a screenshot.** If a case asserts
  change over time (streaming output, a ticking timer, loading states,
  animations), record it with `scripts/record-gif.sh` and embed the GIF —
@@ -11,7 +11,7 @@ output):

 ```
 .records/reports/<YYYYMMDD-HHMMSS>-<slug>/
-├── report.md      # human-readable report (embedded screenshots, case table, verdict)
+├── report.md      # human-readable report (case table with inline screenshots, verdict)
 ├── result.json    # machine-readable results (pass/fail counts, score)
 └── assets/        # evidence: screenshots, HAR files, CLI transcripts
 ```
@@ -25,7 +25,9 @@ output):
   ```

   The script creates the directory, pre-fills branch / commit / date in both
-   files, and prints the directory path.
+   files, and prints the directory path. The scaffold uses the compact report
+   shape below; translate its headings and table labels to the user's language
+   before delivery if needed.

 2. **Collect evidence as you test** — every asserted behavior gets one evidence
   item in `$DIR/assets/`:
@@ -48,10 +50,14 @@ output):

     Embed it like an image: `![case 2](assets/case2-streaming.gif)`. Verify
     at least the first/last frames visually (Read the GIF) before citing.
+
   - CLI: exact command + trimmed output (`$CLI task list | tee "$DIR/assets/task-list.txt"`).
   - Network: `agent-browser network requests` dumps or HAR files.

 3. **Fill `report.md` as you go** — don't reconstruct from memory at the end.
+   The primary evidence belongs in the case table itself: each row should pair
+   the assertion with the screenshot/GIF/link that proves it, so readers can
+   scan the result without jumping between sections.

 4. **Set the verdict** in both `report.md` and `result.json`, then link the
   report directory in your final answer to the user.
@@ -60,21 +66,47 @@ output):

 **`report.md` MUST be written in the language the user is conversing in** —
 the whole file, headings included. If the conversation is in Chinese, the
-report is in Chinese; do not mix English prose into it. The scaffold's English
-headings are placeholders — translate them when filling. Exceptions that stay
-as-is: code/commands, identifiers, log excerpts, and `result.json` (its keys
-and status values are machine-read and stay English; the `title` and case
-`name` fields follow the user's language).
+report is in Chinese; do not mix English prose into it. The scaffold headings
+are placeholders — translate them when filling if the user is not conversing in
+the scaffold language. Exceptions that stay as-is: code/commands, identifiers,
+log excerpts, and `result.json` (its keys and status values are machine-read
+and stay English; the `title` and case `name` fields follow the user's
+language).

 ## report.md sections

-| Section         | Content                                                                            |
-| --------------- | ---------------------------------------------------------------------------------- |
-| **Scope**       | What changed / what is being verified; branch + commit                             |
-| **Environment** | Server URL, surfaces used (cli / electron / web / bot), relevant versions          |
-| **Cases**       | Table: `# \| case \| surface \| steps \| expected \| actual \| status \| evidence` |
-| **Evidence**    | Embedded screenshots/GIFs (`![case 1](assets/case1.png)`), fenced CLI transcripts  |
-| **Verdict**     | Pass/fail/blocked counts, optional 0–100 score, open issues / follow-ups           |
+Default report shape:
+
+| Section          | Content                                                                                      |
+| ---------------- | -------------------------------------------------------------------------------------------- |
+| **Scope**        | What changed / what is being verified; branch, commit, date, surface, entry URL/page, focus  |
+| **Cases**        | Compact table: `# \| Case \| Result \| Key observation \| Evidence`                          |
+| **Verdict**      | Overall verdict first (`pass` / `partial` / `fail`), then the concise reasons and follow-ups |
+| **Verification** | Commands or automated checks run in this session, with trimmed results                       |
+| **Score**        | Pass/fail/blocked counts, optional 0–100 score                                               |
+
+The case table is the main reading surface. Prefer one clear row per user
+scenario or regression assertion, and put the screenshot/GIF directly in the
+`Evidence` cell:
+
+```markdown
+| #   | Case                     | Result | Key observation                                                   | Evidence                                         |
+| --- | ------------------------ | ------ | ----------------------------------------------------------------- | ------------------------------------------------ |
+| 1   | Create a new page        | pass   | Title and body persisted after refresh                            | ![created page](assets/new-page-created.png)     |
+| 2   | Respect requested length | fail   | Requested about 600 Chinese characters; final body was about 1286 | ![final article](assets/write-article-final.png) |
+```
+
+Avoid the old wide table with separate `steps`, `expected`, and `actual`
+columns unless the test is purely non-visual and truly needs that breakdown.
+For UI reports, those columns make screenshot-backed reading harder. Put
+procedural detail in the row's key observation only when it changes the
+interpretation of the result.
+
+Use an extra evidence/detail section only when the inline table cannot carry
+the material cleanly, such as long CLI transcripts, HAR summaries, or multiple
+screenshots for one case. In that situation, keep the table evidence cell as a
+short inline proof or link, then put the longer material under `Verification`
+or a brief `Additional Evidence` section.

 Status values: `pass` / `fail` / `blocked` (couldn't run — e.g. auth or env
 missing; a blocked case is not a pass).
@@ -24,39 +24,53 @@ DATE_HUMAN=$(date '+%Y-%m-%d %H:%M')
 DATE_ISO=$(date '+%Y-%m-%dT%H:%M:%S%z')

 cat > "$DIR/report.md" << EOF
-# Test Report: $TITLE
+# 测试报告：$TITLE

-## Scope
+## 范围

-<!-- What changed / what is being verified -->
+<!-- 测试目标 / 变更范围 / 重点风险 -->

- Branch: \`$BRANCH\`
- Commit: \`$COMMIT\`
- Date: $DATE_HUMAN
+- 分支：\`$BRANCH\`
+- 当前提交：\`$COMMIT\`
+- 日期：$DATE_HUMAN
+- 表面：<!-- CLI / Electron + CDP / Web / Bot:<platform> -->
+- 测试页 / 入口：<!-- e.g. /settings or http://localhost:3010 -->
+- 重点：<!-- 本轮最关心的体验、功能或回归点 -->

-## Environment
+## 用例

- Server: <!-- e.g. http://localhost:3010 -->
- Surfaces: <!-- cli / electron / web / bot:<platform> -->
+| # | 用例 | 结果 | 关键现象 | 证据 |
+| - | ---- | ---- | -------- | ---- |
+| 1 |      | 待测 |          | ![用例 1](assets/case1.png) |

-## Cases
+## 结论

-| # | Case | Surface | Steps | Expected | Actual | Status | Evidence |
-| - | ---- | ------- | ----- | -------- | ------ | ------ | -------- |
-| 1 |      |         |       |          |        |        |          |
+整体结论：\`pending\`。

-## Evidence
+<!-- 用 1-2 段概括用户最需要知道的结果；失败和阻塞必须明确说明影响。 -->

-<!-- Embed screenshots: ![case 1](assets/case1.png) -->
-<!-- CLI transcripts in fenced blocks, with the exact command -->
+仍需处理 / 跟进：

-## Verdict
+- <!-- TODO -->

- Passed: 0 / 0
- Failed: 0
- Blocked: 0
- Score (optional): —
- Open issues / follow-ups:
+## 本轮验证
+
+<!-- 如有自动化或命令行验证，保留精简命令与结果；没有则写“未运行额外自动化验证”。 -->
+
+\`\`\`bash
+# command
+\`\`\`
+
+结果：
+
+- <!-- TODO -->
+
+## 评分
+
+- 通过：0
+- 失败：0
+- 阻塞：0
+- 评分：— / 100
 EOF

 cat > "$DIR/result.json" << EOF