✨ feat(skills): agent-testing iteration after first real-world run (#15700)

* 📝 docs(skills): make agent-testing Step 0 an env-setup + auth checklist Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ✨ feat(skills): agent-testing probes, GIF evidence, and report-language rule Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 19:20:04 +00:00 · 2026-06-11 23:52:25 +08:00
parent 6c8976b641
commit eca449e4e2
8 changed files with 333 additions and 27 deletions
@@ -19,25 +19,60 @@ also run as full cloud automation. Every test session follows the same
 four-step contract:

 ```
-Step 0: Auth ready?  →  Step 1: Pick surface  →  Step 2: Run  →  Step 3: Structured report
+Step 0: Env + Auth  →  Step 1: Pick surface  →  Step 2: Run  →  Step 3: Structured report
 ```

-## Step 0 — Auth first (mandatory)
+## Step 0 — Environment setup + auth check (mandatory)

-**Auth is the gate for all automated testing.** Prepare and verify it BEFORE
-writing a single test step — a half-finished test run that dies on a login wall
-wastes the whole session.
+Step 0 is about getting the environment ready: **dependencies are healthy**
+and **auth is green**. A test run that dies halfway on a missing dependency or
+a login wall wastes the whole session — clear both gates BEFORE writing a
+single test step.
+
+### 0.1 Dependencies are installed — root AND standalone apps
+
+The root pnpm workspace does **NOT** cover every app: `pnpm-workspace.yaml`
+lists `packages/**`, `e2e`, `apps/server`, and only `apps/desktop/src/main` —
+**`apps/desktop` and `apps/cli` are standalone**, each keeping its own
+`node_modules` with its own links into `packages/`. A root install does not
+refresh them, so install in every app the test will touch:
+
+```bash
+pnpm install                      # root workspace
+cd apps/desktop && pnpm install   # Electron surface
+cd apps/cli && pnpm install       # CLI surface
+```
+
+Symptom of a stale standalone install: the build/launch fails to resolve a
+recently added workspace package — `Rolldown failed to resolve import
+"@lobechat/<pkg>"` (Electron) or `Cannot find module '@lobechat/<pkg>'` (CLI).
+
+### 0.2 Run scripts from the repo root
+
+All paths in this skill (`./.agents/skills/agent-testing/...`) are
+repo-root-relative, and background commands inherit the current working
+directory — a script launched while `cwd` is `apps/desktop` fails with
+`No such file or directory`. Verify `pwd` is the repo root before launching
+long-running scripts.
+
+### 0.3 Auth is green
+
+**Auth is the gate for all automated testing.**

 ```bash
 ./.agents/skills/agent-testing/scripts/setup-auth.sh status
 ```

-| Surface  | Mechanism                                         | One-key path                   | Human needed?                 |
-| -------- | ------------------------------------------------- | ------------------------------ | ----------------------------- |
-| CLI      | OIDC Device Code Flow (`apps/cli/.lobehub-dev`)   | `setup-auth.sh cli`            | Yes — browser authorization   |
-| Web      | better-auth cookie injection into `agent-browser` | `pbpaste \| setup-auth.sh web` | Copy cookie once per rotation |
-| Electron | App's own persistent login state                  | Log in once in the app         | Once                          |
-| Bot      | Native apps already logged in                     | —                              | Once per app                  |
+| Surface  | Mechanism                                         | One-key path                   | Standard check                 |
+| -------- | ------------------------------------------------- | ------------------------------ | ------------------------------ |
+| CLI      | OIDC Device Code Flow (`apps/cli/.lobehub-dev`)   | `setup-auth.sh cli`            | `setup-auth.sh status`         |
+| Web      | better-auth cookie injection into `agent-browser` | `pbpaste \| setup-auth.sh web` | `setup-auth.sh web-verify`     |
+| Electron | App's own persistent login state                  | Log in once in the app         | `app-probe.sh auth`            |
+| Bot      | Native apps already logged in                     | —                              | per-platform screenshot        |
+
+Login-state checks are standardized — do NOT hand-roll `window.__LOBE_STORES`
+eval snippets; use `scripts/app-probe.sh auth` (returns `{ isSignedIn, userId }`,
+works for Electron CDP and web sessions via `AB_TARGET`).

 If `status` is not all green, fix auth first (the steps that need a human must be
 requested from the user explicitly). Full background and failure modes:
@@ -113,15 +148,23 @@ Surface guides above carry the detailed workflows. Shared infrastructure:

 All under `.agents/skills/agent-testing/scripts/`:

-| Script                    | Usage                                                         |
-| ------------------------- | ------------------------------------------------------------- |
-| `setup-auth.sh`           | One-stop auth setup & status check (`status` / `cli` / `web`) |
-| `report-init.sh`          | Scaffold a structured test report (Step 3)                    |
-| `electron-dev.sh`         | Manage Electron dev env (start/stop/status/restart, CDP 9222) |
-| `capture-app-window.sh`   | Screenshot a specific app window (general; used by bot tests) |
-| `record-app-screen.sh`    | Record app screen (video + periodic screenshots)              |
-| `record-electron-demo.sh` | Record Electron app demo with ffmpeg                          |
-| `agent-gateway/`          | Gateway probe / dump / analyze tools                          |
+| Script                    | Usage                                                                          |
+| ------------------------- | ------------------------------------------------------------------------------ |
+| `setup-auth.sh`           | One-stop auth setup & status check (`status` / `cli` / `web`)                  |
+| `app-probe.sh`            | LobeHub app probes: `auth` / `route` / `ops` / `goto <path>` / `errors`        |
+| `record-gif.sh`           | Frame-sequence → GIF for time-based behavior (streaming, timers, animations)   |
+| `report-init.sh`          | Scaffold a structured test report (Step 3)                                     |
+| `electron-dev.sh`         | Manage Electron dev env (start/stop/status/restart, CDP 9222)                  |
+| `capture-app-window.sh`   | Screenshot a specific app window (general; used by bot tests)                  |
+| `record-app-screen.sh`    | Record app screen (video + periodic screenshots)                               |
+| `record-electron-demo.sh` | Record Electron app demo with ffmpeg                                           |
+| `agent-gateway/`          | Gateway probe / dump / analyze tools                                           |
+
+`app-probe.sh` is the LobeHub-specific fast path into app state — auth check,
+current route, running operations, and `goto <path>` quick navigation
+(`/agent/<agentId>/<topicId>`, `/task/<taskId>`, `/settings`, …) so a test can
+jump straight to the state under test instead of clicking through the UI. See
+[ui/electron.md](./ui/electron.md#lobehub-probes--quick-navigation) for usage.

 ## Step 3 — Structured report (mandatory deliverable)

@@ -139,6 +182,16 @@ Reports live in `.records/reports/<timestamp>-<slug>/` (gitignored): `report.md`
 pass/fail + score), `assets/` (evidence). Format spec and evidence rules:
 [references/report.md](./references/report.md).

+Two hard rules worth front-loading:
+
+- **Report language = the user's conversation language.** Write the ENTIRE
+  `report.md` (headings included) in the language the user is conversing in —
+  no mixed English. `result.json` keys/status values stay English.
+- **Time-based behavior needs a GIF, not a screenshot.** If a case asserts
+  change over time (streaming output, a ticking timer, loading states,
+  animations), record it with `scripts/record-gif.sh` and embed the GIF —
+  a static screenshot cannot prove the behavior.
+
 ## Directory map

 ```
@@ -16,7 +16,7 @@ flakiness.
 | Requirement  | Details                                                                           |
 | ------------ | --------------------------------------------------------------------------------- |
 | Dev server   | `localhost:3010` — see [../references/dev-server.md](../references/dev-server.md) |
-| CLI source   | `apps/cli/` — runs from source, no rebuild needed                                 |
+| CLI source   | `apps/cli/` — runs from source, no rebuild; standalone `node_modules` — run `pnpm install` inside `apps/cli/` (root install does not cover it) |
 | CLI dev mode | `LOBEHUB_CLI_HOME=.lobehub-dev` for isolated credentials                          |
 | Auth         | Device Code Flow login — see [../references/auth.md](../references/auth.md)       |

@@ -101,8 +101,16 @@ agent-browser --session lobehub-dev snapshot -i | head -20

 The desktop app keeps its own persistent login state in its user-data
 directory — log in once manually inside the app and it survives restarts of
-`electron-dev.sh`. No injection needed; just confirm with a snapshot that the
-app is past the signin screen before running UI tests.
+`electron-dev.sh`. No injection needed. The standard check (do NOT hand-roll a
+store eval) once Electron is up with CDP:
+
+```bash
+./.agents/skills/agent-testing/scripts/app-probe.sh auth
+# → {"ok":true,"isSignedIn":true,"userId":"user_xxx"}
+```
+
+`setup-auth.sh status` runs this probe automatically when CDP 9222 is
+reachable.

 ## Scope

@@ -29,9 +29,25 @@ output):

 2. **Collect evidence as you test** — every asserted behavior gets one evidence
   item in `$DIR/assets/`:
-   - UI: `agent-browser screenshot` or `capture-app-window.sh`, then **verify
-     the screenshot with the Read tool before citing it** — never cite an
-     image you haven't looked at.
+   - UI (static state): `agent-browser screenshot` or `capture-app-window.sh`,
+     then **verify the screenshot with the Read tool before citing it** —
+     never cite an image you haven't looked at.
+   - UI (time-based behavior): **screenshot vs GIF is a judgment you must
+     make per case.** If the assertion is about change over time — streaming
+     output, a ticking timer, loading/progress states, animations,
+     appear/disappear transitions — a static screenshot cannot prove it.
+     Record a frame sequence and synthesize a GIF:
+
+     ```bash
+     # start recording (background), trigger the behavior, wait for it to finish
+     ../scripts/record-gif.sh "$DIR/assets/case2-streaming.gif" 12 2 &
+     GIF_PID=$!
+     # ... drive the scenario ...
+     wait $GIF_PID
+     ```
+
+     Embed it like an image: `![case 2](assets/case2-streaming.gif)`. Verify
+     at least the first/last frames visually (Read the GIF) before citing.
   - CLI: exact command + trimmed output (`$CLI task list | tee "$DIR/assets/task-list.txt"`).
   - Network: `agent-browser network requests` dumps or HAR files.

@@ -40,6 +56,16 @@ output):
 4. **Set the verdict** in both `report.md` and `result.json`, then link the
   report directory in your final answer to the user.

+## Report language (hard rule)
+
+**`report.md` MUST be written in the language the user is conversing in** —
+the whole file, headings included. If the conversation is in Chinese, the
+report is in Chinese; do not mix English prose into it. The scaffold's English
+headings are placeholders — translate them when filling. Exceptions that stay
+as-is: code/commands, identifiers, log excerpts, and `result.json` (its keys
+and status values are machine-read and stay English; the `title` and case
+`name` fields follow the user's language).
+
 ## report.md sections

 | Section         | Content                                                                            |
@@ -47,7 +73,7 @@ output):
 | **Scope**       | What changed / what is being verified; branch + commit                             |
 | **Environment** | Server URL, surfaces used (cli / electron / web / bot), relevant versions          |
 | **Cases**       | Table: `# \| case \| surface \| steps \| expected \| actual \| status \| evidence` |
-| **Evidence**    | Embedded screenshots (`![case 1](assets/case1.png)`), fenced CLI transcripts       |
+| **Evidence**    | Embedded screenshots/GIFs (`![case 1](assets/case1.png)`), fenced CLI transcripts  |
 | **Verdict**     | Pass/fail/blocked counts, optional 0–100 score, open issues / follow-ups           |

 Status values: `pass` / `fail` / `blocked` (couldn't run — e.g. auth or env
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+# app-probe.sh — standardized probes for a running LobeHub app (Electron via
+# CDP, or a web agent-browser session). Use these instead of hand-rolling
+# `window.__LOBE_STORES` eval snippets — especially the auth check.
+#
+# Usage:
+#   app-probe.sh auth              # { isSignedIn, userId } from the user store
+#   app-probe.sh route             # current SPA route
+#   app-probe.sh ops               # running chat operations (type / status / startTime)
+#   app-probe.sh goto <path>       # navigate the SPA to a route (full reload), e.g. goto /agent/agt_xxx
+#   app-probe.sh errors-install    # install a console.error interceptor
+#   app-probe.sh errors            # dump errors captured since errors-install
+#
+# Target selection (default: Electron over CDP 9222):
+#   AB_TARGET="--cdp 9222"             # Electron (default; CDP_PORT also honored)
+#   AB_TARGET="--session lobehub-dev"  # web agent-browser session
+#
+# Common routes (desktop SPA): /  /agent/<agentId>  /agent/<agentId>/<topicId>
+#   /task  /task/<taskId>  /page  /settings  /community
+
+set -euo pipefail
+
+AB_TARGET="${AB_TARGET:---cdp ${CDP_PORT:-9222}}"
+
+run_eval() {
+  # shellcheck disable=SC2086
+  agent-browser $AB_TARGET eval --stdin
+}
+
+case "${1:-}" in
+  auth)
+    run_eval << 'EVALEOF'
+(function () {
+  var stores = window.__LOBE_STORES;
+  if (!stores || !stores.user) return JSON.stringify({ ok: false, reason: 'no user store — app not loaded yet?' });
+  var u = stores.user();
+  return JSON.stringify({ ok: !!u.isSignedIn, isSignedIn: !!u.isSignedIn, userId: (u.user && u.user.id) || null });
+})()
+EVALEOF
+    ;;
+  route)
+    run_eval << 'EVALEOF'
+location.pathname + location.search + location.hash
+EVALEOF
+    ;;
+  ops)
+    run_eval << 'EVALEOF'
+(function () {
+  var stores = window.__LOBE_STORES;
+  if (!stores || !stores.chat) return JSON.stringify({ ok: false, reason: 'no chat store — open a conversation first' });
+  var ops = Object.values(stores.chat().operations || {});
+  var running = ops.filter(function (o) { return o.status === 'running'; });
+  return JSON.stringify({
+    ok: true,
+    running: running.map(function (o) { return { startTime: o.metadata && o.metadata.startTime, type: o.type }; }),
+    runningCount: running.length,
+    total: ops.length,
+  });
+})()
+EVALEOF
+    ;;
+  goto)
+    TARGET_PATH="${2:?Usage: app-probe.sh goto <path>}"
+    # shellcheck disable=SC2086
+    agent-browser $AB_TARGET eval "location.href = '$TARGET_PATH'" > /dev/null
+    sleep 2
+    bash "${BASH_SOURCE[0]}" route
+    ;;
+  errors-install)
+    run_eval << 'EVALEOF'
+(function () {
+  window.__CAPTURED_ERRORS = [];
+  var orig = console.error;
+  console.error = function () {
+    var msg = Array.from(arguments).map(function (a) {
+      if (a instanceof Error) return a.message;
+      return typeof a === 'object' ? JSON.stringify(a) : String(a);
+    }).join(' ');
+    window.__CAPTURED_ERRORS.push(msg);
+    orig.apply(console, arguments);
+  };
+  return 'installed';
+})()
+EVALEOF
+    ;;
+  errors)
+    run_eval << 'EVALEOF'
+JSON.stringify(window.__CAPTURED_ERRORS || 'interceptor not installed — run errors-install first')
+EVALEOF
+    ;;
+  *)
+    echo "Usage: $0 {auth|route|ops|goto <path>|errors-install|errors}" >&2
+    exit 2
+    ;;
+esac
@@ -0,0 +1,61 @@
+#!/usr/bin/env bash
+# record-gif.sh — capture a frame sequence via agent-browser (CDP) and
+# synthesize a GIF for embedding in a test report.
+#
+# Use this whenever the asserted behavior is about CHANGE OVER TIME —
+# streaming output, a ticking timer, loading states, animations. A static
+# screenshot cannot prove those; a GIF can. Cloud-portable: frames come from
+# CDP rendering, no OS-level screen capture.
+#
+# Usage:
+#   record-gif.sh <output.gif> <duration_seconds> [fps]
+#
+#   AB_TARGET="--cdp 9222"             # Electron (default; CDP_PORT honored)
+#   AB_TARGET="--session lobehub-dev"  # web agent-browser session
+#   GIF_WIDTH=960                      # output width (px), default 960
+#
+# Requires ffmpeg (`brew install ffmpeg`). Effective fps is capped by
+# screenshot latency (~0.3-0.5s per frame); 1-2 fps is the realistic range.
+#
+# Example — record a 12s run and embed it in the report:
+#   ./record-gif.sh "$DIR/assets/case2-tray-running.gif" 12 2 &
+#   GIF_PID=$!
+#   # ... trigger the streaming behavior ...
+#   wait $GIF_PID
+
+set -euo pipefail
+
+OUT="${1:?Usage: record-gif.sh <output.gif> <duration_seconds> [fps]}"
+DUR="${2:?Usage: record-gif.sh <output.gif> <duration_seconds> [fps]}"
+FPS="${3:-2}"
+AB_TARGET="${AB_TARGET:---cdp ${CDP_PORT:-9222}}"
+GIF_WIDTH="${GIF_WIDTH:-960}"
+
+command -v ffmpeg > /dev/null || {
+  echo "ffmpeg not found — install with: brew install ffmpeg" >&2
+  exit 1
+}
+
+TMP=$(mktemp -d)
+trap 'rm -rf "$TMP"' EXIT
+
+FRAMES=$((DUR * FPS))
+INTERVAL=$(python3 -c "print(1 / $FPS)")
+
+for i in $(seq -f '%04g' 1 "$FRAMES"); do
+  # shellcheck disable=SC2086
+  agent-browser $AB_TARGET screenshot "$TMP/frame-$i.png" > /dev/null 2>&1 || true
+  sleep "$INTERVAL"
+done
+
+CAPTURED=$(find "$TMP" -name 'frame-*.png' | wc -l | tr -d ' ')
+[ "$CAPTURED" -gt 0 ] || {
+  echo "no frames captured — is the app reachable via $AB_TARGET?" >&2
+  exit 1
+}
+
+ffmpeg -y -loglevel error -framerate "$FPS" -pattern_type glob -i "$TMP/frame-*.png" \
+  -vf "fps=$FPS,scale=$GIF_WIDTH:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" \
+  "$OUT"
+
+echo "$OUT ($CAPTURED frames @ ${FPS}fps)"
@@ -63,12 +63,33 @@ check_web() {
  fi
 }

+check_electron() {
+  local cdp_port="${CDP_PORT:-9222}"
+  if ! curl -s -o /dev/null --max-time 2 "http://localhost:$cdp_port/json/version" 2> /dev/null; then
+    note "electron: not running (CDP $cdp_port unreachable) — start with electron-dev.sh; check skipped"
+    return 0
+  fi
+  local probe result
+  probe="$(dirname "${BASH_SOURCE[0]}")/app-probe.sh"
+  result=$(bash "$probe" auth 2> /dev/null || true)
+  # agent-browser eval returns the JSON string with escaped quotes — normalize.
+  result="${result//\\/}"
+  if [[ "$result" == *'"isSignedIn":true'* ]]; then
+    ok "electron app signed in ($result)"
+  else
+    bad "electron app NOT signed in ($result)"
+    note "log in once manually inside the app (state persists across restarts)"
+    return 1
+  fi
+}
+
 cmd_status() {
  echo "agent-testing auth status (SERVER_URL=$SERVER_URL):"
  local rc=0
  check_server || rc=1
  check_cli || rc=1
  check_web || rc=1
+  check_electron || rc=1
  if [[ $rc -eq 0 ]]; then
    echo "all green — safe to start automated testing."
  else
@@ -39,6 +39,39 @@ After `start` succeeds, connect with: `agent-browser --cdp 9222 snapshot -i`
 | `ELECTRON_WAIT_S` | `60`                    | Max seconds to wait for Electron process |
 | `RENDERER_WAIT_S` | `60`                    | Max seconds to wait for SPA to load      |

+### LobeHub Probes & Quick Navigation
+
+`scripts/app-probe.sh` is the standard fast path into app state — **use it
+instead of hand-rolling `__LOBE_STORES` eval snippets** for these common needs:
+
+```bash
+PROBE=".agents/skills/agent-testing/scripts/app-probe.sh"
+
+$PROBE auth              # login check (Step 0.3) → { isSignedIn, userId }
+$PROBE route             # current SPA route
+$PROBE ops               # running chat operations (type / startTime)
+$PROBE goto /settings    # jump the SPA straight to a route (full reload)
+$PROBE errors-install    # install console.error interceptor
+$PROBE errors            # dump captured errors
+```
+
+`goto` lets a test enter the state under test directly instead of clicking
+through the UI. Common desktop routes:
+
+| Route                         | Where it lands                       |
+| ----------------------------- | ------------------------------------ |
+| `/`                           | Home (has a chat input)              |
+| `/agent/<agentId>`            | Agent conversation (latest topic)    |
+| `/agent/<agentId>/<topicId>`  | Specific topic in a conversation     |
+| `/task` · `/task/<taskId>`    | Task list / task detail              |
+| `/page`                       | Documents (文稿)                     |
+| `/settings`                   | Settings                             |
+| `/community`                  | Discover / community                 |
+
+Targets default to Electron (`--cdp 9222`); set `AB_TARGET="--session <name>"`
+for web sessions. For deeper or one-off state inspection, fall back to raw
+eval below.
+
 ### LobeHub-Specific Patterns

 #### Access Zustand Store State
@@ -108,5 +141,14 @@ agent-browser --cdp 9222 eval "JSON.stringify(window.__CAPTURED_ERRORS)"

 - **Always use `electron-dev.sh stop` to clean up** — `pkill -f "Electron"` only kills the main process; helper processes (GPU, renderer, network) survive. The script finds and kills all of them via PID matching against the project's electron binary path.
 - **`npx electron-vite dev` must run from `apps/desktop/`** — running from project root fails silently. The `electron-dev.sh` script handles this automatically.
+- **Dev build auto-opens DevTools, which hijacks the CDP target** — `agent-browser --cdp 9222` may attach to the DevTools page (`devtools://…`) instead of the app (`app://renderer/`). Symptom: `get url` returns a `devtools://` URL. Fix: close the DevTools target and reconnect:
+
+  ```bash
+  DT_ID=$(curl -s http://localhost:9222/json/list | python3 -c "import json,sys; ts=json.load(sys.stdin); print(next(t['id'] for t in ts if t['type']=='page' and t['url'].startswith('devtools://')))")
+  curl -s "http://localhost:9222/json/close/$DT_ID" > /dev/null
+  agent-browser close --all && agent-browser --cdp 9222 get url   # expect app://renderer/
+  ```
+
 - **Don't resize the Electron window after load** — resizing triggers full SPA reload
 - **Store is at `window.__LOBE_STORES`** not `window.__ZUSTAND_STORES__`
+- **Streaming / ticking UI needs GIF evidence** — see `scripts/record-gif.sh`; a static screenshot cannot prove time-based behavior.