From eca449e4e2256f1f128f857599850d1b93e917cc Mon Sep 17 00:00:00 2001 From: Arvin Xu Date: Thu, 11 Jun 2026 23:52:25 +0800 Subject: [PATCH] =?UTF-8?q?=E2=9C=A8=20feat(skills):=20agent-testing=20ite?= =?UTF-8?q?ration=20after=20first=20real-world=20run=20(#15700)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * ๐Ÿ“ docs(skills): make agent-testing Step 0 an env-setup + auth checklist Co-Authored-By: Claude Fable 5 * โœจ feat(skills): agent-testing probes, GIF evidence, and report-language rule Co-Authored-By: Claude Fable 5 --------- Co-authored-by: Claude Fable 5 --- .agents/skills/agent-testing/SKILL.md | 93 ++++++++++++++---- .agents/skills/agent-testing/cli/index.md | 2 +- .../skills/agent-testing/references/auth.md | 12 ++- .../skills/agent-testing/references/report.md | 34 ++++++- .../skills/agent-testing/scripts/app-probe.sh | 95 +++++++++++++++++++ .../agent-testing/scripts/record-gif.sh | 61 ++++++++++++ .../agent-testing/scripts/setup-auth.sh | 21 ++++ .agents/skills/agent-testing/ui/electron.md | 42 ++++++++ 8 files changed, 333 insertions(+), 27 deletions(-) create mode 100755 .agents/skills/agent-testing/scripts/app-probe.sh create mode 100755 .agents/skills/agent-testing/scripts/record-gif.sh diff --git a/.agents/skills/agent-testing/SKILL.md b/.agents/skills/agent-testing/SKILL.md index 11bd3581cd..e9d4522c64 100644 --- a/.agents/skills/agent-testing/SKILL.md +++ b/.agents/skills/agent-testing/SKILL.md @@ -19,25 +19,60 @@ also run as full cloud automation. Every test session follows the same four-step contract: ``` -Step 0: Auth ready? โ†’ Step 1: Pick surface โ†’ Step 2: Run โ†’ Step 3: Structured report +Step 0: Env + Auth โ†’ Step 1: Pick surface โ†’ Step 2: Run โ†’ Step 3: Structured report ``` -## Step 0 โ€” Auth first (mandatory) +## Step 0 โ€” Environment setup + auth check (mandatory) -**Auth is the gate for all automated testing.** Prepare and verify it BEFORE -writing a single test step โ€” a half-finished test run that dies on a login wall -wastes the whole session. +Step 0 is about getting the environment ready: **dependencies are healthy** +and **auth is green**. A test run that dies halfway on a missing dependency or +a login wall wastes the whole session โ€” clear both gates BEFORE writing a +single test step. + +### 0.1 Dependencies are installed โ€” root AND standalone apps + +The root pnpm workspace does **NOT** cover every app: `pnpm-workspace.yaml` +lists `packages/**`, `e2e`, `apps/server`, and only `apps/desktop/src/main` โ€” +**`apps/desktop` and `apps/cli` are standalone**, each keeping its own +`node_modules` with its own links into `packages/`. A root install does not +refresh them, so install in every app the test will touch: + +```bash +pnpm install # root workspace +cd apps/desktop && pnpm install # Electron surface +cd apps/cli && pnpm install # CLI surface +``` + +Symptom of a stale standalone install: the build/launch fails to resolve a +recently added workspace package โ€” `Rolldown failed to resolve import +"@lobechat/"` (Electron) or `Cannot find module '@lobechat/'` (CLI). + +### 0.2 Run scripts from the repo root + +All paths in this skill (`./.agents/skills/agent-testing/...`) are +repo-root-relative, and background commands inherit the current working +directory โ€” a script launched while `cwd` is `apps/desktop` fails with +`No such file or directory`. Verify `pwd` is the repo root before launching +long-running scripts. + +### 0.3 Auth is green + +**Auth is the gate for all automated testing.** ```bash ./.agents/skills/agent-testing/scripts/setup-auth.sh status ``` -| Surface | Mechanism | One-key path | Human needed? | -| -------- | ------------------------------------------------- | ------------------------------ | ----------------------------- | -| CLI | OIDC Device Code Flow (`apps/cli/.lobehub-dev`) | `setup-auth.sh cli` | Yes โ€” browser authorization | -| Web | better-auth cookie injection into `agent-browser` | `pbpaste \| setup-auth.sh web` | Copy cookie once per rotation | -| Electron | App's own persistent login state | Log in once in the app | Once | -| Bot | Native apps already logged in | โ€” | Once per app | +| Surface | Mechanism | One-key path | Standard check | +| -------- | ------------------------------------------------- | ------------------------------ | ------------------------------ | +| CLI | OIDC Device Code Flow (`apps/cli/.lobehub-dev`) | `setup-auth.sh cli` | `setup-auth.sh status` | +| Web | better-auth cookie injection into `agent-browser` | `pbpaste \| setup-auth.sh web` | `setup-auth.sh web-verify` | +| Electron | App's own persistent login state | Log in once in the app | `app-probe.sh auth` | +| Bot | Native apps already logged in | โ€” | per-platform screenshot | + +Login-state checks are standardized โ€” do NOT hand-roll `window.__LOBE_STORES` +eval snippets; use `scripts/app-probe.sh auth` (returns `{ isSignedIn, userId }`, +works for Electron CDP and web sessions via `AB_TARGET`). If `status` is not all green, fix auth first (the steps that need a human must be requested from the user explicitly). Full background and failure modes: @@ -113,15 +148,23 @@ Surface guides above carry the detailed workflows. Shared infrastructure: All under `.agents/skills/agent-testing/scripts/`: -| Script | Usage | -| ------------------------- | ------------------------------------------------------------- | -| `setup-auth.sh` | One-stop auth setup & status check (`status` / `cli` / `web`) | -| `report-init.sh` | Scaffold a structured test report (Step 3) | -| `electron-dev.sh` | Manage Electron dev env (start/stop/status/restart, CDP 9222) | -| `capture-app-window.sh` | Screenshot a specific app window (general; used by bot tests) | -| `record-app-screen.sh` | Record app screen (video + periodic screenshots) | -| `record-electron-demo.sh` | Record Electron app demo with ffmpeg | -| `agent-gateway/` | Gateway probe / dump / analyze tools | +| Script | Usage | +| ------------------------- | ------------------------------------------------------------------------------ | +| `setup-auth.sh` | One-stop auth setup & status check (`status` / `cli` / `web`) | +| `app-probe.sh` | LobeHub app probes: `auth` / `route` / `ops` / `goto ` / `errors` | +| `record-gif.sh` | Frame-sequence โ†’ GIF for time-based behavior (streaming, timers, animations) | +| `report-init.sh` | Scaffold a structured test report (Step 3) | +| `electron-dev.sh` | Manage Electron dev env (start/stop/status/restart, CDP 9222) | +| `capture-app-window.sh` | Screenshot a specific app window (general; used by bot tests) | +| `record-app-screen.sh` | Record app screen (video + periodic screenshots) | +| `record-electron-demo.sh` | Record Electron app demo with ffmpeg | +| `agent-gateway/` | Gateway probe / dump / analyze tools | + +`app-probe.sh` is the LobeHub-specific fast path into app state โ€” auth check, +current route, running operations, and `goto ` quick navigation +(`/agent//`, `/task/`, `/settings`, โ€ฆ) so a test can +jump straight to the state under test instead of clicking through the UI. See +[ui/electron.md](./ui/electron.md#lobehub-probes--quick-navigation) for usage. ## Step 3 โ€” Structured report (mandatory deliverable) @@ -139,6 +182,16 @@ Reports live in `.records/reports/-/` (gitignored): `report.md` pass/fail + score), `assets/` (evidence). Format spec and evidence rules: [references/report.md](./references/report.md). +Two hard rules worth front-loading: + +- **Report language = the user's conversation language.** Write the ENTIRE + `report.md` (headings included) in the language the user is conversing in โ€” + no mixed English. `result.json` keys/status values stay English. +- **Time-based behavior needs a GIF, not a screenshot.** If a case asserts + change over time (streaming output, a ticking timer, loading states, + animations), record it with `scripts/record-gif.sh` and embed the GIF โ€” + a static screenshot cannot prove the behavior. + ## Directory map ``` diff --git a/.agents/skills/agent-testing/cli/index.md b/.agents/skills/agent-testing/cli/index.md index e7088c43b2..d829452b55 100644 --- a/.agents/skills/agent-testing/cli/index.md +++ b/.agents/skills/agent-testing/cli/index.md @@ -16,7 +16,7 @@ flakiness. | Requirement | Details | | ------------ | --------------------------------------------------------------------------------- | | Dev server | `localhost:3010` โ€” see [../references/dev-server.md](../references/dev-server.md) | -| CLI source | `apps/cli/` โ€” runs from source, no rebuild needed | +| CLI source | `apps/cli/` โ€” runs from source, no rebuild; standalone `node_modules` โ€” run `pnpm install` inside `apps/cli/` (root install does not cover it) | | CLI dev mode | `LOBEHUB_CLI_HOME=.lobehub-dev` for isolated credentials | | Auth | Device Code Flow login โ€” see [../references/auth.md](../references/auth.md) | diff --git a/.agents/skills/agent-testing/references/auth.md b/.agents/skills/agent-testing/references/auth.md index 8e13c50717..90077aa62d 100644 --- a/.agents/skills/agent-testing/references/auth.md +++ b/.agents/skills/agent-testing/references/auth.md @@ -101,8 +101,16 @@ agent-browser --session lobehub-dev snapshot -i | head -20 The desktop app keeps its own persistent login state in its user-data directory โ€” log in once manually inside the app and it survives restarts of -`electron-dev.sh`. No injection needed; just confirm with a snapshot that the -app is past the signin screen before running UI tests. +`electron-dev.sh`. No injection needed. The standard check (do NOT hand-roll a +store eval) once Electron is up with CDP: + +```bash +./.agents/skills/agent-testing/scripts/app-probe.sh auth +# โ†’ {"ok":true,"isSignedIn":true,"userId":"user_xxx"} +``` + +`setup-auth.sh status` runs this probe automatically when CDP 9222 is +reachable. ## Scope diff --git a/.agents/skills/agent-testing/references/report.md b/.agents/skills/agent-testing/references/report.md index 24064d7fc8..e427536b8e 100644 --- a/.agents/skills/agent-testing/references/report.md +++ b/.agents/skills/agent-testing/references/report.md @@ -29,9 +29,25 @@ output): 2. **Collect evidence as you test** โ€” every asserted behavior gets one evidence item in `$DIR/assets/`: - - UI: `agent-browser screenshot` or `capture-app-window.sh`, then **verify - the screenshot with the Read tool before citing it** โ€” never cite an - image you haven't looked at. + - UI (static state): `agent-browser screenshot` or `capture-app-window.sh`, + then **verify the screenshot with the Read tool before citing it** โ€” + never cite an image you haven't looked at. + - UI (time-based behavior): **screenshot vs GIF is a judgment you must + make per case.** If the assertion is about change over time โ€” streaming + output, a ticking timer, loading/progress states, animations, + appear/disappear transitions โ€” a static screenshot cannot prove it. + Record a frame sequence and synthesize a GIF: + + ```bash + # start recording (background), trigger the behavior, wait for it to finish + ../scripts/record-gif.sh "$DIR/assets/case2-streaming.gif" 12 2 & + GIF_PID=$! + # ... drive the scenario ... + wait $GIF_PID + ``` + + Embed it like an image: `![case 2](assets/case2-streaming.gif)`. Verify + at least the first/last frames visually (Read the GIF) before citing. - CLI: exact command + trimmed output (`$CLI task list | tee "$DIR/assets/task-list.txt"`). - Network: `agent-browser network requests` dumps or HAR files. @@ -40,6 +56,16 @@ output): 4. **Set the verdict** in both `report.md` and `result.json`, then link the report directory in your final answer to the user. +## Report language (hard rule) + +**`report.md` MUST be written in the language the user is conversing in** โ€” +the whole file, headings included. If the conversation is in Chinese, the +report is in Chinese; do not mix English prose into it. The scaffold's English +headings are placeholders โ€” translate them when filling. Exceptions that stay +as-is: code/commands, identifiers, log excerpts, and `result.json` (its keys +and status values are machine-read and stay English; the `title` and case +`name` fields follow the user's language). + ## report.md sections | Section | Content | @@ -47,7 +73,7 @@ output): | **Scope** | What changed / what is being verified; branch + commit | | **Environment** | Server URL, surfaces used (cli / electron / web / bot), relevant versions | | **Cases** | Table: `# \| case \| surface \| steps \| expected \| actual \| status \| evidence` | -| **Evidence** | Embedded screenshots (`![case 1](assets/case1.png)`), fenced CLI transcripts | +| **Evidence** | Embedded screenshots/GIFs (`![case 1](assets/case1.png)`), fenced CLI transcripts | | **Verdict** | Pass/fail/blocked counts, optional 0โ€“100 score, open issues / follow-ups | Status values: `pass` / `fail` / `blocked` (couldn't run โ€” e.g. auth or env diff --git a/.agents/skills/agent-testing/scripts/app-probe.sh b/.agents/skills/agent-testing/scripts/app-probe.sh new file mode 100755 index 0000000000..e98d151bfe --- /dev/null +++ b/.agents/skills/agent-testing/scripts/app-probe.sh @@ -0,0 +1,95 @@ +#!/usr/bin/env bash +# app-probe.sh โ€” standardized probes for a running LobeHub app (Electron via +# CDP, or a web agent-browser session). Use these instead of hand-rolling +# `window.__LOBE_STORES` eval snippets โ€” especially the auth check. +# +# Usage: +# app-probe.sh auth # { isSignedIn, userId } from the user store +# app-probe.sh route # current SPA route +# app-probe.sh ops # running chat operations (type / status / startTime) +# app-probe.sh goto # navigate the SPA to a route (full reload), e.g. goto /agent/agt_xxx +# app-probe.sh errors-install # install a console.error interceptor +# app-probe.sh errors # dump errors captured since errors-install +# +# Target selection (default: Electron over CDP 9222): +# AB_TARGET="--cdp 9222" # Electron (default; CDP_PORT also honored) +# AB_TARGET="--session lobehub-dev" # web agent-browser session +# +# Common routes (desktop SPA): / /agent/ /agent// +# /task /task/ /page /settings /community + +set -euo pipefail + +AB_TARGET="${AB_TARGET:---cdp ${CDP_PORT:-9222}}" + +run_eval() { + # shellcheck disable=SC2086 + agent-browser $AB_TARGET eval --stdin +} + +case "${1:-}" in + auth) + run_eval << 'EVALEOF' +(function () { + var stores = window.__LOBE_STORES; + if (!stores || !stores.user) return JSON.stringify({ ok: false, reason: 'no user store โ€” app not loaded yet?' }); + var u = stores.user(); + return JSON.stringify({ ok: !!u.isSignedIn, isSignedIn: !!u.isSignedIn, userId: (u.user && u.user.id) || null }); +})() +EVALEOF + ;; + route) + run_eval << 'EVALEOF' +location.pathname + location.search + location.hash +EVALEOF + ;; + ops) + run_eval << 'EVALEOF' +(function () { + var stores = window.__LOBE_STORES; + if (!stores || !stores.chat) return JSON.stringify({ ok: false, reason: 'no chat store โ€” open a conversation first' }); + var ops = Object.values(stores.chat().operations || {}); + var running = ops.filter(function (o) { return o.status === 'running'; }); + return JSON.stringify({ + ok: true, + running: running.map(function (o) { return { startTime: o.metadata && o.metadata.startTime, type: o.type }; }), + runningCount: running.length, + total: ops.length, + }); +})() +EVALEOF + ;; + goto) + TARGET_PATH="${2:?Usage: app-probe.sh goto }" + # shellcheck disable=SC2086 + agent-browser $AB_TARGET eval "location.href = '$TARGET_PATH'" > /dev/null + sleep 2 + bash "${BASH_SOURCE[0]}" route + ;; + errors-install) + run_eval << 'EVALEOF' +(function () { + window.__CAPTURED_ERRORS = []; + var orig = console.error; + console.error = function () { + var msg = Array.from(arguments).map(function (a) { + if (a instanceof Error) return a.message; + return typeof a === 'object' ? JSON.stringify(a) : String(a); + }).join(' '); + window.__CAPTURED_ERRORS.push(msg); + orig.apply(console, arguments); + }; + return 'installed'; +})() +EVALEOF + ;; + errors) + run_eval << 'EVALEOF' +JSON.stringify(window.__CAPTURED_ERRORS || 'interceptor not installed โ€” run errors-install first') +EVALEOF + ;; + *) + echo "Usage: $0 {auth|route|ops|goto |errors-install|errors}" >&2 + exit 2 + ;; +esac diff --git a/.agents/skills/agent-testing/scripts/record-gif.sh b/.agents/skills/agent-testing/scripts/record-gif.sh new file mode 100755 index 0000000000..07bcc26c54 --- /dev/null +++ b/.agents/skills/agent-testing/scripts/record-gif.sh @@ -0,0 +1,61 @@ +#!/usr/bin/env bash +# record-gif.sh โ€” capture a frame sequence via agent-browser (CDP) and +# synthesize a GIF for embedding in a test report. +# +# Use this whenever the asserted behavior is about CHANGE OVER TIME โ€” +# streaming output, a ticking timer, loading states, animations. A static +# screenshot cannot prove those; a GIF can. Cloud-portable: frames come from +# CDP rendering, no OS-level screen capture. +# +# Usage: +# record-gif.sh [fps] +# +# AB_TARGET="--cdp 9222" # Electron (default; CDP_PORT honored) +# AB_TARGET="--session lobehub-dev" # web agent-browser session +# GIF_WIDTH=960 # output width (px), default 960 +# +# Requires ffmpeg (`brew install ffmpeg`). Effective fps is capped by +# screenshot latency (~0.3-0.5s per frame); 1-2 fps is the realistic range. +# +# Example โ€” record a 12s run and embed it in the report: +# ./record-gif.sh "$DIR/assets/case2-tray-running.gif" 12 2 & +# GIF_PID=$! +# # ... trigger the streaming behavior ... +# wait $GIF_PID + +set -euo pipefail + +OUT="${1:?Usage: record-gif.sh [fps]}" +DUR="${2:?Usage: record-gif.sh [fps]}" +FPS="${3:-2}" +AB_TARGET="${AB_TARGET:---cdp ${CDP_PORT:-9222}}" +GIF_WIDTH="${GIF_WIDTH:-960}" + +command -v ffmpeg > /dev/null || { + echo "ffmpeg not found โ€” install with: brew install ffmpeg" >&2 + exit 1 +} + +TMP=$(mktemp -d) +trap 'rm -rf "$TMP"' EXIT + +FRAMES=$((DUR * FPS)) +INTERVAL=$(python3 -c "print(1 / $FPS)") + +for i in $(seq -f '%04g' 1 "$FRAMES"); do + # shellcheck disable=SC2086 + agent-browser $AB_TARGET screenshot "$TMP/frame-$i.png" > /dev/null 2>&1 || true + sleep "$INTERVAL" +done + +CAPTURED=$(find "$TMP" -name 'frame-*.png' | wc -l | tr -d ' ') +[ "$CAPTURED" -gt 0 ] || { + echo "no frames captured โ€” is the app reachable via $AB_TARGET?" >&2 + exit 1 +} + +ffmpeg -y -loglevel error -framerate "$FPS" -pattern_type glob -i "$TMP/frame-*.png" \ + -vf "fps=$FPS,scale=$GIF_WIDTH:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" \ + "$OUT" + +echo "$OUT ($CAPTURED frames @ ${FPS}fps)" diff --git a/.agents/skills/agent-testing/scripts/setup-auth.sh b/.agents/skills/agent-testing/scripts/setup-auth.sh index 29723e77dd..8ac6afab93 100755 --- a/.agents/skills/agent-testing/scripts/setup-auth.sh +++ b/.agents/skills/agent-testing/scripts/setup-auth.sh @@ -63,12 +63,33 @@ check_web() { fi } +check_electron() { + local cdp_port="${CDP_PORT:-9222}" + if ! curl -s -o /dev/null --max-time 2 "http://localhost:$cdp_port/json/version" 2> /dev/null; then + note "electron: not running (CDP $cdp_port unreachable) โ€” start with electron-dev.sh; check skipped" + return 0 + fi + local probe result + probe="$(dirname "${BASH_SOURCE[0]}")/app-probe.sh" + result=$(bash "$probe" auth 2> /dev/null || true) + # agent-browser eval returns the JSON string with escaped quotes โ€” normalize. + result="${result//\\/}" + if [[ "$result" == *'"isSignedIn":true'* ]]; then + ok "electron app signed in ($result)" + else + bad "electron app NOT signed in ($result)" + note "log in once manually inside the app (state persists across restarts)" + return 1 + fi +} + cmd_status() { echo "agent-testing auth status (SERVER_URL=$SERVER_URL):" local rc=0 check_server || rc=1 check_cli || rc=1 check_web || rc=1 + check_electron || rc=1 if [[ $rc -eq 0 ]]; then echo "all green โ€” safe to start automated testing." else diff --git a/.agents/skills/agent-testing/ui/electron.md b/.agents/skills/agent-testing/ui/electron.md index 3c1f864d6d..667dc1e952 100644 --- a/.agents/skills/agent-testing/ui/electron.md +++ b/.agents/skills/agent-testing/ui/electron.md @@ -39,6 +39,39 @@ After `start` succeeds, connect with: `agent-browser --cdp 9222 snapshot -i` | `ELECTRON_WAIT_S` | `60` | Max seconds to wait for Electron process | | `RENDERER_WAIT_S` | `60` | Max seconds to wait for SPA to load | +### LobeHub Probes & Quick Navigation + +`scripts/app-probe.sh` is the standard fast path into app state โ€” **use it +instead of hand-rolling `__LOBE_STORES` eval snippets** for these common needs: + +```bash +PROBE=".agents/skills/agent-testing/scripts/app-probe.sh" + +$PROBE auth # login check (Step 0.3) โ†’ { isSignedIn, userId } +$PROBE route # current SPA route +$PROBE ops # running chat operations (type / startTime) +$PROBE goto /settings # jump the SPA straight to a route (full reload) +$PROBE errors-install # install console.error interceptor +$PROBE errors # dump captured errors +``` + +`goto` lets a test enter the state under test directly instead of clicking +through the UI. Common desktop routes: + +| Route | Where it lands | +| ----------------------------- | ------------------------------------ | +| `/` | Home (has a chat input) | +| `/agent/` | Agent conversation (latest topic) | +| `/agent//` | Specific topic in a conversation | +| `/task` ยท `/task/` | Task list / task detail | +| `/page` | Documents (ๆ–‡็จฟ) | +| `/settings` | Settings | +| `/community` | Discover / community | + +Targets default to Electron (`--cdp 9222`); set `AB_TARGET="--session "` +for web sessions. For deeper or one-off state inspection, fall back to raw +eval below. + ### LobeHub-Specific Patterns #### Access Zustand Store State @@ -108,5 +141,14 @@ agent-browser --cdp 9222 eval "JSON.stringify(window.__CAPTURED_ERRORS)" - **Always use `electron-dev.sh stop` to clean up** โ€” `pkill -f "Electron"` only kills the main process; helper processes (GPU, renderer, network) survive. The script finds and kills all of them via PID matching against the project's electron binary path. - **`npx electron-vite dev` must run from `apps/desktop/`** โ€” running from project root fails silently. The `electron-dev.sh` script handles this automatically. +- **Dev build auto-opens DevTools, which hijacks the CDP target** โ€” `agent-browser --cdp 9222` may attach to the DevTools page (`devtools://โ€ฆ`) instead of the app (`app://renderer/`). Symptom: `get url` returns a `devtools://` URL. Fix: close the DevTools target and reconnect: + + ```bash + DT_ID=$(curl -s http://localhost:9222/json/list | python3 -c "import json,sys; ts=json.load(sys.stdin); print(next(t['id'] for t in ts if t['type']=='page' and t['url'].startswith('devtools://')))") + curl -s "http://localhost:9222/json/close/$DT_ID" > /dev/null + agent-browser close --all && agent-browser --cdp 9222 get url # expect app://renderer/ + ``` + - **Don't resize the Electron window after load** โ€” resizing triggers full SPA reload - **Store is at `window.__LOBE_STORES`** not `window.__ZUSTAND_STORES__` +- **Streaming / ticking UI needs GIF evidence** โ€” see `scripts/record-gif.sh`; a static screenshot cannot prove time-based behavior.