mirror of
https://github.com/lobehub/lobe-chat.git
synced 2026-06-14 03:30:19 +00:00
📝 docs(agent-testing): require inline visual evidence (#15750)
This commit is contained in:
@@ -192,6 +192,10 @@ Two hard rules worth front-loading:
|
|||||||
`# | case | result | key observation | evidence` shape and embed the
|
`# | case | result | key observation | evidence` shape and embed the
|
||||||
screenshot/GIF in the evidence cell. Use separate evidence sections only for
|
screenshot/GIF in the evidence cell. Use separate evidence sections only for
|
||||||
long CLI transcripts, HAR summaries, or supplemental detail.
|
long CLI transcripts, HAR summaries, or supplemental detail.
|
||||||
|
- **Visual evidence must render inline.** Screenshots and GIFs in `report.md`
|
||||||
|
must use Markdown image syntax like ``. Do not
|
||||||
|
use bare file paths, Markdown links, or local file links as the primary
|
||||||
|
visual evidence; those make the report unreadable without opening each asset.
|
||||||
- **Time-based behavior needs a GIF, not a screenshot.** If a case asserts
|
- **Time-based behavior needs a GIF, not a screenshot.** If a case asserts
|
||||||
change over time (streaming output, a ticking timer, loading states,
|
change over time (streaming output, a ticking timer, loading states,
|
||||||
animations), record it with `scripts/record-gif.sh` and embed the GIF —
|
animations), record it with `scripts/record-gif.sh` and embed the GIF —
|
||||||
|
|||||||
@@ -34,6 +34,7 @@ output):
|
|||||||
- UI (static state): `agent-browser screenshot` or `capture-app-window.sh`,
|
- UI (static state): `agent-browser screenshot` or `capture-app-window.sh`,
|
||||||
then **verify the screenshot with the Read tool before citing it** —
|
then **verify the screenshot with the Read tool before citing it** —
|
||||||
never cite an image you haven't looked at.
|
never cite an image you haven't looked at.
|
||||||
|
|
||||||
- UI (time-based behavior): **screenshot vs GIF is a judgment you must
|
- UI (time-based behavior): **screenshot vs GIF is a judgment you must
|
||||||
make per case.** If the assertion is about change over time — streaming
|
make per case.** If the assertion is about change over time — streaming
|
||||||
output, a ticking timer, loading/progress states, animations,
|
output, a ticking timer, loading/progress states, animations,
|
||||||
@@ -52,12 +53,15 @@ output):
|
|||||||
at least the first/last frames visually (Read the GIF) before citing.
|
at least the first/last frames visually (Read the GIF) before citing.
|
||||||
|
|
||||||
- CLI: exact command + trimmed output (`$CLI task list | tee "$DIR/assets/task-list.txt"`).
|
- CLI: exact command + trimmed output (`$CLI task list | tee "$DIR/assets/task-list.txt"`).
|
||||||
|
|
||||||
- Network: `agent-browser network requests` dumps or HAR files.
|
- Network: `agent-browser network requests` dumps or HAR files.
|
||||||
|
|
||||||
3. **Fill `report.md` as you go** — don't reconstruct from memory at the end.
|
3. **Fill `report.md` as you go** — don't reconstruct from memory at the end.
|
||||||
The primary evidence belongs in the case table itself: each row should pair
|
The primary evidence belongs in the case table itself: each row should pair
|
||||||
the assertion with the screenshot/GIF/link that proves it, so readers can
|
the assertion with the screenshot/GIF or non-visual artifact that proves it,
|
||||||
scan the result without jumping between sections.
|
so readers can scan the result without jumping between sections. UI evidence
|
||||||
|
must render inline with Markdown image syntax; a plain link or file path is
|
||||||
|
not acceptable as primary visual evidence.
|
||||||
|
|
||||||
4. **Set the verdict** in both `report.md` and `result.json`, then link the
|
4. **Set the verdict** in both `report.md` and `result.json`, then link the
|
||||||
report directory in your final answer to the user.
|
report directory in your final answer to the user.
|
||||||
@@ -96,6 +100,27 @@ scenario or regression assertion, and put the screenshot/GIF directly in the
|
|||||||
| 2 | Respect requested length | fail | Requested about 600 Chinese characters; final body was about 1286 |  |
|
| 2 | Respect requested length | fail | Requested about 600 Chinese characters; final body was about 1286 |  |
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Inline visual evidence
|
||||||
|
|
||||||
|
Screenshots and GIFs must be embedded so the report shows the image inline:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|

|
||||||
|

|
||||||
|
```
|
||||||
|
|
||||||
|
Do **not** use these as the primary evidence for UI cases:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
[case 1 result](assets/case1-result.png)
|
||||||
|
assets/case1-result.png
|
||||||
|
file:///tmp/case1-result.png
|
||||||
|
```
|
||||||
|
|
||||||
|
Links are acceptable for non-visual artifacts such as CLI transcripts, HAR
|
||||||
|
files, or long logs. For videos, embed a representative screenshot/GIF inline in
|
||||||
|
the case row and link the full video as supplemental evidence.
|
||||||
|
|
||||||
Avoid the old wide table with separate `steps`, `expected`, and `actual`
|
Avoid the old wide table with separate `steps`, `expected`, and `actual`
|
||||||
columns unless the test is purely non-visual and truly needs that breakdown.
|
columns unless the test is purely non-visual and truly needs that breakdown.
|
||||||
For UI reports, those columns make screenshot-backed reading harder. Put
|
For UI reports, those columns make screenshot-backed reading harder. Put
|
||||||
@@ -104,9 +129,10 @@ interpretation of the result.
|
|||||||
|
|
||||||
Use an extra evidence/detail section only when the inline table cannot carry
|
Use an extra evidence/detail section only when the inline table cannot carry
|
||||||
the material cleanly, such as long CLI transcripts, HAR summaries, or multiple
|
the material cleanly, such as long CLI transcripts, HAR summaries, or multiple
|
||||||
screenshots for one case. In that situation, keep the table evidence cell as a
|
screenshots for one case. In that situation, keep the table evidence cell as an
|
||||||
short inline proof or link, then put the longer material under `Verification`
|
inline visual proof for UI cases or a concise link for non-visual artifacts,
|
||||||
or a brief `Additional Evidence` section.
|
then put the longer material under `Verification` or a brief
|
||||||
|
`Additional Evidence` section.
|
||||||
|
|
||||||
Status values: `pass` / `fail` / `blocked` (couldn't run — e.g. auth or env
|
Status values: `pass` / `fail` / `blocked` (couldn't run — e.g. auth or env
|
||||||
missing; a blocked case is not a pass).
|
missing; a blocked case is not a pass).
|
||||||
@@ -147,7 +173,8 @@ word the user reads first: `pass`, `fail`, or `partial`.
|
|||||||
## Rules
|
## Rules
|
||||||
|
|
||||||
- **No evidence, no claim** — every `pass`/`fail` in the case table must link
|
- **No evidence, no claim** — every `pass`/`fail` in the case table must link
|
||||||
at least one asset.
|
at least one asset. UI cases must inline-embed their primary screenshot/GIF;
|
||||||
|
non-visual CLI/network cases may link transcripts, HAR files, or logs.
|
||||||
- **Screenshots must be visually verified** with the Read tool before being
|
- **Screenshots must be visually verified** with the Read tool before being
|
||||||
cited.
|
cited.
|
||||||
- **Report failures faithfully** — a failing case with clear evidence is a good
|
- **Report failures faithfully** — a failing case with clear evidence is a good
|
||||||
|
|||||||
Reference in New Issue
Block a user