Compare commits

...

1 Commits

Author SHA1 Message Date
Innei d7b87af8ad 🔨 chore: add agent-evals skill to .agents/skills
Made-with: Cursor
2026-04-07 19:42:45 +08:00
4 changed files with 483 additions and 0 deletions
+119
View File
@@ -0,0 +1,119 @@
---
name: agent-evals
description: "Use when running agent evals, iterating on prompts to improve pass rates, comparing models, or publishing eval results to Linear. Triggers on 'agent-evals', 'eval', 'run eval', 'compare models', 'optimize prompt', 'eval baseline', 'prompt iteration', 'why is eval failing', 'run boundary cases', 'bot test'."
---
# Agent Evals
## Overview
`devtools/agent-evals/` runs real agent executions against PGlite for end-to-end testing and model comparison. No vitest, no mocks — real DB, real LLM, real tool execution.
See [cli-reference.md](references/cli-reference.md) for full CLI commands, ScenarioConfig type, MCP/bot/matrix setup, and assertion API.
See [model-ranking.md](references/model-ranking.md) for current model tier rankings per scenario.
See [linear-workflow.md](references/linear-workflow.md) for how to write, publish, and follow up eval results on Linear.
## When to Use
- Running or creating eval scenarios for agent behavior
- Iterating on prompts (systemRole, toolSystemRole, context engine) to improve pass rates
- Comparing model performance across scenarios
- Debugging why an eval case fails
- Publishing eval results to Linear
**When NOT to use:**
- Unit testing individual functions → use Vitest
- Manual QA in browser → use dev server directly
- Testing non-agent features (UI, API routes)
## Quick Start
```bash
bun run agent-evals run web-onboarding-v3 --case-id fe-intj-crud-v1 --no-matrix --model gpt-5.4-mini
bun run agent-evals run web-onboarding-v3 --all-cases
bun run agent-evals list
```
Use `--no-matrix` for fast single-model iteration. Enable matrix only for final validation.
## Eval Iteration Workflow
```dot
digraph eval_iteration {
rankdir=TB;
"Run eval cases" -> "All pass?";
"All pass?" -> "Update model-ranking.md" [label="yes"];
"Update model-ranking.md" -> "Publish to Linear";
"All pass?" -> "Diagnose root cause" [label="no"];
"Diagnose root cause" -> "Fix prompt or injection";
"Fix prompt or injection" -> "Re-run SAME cases";
"Re-run SAME cases" -> "Run baseline (regression)";
"Run baseline (regression)" -> "All pass?";
}
```
### Diagnose Root Cause
Classify failures into three layers:
**Layer 1 — Prompt issue (systemRole / toolSystemRole)**
- Symptoms: Tool calls happen but wrong order, missing specific ones, ignores completion signals
- Fix: `lobehub/packages/builtin-agent-onboarding/src/systemRole.ts` or `toolSystemRole.ts`
**Layer 2 — System injection issue (context engine)**
- Symptoms: Model never calls a tool despite prompt telling it to, phase stuck, `<next_actions>` never instructs the right action
- Fix: `lobehub/packages/context-engine/src/providers/OnboardingActionHintInjector.ts`
- Key: `<next_actions>` is the ONLY thing that tells the model which tools to call each turn. If a tool is never mentioned for the current phase, the model will never call it.
**Layer 3 — Model capability issue**
- Symptoms: Zero tool calls, ignores `<next_actions>` entirely, pure text despite tool availability, prompt changes have no effect
- Fix: Switch model. No prompt fix compensates for model inability in long context.
### Fix and Re-run
1. Fix the root cause
2. Re-run the **exact same failing cases**
3. Run **baseline cases** for regression check:
```bash
bun run agent-evals run web-onboarding-v3 \
--case-id fe-intj-crud-v1,pm-enfp-collab-v1,be-istp-reliability-v1,da-intj-automation-en-v1,designer-infp-creative-ja-v1 \
--no-matrix --model gpt-5.4-mini
```
### Update Model Ranking
After all cases pass (or after a matrix run), update [model-ranking.md](references/model-ranking.md):
- Add/update the ranking table under the scenario section
- Update the date
- Add new baseline cases if introduced
### Publish to Linear
See [linear-workflow.md](references/linear-workflow.md) for the full workflow: issue structure, publishing commands, and follow-up steps.
## Key Files
| File | Role |
| ------------------------------------------------------- | ----------------------------------- |
| `devtools/agent-evals/scenarios/*.ts` | Scenario configs + assertions |
| `devtools/agent-evals/datasets/onboarding/golden-v1.ts` | Test cases (baseline + extreme) |
| `lobehub/.../systemRole.ts` | Conversation flow prompt |
| `lobehub/.../toolSystemRole.ts` | Tool usage rules prompt |
| `lobehub/.../OnboardingActionHintInjector.ts` | Per-turn `<next_actions>` injection |
| `lobehub/src/server/services/onboarding/index.ts` | Phase derivation (`derivePhase`) |
## Common Mistakes
| Mistake | Fix |
| -------------------------------------- | ---------------------------------------------------------------- |
| Only change prompt wording | Check if `<next_actions>` even mentions the tool for that phase |
| Skip baseline regression check | Edge case fixes can break happy path |
| Compare across different judge models | gpt-4o-mini scores ≠ gpt-5.4-mini scores — always use same judge |
| Run full matrix during iteration | `--no-matrix` + single model for speed; matrix for final only |
| Assume prompt fix works for all models | Test at least 2 models |
@@ -0,0 +1,152 @@
# Agent Evals CLI Reference
## Directory Structure
```
devtools/agent-evals/
├── cli.ts # CLI entry (#!/usr/bin/env bun)
├── package.json # @cloud/agent-evals
├── types.ts # ScenarioConfig, McpToolConfig, BotContext, ModelVariant
├── helpers/
│ ├── env.ts # PGlite init, user/agent creation, MCP registration
│ ├── runner.ts # runAgent(), runMatrix()
│ ├── snapshot.ts # SnapshotHandle assertion API
│ ├── tracing.ts # StepLifecycleCallbacks → ExecutionSnapshot collector
│ ├── mcp.ts # MCP server discovery + DB registration
│ ├── claude-credentials.ts # Read OAuth tokens from Claude Code keychain
│ └── compare.ts # Matrix comparison table renderer
└── scenarios/ # Scenario files (export default ScenarioConfig)
```
## CLI Commands
```bash
# Run a specific scenario
bun run agent-evals run basic-chat
# Override model/provider
bun run agent-evals run bot-discord --model deepseek-chat
bun run agent-evals run basic-chat --model claude-sonnet-4-20250514 --provider openai
# Inline prompt
bun run agent-evals run --prompt "What is 2+2?" --model gpt-4o-mini
# Model matrix (comma-separated, supports model@provider)
bun run agent-evals run --prompt "Hello" --matrix "gpt-4o-mini@openai,deepseek-chat@openai"
# Dataset cases
bun run agent-evals run web-onboarding-v3 --all-cases
bun run agent-evals run web-onboarding-v3 --case-id fe-intj-crud-v1
bun run agent-evals run web-onboarding-v3 --all-cases --sample-cases 2 --seed 7
# Disable scenario matrix
bun run agent-evals run web-onboarding-v3 --no-matrix --model gpt-5.4-mini
# Run all scenarios
bun run agent-evals run --all
# List scenarios
bun run agent-evals list
```
## ScenarioConfig Type
```typescript
interface ScenarioConfig {
name: string;
description?: string;
prompt: string;
agent: {
model?: string; // default: gpt-4o-mini
provider?: string; // default: openai
systemRole?: string;
plugins?: string[];
mcpServers?: McpToolConfig[];
};
bot?: BotContext; // Simulate bot trigger (discord, telegram, etc.)
matrix?: ModelVariant[]; // Run across multiple model/provider combos
cases?: ScenarioCase[]; // Reusable conversation cases
maxSteps?: number; // default: 10
timeout?: number; // default: 120_000
turns?: string[]; // Multi-turn follow-up messages
assertions?: (snapshot: ExecutionSnapshot, context: AssertionContext) => void;
}
```
## Creating a New Scenario
```typescript
import type { ScenarioConfig } from '../types';
export default {
name: 'My Scenario',
agent: { model: 'gpt-4o-mini' },
prompt: 'Your test prompt here',
assertions: (snapshot) => {
if (snapshot.completionReason !== 'done') {
throw new Error(`Expected "done", got "${snapshot.completionReason}"`);
}
},
} satisfies ScenarioConfig;
```
## MCP Tool Testing
```typescript
export default {
name: 'Linear MCP',
agent: {
model: 'gpt-4o-mini',
mcpServers: [
{
identifier: 'linear-server',
connection: { type: 'http', url: 'https://mcp.linear.app/mcp' },
auth: { type: 'bearer', token: 'auto' },
},
],
},
prompt: 'List recent issues in LOBE project',
} satisfies ScenarioConfig;
```
Auth token resolution: `'auto'` (macOS Keychain) → `'$ENV_VAR'` → literal string.
## Bot Trigger Testing
```typescript
export default {
name: 'Discord Bot',
agent: { model: 'gpt-4o-mini', systemRole: 'You are a Discord bot.' },
bot: {
platform: 'discord',
applicationId: 'my-bot-id',
platformThreadId: 'discord:guild:channel:thread',
discordContext: {
channel: { id: 'ch001', name: 'general' },
guild: { id: 'guild001' },
},
},
prompt: 'Hello bot!',
} satisfies ScenarioConfig;
```
## SnapshotHandle Assertion API
```typescript
handle
.assertCompleted() // completionReason === 'done'
.assertNoError()
.assertStepCount(2, 5) // min 2, max 5 steps
.assertHasToolCall('lobe-web-browsing', 'search')
.atStep(0, (step) => { ... })
.someStep((step) => step.content?.includes('keyword'), 'Expected keyword')
.print();
```
## Implementation Details
- **DB**: PGlite in-memory via `getTestDB()`
- **Agent**: `AiAgentService(db, userId)` — constructor injection
- **Provider**: Default `openai` (non-`lobehub`) — skips billing hooks
- **State**: `InMemoryAgentStateManager` — no Redis needed
- **MCP**: `MCPService.getStreamableMcpServerManifest()` discovers tools
@@ -0,0 +1,188 @@
# Linear Eval Results Workflow
How to write, publish, and follow up on eval result issues in Linear.
## Tool Priority
- **Preferred**: Linear MCP (`mcp__linear-server__*`) — if user has configured the MCP server, use it for all operations (create issue, add comment, update status, add relation)
- **Fallback**: [Linear CLI](https://github.com/schpet/linear-cli) (`linear` command) — third-party CLI, use when MCP is unavailable
Check MCP availability first. All examples below show both approaches.
## 1. Writing the Result Issue
Title format: `Eval: <scenario> — <what was tested or changed>`
Examples:
- `Eval: web-onboarding-v3 — baseline all models`
- `Eval: web-onboarding-v3 — fix phase3 tool hint regression`
Structure the issue body with these sections:
```markdown
## Context
- Scenario: `<scenario-name>` (e.g. `web-onboarding-v3`)
- Model: `<model>` (e.g. `gpt-5.4-mini`)
- Cases: baseline / all / specific case IDs
- Prompt changes: brief description of what changed (if iterating)
## Results
| Model | Status | Score | finishOnboarding | Fields | Tokens | Cost | Notes |
| ---------------- | ------- | ----- | ---------------- | ------ | ------ | ------- | ----- |
| gpt-5.4-mini | ✅ PASS | 7/10 | ✓ | ✓ | 24.3k | $0.0035 | ... |
| deepseek-v3.2 | ✅ PASS | — | ✓ | ✓ | 24.4k | — | ... |
| claude-haiku-4.5 | ❌ FAIL | — | — | ✗ | — | — | ... |
## Baseline Comparison
> Compare with previous version (link to prior eval issue)
| Model | Previous | Current | Change |
| ------------- | ------------ | ------------ | ------ |
| gpt-5.4-mini | STALL | ✅ PASS 7/10 | ⬆ |
| deepseek-v3.2 | ✅ PASS 7/10 | ✅ PASS | — |
## Findings
- Bullet list of observations, regressions, or improvements
- Link to specific prompt diff if applicable
## Recommendations
- Actionable next steps based on findings
```
## 2. Publishing to Linear
### Via MCP (preferred)
```
mcp__linear-server__create_issue:
title: "Eval: web-onboarding-v3 — baseline all models"
description: <issue body>
teamId: <LOBE team ID>
labelIds: ["claude code"]
mcp__linear-server__create_issue_relation:
issueId: <new issue ID>
relatedIssueId: <parent tracking issue ID>
type: "related"
```
### Via CLI (fallback)
```bash
cat > /tmp/eval-results.md << 'EOF'
<issue body>
EOF
linear issue create \
--title "Eval: web-onboarding-v3 — baseline all models" \
--description-file /tmp/eval-results.md \
--team LOBE
linear issue relation add LOBE-XXXX related LOBE-6672
```
Parent issue relationships per scenario are tracked in [model-ranking.md](model-ranking.md). Always `related` link new eval issues to the scenario's parent issues.
## 3. Follow-up
Follow-up is done as **comments on the scenario's parent tracking issues**, not as separate issues. This keeps the full eval history threaded in one place.
### Comment on parent issues
After publishing a new eval result issue, add a follow-up comment to each related parent tracking issue (e.g. LOBE-6672) summarizing what changed. The comment should include:
- Link to the new eval result issue
- Key ranking changes (which models moved tiers)
- Regressions or improvements vs previous run
- Actionable next steps
Example comment format (based on actual LOBE-6672 follow-ups):
```markdown
## V3 + Escape Hatch: 7-model Matrix (2026-04-07)
**Based on**: V3 prompt + `<next_actions>` escape hatch fix (LOBE-6810)
**Eval issue**: LOBE-XXXX
### Summary
- **4/7 PASS** (gpt-5.4-mini, deepseek-v3.2, minimax-m2.5, glm-5)
- glm-5 first pass ever (V2 FAIL 4/10 → V3+escape hatch PASS)
- claude-haiku-4.5 regression (V3 PASS → V3+escape hatch FAIL)
### Ranking Changes
| Model | Previous Tier | Current Tier |
| ---------------- | ------------- | ------------ |
| glm-5 | Unstable | Usable ⬆ |
| claude-haiku-4.5 | Usable | Unstable ⬇ |
### Next Steps
- Investigate haiku regression (may need conditional escape hatch injection)
- Consider removing groq models from onboarding support list
```
### Via MCP
```
mcp__linear-server__create_comment:
issueId: <parent issue ID, e.g. LOBE-6672>
body: <comment body>
```
### Via CLI
```bash
cat > /tmp/eval-comment.md << 'EOF'
<comment body>
EOF
linear issue comment add LOBE-6672 --body-file /tmp/eval-comment.md
```
### Link regressions
If a previously passing case now fails, create a separate bug issue and `blocks` link it:
```bash
# MCP
mcp__linear-server__create_issue: title "Regression: <case-id> fails after <change>"
mcp__linear-server__create_issue_relation: type "blocks"
# CLI
linear issue create --title "Regression: <case-id> fails after <change>" --team LOBE
linear issue relation add LOBE-YYYY blocks LOBE-XXXX
```
### Close resolved issues
If an eval run confirms a fix for a tracked issue, comment the result and update status:
```bash
# MCP
mcp__linear-server__create_comment: issueId <ID>, body "Confirmed fixed in eval run LOBE-ZZZZ"
mcp__linear-server__update_issue: id <ID>, stateId <Done state ID>
# CLI
linear issue comment add LOBE-XXXX --body "Confirmed fixed in eval run LOBE-ZZZZ"
linear issue update LOBE-XXXX --status "Done"
```
### Update model-ranking.md
After publishing and commenting, update [model-ranking.md](model-ranking.md) if ranking changed:
- New or updated ranking table under the scenario section
- Updated date
- New baseline cases if added
### Iterate
If cases still fail, return to the [eval iteration workflow](../SKILL.md) (diagnose → fix → re-run → baseline regression).
@@ -0,0 +1,24 @@
# Model Ranking
Per-scenario model ranking and eval history. Updated continuously as new eval runs complete.
---
## web-onboarding-v3
**Linear:** LOBE-6627, LOBE-6672, LOBE-6810, LOBE-6819
**Baseline cases:**
```
fe-intj-crud-v1, pm-enfp-collab-v1, be-istp-reliability-v1, da-intj-automation-en-v1, designer-infp-creative-ja-v1
```
**Ranking (2026-04-07):**
| Tier | Models | Notes |
| ------------ | -------------------------------------- | -------------------------------------- |
| Reliable | gpt-5.4-mini, deepseek-v3.2 | Consistent pass across all case types |
| Usable | minimax-m2.5, glm-5 | Pass baseline, may need escape hatches |
| Unstable | claude-haiku-4.5 | Passes sometimes, fails unpredictably |
| Incompatible | groq/llama-4-scout, groq/llama-3.3-70b | Cannot follow complex tool protocols |