🔨 chore: add agent-evals skill to .agents/skills

Made-with: Cursor
2026-06-14 03:30:19 +00:00 · 2026-04-07 19:42:45 +08:00
4 changed files with 483 additions and 0 deletions
@@ -0,0 +1,119 @@
+---
+name: agent-evals
+description: "Use when running agent evals, iterating on prompts to improve pass rates, comparing models, or publishing eval results to Linear. Triggers on 'agent-evals', 'eval', 'run eval', 'compare models', 'optimize prompt', 'eval baseline', 'prompt iteration', 'why is eval failing', 'run boundary cases', 'bot test'."
+---
+
+# Agent Evals
+
+## Overview
+
+`devtools/agent-evals/` runs real agent executions against PGlite for end-to-end testing and model comparison. No vitest, no mocks — real DB, real LLM, real tool execution.
+
+See [cli-reference.md](references/cli-reference.md) for full CLI commands, ScenarioConfig type, MCP/bot/matrix setup, and assertion API.
+See [model-ranking.md](references/model-ranking.md) for current model tier rankings per scenario.
+See [linear-workflow.md](references/linear-workflow.md) for how to write, publish, and follow up eval results on Linear.
+
+## When to Use
+
+- Running or creating eval scenarios for agent behavior
+- Iterating on prompts (systemRole, toolSystemRole, context engine) to improve pass rates
+- Comparing model performance across scenarios
+- Debugging why an eval case fails
+- Publishing eval results to Linear
+
+**When NOT to use:**
+
+- Unit testing individual functions → use Vitest
+- Manual QA in browser → use dev server directly
+- Testing non-agent features (UI, API routes)
+
+## Quick Start
+
+```bash
+bun run agent-evals run web-onboarding-v3 --case-id fe-intj-crud-v1 --no-matrix --model gpt-5.4-mini
+bun run agent-evals run web-onboarding-v3 --all-cases
+bun run agent-evals list
+```
+
+Use `--no-matrix` for fast single-model iteration. Enable matrix only for final validation.
+
+## Eval Iteration Workflow
+
+```dot
+digraph eval_iteration {
+  rankdir=TB;
+  "Run eval cases" -> "All pass?";
+  "All pass?" -> "Update model-ranking.md" [label="yes"];
+  "Update model-ranking.md" -> "Publish to Linear";
+  "All pass?" -> "Diagnose root cause" [label="no"];
+  "Diagnose root cause" -> "Fix prompt or injection";
+  "Fix prompt or injection" -> "Re-run SAME cases";
+  "Re-run SAME cases" -> "Run baseline (regression)";
+  "Run baseline (regression)" -> "All pass?";
+}
+```
+
+### Diagnose Root Cause
+
+Classify failures into three layers:
+
+**Layer 1 — Prompt issue (systemRole / toolSystemRole)**
+
+- Symptoms: Tool calls happen but wrong order, missing specific ones, ignores completion signals
+- Fix: `lobehub/packages/builtin-agent-onboarding/src/systemRole.ts` or `toolSystemRole.ts`
+
+**Layer 2 — System injection issue (context engine)**
+
+- Symptoms: Model never calls a tool despite prompt telling it to, phase stuck, `<next_actions>` never instructs the right action
+- Fix: `lobehub/packages/context-engine/src/providers/OnboardingActionHintInjector.ts`
+- Key: `<next_actions>` is the ONLY thing that tells the model which tools to call each turn. If a tool is never mentioned for the current phase, the model will never call it.
+
+**Layer 3 — Model capability issue**
+
+- Symptoms: Zero tool calls, ignores `<next_actions>` entirely, pure text despite tool availability, prompt changes have no effect
+- Fix: Switch model. No prompt fix compensates for model inability in long context.
+
+### Fix and Re-run
+
+1. Fix the root cause
+2. Re-run the **exact same failing cases**
+3. Run **baseline cases** for regression check:
+
+```bash
+bun run agent-evals run web-onboarding-v3 \
+  --case-id fe-intj-crud-v1,pm-enfp-collab-v1,be-istp-reliability-v1,da-intj-automation-en-v1,designer-infp-creative-ja-v1 \
+  --no-matrix --model gpt-5.4-mini
+```
+
+### Update Model Ranking
+
+After all cases pass (or after a matrix run), update [model-ranking.md](references/model-ranking.md):
+
+- Add/update the ranking table under the scenario section
+- Update the date
+- Add new baseline cases if introduced
+
+### Publish to Linear
+
+See [linear-workflow.md](references/linear-workflow.md) for the full workflow: issue structure, publishing commands, and follow-up steps.
+
+## Key Files
+
+| File                                                    | Role                                |
+| ------------------------------------------------------- | ----------------------------------- |
+| `devtools/agent-evals/scenarios/*.ts`                   | Scenario configs + assertions       |
+| `devtools/agent-evals/datasets/onboarding/golden-v1.ts` | Test cases (baseline + extreme)     |
+| `lobehub/.../systemRole.ts`                             | Conversation flow prompt            |
+| `lobehub/.../toolSystemRole.ts`                         | Tool usage rules prompt             |
+| `lobehub/.../OnboardingActionHintInjector.ts`           | Per-turn `<next_actions>` injection |
+| `lobehub/src/server/services/onboarding/index.ts`       | Phase derivation (`derivePhase`)    |
+
+## Common Mistakes
+
+| Mistake                                | Fix                                                              |
+| -------------------------------------- | ---------------------------------------------------------------- |
+| Only change prompt wording             | Check if `<next_actions>` even mentions the tool for that phase  |
+| Skip baseline regression check         | Edge case fixes can break happy path                             |
+| Compare across different judge models  | gpt-4o-mini scores ≠ gpt-5.4-mini scores — always use same judge |
+| Run full matrix during iteration       | `--no-matrix` + single model for speed; matrix for final only    |
+| Assume prompt fix works for all models | Test at least 2 models                                           |
@@ -0,0 +1,152 @@
+# Agent Evals CLI Reference
+
+## Directory Structure
+
+```
+devtools/agent-evals/
+├── cli.ts                          # CLI entry (#!/usr/bin/env bun)
+├── package.json                    # @cloud/agent-evals
+├── types.ts                        # ScenarioConfig, McpToolConfig, BotContext, ModelVariant
+├── helpers/
+│   ├── env.ts                      # PGlite init, user/agent creation, MCP registration
+│   ├── runner.ts                   # runAgent(), runMatrix()
+│   ├── snapshot.ts                 # SnapshotHandle assertion API
+│   ├── tracing.ts                  # StepLifecycleCallbacks → ExecutionSnapshot collector
+│   ├── mcp.ts                      # MCP server discovery + DB registration
+│   ├── claude-credentials.ts       # Read OAuth tokens from Claude Code keychain
+│   └── compare.ts                  # Matrix comparison table renderer
+└── scenarios/                      # Scenario files (export default ScenarioConfig)
+```
+
+## CLI Commands
+
+```bash
+# Run a specific scenario
+bun run agent-evals run basic-chat
+
+# Override model/provider
+bun run agent-evals run bot-discord --model deepseek-chat
+bun run agent-evals run basic-chat --model claude-sonnet-4-20250514 --provider openai
+
+# Inline prompt
+bun run agent-evals run --prompt "What is 2+2?" --model gpt-4o-mini
+
+# Model matrix (comma-separated, supports model@provider)
+bun run agent-evals run --prompt "Hello" --matrix "gpt-4o-mini@openai,deepseek-chat@openai"
+
+# Dataset cases
+bun run agent-evals run web-onboarding-v3 --all-cases
+bun run agent-evals run web-onboarding-v3 --case-id fe-intj-crud-v1
+bun run agent-evals run web-onboarding-v3 --all-cases --sample-cases 2 --seed 7
+
+# Disable scenario matrix
+bun run agent-evals run web-onboarding-v3 --no-matrix --model gpt-5.4-mini
+
+# Run all scenarios
+bun run agent-evals run --all
+
+# List scenarios
+bun run agent-evals list
+```
+
+## ScenarioConfig Type
+
+```typescript
+interface ScenarioConfig {
+  name: string;
+  description?: string;
+  prompt: string;
+  agent: {
+    model?: string; // default: gpt-4o-mini
+    provider?: string; // default: openai
+    systemRole?: string;
+    plugins?: string[];
+    mcpServers?: McpToolConfig[];
+  };
+  bot?: BotContext; // Simulate bot trigger (discord, telegram, etc.)
+  matrix?: ModelVariant[]; // Run across multiple model/provider combos
+  cases?: ScenarioCase[]; // Reusable conversation cases
+  maxSteps?: number; // default: 10
+  timeout?: number; // default: 120_000
+  turns?: string[]; // Multi-turn follow-up messages
+  assertions?: (snapshot: ExecutionSnapshot, context: AssertionContext) => void;
+}
+```
+
+## Creating a New Scenario
+
+```typescript
+import type { ScenarioConfig } from '../types';
+
+export default {
+  name: 'My Scenario',
+  agent: { model: 'gpt-4o-mini' },
+  prompt: 'Your test prompt here',
+  assertions: (snapshot) => {
+    if (snapshot.completionReason !== 'done') {
+      throw new Error(`Expected "done", got "${snapshot.completionReason}"`);
+    }
+  },
+} satisfies ScenarioConfig;
+```
+
+## MCP Tool Testing
+
+```typescript
+export default {
+  name: 'Linear MCP',
+  agent: {
+    model: 'gpt-4o-mini',
+    mcpServers: [
+      {
+        identifier: 'linear-server',
+        connection: { type: 'http', url: 'https://mcp.linear.app/mcp' },
+        auth: { type: 'bearer', token: 'auto' },
+      },
+    ],
+  },
+  prompt: 'List recent issues in LOBE project',
+} satisfies ScenarioConfig;
+```
+
+Auth token resolution: `'auto'` (macOS Keychain) → `'$ENV_VAR'` → literal string.
+
+## Bot Trigger Testing
+
+```typescript
+export default {
+  name: 'Discord Bot',
+  agent: { model: 'gpt-4o-mini', systemRole: 'You are a Discord bot.' },
+  bot: {
+    platform: 'discord',
+    applicationId: 'my-bot-id',
+    platformThreadId: 'discord:guild:channel:thread',
+    discordContext: {
+      channel: { id: 'ch001', name: 'general' },
+      guild: { id: 'guild001' },
+    },
+  },
+  prompt: 'Hello bot!',
+} satisfies ScenarioConfig;
+```
+
+## SnapshotHandle Assertion API
+
+```typescript
+handle
+  .assertCompleted()          // completionReason === 'done'
+  .assertNoError()
+  .assertStepCount(2, 5)     // min 2, max 5 steps
+  .assertHasToolCall('lobe-web-browsing', 'search')
+  .atStep(0, (step) => { ... })
+  .someStep((step) => step.content?.includes('keyword'), 'Expected keyword')
+  .print();
+```
+
+## Implementation Details
+
+- **DB**: PGlite in-memory via `getTestDB()`
+- **Agent**: `AiAgentService(db, userId)` — constructor injection
+- **Provider**: Default `openai` (non-`lobehub`) — skips billing hooks
+- **State**: `InMemoryAgentStateManager` — no Redis needed
+- **MCP**: `MCPService.getStreamableMcpServerManifest()` discovers tools
@@ -0,0 +1,188 @@
+# Linear Eval Results Workflow
+
+How to write, publish, and follow up on eval result issues in Linear.
+
+## Tool Priority
+
+- **Preferred**: Linear MCP (`mcp__linear-server__*`) — if user has configured the MCP server, use it for all operations (create issue, add comment, update status, add relation)
+- **Fallback**: [Linear CLI](https://github.com/schpet/linear-cli) (`linear` command) — third-party CLI, use when MCP is unavailable
+
+Check MCP availability first. All examples below show both approaches.
+
+## 1. Writing the Result Issue
+
+Title format: `Eval: <scenario> — <what was tested or changed>`
+
+Examples:
+
+- `Eval: web-onboarding-v3 — baseline all models`
+- `Eval: web-onboarding-v3 — fix phase3 tool hint regression`
+
+Structure the issue body with these sections:
+
+```markdown
+## Context
+
+- Scenario: `<scenario-name>` (e.g. `web-onboarding-v3`)
+- Model: `<model>` (e.g. `gpt-5.4-mini`)
+- Cases: baseline / all / specific case IDs
+- Prompt changes: brief description of what changed (if iterating)
+
+## Results
+
+| Model            | Status  | Score | finishOnboarding | Fields | Tokens | Cost    | Notes |
+| ---------------- | ------- | ----- | ---------------- | ------ | ------ | ------- | ----- |
+| gpt-5.4-mini     | ✅ PASS | 7/10  | ✓                | ✓      | 24.3k  | $0.0035 | ...   |
+| deepseek-v3.2    | ✅ PASS | —     | ✓                | ✓      | 24.4k  | —       | ...   |
+| claude-haiku-4.5 | ❌ FAIL | —     | —                | ✗      | —      | —       | ...   |
+
+## Baseline Comparison
+
+> Compare with previous version (link to prior eval issue)
+
+| Model         | Previous     | Current      | Change |
+| ------------- | ------------ | ------------ | ------ |
+| gpt-5.4-mini  | STALL        | ✅ PASS 7/10 | ⬆      |
+| deepseek-v3.2 | ✅ PASS 7/10 | ✅ PASS      | —      |
+
+## Findings
+
+- Bullet list of observations, regressions, or improvements
+- Link to specific prompt diff if applicable
+
+## Recommendations
+
+- Actionable next steps based on findings
+```
+
+## 2. Publishing to Linear
+
+### Via MCP (preferred)
+
+```
+mcp__linear-server__create_issue:
+  title: "Eval: web-onboarding-v3 — baseline all models"
+  description: <issue body>
+  teamId: <LOBE team ID>
+  labelIds: ["claude code"]
+
+mcp__linear-server__create_issue_relation:
+  issueId: <new issue ID>
+  relatedIssueId: <parent tracking issue ID>
+  type: "related"
+```
+
+### Via CLI (fallback)
+
+```bash
+cat > /tmp/eval-results.md << 'EOF'
+<issue body>
+EOF
+
+linear issue create \
+  --title "Eval: web-onboarding-v3 — baseline all models" \
+  --description-file /tmp/eval-results.md \
+  --team LOBE
+
+linear issue relation add LOBE-XXXX related LOBE-6672
+```
+
+Parent issue relationships per scenario are tracked in [model-ranking.md](model-ranking.md). Always `related` link new eval issues to the scenario's parent issues.
+
+## 3. Follow-up
+
+Follow-up is done as **comments on the scenario's parent tracking issues**, not as separate issues. This keeps the full eval history threaded in one place.
+
+### Comment on parent issues
+
+After publishing a new eval result issue, add a follow-up comment to each related parent tracking issue (e.g. LOBE-6672) summarizing what changed. The comment should include:
+
+- Link to the new eval result issue
+- Key ranking changes (which models moved tiers)
+- Regressions or improvements vs previous run
+- Actionable next steps
+
+Example comment format (based on actual LOBE-6672 follow-ups):
+
+```markdown
+## V3 + Escape Hatch: 7-model Matrix (2026-04-07)
+
+**Based on**: V3 prompt + `<next_actions>` escape hatch fix (LOBE-6810)
+**Eval issue**: LOBE-XXXX
+
+### Summary
+
+- **4/7 PASS** (gpt-5.4-mini, deepseek-v3.2, minimax-m2.5, glm-5)
+- glm-5 first pass ever (V2 FAIL 4/10 → V3+escape hatch PASS)
+- claude-haiku-4.5 regression (V3 PASS → V3+escape hatch FAIL)
+
+### Ranking Changes
+
+| Model            | Previous Tier | Current Tier |
+| ---------------- | ------------- | ------------ |
+| glm-5            | Unstable      | Usable ⬆     |
+| claude-haiku-4.5 | Usable        | Unstable ⬇   |
+
+### Next Steps
+
+- Investigate haiku regression (may need conditional escape hatch injection)
+- Consider removing groq models from onboarding support list
+```
+
+### Via MCP
+
+```
+mcp__linear-server__create_comment:
+  issueId: <parent issue ID, e.g. LOBE-6672>
+  body: <comment body>
+```
+
+### Via CLI
+
+```bash
+cat > /tmp/eval-comment.md << 'EOF'
+<comment body>
+EOF
+
+linear issue comment add LOBE-6672 --body-file /tmp/eval-comment.md
+```
+
+### Link regressions
+
+If a previously passing case now fails, create a separate bug issue and `blocks` link it:
+
+```bash
+# MCP
+mcp__linear-server__create_issue: title "Regression: <case-id> fails after <change>"
+mcp__linear-server__create_issue_relation: type "blocks"
+
+# CLI
+linear issue create --title "Regression: <case-id> fails after <change>" --team LOBE
+linear issue relation add LOBE-YYYY blocks LOBE-XXXX
+```
+
+### Close resolved issues
+
+If an eval run confirms a fix for a tracked issue, comment the result and update status:
+
+```bash
+# MCP
+mcp__linear-server__create_comment: issueId <ID>, body "Confirmed fixed in eval run LOBE-ZZZZ"
+mcp__linear-server__update_issue: id <ID>, stateId <Done state ID>
+
+# CLI
+linear issue comment add LOBE-XXXX --body "Confirmed fixed in eval run LOBE-ZZZZ"
+linear issue update LOBE-XXXX --status "Done"
+```
+
+### Update model-ranking.md
+
+After publishing and commenting, update [model-ranking.md](model-ranking.md) if ranking changed:
+
+- New or updated ranking table under the scenario section
+- Updated date
+- New baseline cases if added
+
+### Iterate
+
+If cases still fail, return to the [eval iteration workflow](../SKILL.md) (diagnose → fix → re-run → baseline regression).
@@ -0,0 +1,24 @@
+# Model Ranking
+
+Per-scenario model ranking and eval history. Updated continuously as new eval runs complete.
+
+---
+
+## web-onboarding-v3
+
+**Linear:** LOBE-6627, LOBE-6672, LOBE-6810, LOBE-6819
+
+**Baseline cases:**
+
+```
+fe-intj-crud-v1, pm-enfp-collab-v1, be-istp-reliability-v1, da-intj-automation-en-v1, designer-infp-creative-ja-v1
+```
+
+**Ranking (2026-04-07):**
+
+| Tier         | Models                                 | Notes                                  |
+| ------------ | -------------------------------------- | -------------------------------------- |
+| Reliable     | gpt-5.4-mini, deepseek-v3.2            | Consistent pass across all case types  |
+| Usable       | minimax-m2.5, glm-5                    | Pass baseline, may need escape hatches |
+| Unstable     | claude-haiku-4.5                       | Passes sometimes, fails unpredictably  |
+| Incompatible | groq/llama-4-scout, groq/llama-3.3-70b | Cannot follow complex tool protocols   |