5 Commits

Author SHA1 Message Date
AmAzing- 54e1b59ce6 feat(agent-management): paginate searchAgent with real totals + wire 8 packages into CI (#15448)
*  feat(agent-management): paginate searchAgent with real totals and cap notice

The searchAgent tool silently clamped limit to 20 with no pagination and
reported totalCount as the returned page size, so models (and users) could
never discover agents beyond the 20 most recently updated ones.

- AgentModel: extract shared where builder, add countAgents (same
  conditions as queryAgents)
- lambda router + client agent service: expose countAgents
- server tool runtime & AgentManagerRuntime: pass offset through, report
  real totals (workspace + marketplace), emit explicit notes when the
  requested limit is capped and when more pages exist, explain
  out-of-range offsets instead of claiming no matches
- manifest: add offset param, document pagination
- agent-manager-runtime: add vitest config + test scripts (suite was
  previously unrunnable), repair stale store mocks

* 👷 build(ci): wire 8 tested packages into the package test workflow

An audit found 8 packages carrying test:coverage scripts that were never
added to the CI PACKAGES allowlist, so their suites never ran:

- agent-gateway-client, device-gateway-client, device-identity,
  eval-dataset-parser: already green, added as-is
- eval-rubric, fetch-sse: had no package-level vitest config, so vitest
  fell back to the root config whose setup/aliases break outside src/ —
  added minimal configs
- heterogeneous-agents: one assertion drifted (labels registry gained
  amp/hermes/openclaw/opencode) with nobody noticing — updated
- agent-manager-runtime: wired in the previous commit

All 8 verified locally with the exact CI command
(bun run --filter <pkg> test:coverage).

*  test(agent-management): cover searchAgent error path and market totalCount fallback

Codecov flagged 3 uncovered lines in the patch: the searchAgents catch
block (2 misses) and the totalCount ?? items.length fallback (1 partial).
Add the missing failure-path and fallback tests on both execution paths
(client AgentManagerRuntime + server tool runtime).
2026-06-04 10:52:25 +08:00
Innei 49d191d2a7 🐛 fix: unify TypeScript peer resolution on 6.x (#15263) 2026-05-27 19:22:35 +08:00
Rylan Cai ea329113be feat(eval): add external scoring mode (#12729)
* wip: add llm relevant & BrowseComp

* wip: add widesearch desc

* wip: dsqa, hle, widesearch

* wip: add dsqa

* wip: add awaiting eval status for runs

* wip: add awaiting status for run

* wip: adjust hle-verified

* 🐛 fix: browsecomp topics

* 📝 docs: add annotations

* wip: add awaiting status for pass@k

* wip: add complete status

* wip: update theard dots

* wip: update run status page

* wip: remove useless impl

* wip: update prompt

*  feat: add external eval routes

* wip: add eval cli

* 🐛 fix: support authoritize in no browser environment

* wip: pass tests

* ♻️ refactor: remove tests

* ♻️ refactor: mo camel case
2026-03-10 09:53:26 +08:00
YuTengjing c1521d2aeb 💄 style: batch fix eslint violations across packages (#12601) 2026-03-03 02:19:50 +08:00
Arvin Xu e7598fe90b feat: support agent benchmark (#12355)
* improve total

fix page size issue

fix error message handler

fix eval home page

try to fix batch run agent step issue

fix run list

fix dataset loading

fix abort issue

improve jump and table column

fix error streaming

try to fix error output in vercel

refactor qstash workflow client

improve passK

add evals to proxy

refactor metrics

try to fix build

refactor tests

improve detail page

fix passK issue

improve eval-rubric

fix types

support passK

fix type

update

fix db insert issue

improve dataset ui

improve run config

finish step limit now

add step limited

100% coverage to models

add failed tests todo

support interruptOperation

fix lint

improve report detail

improve pass rate

improve sort order issue

fix timeout issue

Update db schema

完整 case 跑通

update database

improve error handling

refactor to improve database

优化 test case 的处理流程

优化部分细节体验和实现

基本完成 Benchmark 全流程功能

优化 run case 展示

优化 run case 序号问题

优化 eval test case 页面

新增 eval test 模式

新增 dataset 页面

update schema

support

finish create test run

fix

update

improve import exp

refactor data flow

improve import workflow

rubric Benchmark detail 页面

improve import ux

update schema

finish eval home page

add eval workflow endpoint

implement benchmark run model

refactor RAG eval

implement backend

update db schema

update db migration

init benchmark

* support rerun error test case

* fix tests

* fix tests
2026-02-21 20:36:40 +08:00