Skip to content

Experiments

AgentV eval files are the only runnable authoring artifact. Use top-level experiment: inside eval.yaml for runtime choices: targets, workers, timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy.

name: support-regression
experiment:
targets: [codex-gpt5, claude-sonnet]
workers: 2
timeout_seconds: 720
repeat:
count: 4
strategy: pass_at_k
cost_limit_usd: 2.00
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
tests:
- id: refund-eligibility
input: Can this customer get a refund?
criteria: Applies the refund policy correctly

execution: is accepted only as a legacy top-level alias for existing eval files. Do not use both experiment: and execution: in the same eval.

Use tests[] for composition, imports, and selection.

tests:
- include: evals/support/*.eval.yaml
type: suite
select:
test_ids:
- refund-*
- missing-order-date
tags: regression
metadata:
priority: high
run:
threshold: 1.0
repeat:
count: 2
strategy: pass_all
- include: cases/*.cases.yaml
type: tests
- include: cases/regression.jsonl
type: tests
- cases/smoke/*.cases.yaml

type: suite preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The parent eval still owns the single run bundle. Runtime defaults from an imported suite apply only where they can be scoped to that suite’s tests: threshold, repeat or runs, timeout_seconds, and budget_usd. If the parent eval supplies one of those defaults, the parent value wins for imported tests. Fields that cannot be scoped inside one parent run, such as target, targets, workers, workspace, agent, model, agent_options, and sandbox, must be supplied by the parent experiment when importing the suite.

type: tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

tests[].select.test_ids filters imported test IDs with glob patterns. tests[].select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. tests[].select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to include with type: tests and may point at raw case files, directories, or globs. Importing another eval suite must use object form with include: and type: suite.

Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite rows keep their source suite metadata in index.jsonl. Use each row’s result_dir as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names.

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > tests[].run > parent experiment > imported suite experiment defaults
experiment:
target: agent
threshold: 0.8
repeat:
count: 3
strategy: pass_at_k
tests:
- include: ./evals/flaky-agentic/**/*.eval.yaml
type: suite
select:
tags: [agentic]
run:
repeat:
count: 3
strategy: pass_at_k
- include: ./evals/regression/**/*.eval.yaml
type: suite
select:
tags: [must-pass]
run:
threshold: 1.0
repeat:
count: 2
strategy: pass_all
- id: critical-case
input: "..."
criteria: Must pass exactly
run:
threshold: 1.0
repeat:
count: 1

Scoped run: supports threshold, repeat, timeout_seconds, and budget_usd. Candidate-changing fields such as target and targets stay parent-level under experiment:. Workspace mutation belongs in workspace.hooks, and runner-specific setup belongs in targets[].hooks.

experiment: configures evaluation policy. It does not own commands that prepare files, dependencies, repos, or target-specific runner state.

NeedPut it in
Install dependencies, build the repo, seed filesworkspace.hooks.before_all
Reset or apply per-case stateworkspace.hooks.before_each / workspace.hooks.after_each
Configure an agent runner or provider varianttargets[].hooks
Choose targets, repeats, pass policy, budget, thresholdexperiment
workspace:
hooks:
before_all:
command: ["bash", "-lc", "bun install && bun run build"]
targets:
- name: agent-with-skills
provider: codex
hooks:
before_each:
command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]
experiment:
target: agent-with-skills
repeat:
count: 3
strategy: pass_at_k

repeat supports the same core strategies as repeated attempts:

experiment:
repeat:
count: 3
strategy: mean
cost_limit_usd: 1.50

Supported strategies:

StrategyBehavior
pass_at_kUses the best passing attempt; early-exits by default unless early_exit: false is set
pass_allUses the weakest attempt score, so every repeated attempt must meet the threshold
meanAggregates repeated attempt scores by mean
confidence_intervalUses the lower bound of a 95% confidence interval as the conservative score

AgentV also accepts runs and early_exit under experiment: as shorthand for repeat-run policy:

experiment:
runs: 4
early_exit: true

Do not set both repeat and runs in the same runtime block.

Default eval runs write to:

.agentv/results/<eval-name>/<timestamp>/

Imported source suite metadata appears in index.jsonl rows and manifests. AgentV does not add a redundant suite directory when the result group is already the eval name.