Experiments

AgentV eval files are the only runnable authoring artifact. Use top-level experiment: inside eval.yaml for runtime choices: targets, workers, timeout, sandbox/runtime knobs, budgets, thresholds, and repeat-run policy.

name: support-regression

experiment:
  targets: [codex-gpt5, claude-sonnet]
  workers: 2
  timeout_seconds: 720
  repeat:
    count: 4
    strategy: pass_at_k
    cost_limit_usd: 2.00

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

tests:
  - id: refund-eligibility
    input: Can this customer get a refund?
    criteria: Applies the refund policy correctly

execution: is accepted only as a legacy top-level alias for existing eval files. Do not use both experiment: and execution: in the same eval.

Tests Imports

Use tests[] for composition, imports, and selection.

tests:
  - include: evals/support/*.eval.yaml
    type: suite
    select:
      test_ids:
        - refund-*
        - missing-order-date
      tags: regression
      metadata:
        priority: high
    run:
      threshold: 1.0
      repeat:
        count: 2
        strategy: pass_all
  - include: cases/*.cases.yaml
    type: tests
  - include: cases/regression.jsonl
    type: tests
  - cases/smoke/*.cases.yaml

type: suite preserves the imported suite’s task contract: metadata, workspace, shared input, shared assertions, and tests. The parent eval still owns the single run bundle. Runtime defaults from an imported suite apply only where they can be scoped to that suite’s tests: threshold, repeat or runs, timeout_seconds, and budget_usd. If the parent eval supplies one of those defaults, the parent value wins for imported tests. Fields that cannot be scoped inside one parent run, such as target, targets, workers, workspace, agent, model, agent_options, and sandbox, must be supplied by the parent experiment when importing the suite.

type: tests imports only raw test entries. It intentionally drops shared context from an imported eval suite, so parent suite fields apply to those raw cases.

tests[].select.test_ids filters imported test IDs with glob patterns. tests[].select.tags filters each imported case’s effective metadata.tags. Effective case tags are suite-first and deduped: suite.tags + suite.metadata.tags + test.metadata.tags. Top-level suite tags still remain suite identity metadata for discovery and reporting; selection reads the merged case metadata view. tests[].select.metadata filters case metadata by key/value, where selector values may be scalars or lists. Globbed include paths are resolved in deterministic path order, then test order.

String-valued tests and string entries inside tests[] are raw-case import shorthand. They are equivalent to include with type: tests and may point at raw case files, directories, or globs. Importing another eval suite must use object form with include: and type: suite.

Suite imports are resolved as a deterministic include graph. Circular type: suite imports fail validation with the import chain; raw-case shorthand does not recursively load suite runtime blocks.

Imported suite rows keep their source suite metadata in index.jsonl. Use each row’s result_dir as the authoritative path to generated artifacts inside the run directory; do not infer layout from suite names.

Scoped Run Overrides

Use scoped run: blocks for result interpretation and scheduling policies that vary by include group or test case. Precedence is:

test.run > tests[].run > parent experiment > imported suite experiment defaults

experiment:
  target: agent
  threshold: 0.8
  repeat:
    count: 3
    strategy: pass_at_k

tests:
  - include: ./evals/flaky-agentic/**/*.eval.yaml
    type: suite
    select:
      tags: [agentic]
    run:
      repeat:
        count: 3
        strategy: pass_at_k

  - include: ./evals/regression/**/*.eval.yaml
    type: suite
    select:
      tags: [must-pass]
    run:
      threshold: 1.0
      repeat:
        count: 2
        strategy: pass_all

  - id: critical-case
    input: "..."
    criteria: Must pass exactly
    run:
      threshold: 1.0
      repeat:
        count: 1

Scoped run: supports threshold, repeat, timeout_seconds, and budget_usd. Candidate-changing fields such as target and targets stay parent-level under experiment:. Workspace mutation belongs in workspace.hooks, and runner-specific setup belongs in targets[].hooks.

Lifecycle Ownership

experiment: configures evaluation policy. It does not own commands that prepare files, dependencies, repos, or target-specific runner state.

Need	Put it in
Install dependencies, build the repo, seed files	`workspace.hooks.before_all`
Reset or apply per-case state	`workspace.hooks.before_each` / `workspace.hooks.after_each`
Configure an agent runner or provider variant	`targets[].hooks`
Choose targets, repeats, pass policy, budget, threshold	`experiment`

workspace:
  hooks:
    before_all:
      command: ["bash", "-lc", "bun install && bun run build"]

targets:
  - name: agent-with-skills
    provider: codex
    hooks:
      before_each:
        command: ["sh", "-c", "cp -R skills \"{{workspace_path}}/.codex/skills\""]

experiment:
  target: agent-with-skills
  repeat:
    count: 3
    strategy: pass_at_k

Repeat Runs

repeat supports the same core strategies as repeated attempts:

experiment:
  repeat:
    count: 3
    strategy: mean
    cost_limit_usd: 1.50

Supported strategies:

Strategy	Behavior
`pass_at_k`	Uses the best passing attempt; early-exits by default unless `early_exit: false` is set
`pass_all`	Uses the weakest attempt score, so every repeated attempt must meet the threshold
`mean`	Aggregates repeated attempt scores by mean
`confidence_interval`	Uses the lower bound of a 95% confidence interval as the conservative score

AgentV also accepts runs and early_exit under experiment: as shorthand for repeat-run policy:

experiment:
  runs: 4
  early_exit: true

Do not set both repeat and runs in the same runtime block.

Result Layout

Default eval runs write to:

.agentv/results/<eval-name>/<timestamp>/

Imported source suite metadata appears in index.jsonl rows and manifests. AgentV does not add a redundant suite directory when the result group is already the eval name.