Benchmark
The task contract.
Definition
A benchmark is the versioned task contract for a class of runs. It defines what problem is being attempted, what the agent may read or modify, what artifacts it must produce, which budgets and constraints bind the attempt, and which evaluations are allowed to score the result. A benchmark is not the scoring act itself; scoring belongs to Evaluation.
How It Looks
A benchmark looks like a versioned task spec: visible inputs, forbidden information, writable surfaces, required artifact schema, budget limits, and the evaluation versions that are allowed to score submissions.
How To Use It
Use a benchmark to freeze the rules of the game before runs begin. If agents can change the rules, inspect hidden labels, rewrite the scorer, expand the budget after seeing results, or submit artifacts in incompatible shapes, the benchmark is not trustworthy.
Contract, Not Score
The benchmark says what counts as a valid attempt. It does not decide whether the attempt was good. That separation lets multiple evaluations score the same run artifacts, lets a benchmark evolve without silently changing historical scores, and keeps hidden labels or private tests out of the agent-visible task definition.
A benchmark should read like a contract: eligible inputs, forbidden information, writable surfaces, required outputs, resource limits, submission format, and evaluation compatibility. A run either satisfies that contract or it is invalid for the benchmark, even before any metric is computed.
Expected Outputs
Expected outputs are the artifacts every valid run must produce before evaluation can start. They should be concrete enough that an evaluator can reject malformed submissions without interpreting intent.
For prediction tasks, expected outputs usually include a manifest, prediction file, schema version, dataset version, and run id. For training tasks, they may include checkpoints, model cards, logs, generated datasets, config snapshots, and a loader command. For code tasks, they may include patches, test reports, build artifacts, and declared dependencies.
The benchmark should define which outputs are mandatory, optional, human-readable, machine-readable, retained for audit, or passed to evaluation. It should also define failure behavior: missing artifact, partial result, timeout artifact, corrupt checkpoint, non-deterministic loader, or output produced after budget expiration.
Budgets
Budgets are part of the task, not bookkeeping after the fact. A benchmark can limit wall-clock time, GPU-hours, CPU cores, memory, token spend, API dollars, dataset samples, self-play games, solver calls, search nodes, attempts, submissions, or human review minutes.
The contract should define where budgets are measured and enforced. For example, wall time may start when the run enters running state, GPU-hours may include failed child runs, and token spend may include planning prompts, tool calls, and judge calls unless explicitly excluded.
Budgets should be versioned with the benchmark because changing a budget changes the task. A one-hour learning task and a one-week learning task are different tasks even if they share the same evaluation.
Allowed Files And Surfaces
A benchmark must tell the agent what it can read, write, execute, and modify. This includes repository paths, dataset mounts, scratch directories, artifact directories, environment variables, secrets, remote services, package registries, and hardware devices.
Allowed files are especially important when the scoring harness lives near the task harness. Agents may be allowed to edit model code, prompts, configs, and training scripts while evaluation code, hidden labels, private tests, arena opponents, and metric definitions remain read-protected or unavailable.
The contract should state whether generated files become artifacts, whether dependency lockfiles may change, whether network access is allowed, and whether external knowledge is permitted. These permissions define the research question as much as the dataset does.
Versions
A benchmark version identifies one frozen contract. It should include the task text, allowed inputs, dataset versions, output schema, budget rules, allowed file surfaces, compatible evaluation ids, and any policy constraints that affect validity.
Non-breaking edits clarify wording, fix typos, or add examples without changing what a valid run may do. Breaking edits change inputs, hidden data, budgets, output schemas, allowed files, baseline opponents, or compatible evaluations. Breaking edits should create a new benchmark version rather than silently mutating old results.
Runs should record the exact benchmark version they targeted. Evaluations should record both the run artifacts and the evaluation version used, so the system can distinguish a changed task from a changed scorer.
Benchmark Versus Evaluation
Benchmark answers: What is the task? Evaluation answers: How was a submitted artifact scored? The benchmark is visible to the agent and operator. The evaluation may contain hidden labels, private tests, trusted judges, arena orchestration, and metric aggregation that agents must not control.
A benchmark can list success criteria such as target metric names, pass/fail gates, leaderboard eligibility, or compatible evaluation families. It should not embed the private scoring implementation, hidden answers, or mutable judge prompts that determine the final score.
The same benchmark can support multiple evaluations. A prediction benchmark might be scored by exact-match accuracy, judge-assisted correctness, and calibration. A training benchmark might be scored by fast smoke evaluation, full held-out evaluation, and efficiency evaluation. Those evaluations can change or be rerun without redefining the task contract.
Show Examples
W2S math benchmark sketch
This sketch defines the weak-to-strong task boundary and artifact contract. It exposes public prompts and weak labels, keeps strong labels and official scoring in evaluation-only storage, defines the prediction schema, and leaves PGR or exact-match computation to Evaluation.
benchmark_id: w2s-math version: 1.2.0 visible_inputs: - dataset: math_public_prompts@2026-04-10 - weak_labels: weak_teacher_answers@2026-04-10 hidden_from_agent: - strong_labels - official_pgr_scorer required_outputs: - predictions.jsonl - run_manifest.json - optional_reasoning_traces.jsonl budget: wall_time_hours: 6 token_usd: 200 compatible_evaluations: - w2s-pgr@3 - math-exact-match@2
AutoGo fast learning benchmark sketch
This sketch defines the game, baseline, mutable experiment surfaces, immutable arena surfaces, budgets, and required artifacts without baking win-rate, Elo, or loss calculation into the benchmark.
benchmark_id: autogo-fast-learning version: 0.9.0 task: goal: improve player strength under fixed compute ruleset: chinese-7.5-komi allowed_to_modify: - players/autogo/** - training/** - configs/experiments/** read_only: - arena/** - baselines/** - evaluations/** required_outputs: - final_checkpoint - training_log.jsonl - self_play_dataset_manifest.json - load_player_command.txt budget: gpu_hours: 24 self_play_games: 50000 compatible_evaluations: - autogo-fast-arena@1 - autogo-full-arena@1
Invalid benchmark smell
If a benchmark includes hidden answers or scorer implementation details in the agent-visible contract, it is mixing Benchmark and Evaluation.
bad:
benchmark_contains:
- hidden_test_labels.csv
- scorer_private_thresholds.py
- judge_prompt_with_answer_key.txt
better:
benchmark_contains:
- output_schema
- allowed_inputs
- compatible_evaluation_ids
evaluation_contains:
- hidden labels
- scorer implementation
- metric aggregationOwns / Defines
Rules, allowed inputs, expected outputs, budgets, metrics, and applicable evaluations.
Questions Operators Should Answer
- What exact inputs does the agent receive, and what information remains hidden inside evaluation-only storage?
- What output schema, artifact layout, or API contract must every run produce?
- Which budgets are part of the task contract: wall time, tokens, dollars, GPUs, samples, or attempts?
- Which files, directories, services, tools, secrets, and network surfaces may the agent read, write, execute, or modify?
- Which evaluation ids and versions are compatible with this benchmark, and what metrics may they emit?
- Can one benchmark support multiple datasets, difficulty tiers, or evaluation versions without becoming an ambiguous task?
- What changes are breaking enough to require a new benchmark version?