Primitive

Evaluation

The scoring process or judge.

Definition

Evaluation is the trusted scoring or judging process applied to run outputs. It is not the benchmark: the benchmark defines the task contract, dataset, prompts, environment, and submission shape, while the evaluation decides how a submitted run is scored. An evaluation may execute unit or integration tests, compare predictions against hidden labels, run a head-to-head arena, call a model judge, coordinate human review, or aggregate operational signals into metrics. Because evaluation is the authority that turns artifacts into scores, it should be reproducible, versioned, auditable, rerunnable, and explicit about what private information it can access.

How It Looks

Run artifactsEvaluation / judgeSignals / metrics / gates

An evaluation looks like a trusted scorer: it receives run artifacts, resolves hidden data or private tests, executes the scoring process, emits metrics and diagnostics, and records enough evidence to rerun the score.

How To Use It

Use evaluation when trust matters. It should be the only component with access to hidden labels, private tests, official scoring rubrics, judge prompts, adjudication policy, or private arena opponents.

Not the Benchmark

A benchmark describes what work should be attempted: the task, dataset, environment, input schema, output schema, allowed tools, time limits, and submission rules. An evaluation describes how completed work is judged. Keeping them separate lets the same benchmark support multiple scorers, allows scorer bugs to be fixed without rewriting the task contract, and makes it clear which component is trusted with hidden data.

What Evaluations Can Do

An evaluation can be a deterministic test suite, a hidden-label scorer, an arena runner, a human review workflow, a model-as-judge pipeline, or a hybrid process. The important property is that it consumes run artifacts and emits signals that downstream systems can compare, filter, audit, and rerun.

Hidden Labels And Private Tests

For prediction tasks, the evaluation usually owns the answer key, private split, scoring rubric, or withheld tests. Runs submit predictions or artifacts, not labels. The scorer compares outputs against hidden labels, executes private tests, records failures, and emits only approved metrics or diagnostics. This prevents overfitting, leakage, and accidental exposure of the official scoring data.

Arena Evaluation

Some systems are better judged by interaction than by static labels. An arena evaluation runs competitors in matched conditions, controls seeds and opponents, records traces, and computes outcomes such as win rate, Elo, task completion rate, or regret. The arena itself is part of the evaluator because small changes to opponent policy, pairing logic, or tie-breaking rules can change the score.

Human And Model Judges

Judged evaluations can route artifacts to human reviewers, model judges, or both. The evaluation should specify the rubric, judge instructions, sampling policy, adjudication rules, calibration set, aggregation method, and appeal or disagreement handling. Model judges should be treated like scorer code: name the model, prompt, temperature, tool access, rubric version, and any post-processing.

Scorer Versions And Reruns

A score is only meaningful when tied to the scorer version that produced it. Store the scorer implementation, container or commit, judge model, prompt version, dependency versions, hidden-data snapshot, arena opponent version, and configuration. When the scorer changes, old scores should be preserved with their original version and new scores should be created by an explicit rerun rather than silently overwritten.

Metrics And Signals

Evaluations should emit both headline metrics and supporting signals. Headline metrics might include accuracy, F1, pass rate, PGR, win rate, cost-normalized score, latency, safety violation rate, or human preference rate. Supporting signals might include per-example outcomes, confidence intervals, judge rationales, test logs, trace IDs, flaky-test markers, invalid-submission errors, and policy gates.

Auditability

The evaluator should leave enough evidence to explain why a run received its score without exposing private data. Useful records include input artifact hashes, scorer version, timestamps, random seeds, sandbox limits, judge assignments, arena pairings, metric calculations, and redacted traces. This makes leaderboard decisions, policy gates, and regressions defensible.

Show Examples

W2S PGR

In a W2S-style task, the benchmark can define the prediction format and dataset split, while the evaluation owns the hidden labels and the PGR calculation. A run submits predictions. The scorer validates the file, joins predictions to the hidden answer key, computes PGR, emits aggregate and per-slice signals, and returns no raw labels to the submitter.

AutoGo Win Rate

In an AutoGo-style task, the benchmark can define the environment and agent interface, while the evaluation runs a controlled arena. The scorer launches the submitted policy against a baseline or reference opponent, fixes seeds and match counts, records game traces, handles crashes or illegal moves, and reports win rate with uncertainty and failure classes.

Model-Judged Review

For open-ended outputs, the evaluation may call a model judge with a private rubric. The scorer packages the prompt, candidate answer, references, and rubric; requests a structured judgment; validates the response; and aggregates judge scores into preference, quality, or safety metrics. The judge prompt and model version are part of the scorer version.

Owns / Defines

Input artifacts, scorer command/API, hidden labels or rubrics, tests, arena configuration, judge policy, scorer versions, metric outputs, and pass/fail gates.

Questions Operators Should Answer

What is the benchmark contract, and what scoring responsibility belongs only to the evaluation?
What artifacts and metadata must a run provide before evaluation can start?
Which scorer implementation is trusted, versioned, sandboxed, and allowed to access hidden data?
What hidden labels, private tests, rubrics, judge prompts, or arena opponents are evaluator-only?
Which scorer version, judge model, prompt, dependency set, hidden-data snapshot, and arena configuration produced each score?
What outputs are emitted: scalar metrics, pass/fail gates, confidence intervals, traces, explanations, or error classes?
Can evaluations be rerun after scorer changes, and how are old scores preserved, superseded, invalidated, or compared?
How should flaky, partial, human, or model-judged scores be represented in leaderboards and policy decisions?
How are reruns audited so users can tell whether a score changed because the run changed, the scorer changed, or hidden data changed?