Dataset
The versioned input material.
Definition
A dataset is versioned input material consumed by runs, agents, training jobs, or evaluations. It can be mounted into a worker, synced into managed storage, streamed from a remote source, generated by an earlier run, or hidden behind an evaluation boundary. Backend design should make provenance, immutability, checksums, split visibility, and data movement explicit so operators can reproduce results and enforce access rules.
How It Looks
A dataset looks like a versioned source plus movement rules: where the bytes live, what checksum pins them, which splits exist, who can see each split, how workers receive it, and what run or import produced it.
How To Use It
Use datasets for anything that must be versioned, audited, shared across runs, or moved to machines. Generated data should become a dataset too, with a parent run, input dataset versions, code snapshot, policy, and checksum recorded as provenance.
What a dataset owns
A dataset record should identify where the bytes or records come from, what version was used, how integrity was checked, which splits exist, who can see each split, and how workers are allowed to access the material.
The primitive is not just a pointer to storage. It is the contract that lets a run say exactly which inputs it consumed and lets an evaluator prove that private or hidden material did not leak into the agent-visible context.
At minimum, model the source URI, version identifier, checksum or manifest digest, schema, split names, visibility rules, sync or mount mode, provenance links, license or retention constraints, and deletion status.
Mounted mode
Mounted datasets are exposed to workers as files or directories without copying the full payload into each job. This is useful for large read-only corpora, shared benchmark fixtures, model checkpoints, rule books, maps, media assets, or simulator resources.
The important contract is whether the mount is immutable and content-addressed. If a path like /data/current can change underneath a run, the dataset version must still resolve to a stable snapshot or manifest so the run remains reproducible.
Operators need mount path, read/write policy, cache behavior, expected locality, fallback behavior when the mount is unavailable, and whether files are visible to agents, tools, evaluators, or only infrastructure.
Synced mode
Synced datasets are copied from an upstream source into managed storage before or during execution. The platform can then pin, checksum, replicate, cache, and serve the copied version independently of the upstream system.
This mode is best when workers need fast local reads, upstream availability is unreliable, the source may mutate, or the platform must enforce retention and access rules. Syncing should produce a manifest that records copied objects, byte sizes, digests, timestamps, and skipped or failed entries.
A synced dataset should distinguish the upstream revision from the platform snapshot. The upstream branch or bucket prefix can keep changing, while the registered dataset version should remain stable unless an operator intentionally creates a new version.
Remote or streamed mode
Remote or streamed datasets are read from an external service, warehouse, queue, API, or object store at execution time. The dataset record describes how to read the source and how much of the stream is part of the versioned contract.
Streaming is useful for very large collections, online evaluation feeds, rate-limited partner data, or sources that cannot be copied for legal or operational reasons. It is also the riskiest mode for reproducibility unless the query, cursor, snapshot timestamp, pagination behavior, and response digests are captured.
Operators should decide whether streamed records are cached into a replayable manifest after use. If not, the run record must be explicit that exact byte-for-byte replay depends on the remote provider.
Generated mode
Generated datasets are produced by runs and then promoted into first-class dataset versions. Examples include synthetic prompts, filtered traces, self-play games, rollouts, labels, embeddings, distilled preference pairs, or transformed train and test splits.
Generated data must keep lineage to the parent run, producing agent or model, input dataset versions, code snapshot, random seeds, generation policy, filters, redaction rules, and checksum. Without those links, generated material becomes impossible to audit or regenerate.
Derived datasets should not overwrite their parents. They should form a lineage graph so operators can answer which results depended on which source data, transformations, and policies.
Hidden mode
Hidden datasets contain material that must not be visible to the agent, contestant, training process, or user-facing run context. Hidden labels, gold answers, private rubrics, grader prompts, holdout examples, and safety review notes all fit this mode.
Hidden does not mean unversioned. The evaluator still needs a stable dataset version, checksum, access log, and provenance trail. The platform should enforce boundary checks so hidden splits are only injected into evaluation code or trusted grader tools.
A useful implementation treats visibility as split-level metadata. A dataset can contain public prompts, private metadata, and hidden labels, with each consumer receiving only the fields and splits it is allowed to inspect.
Versioning and integrity
Dataset versions should be immutable once registered. If upstream material changes, register a new version rather than mutating the old one. This keeps run comparisons meaningful and prevents silent benchmark drift.
Checksums can be a single archive digest, a manifest digest, per-object digests, row-level hashes, or query result digests depending on storage mode. The key is to define what the checksum covers: bytes, records, normalized rows, schema, metadata, or all of them.
For large or streamed datasets, prefer manifests that can verify partial reads and record ordering. The run should record the exact dataset version and, when applicable, the subset, shard, cursor range, or sample seed it consumed.
Movement and visibility
Dataset movement is part of the security model. Copying a hidden split to an agent worker is a leak even if the code promises not to read it. Mounting a private dataset into a shared container can violate isolation even if file permissions look correct.
Declare where each split may travel: control plane only, evaluator worker, trainer worker, agent sandbox, object cache, local disk, external warehouse, or no-copy remote access. Also declare whether derived outputs may contain source records or only aggregates.
Visibility should cover people and systems. Human operators, annotators, agents, trainers, evaluators, graders, dashboards, logs, and exports may all need different access to the same dataset version.
Show Examples
W2S benchmark
A W2S setup can register public unlabeled examples as an agent-visible dataset and hidden labels as an evaluator-only split. Agents receive prompts and allowed metadata, while the grader resolves the hidden label split by version after the run finishes.
dataset:w2s-prompts@2026-04-12 mode: synced splits: public_prompts: agent_visible hidden_labels: evaluator_only checksum: manifest:sha256:... provenance: imported from benchmark release 2026-04-12
AutoGo self-play
AutoGo can mount static game rules and opening books, stream a remote opponent archive for analysis, and promote self-play games from each training run into generated datasets. Later evaluations can reference both the generated game records and the checkpoint dataset that produced them.
dataset:autogo-selfplay@run-8421 mode: generated parents: - dataset:autogo-openings@v3 - run:autogo-train-8421 artifacts: games: gs://aquamarine/autogo/run-8421/games/*.jsonl checkpoints: gs://aquamarine/autogo/run-8421/checkpoints/*.safetensors checksum: manifest:sha256:...
Remote labeling stream
A labeling job can read records from a remote warehouse without copying the full source. The dataset version should pin the query, snapshot timestamp, schema, and replay manifest for records actually consumed.
dataset:triage-cases@warehouse-snapshot-188 mode: remote_streamed source: warehouse://prod.cases query_digest: sha256:... snapshot_time: 2026-05-01T00:00:00Z replay_manifest: gs://aquamarine/manifests/triage-cases-188.jsonl
Owns / Defines
Source URI, checksum, splits, visibility, sync mode, and provenance.
Questions Operators Should Answer
- Is the dataset immutable once registered, or can it sync from a changing upstream source?
- What exact version identifier should runs cite: release tag, object manifest digest, warehouse snapshot, query digest, Git commit, or generated run ID?
- What does the checksum cover: raw bytes, normalized records, manifests, schema, metadata, split assignments, or all of them?
- Which splits and fields are visible to agents, trainers, evaluators, annotators, dashboards, exports, and human operators?
- How are generated datasets linked back to the run, agent, code snapshot, and policy that produced them?
- Can hidden labels or grader-only fields ever move to agent workers, logs, caches, traces, or derived datasets?
- Which storage mode is required: mounted filesystem, synced object snapshot, external warehouse, streaming API, generated artifact, inline metadata, or hybrid?
- When data moves, who owns the copy, where is it cached, how long is it retained, and how is deletion propagated?
- Are derived datasets allowed to include source records, redacted records, aggregates, embeddings, labels, or only references to parents?
- How are schema changes, split changes, deduplication, licensing, consent, and deletion requests tracked across derived datasets?
- If a remote provider changes or disappears, can the platform replay the exact records used by an earlier run?
- Should dataset registration fail when checksums, manifests, licenses, provenance, or visibility rules are missing?