Primitive

Workspace

The project boundary for research.

Definition

A workspace is the durable namespace where research is organized and governed. It isolates teams, secrets, budgets, compute access, shared memory, and benchmark visibility while giving product and backend systems a single place to attach billing, permissions, audit logs, and retention rules.

How It Looks

WorkspaceBenchmarks / Datasets / AgentsRuns / Memory / Policies

A workspace looks like the lab boundary: shared compute credentials, budget limits, benchmark registry, dataset store, agents, run history, memory, policies, secrets, artifacts, and audit logs.

How To Use It

Use a workspace whenever permissions, billing, compute credentials, or memory should be shared. Do not make every benchmark its own workspace unless the team or security boundary is actually different.

What it is

A workspace is the top-level research boundary: the place where a group agrees that benchmarks, agents, datasets, runs, memory, policies, secrets, compute, and artifacts can see each other. It is not just a folder. It is the unit that product uses for navigation, collaboration, billing, and admin control, and the unit that backend uses for tenancy, authorization, quotas, audit, storage prefixes, and retention.

The practical rule is: if two resources should share trust, budget, credentials, and operational history, they belong in the same workspace. If they require different admins, deletion rules, cost centers, external integrations, or data visibility, they probably need separate workspaces even when the research topic is similar.

Product shape

In product, a workspace should feel like the research lab, not like a single experiment. The workspace home should answer: what is this group trying to improve, what is currently running, what changed recently, what resources are blocked, and what decisions need an operator. Useful surfaces include active runs, benchmark scoreboards, dataset growth, GPU utilization, budget burn, recent memory writes, failing policies, pending approvals, and artifacts worth promoting.

Navigation should make ownership obvious. A run page should show its workspace, benchmark, agent, dataset inputs, policy trigger, compute pool, artifact outputs, and audit trail. A dataset page should show whether it is workspace-private, benchmark-specific, derived from external data, generated by self-play, or safe to reuse across systems. A memory page should show the scope and trust level of each entry rather than presenting all notes as equally authoritative.

Backend shape

In backend, workspace_id should be present on every durable object that participates in tenancy: benchmark, dataset, agent, run, evaluation, memory item, policy, secret reference, compute pool, artifact, queue item, budget ledger entry, and audit event. Cross-workspace references should be explicit links with their own permissions, not accidental foreign keys.

Workspace boundaries should be enforced before resource-specific permissions. A user may be able to view a public benchmark definition without being allowed to read private runs, secrets, generated datasets, or memory in the workspace where that benchmark is used. Backend APIs should make the scope impossible to omit: list runs in workspace X, launch against benchmark Y inside workspace X, write artifact Z under workspace X storage policy.

Operationally, workspaces are where quotas and cleanup become tractable. Compute admission, concurrent run limits, external API spend, storage lifecycle, artifact TTLs, secret rotation, webhook targets, and audit retention all need a single governing object. Without that object, product ends up enforcing policy in UI state and backend ends up reconstructing ownership from loosely related records.

Lifecycle

A workspace usually moves through create, configure, operate, audit, archive, and export/delete. Creation should establish owners, billing, default visibility, storage region, compute access, and initial policies. Operation should record run launches, cancellations, manual overrides, policy decisions, dataset imports, memory writes, artifact promotions, and budget changes.

Archival should freeze launches while preserving reproducibility: benchmark definitions, run metadata, artifact manifests, evaluation summaries, and audit logs should remain readable according to retention policy. Deletion should be explicit about what is erased, what is anonymized, what is retained for compliance, and which derived public artifacts survive outside the workspace.

Operator questions

Before launch: who owns the workspace, what budget can it burn, which compute pools may it use, what secrets can agents access, what data can leave the workspace, and who can approve policy changes?

During operation: are queues backed up, are agents repeatedly failing for the same reason, is a policy exploiting stale memory, are datasets growing faster than expected, are evaluations still measuring the benchmark contract, and which runs deserve human review?

After results: which artifacts are promotable, which memory entries should become durable guidance, which datasets are contaminated or reusable, which failed attempts should be hidden from future policies, and what needs to be exported for reproducibility or customer reporting?

Show Examples

AutoGo research workspace

For AutoGo-style systems, the workspace is the arena where automated research loops coordinate. It holds Go engines, self-play workers, MCTS/search configs, generated games, checkpoints, Elo or arena evaluations, failed experiment memory, and policies that decide the next sweep. The workspace should make the loop inspectable: why this run launched, which prior result it used, what budget it consumed, and whether its artifact is eligible for promotion. A team might create workspace `autogo-main` with a GPU worker pool, self-play dataset store, Go benchmark registry, shared experiment memory, and policies for selecting the next MCTS/training sweep.

{
  "workspace": "autogo-main",
  "compute_pools": ["ssh-a100-west", "spot-h100-batch"],
  "benchmarks": ["go-fast-learning", "arena-elo-regression"],
  "shared_memory": ["failed-search-configs", "promoted-openings"],
  "policies": ["launch-next-sweep", "promote-checkpoint"],
  "budgets": { "gpu_hours_per_week": 1200, "max_concurrent_runs": 48 }
}

Weak-to-strong workspace

For weak-to-strong systems, the workspace is the controlled environment where weak labels, strong model attempts, judge policies, disagreement sets, and promotion rules stay connected. A W2S workspace may contain teacher agents, student agents, generated training datasets, holdout evaluations, rubric memory, and safety policies that decide which examples can be trusted for the next training round. The important design constraint is provenance: when a strong model improves, the workspace should trace the improvement back to weak sources, filtering policies, training runs, evaluation gates, and human interventions.

workspace_id: w2s-medqa
teacher_agents: weak-labeler-v3, rubric-judge-v2
student_agents: strong-candidate-a, strong-candidate-b
datasets: weak_labels_2026_05, disagreement_holdout, reviewed_promotions
gates: no_holdout_regression, calibrated_disagreement_gain, reviewer_approval

Backend tenancy rule

Every durable object carries `workspace_id`, and APIs require it in the route or request context. A run cannot read a dataset, secret, memory item, or artifact from another workspace unless there is an explicit share grant. This makes authorization, cost accounting, audit lookup, and storage cleanup local to the workspace instead of inferred from object names.

POST /workspaces/ws_123/runs
{
  "benchmark_id": "bench_autogo_fast_learning",
  "agent_id": "agent_policy_runner",
  "dataset_ids": ["ds_self_play_window_42"],
  "compute_pool_id": "pool_ssh_a100",
  "budget": { "gpu_hours": 24 }
}

Workspace split decision

Do not split workspaces just because there are many benchmarks. Split when the operating boundary changes: different customer data, different deletion obligations, different admins, different billing, different secret access, or different compute contracts. Keep related benchmarks together when operators need shared memory, shared queues, and comparable budget accounting.

Owns / Defines

Benchmarks, datasets, agents, runs, memory, policies, secrets, compute, and artifacts.

Questions Operators Should Answer

Who can create, administer, archive, or transfer a workspace?
Should a workspace map to a customer, team, project, security boundary, or billing account?
Which resources are workspace-scoped versus benchmark-scoped: secrets, queues, storage, memory, agents, and policies?
How are quotas enforced across compute, storage, external API spend, and concurrent runs?
What data must be exportable, retained, deleted, or hidden when a workspace changes ownership?