Primitive

Policy

The rule for choosing next work.

Definition

A policy decides what should happen next. It can encode a sweep, hill climber, Bayesian optimizer, active learner, agent planner, paper importer, scheduler, or human approval gate. Policies read state and memory, then propose or launch runs under explicit constraints.

How It Looks

Memory + signalsPolicyNext run proposal

A policy looks like a versioned decision rule: objective, input signals, memory filters, search strategy, budgets, approval gates, stop conditions, and the run proposals it is allowed to launch.

How To Use It

Use policy for the evolution loop. It should make explicit why a run was launched, what objective it optimizes, and what constraints limit it.

Policy Chooses Work

A policy is the decision rule for what should happen next. It can be simple, like a fixed sweep, or complex, like an agent planner that reads memory and proposes experiments. The important part is that the decision is explicit, versioned, constrained, and auditable.

Inputs

Policies read memory, recent signals, benchmark rules, dataset availability, resource state, budget limits, and operator constraints. A policy should never operate only on the latest metric because the best next run often depends on failures, variance, data movement cost, and hidden evaluation gates.

Outputs

A policy outputs proposals or runs. Some policies launch runs directly. Others create candidate plans for human approval. Outputs should explain why the work is worth doing, what memory it used, what objective it optimizes, and what resources it expects to consume.

Policy Types

Common policies include hill climbing, random sweeps, grid sweeps, Bayesian optimization, beam search, active learning, paper-inspired idea generation, exploit/explore schedulers, failure triage, and human approval gates. They should all fit the same primitive even if their implementations differ.

Safety And Budget Gates

Policies need guardrails because they allocate scarce resources. They should enforce GPU-hour budgets, cloud spend limits, max concurrency, benchmark rules, approval requirements, forbidden actions, dataset visibility, and stop conditions.

Show Examples

AutoGo bounded sweep

A policy reads recent arena memory, sees that low temperature settings helped but high simulation counts are expensive, and launches a bounded sweep over temperature and c_puct on available SSH GPU workers.

policy:
  type: bounded_sweep
  objective: maximize win_rate
  memory_filter: benchmark=autogo-fastlearn
  constraints:
    max_gpu_hours: 24
    max_concurrent_runs: 8
    avoid_memory_tags: [unstable, invalidated]

W2S human gate

A policy lets an agent propose weak-to-strong ideas, but requires a human to approve any run that touches hidden-evaluation-adjacent code or uses more than a fixed GPU budget.

policy:
  type: human_gate
  proposal_agent: codex
  auto_approve:
    max_gpu_hours: 2
    datasets: [public]
  require_approval:
    - hidden_eval_boundary
    - budget_over_2_gpu_hours

Owns / Defines

Objective, search strategy, constraints, memory filters, stopping conditions, and approval gates.

Questions Operators Should Answer

What objective is the policy optimizing, and which metrics or gates are allowed to influence it?
What constraints apply: budget, safety, approvals, resource availability, benchmark rules, and stop conditions?
Can the policy launch runs directly, or must it produce proposals for human approval?
Which memory filters, score histories, and failure patterns should the policy use before scheduling work?
How are policy versions compared, paused, audited, and rolled back if they allocate resources poorly?