Primitive

Environment

The executable world the agent enters.

Definition

An environment is the concrete world where a benchmark attempt happens. It turns a research contract into something executable: a repository, container image, simulator, task adapter, RL environment, shell command, mounted data, allowed files, forbidden files, and expected outputs. Benchmark says what game is being played. Environment says how the agent actually plays it.

How It Looks

Benchmark contractEnvironment sandboxRun evidence

An environment looks like a sandbox recipe: build this image, mount these datasets, make these paths writable, keep evaluator files read-only or absent, run this command, parse these metrics, and collect these artifacts.

How To Use It

Use an environment whenever the same benchmark can run in different executable worlds: local Docker, Harbor task adapter, RL simulator, AutoGo worker image, W2S sandbox, SWE-bench container, or a custom lab repository.

Why It Exists

Without Environment, Benchmark becomes overloaded. A benchmark should define the task and allowed behavior; an environment should define the runnable substrate. This is what lets Picidae support tiny autoresearch repos, W2S sandboxes, Go self-play systems, Harbor adapters, RL environments, and future task worlds without making each one a new backend.

Agent Freedom Boundary

The environment is where YOLO access becomes safe enough. The agent may be allowed to edit code, spawn subagents, create plots, write scripts, and run experiments inside writable surfaces. The same environment can keep evaluator code, hidden labels, protected datasets, secrets, and promotion credentials outside the agent sandbox.

Executable Contract

A useful environment records image, command, setup, working directory, environment variables, mounts, resource hints, timeout, network mode, metric parser, artifact globs, and cleanup behavior. These fields are not research logic; they are the envelope that makes arbitrary research code reproducible.

Adapters

Harbor-like systems, RL worlds, and benchmark harnesses are all environment adapters. They can expose very different internal APIs, but Picidae only needs the adapter to accept a run spec, execute a command or episode loop, emit metrics, and preserve artifacts.

Show Examples

Autoresearch sandbox

The agent can edit the training file and manage its own experiment loop, while the environment protects the evaluator and captures each run's score, diff, log, and memory.

environment: autoresearch-single-gpu
image: autoresearch:cu128
writable:
  - train.py
  - results.tsv
readonly:
  - prepare.py
command: uv run train.py
timeout: 10m
metrics:
  val_bpb: "^val_bpb:"
artifacts:
  - run.log
  - results.tsv
  - train.py

RL environment adapter

An RL environment can be represented as a simulator plus training command, episode limits, reward parser, and rollout artifacts. The agent can change policy code without changing the scoring wrapper.

environment: gym-cartpole-generalization
image: rl-lab:latest
command: python train.py --env CartPole-v1
readonly:
  - eval_env.py
writable:
  - agents/**
metrics:
  mean_reward: "mean_reward:"
artifacts:
  - policy.pt
  - rollouts.jsonl

Owns / Defines

Docker image, repo snapshot, writable paths, read-only paths, setup commands, run commands, resource needs, network rules, and metric parsers.

Questions Operators Should Answer

What files can the agent edit, and what files are read-only or absent?
What command starts the attempt, and what command prepares the environment?
Which metrics can be parsed without trusting the agent's prose?
Which artifacts must be collected to reproduce or explain the run?
Can the same benchmark run under multiple environments without changing the task contract?