Agents can already execute
A strong Codex or Claude session with YOLO access can edit code, run experiments, maintain TSVs, spawn subagents, and build local dashboards. Picidae should not pretend execution is the scarce layer.
Picidae is not just a runner. Agents can already run experiments. The harder product is turning autonomous research state into trusted evidence, useful plots, lineages, and decisions a human can check from a phone.
The system should be generic enough for autoresearch, W2S, AutoGo, Harbor-style benchmark adapters, RL environments, and future labs. The common problem is not launching commands; it is deciding what the growing research tree means.
A strong Codex or Claude session with YOLO access can edit code, run experiments, maintain TSVs, spawn subagents, and build local dashboards. Picidae should not pretend execution is the scarce layer.
Autonomous research creates too much state: logs, diffs, scores, artifacts, dead ends, near misses, and suspicious wins. The product has to compress that state into decisions a human can make quickly.
Picidae should show which research row deserves compute, which result is fake progress, which plot explains the frontier jump, and whether to continue, branch, stop, rerun, or promote.
Agents should stay flexible inside the environment. Picidae owns the outer contract: benchmark, sandbox, trusted evaluation, evidence, memory, lineage, and decision history.
The execution layer keeps agents reproducible and measurable. The supervision layer is the differentiator: cards, lineages, insights, decisions, and memory make the research portfolio steerable.
Picidae should preserve the whole loop, not just the execution step. A run becomes useful only after it is scored, attached to a card, compressed into an insight, turned into a decision, saved as memory, and used by policy to choose the next run.
This is the vocabulary the backend should preserve. Everything else should either be implementation detail or a view over these concepts.
These distinctions matter because otherwise the product collapses into either a generic job runner or an unstructured agent log.
Benchmark defines the task contract. Environment defines the executable world. Evaluation defines the trusted scoring process.
A card is the research move or hypothesis. A run is one concrete execution that produces evidence for or against it.
An insight is an observation that changes strategy now. Memory is the durable store of validated, contradicted, or raw research knowledge.
A decision is one steering action. A policy is the reusable rule or agent that proposes, authorizes, or launches future actions.
A lineage is one research row inside the tree. A workspace is the lab boundary that owns many rows, people, agents, datasets, and budgets.
A finding is the user-facing label for validated compressed memory. It is not a separate backend primitive.
These are the distinctions that need to stay crisp as the system grows.
Codex can run experiments. Picidae should make the research frontier legible: what changed, what is suspicious, what deserves compute, and what decision should happen next.
Benchmark is the task contract. Environment is the executable world. Evaluation is the trusted scoring process. Keeping them separate lets agents stay flexible without grading themselves.
A run is what happened. A card is the research move being tested. One card can have many runs for seeds, ablations, phases, or reruns.
Lineage is the active tree of research moves. Memory is what survives for future agents after the evidence is compressed, validated, contradicted, or promoted.