Docs / research cockpit

Primitives for steering autonomous research.

Picidae is not just a runner. Agents can already run experiments. The harder product is turning autonomous research state into trusted evidence, useful plots, lineages, and decisions a human can check from a phone.

Reasoning

The system should be generic enough for autoresearch, W2S, AutoGo, Harbor-style benchmark adapters, RL environments, and future labs. The common problem is not launching commands; it is deciding what the growing research tree means.

Agents can already execute

A strong Codex or Claude session with YOLO access can edit code, run experiments, maintain TSVs, spawn subagents, and build local dashboards. Picidae should not pretend execution is the scarce layer.

Attention is the bottleneck

Autonomous research creates too much state: logs, diffs, scores, artifacts, dead ends, near misses, and suspicious wins. The product has to compress that state into decisions a human can make quickly.

The cockpit is the wedge

Picidae should show which research row deserves compute, which result is fake progress, which plot explains the frontier jump, and whether to continue, branch, stop, rerun, or promote.

Freedom inside, structure outside

Agents should stay flexible inside the environment. Picidae owns the outer contract: benchmark, sandbox, trusted evaluation, evidence, memory, lineage, and decision history.

Two layers

The execution layer keeps agents reproducible and measurable. The supervision layer is the differentiator: cards, lineages, insights, decisions, and memory make the research portfolio steerable.

One research loop

Picidae should preserve the whole loop, not just the execution step. A run becomes useful only after it is scored, attached to a card, compressed into an insight, turned into a decision, saved as memory, and used by policy to choose the next run.

Overview table

This is the vocabulary the backend should preserve. Everything else should either be implementation detail or a view over these concepts.

PrimitiveMeaningOwns / defines
WorkspaceThe project boundary for research.Benchmarks, datasets, agents, runs, memory, policies, secrets, compute, and artifacts.EnvironmentThe executable world the agent enters.Docker image, repo snapshot, writable paths, read-only paths, setup commands, run commands, resource needs, network rules, and metric parsers.BenchmarkThe task contract.Rules, allowed inputs, expected outputs, budgets, metrics, and applicable evaluations.DatasetThe versioned input material.Source URI, checksum, splits, visibility, sync mode, and provenance.EvaluationThe scoring process or judge.Input artifacts, scorer command/API, hidden labels or rubrics, tests, arena configuration, judge policy, scorer versions, metric outputs, and pass/fail gates.AgentThe actor allowed to do research work.Identity, runtime, tools, permissions, prompt/configuration, memory access, approvals, and audit trail.CardOne research move.Hypothesis, proposed change, parent card, linked runs, status, evidence summary, and next action.LineageA row of related research moves.Cards, branches, frontier score, direction label, entropy category, status, and strategic recommendation.InsightA compressed observation that changes strategy.Claim, plot, evidence set, confidence, counter-evidence, affected lineage, and suggested decision.DecisionThe steering action taken from evidence.Actor, target card or lineage, action, reason, evidence links, approval state, and downstream policy update.RunThe thing that happened.Action, status, logs, artifacts, metrics, resources, config, and optional parent run.MemoryWhat the system keeps for future decisions.Observations, failures, findings, notes, papers, snapshots, decisions, and references.PolicyThe rule for choosing next work.Objective, search strategy, constraints, memory filters, stopping conditions, and approval gates.
Confusable pairs

These distinctions matter because otherwise the product collapses into either a generic job runner or an unstructured agent log.

Benchmark / Environment / Evaluation

Benchmark defines the task contract. Environment defines the executable world. Evaluation defines the trusted scoring process.

Card / Run

A card is the research move or hypothesis. A run is one concrete execution that produces evidence for or against it.

Insight / Memory

An insight is an observation that changes strategy now. Memory is the durable store of validated, contradicted, or raw research knowledge.

Decision / Policy

A decision is one steering action. A policy is the reusable rule or agent that proposes, authorizes, or launches future actions.

Lineage / Workspace

A lineage is one research row inside the tree. A workspace is the lab boundary that owns many rows, people, agents, datasets, and budgets.

Finding / Memory

A finding is the user-facing label for validated compressed memory. It is not a separate backend primitive.

Support articles

These are the distinctions that need to stay crisp as the system grows.

Why Not Just Codex In A Cluster?

Codex can run experiments. Picidae should make the research frontier legible: what changed, what is suspicious, what deserves compute, and what decision should happen next.

Benchmark vs Environment vs Evaluation

Benchmark is the task contract. Environment is the executable world. Evaluation is the trusted scoring process. Keeping them separate lets agents stay flexible without grading themselves.

Card vs Run

A run is what happened. A card is the research move being tested. One card can have many runs for seeds, ablations, phases, or reruns.

Lineage vs Memory

Lineage is the active tree of research moves. Memory is what survives for future agents after the evidence is compressed, validated, contradicted, or promoted.