Run
Launch a research job with an agent, benchmark, budget, and compute target.
Agents can already edit code and run experiments. The missing piece is the system around them: queues, GPU leases, clean workspaces, private evaluation, artifacts, and memory.
Picidae is that layer. Point it at a benchmark, choose an agent, set a budget, and let the platform turn attempts into reviewable research evidence.
One contract for the full loop. The agent can be Codex, Claude, a script, or a human. The compute can be local Docker today and a GPU provider tomorrow.
Launch a research job with an agent, benchmark, budget, and compute target.
Call private evaluators, compare seeds, and reject results that do not hold up.
Save code snapshots, artifacts, failures, and findings for the next agent.
# Start a research program aq run \ --benchmark w2s \ --agent codex \ --compute lambda:a100 \ --budget 32-trials \ --publish validated-findings
The platform keeps the raw trail, then compresses it into memory. A finding is the validated form: what changed, what moved, which evaluator judged it, and which snapshot produced it.
Karpathy showed the overnight loop. Benchmark work shows why grading matters. Picidae is the missing layer between the worker and the evaluator.
Read the reasoning →