Picidae

Auto-research scheduler

Agents can already edit code and run experiments. The missing piece is the system around them: queues, GPU leases, clean workspaces, private evaluation, artifacts, and memory.

Picidae is that layer. Point it at a benchmark, choose an agent, set a budget, and let the platform turn attempts into reviewable research evidence.

For humans

Define the benchmark and budget
Inspect runs, diffs, logs, and artifacts
Promote only results that survive evaluation

For agents

Get an isolated workspace
Run trials on local or cloud compute
Read prior findings before wasting GPUs

Built forCodexLocal DockerLambda LabsRunPodCustom clusters

RuntimeRequest access →

One contract for the full loop. The agent can be Codex, Claude, a script, or a human. The compute can be local Docker today and a GPU provider tomorrow.

Run

Launch a research job with an agent, benchmark, budget, and compute target.

Grade

Call private evaluators, compare seeds, and reject results that do not hold up.

Remember

Save code snapshots, artifacts, failures, and findings for the next agent.

RequestResponseSchema

# Start a research program
aq run \
  --benchmark w2s \
  --agent codex \
  --compute lambda:a100 \
  --budget 32-trials \
  --publish validated-findings

What gets tracked

Runs are temporary. Memory is durable.

The platform keeps the raw trail, then compresses it into memory. A finding is the validated form: what changed, what moved, which evaluator judged it, and which snapshot produced it.

BlogRead post →

May 2026

Why auto-research needs a scheduler

Karpathy showed the overnight loop. Benchmark work shows why grading matters. Picidae is the missing layer between the worker and the evaluator.

Read the reasoning →