Auto-research scheduler

Agents can already edit code and run experiments. The missing piece is the system around them: queues, GPU leases, clean workspaces, private evaluation, artifacts, and memory.

Picidae is that layer. Point it at a benchmark, choose an agent, set a budget, and let the platform turn attempts into reviewable research evidence.

For humans
  • Define the benchmark and budget
  • Inspect runs, diffs, logs, and artifacts
  • Promote only results that survive evaluation
For agents
  • Get an isolated workspace
  • Run trials on local or cloud compute
  • Read prior findings before wasting GPUs
Built forCodexLocal DockerLambda LabsRunPodCustom clusters

One contract for the full loop. The agent can be Codex, Claude, a script, or a human. The compute can be local Docker today and a GPU provider tomorrow.

Run

Launch a research job with an agent, benchmark, budget, and compute target.

Grade

Call private evaluators, compare seeds, and reject results that do not hold up.

Remember

Save code snapshots, artifacts, failures, and findings for the next agent.

RequestResponseSchema
# Start a research program
aq run \
  --benchmark w2s \
  --agent codex \
  --compute lambda:a100 \
  --budget 32-trials \
  --publish validated-findings
What gets tracked

Runs are temporary. Memory is durable.

The platform keeps the raw trail, then compresses it into memory. A finding is the validated form: what changed, what moved, which evaluator judged it, and which snapshot produced it.

Why auto-research needs a scheduler

Karpathy showed the overnight loop. Benchmark work shows why grading matters. Picidae is the missing layer between the worker and the evaluator.

Read the reasoning →