Blog / May 2026

Why auto-research needs a scheduler

Agents can now write code, launch experiments, and inspect results. That changes the bottleneck. The hard part is no longer just producing one run. It is coordinating thousands of attempts without losing the evidence.

Karpathy's autoresearch repo is important because it strips the idea down to the smallest useful loop. An agent edits one training file, trains for a fixed budget, checks a metric, and either keeps the change or throws it away. Nothing about that requires a giant lab. It is a simple research worker.

But a single worker is not the product. The product starts when you ask what happens after there are ten, a hundred, or a thousand of these loops running across different machines, benchmarks, ideas, and agents. Who decides what runs next? Who owns the GPU lease? How do results get compared when the setup changes? Which failures are worth preserving? Which result is good enough to become shared knowledge?

OpenAI's benchmark work points at the other missing piece: grading. PaperBench is useful because it treats research progress as something that can be checked against a rubric. MLE-bench is useful because it turns ML engineering into a concrete loop: prepare data, train, submit, compare. The “gold label” lesson is even more basic: if an agent is optimizing a score, the score needs a source of truth behind it.

That is why Picidae should not be another experiment dashboard. Dashboards are where humans look after the run. Auto-research needs the system before, during, and after the run: scheduling as the substrate, plus the cockpit that turns artifacts, metrics, plots, findings, and follow-up work into decisions.

Slurm schedules compute jobs. W&B tracks experiment telemetry. Picidae should help humans steer research programs.

The key object is not the run. A run is just an attempt. The durable object is memory: what was tried, why it mattered, what metric moved, how many seeds were used, what code produced it, and whether the result survived a real evaluator. A finding is the validated, user-facing form of that memory. Once memory becomes first class, agents can stop rediscovering the same dead ends and start building on prior evidence.

This is also how we make the systems better than the first wave of auto-research demos. We make every component swappable: Codex, Claude, or a custom agent; local Docker or rented GPUs; hidden test labels or unit tests; paper replication or model training. The platform owns the contract. The user swaps the parts.

Karpathy autoresearchThe loop is simple enough to run overnight.

One editable training file. One fixed time budget. One metric. The agent changes code, runs the experiment, keeps or reverts, and tries again.

OpenAI PaperBenchResearch progress needs a grader.

Paper replication can be decomposed into many specific checks. That points toward structured evaluators instead of vague demos.

Gold labelsThe evaluator has to be evaluated too.

If agents are optimizing a score, the scoring system needs calibration against trusted human or ground-truth judgments.

MLE-benchML engineering is already becoming agent work.

Dataset prep, training, submission, comparison, and iteration are exactly the kind of workload a research scheduler should own.

What we need to build

  • A benchmark contract that defines setup, allowed files, metrics, and hidden evaluation
  • A scheduler that can launch the same research job locally, in Docker, or on cloud GPUs
  • A findings layer that promotes validated claims instead of raw experiment spam
  • A memory layer agents can query before wasting compute on repeated ideas