Primitive

Memory

What the system keeps for future decisions.

Definition

Memory is retained research knowledge that can influence future work. It includes observations, failures, findings, notes, papers, decisions, summaries, and references to runs, artifacts, metrics, datasets, and code snapshots. A finding is a high-confidence memory record with evidence, not a separate product primitive.

How It Looks

Runs + evaluationsMemory recordsPolicy + agents read memory

A memory record looks like a cited research note: summary, confidence, scope, supporting runs, artifacts, metrics, code snapshots, contradictions, and rules for whether policies may act on it.

How To Use It

Use memory to prevent repeated mistakes and preserve research knowledge. Store both raw observations and compressed findings, but make confidence and references explicit.

Memory Is Not Just Findings

Finding is the user-facing label for validated compressed memory, not a separate backend primitive. Memory also includes weak evidence, failures, observations, model notes, paper summaries, code snapshots, operator decisions, and contradicted claims. Policies should be able to read all of it with confidence and provenance attached.

Raw Memory And Compressed Memory

Raw memory is close to execution: logs, metric traces, error messages, artifacts, and comments. Compressed memory summarizes many raw records into a conclusion. A validated finding is compressed memory backed by evidence. Both forms are needed: raw memory for audit, compressed memory for planning.

References

Every memory record should reference the things that support it: runs, datasets, benchmark versions, evaluation versions, artifacts, code snapshots, papers, and human comments. Without references, memory becomes lore and agents will eventually optimize against ungrounded summaries.

Scope And Visibility

Memory can be scoped to a workspace, benchmark, dataset, agent, policy, or user. Auto-research needs shared memory so agents avoid repeated dead ends, but some memory should remain private when it contains customer data, hidden labels, security details, or unfinished operator notes.

Staleness And Contradiction

Memory can become wrong. A benchmark changes, a scorer bug is fixed, a dataset is deduplicated, or later runs contradict an earlier conclusion. Memory needs confidence, timestamps, supersession, and links to counter-evidence so policies do not keep acting on stale knowledge.

Show Examples

AutoGo failure memory

An agent discovers that a training-mask bug created fake loss improvements. The memory record links to the bad runs, the patch that fixed the bug, and the later evaluation showing the original gains were invalid.

memory:
  type: failure
  confidence: high
  summary: training-mask bug inflated fastlearn metrics
  references:
    runs: [run_102, run_108]
    artifacts: [mask_debug_report.md]
    supersedes: [memory_early_fastlearn_gain]

W2S validated result

A W2S idea improves PGR across five seeds. The memory record stores the result summary, metric distribution, code snapshot, evaluator version, and notes about when to retry the idea.

memory:
  type: finding
  confidence: medium
  summary: confidence reweighting improves PGR on math
  evidence:
    seeds: 5
    evaluator: pgr-v3
    snapshot: commit_81af

Owns / Defines

Observations, failures, findings, notes, papers, snapshots, decisions, and references.

Questions Operators Should Answer

What memory is raw trace, summarized note, validated finding, operator decision, or imported external knowledge?
Who or what can write memory, and what evidence is required before memory affects policy decisions?
How are memory records scoped: workspace-wide, benchmark-specific, agent-specific, dataset-specific, or private?
How are stale, contradicted, low-confidence, or superseded memories detected and handled?
What retrieval interfaces do agents and policies need: search, filters, embeddings, citations, confidence, or recency windows?