Memory
What the system keeps for future decisions.
Definition
Memory is retained research knowledge that can influence future work. It includes observations, failures, findings, notes, papers, decisions, summaries, and references to runs, artifacts, metrics, datasets, and code snapshots. A finding is a high-confidence memory record with evidence, not a separate product primitive.
How It Looks
A memory record looks like a cited research note: summary, confidence, scope, supporting runs, artifacts, metrics, code snapshots, contradictions, and rules for whether policies may act on it.
How To Use It
Use memory to prevent repeated mistakes and preserve research knowledge. Store both raw observations and compressed findings, but make confidence and references explicit.
Memory Is Not Just Findings
Finding is the user-facing label for validated compressed memory, not a separate backend primitive. Memory also includes weak evidence, failures, observations, model notes, paper summaries, code snapshots, operator decisions, and contradicted claims. Policies should be able to read all of it with confidence and provenance attached.
Raw Memory And Compressed Memory
Raw memory is close to execution: logs, metric traces, error messages, artifacts, and comments. Compressed memory summarizes many raw records into a conclusion. A validated finding is compressed memory backed by evidence. Both forms are needed: raw memory for audit, compressed memory for planning.
References
Every memory record should reference the things that support it: runs, datasets, benchmark versions, evaluation versions, artifacts, code snapshots, papers, and human comments. Without references, memory becomes lore and agents will eventually optimize against ungrounded summaries.
Scope And Visibility
Memory can be scoped to a workspace, benchmark, dataset, agent, policy, or user. Auto-research needs shared memory so agents avoid repeated dead ends, but some memory should remain private when it contains customer data, hidden labels, security details, or unfinished operator notes.
Staleness And Contradiction
Memory can become wrong. A benchmark changes, a scorer bug is fixed, a dataset is deduplicated, or later runs contradict an earlier conclusion. Memory needs confidence, timestamps, supersession, and links to counter-evidence so policies do not keep acting on stale knowledge.
Show Examples
AutoGo failure memory
An agent discovers that a training-mask bug created fake loss improvements. The memory record links to the bad runs, the patch that fixed the bug, and the later evaluation showing the original gains were invalid.
memory:
type: failure
confidence: high
summary: training-mask bug inflated fastlearn metrics
references:
runs: [run_102, run_108]
artifacts: [mask_debug_report.md]
supersedes: [memory_early_fastlearn_gain]W2S validated result
A W2S idea improves PGR across five seeds. The memory record stores the result summary, metric distribution, code snapshot, evaluator version, and notes about when to retry the idea.
memory:
type: finding
confidence: medium
summary: confidence reweighting improves PGR on math
evidence:
seeds: 5
evaluator: pgr-v3
snapshot: commit_81afOwns / Defines
Observations, failures, findings, notes, papers, snapshots, decisions, and references.
Questions Operators Should Answer
- What memory is raw trace, summarized note, validated finding, operator decision, or imported external knowledge?
- Who or what can write memory, and what evidence is required before memory affects policy decisions?
- How are memory records scoped: workspace-wide, benchmark-specific, agent-specific, dataset-specific, or private?
- How are stale, contradicted, low-confidence, or superseded memories detected and handled?
- What retrieval interfaces do agents and policies need: search, filters, embeddings, citations, confidence, or recency windows?