Jangwook Kim

Posted on May 12 • Originally published at effloow.com

MemMachine: Ground-Truth Memory for AI Agents

#agentmemory #llminfrastructure #paperpoc #rag

Every time an agent summarizes a conversation to save memory, it loses information. That trade-off has been accepted as unavoidable — LLMs produce long outputs, context windows are finite, and token costs are real. MemMachine, presented in arXiv paper 2604.04853 (April 2026), rejects that premise. Instead of extracting facts at write time, it stores entire conversational episodes verbatim and does the heavy lifting at retrieval time. The result: 93.0% on LongMemEvalS (ICLR 2025) and approximately 80% fewer input tokens compared to Mem0 under matched conditions.

This article walks through the architecture, explains why ground-truth preservation changes the memory equation for agent developers, and shows how to integrate MemMachine into a Python-based agent using the open-source SDK.

Why Agent Memory Is Still Broken

Most production agents handle long-term memory with one of two approaches: stuff everything into the context window (expensive and bounded) or summarize with an LLM before storing (lossy and irreversible).

The summarization approach — used by Mem0 and many RAG-based systems — runs an extraction pass at write time. The LLM reads a conversation and outputs a set of facts or a condensed summary. Those facts go into a vector store. When the user comes back, retrieved facts are injected into the prompt.

The problem is structural: LLM extraction at write time introduces irreversible information loss. What looks like a minor paraphrase today becomes a missed fact next month when the user returns with a follow-up. Multi-hop reasoning across multiple sessions is especially fragile because each hop must rely on the lossy summaries produced at previous write points.

The LoCoMo benchmark makes this concrete — it tests whether an agent can recall facts from extended conversations, and Mem0's token-heavy pipeline still trails open-source alternatives on accuracy. MemMachine reaches 0.9169 on LoCoMo (with gpt-4.1-mini), above published scores for Mem0, Zep, Memobase, and LangMem.

The Core Idea: Ground-Truth Preservation

MemMachine's defining design choice is deferred extraction. Raw conversational episodes are stored verbatim in a graph database. No LLM extraction pass runs at write time. When the agent needs to recall something, a Retrieval Agent queries the episode store and surfaces the original conversation context.

This flips the cost curve. Write operations become cheap — store an episode, index it, done. Read operations become slightly richer — contextualized retrieval expands nucleus matches with neighboring episode context, so a query about "my dietary restrictions" pulls not just the turn where that phrase appeared, but the surrounding dialogue that gives it meaning.

The architecture has four distinct memory tiers:

Short-term workspace — the current conversation buffer. Limited capacity, cleared between sessions (standard context window behavior).

Long-term episodic memory — the ground-truth store. Full conversational episodes in a graph database. This is the structural difference from Mem0-style systems.

Semantic/profile memory — high-level user facts (preferences, identity, stated goals) stored in a SQL database. This tier does run LLM extraction, but only for stable profile data, not for ephemeral conversation content.

Procedural memory — learned patterns, action sequences, and strategies the agent has acquired over interactions.

The episodic tier does the heavy lifting. Because episodes are stored verbatim, the system can always return to the source of truth rather than a derivative. That is what "ground-truth-preserving" means in practice.

The Retrieval Agent: Three Routing Strategies

Raw storage would be useless without smart retrieval. MemMachine introduces a companion Retrieval Agent with a ToolSelectAgent classifier that routes each incoming query to one of three strategies:

Direct retrieval — semantic similarity search over the episode store. Used for simple, single-fact queries ("What is the user's preferred language?"). Fast and low-latency.

Parallel decomposition — splits multi-part queries into sub-queries and executes them concurrently, then merges results. Used when a question has several independent dimensions ("What does the user prefer about both their IDE setup and their deployment workflow?").

Chain-of-query — iterative retrieval where each step informs the next. Used for multi-hop reasoning ("What framework was the user migrating to, and what deployment platform did they choose for it?"). Each query builds on what the previous step retrieved.

This adaptive routing is what enables the multi-hop benchmark numbers. On HotpotQA-hard the Retrieval Agent reaches 93.2%. On WikiMultiHop (2WikiMultiHopQA with randomized noise) it reaches 92.6%, while reducing input tokens by 59% (from 103k to 42k) compared to context-window-stuffing approaches.

The crucial insight: the routing strategies are layered on top of the storage model without modifying it. Improving retrieval does not require changing how episodes are stored. That separation makes the system extensible — you can swap in a new routing strategy without touching the episode graph.

LongMemEvalS: What the 93% Actually Means

LongMemEval (ICLR 2025) benchmarks five long-term memory abilities:

Information extraction — can the agent find a stated fact?
Multi-session reasoning — can it connect facts across sessions?
Temporal reasoning — can it handle time-sensitive updates ("I changed jobs last month")?
Knowledge updates — does it override stale information correctly?
Abstention — does it correctly say "I don't know" rather than hallucinate?

LongMemEvalS is the subset used in the MemMachine paper's ablation study. The 93.0% overall accuracy comes from stacking six optimization dimensions, with retrieval-stage improvements contributing more than ingestion-stage changes:

Retrieval depth tuning: +4.2%
Context formatting: +2.0%
Search prompt design: +1.8%
Query bias correction: +1.4%
Sentence chunking (ingestion): smaller contribution

The ablation is honest — it shows that the biggest gains come from how you retrieve, not from a magic storage algorithm. The ground-truth-preserving model matters because it gives retrieval something accurate to work with. But retrieval engineering is where the performance headroom lives.

Benchmarks at a Glance

Benchmark	MemMachine	Mem0	Zep / Graphiti	Notes
LongMemEvalS (ICLR 2025)	93.0%	[DATA NOT AVAILABLE]	[DATA NOT AVAILABLE]	6-dimension ablation, gpt-4.1-mini
LoCoMo (F1/accuracy)	0.9169	Lower (paper claim)	Lower (paper claim)	gpt-4.1-mini; best published open-framework result
HotpotQA-hard	93.2%	[DATA NOT AVAILABLE]	[DATA NOT AVAILABLE]	Retrieval Agent multi-hop
WikiMultiHop (noisy)	92.6%	[DATA NOT AVAILABLE]	[DATA NOT AVAILABLE]	103k → 42k tokens (59% reduction)
Input tokens vs Mem0 (LoCoMo)	~80% fewer	Baseline	—	Write-time savings; no LLM extraction pass

Independent benchmark comparisons beyond what the MemMachine paper reports are not available at the time of writing. The figures above come from arXiv:2604.04853 and the MemMachine official blog; treat the relative comparisons as claims to verify when production-scale testing is feasible.

Getting Started: Installation and Basic Usage

MemMachine is open-source at github.com/MemMachine/MemMachine and published as memmachine on PyPI. The quickstart requires Docker and Docker Compose because the system uses both a graph database and a SQL database for its memory tiers.

Prerequisites

Docker 24+ and Docker Compose
Python 3.10+
An OpenAI API key (or compatible LLM endpoint for the Retrieval Agent)

Installation

# Download the latest release tarball from the GitHub releases page,
# extract it, then run the setup script
./setup.sh  # walks through Docker config and API key setup

# Alternatively, install the Python SDK standalone:
pip install memmachine

Core Workflow

The SDK follows a four-step pattern: retrieve → enrich → generate → store.

from memmachine import MemMachineClient

client = MemMachineClient(base_url="http://localhost:8000", api_key="YOUR_API_KEY")

user_id = "user_42"
message = "I prefer TypeScript over Python for backend work."

# 1. Retrieve relevant memories before generating a response
memories = client.memory.search(query=message, producer=user_id, limit=5)

# 2. Enrich context with retrieved episodes
context = "\n".join([m.content for m in memories.results])
enriched_prompt = f"User context:\n{context}\n\nUser message:\n{message}"

# 3. Generate response using your LLM (not shown — use your preferred client)
response = your_llm.complete(enriched_prompt)

# 4. Store the full episode verbatim
client.memory.add(
    messages=[
        {"role": "user", "content": message},
        {"role": "assistant", "content": response}
    ],
    producer=user_id
)

The producer field ties episodes to a specific user. When that user returns, every search() call scoped to their producer ID retrieves from their episode history — across all past sessions.

Accessing the Retrieval Agent

For multi-hop queries, use the Retrieval Agent directly:

# The Retrieval Agent automatically selects the routing strategy
result = client.retrieval_agent.query(
    question="What database was the user migrating to last month, and what hosting provider did they pick?",
    producer=user_id
)
print(result.answer)
print(result.sources)  # episodes that contributed to the answer

The agent classifies the query, selects direct, parallel, or chain-of-query routing, and returns both the answer and the source episodes. This traceability — knowing which stored episodes contributed to the answer — is a practical advantage over LLM-extraction-based systems where the provenance chain is broken at write time.

MemMachine vs Mem0: When to Choose Which

The ground-truth-preserving approach is not free. Storing raw episodes grows the database faster than storing extracted summaries. For applications where storage cost is a hard constraint and conversation length is short, Mem0's extraction approach may be a reasonable trade-off. For applications where accuracy and multi-session reasoning matter — personalized coding assistants, customer support agents with long histories, companion AI — the accuracy gains from MemMachine's architecture are likely to outweigh the storage overhead.

Key questions to guide the choice:

Are your sessions long and multi-turn? Ground-truth preservation gains more from longer episodes.
Do you need multi-hop reasoning across sessions? The Retrieval Agent's chain-of-query routing is designed for this.
Is write-time token cost a concern? MemMachine's no-extraction-at-write-time model cuts ingestion costs substantially.
Do you need audit trails? Storing raw episodes lets you trace every memory back to its source conversation.

Common Mistakes When Building Agent Memory

Over-indexing on single-session performance. Most agent memory benchmarks run in a single session. LongMemEval is one of the few that tests across sessions. Evaluate your memory system on multi-session workloads before deploying.

Assuming RAG extraction is lossless. Every LLM extraction pass introduces paraphrase drift. Test by storing a conversation, extracting from it, then asking questions the original conversation answers but the summary might not.

Ignoring retrieval depth. The MemMachine ablation shows retrieval depth tuning (+4.2%) has more impact than chunking strategy. Most teams optimize chunking obsessively while leaving retrieval depth at defaults.

Skipping abstention testing. A memory system that hallucinations when it does not know something is worse than one with lower recall. LongMemEval's abstention dimension is worth including in your eval suite.

Not scoping memories to users. Mixing memories across users is a privacy and accuracy risk. Always use a user-scoped key (like MemMachine's producer field) from day one.

FAQ

Q: Does MemMachine work with models other than OpenAI?

The Retrieval Agent is LLM-agnostic — it uses the LLM for query classification and chain-of-query reasoning, so any model with tool-use capability should work. The documentation references OpenAI by default in the quickstart, but the SDK supports custom LLM endpoints.

Q: How does MemMachine handle knowledge updates?

Newer episodes naturally take precedence in retrieval ranking. For explicit corrections ("actually, I switched from TypeScript to Go last week"), the Retrieval Agent's temporal reasoning handles the update — it surfaces the more recent episode and the semantic/profile memory layer can be updated with the corrected fact.

Q: Is there a hosted / cloud version?

The GitHub repository and PyPI package are open-source. A managed cloud offering (memmachine.ai) appears to be available based on the official site, but pricing and tier details were not independently verified at time of writing.

Q: How does the graph database fit into the architecture?

Episodic memory is stored in a graph database rather than a flat vector store. This allows the system to represent relationships between episodes (same user, same topic, same session) and traverse those relationships during contextualized retrieval — expanding a nucleus match with connected episodes without running a second embedding search.

Q: Can I use MemMachine with LangChain or LlamaIndex?

Integration guides for major agent frameworks were not available in the documentation at the time of writing. The Python SDK and RESTful API are framework-agnostic, so wrapping them for LangChain or LlamaIndex is straightforward. An MCP server interface is also listed as available.

Key Takeaways

Ground-truth-preserving memory is a principled response to a real problem: LLM extraction at write time is lossy, and that loss compounds across sessions. MemMachine's approach — store raw episodes, retrieve intelligently — trades slightly more storage for substantially better accuracy and traceability.

The benchmark results (93.0% LongMemEvalS, 0.9169 LoCoMo, 93.2% HotpotQA-hard) place it at the top of published open-framework results. The 80% token reduction on write operations is a meaningful cost argument for high-volume applications.

For developers building agents that need to remember users across sessions — and get the details right — MemMachine is the most technically grounded option in the open-source memory layer space as of April 2026.

Bottom Line

MemMachine's ground-truth-preserving architecture solves the write-time extraction problem that makes most agent memory systems degrade over long conversations. If you're building personalized agents that need accurate multi-session recall, it's worth evaluating against your current Mem0 or RAG-based setup — the token savings alone may offset the storage overhead.

Sources: arXiv:2604.04853 · MemMachine GitHub · PyPI: memmachine · LongMemEval benchmark · MemMachine blog: LoCoMo results · MemMachine blog: WikiMultiHop

DEV Community