Guatu

Posted on May 6 • Edited on May 8 • Originally published at guatulabs.dev

Cognitive Memory for Agents: Vector Search vs Activation-Based Recall

#aiagents #vectordatabases #llmmemory #cognitivearchitecture

I spent a few weeks trying to build an agent that could remember specific user preferences across sessions without bloating the context window to a point where latency became unbearable. The standard advice is always "just use a vector database." But as the memory store grew, I noticed a weird gap: the agent could find a document about "user prefers dark mode" via cosine similarity, but it couldn't "recall" the immediate emotional state or the nuance of the last three turns of conversation unless they were explicitly mirrored in the embedding.

The problem is that vector search is a retrieval mechanism, not a cognitive memory system. When you move from simple RAG to actual agentic memory, you have to choose between external vector search and internal activation-based recall.

The Decision Point

You face this choice when your agent's "short-term" memory (the context window) is full, and your "long-term" memory (the database) is returning results that are mathematically similar but contextually irrelevant.

If you need your agent to remember a 500-page technical manual, you need a vector store. If you need your agent to exhibit a consistent "personality" or recall a specific pattern of behavior that isn't easily summarized into a string of text for an embedding model, you need something closer to activation-based recall.

Option A: Vector Search (The External Archive)

Vector search is the industry standard for a reason: it's easy to scale and the tooling is mature. You turn a piece of text into a vector using an embedding model (like text-embedding-3-small), shove it into a store like FAISS or Milvus, and query it with another vector.

Strengths:

Scale: You can store billions of vectors.
Cold Storage: It doesn't eat VRAM. It lives on disk or in a dedicated database.
Interpretability: I can literally query the database and see exactly which chunk of text was retrieved.

Weaknesses:

The "Semantic Gap": Cosine similarity is a blunt instrument. If a user says "That's not what I meant," a vector search might retrieve a passage about "meaning" or "intent" rather than understanding the correction.
Latency: You have to embed the query, hit the DB, and then stuff the results into the prompt.

Here is a basic implementation using FAISS. I use this for the "knowledge base" layer of my agents:

import faiss
import numpy as np

# Dimension depends on your embedding model (e.g., 1536 for OpenAI)
dimension = 128 
nb = 1000  # number of memory chunks
index = faiss.IndexFlatL2(dimension) 

# Mocking embeddings of agent experiences
vectors = np.random.random((nb, dimension)).astype('float32')
index.add(vectors) 

# Querying for the top 4 most similar memories
queries = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(queries, 4) 
print(f"Retrieved memory indices: {indices}")

Option B: Activation-Based Recall (The Internal Intuition)

Activation-based recall is more akin to how biological memory works. Instead of searching a database, the "memory" is stored in the weights or the hidden states of the model. In modern agent architectures, this often involves using activation hooks or specialized memory layers (like Memory Transformers) that allow the model to trigger a recall based on the current internal state of the network.

Strengths:

Speed: There is no external API call or DB lookup. The recall happens during the forward pass.
Nuance: It captures "how" something was said, not just "what" was said. It's an associative trigger rather than a keyword search.

Weaknesses:

The Black Box: Debugging this is a nightmare. You can't just "look" at the database to see why the agent recalled a specific memory.
VRAM Pressure: Storing these activations or maintaining a dynamic memory network consumes precious GPU memory.

I've experimented with simple activation hooks in PyTorch to track which "states" trigger certain behaviors. It's not a full-blown Memory Transformer, but it's a start:

import torch
from torch import nn

class AgentModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.memory_buffer = []

    def forward(self, x):
        # In a real system, this would be a specific layer's activation
        # that represents a 'concept' or 'state'
        activation = torch.tanh(x) 

        # Store the activation state for later recall/analysis
        self.memory_buffer.append(activation.detach().cpu().numpy())
        return activation

model = AgentModel()
input_tensor = torch.rand(1, 128)
output = model(input_tensor)
print(f"Stored state vector: {model.memory_buffer[-1]}")

Decision Framework

Criteria	Vector Search	Activation-Based Recall
Data Volume	Massive (TB+)	Small (MB to GB)
Retrieval Speed	Milliseconds (Network/Disk)	Microseconds (GPU)
Precision	Semantic/Keyword	Associative/Pattern
Debugging	Easy (Query the DB)	Hard (Analyze Tensors)
Resource Cost	CPU/Disk/API	VRAM/Compute

My Pick and Why

I don't pick one. I use a hybrid.

If you're building a production agent, relying solely on vector search leads to that "robotic" feeling where the agent repeats the same retrieved snippet regardless of the conversation flow. Relying solely on activations is a recipe for a system you can't debug when it starts hallucinating.

I implement a tiered system. I use a vector store for the "Library" (hard facts, documentation) and a sliding window of activations for the "Working Memory" (current mood, immediate goals, recent corrections). This mirrors the 6-layer memory architecture I've used for my own tools.

For those building multi-agent systems, I recommend offloading the vector search to a shared service and keeping the activation-based recall local to the agent's specific instance. This prevents the "shared memory" from becoming a noisy mess of conflicting embeddings. You can see how this fits into larger patterns in my post on multi-agent architecture patterns.

If you're still struggling with agents that forget things every five minutes, you might be hitting a safety loop. I've written about three-layer safety for autonomous agents which often solves the "infinite loop" problem that people mistake for a memory issue.

If you need help designing a memory architecture that doesn't melt your GPU or your budget, check out my AI agent consulting services.

Lessons learned:
The docs for vector DBs make it sound like they replace the need for cognitive memory. They don't. They replace the need for a filing cabinet. If you want an agent that actually "feels" like it's learning from a conversation in real-time, you have to move closer to the activations.

Top comments (1)

Max Quimby • May 11

The Library vs Working Memory split matches what we ended up with after about a year of trying to make a single vector store carry both loads. The failure mode of vector-only memory is exactly what you describe — the agent "remembers" by reciting retrieved snippets verbatim, which reads as robotic and, worse, hides whether it actually internalized anything.

Two things worth adding from running this in production:

The hardest part of the hybrid isn't either layer — it's the write policy. What gets promoted from working memory into the vector library, and when? We started with "everything," drowned in noise, then went to "only summaries the agent itself proposes" and it worked much better. Self-curated memory is underrated.
Activation-based recall is genuinely opaque to debug, so we treat the working-memory layer as ephemeral by default and force any claim the agent makes about "remembering" something durable to come from the vector layer. It costs some of the conversational nuance but it means a user can actually ask "why did you do that" and get a grounded answer.

What's your eviction policy on the working-memory side? That's the part I haven't found a clean answer for.