DEV Community

Omnithium
Omnithium

Posted on • Originally published at omnithium.ai

Memory and Context Management in Long-Running AI Agents

Most AI agent tutorials focus on single-turn conversations or simple API calls. Production systems operate differently. They handle ongoing user sessions, multi-day workflows, and complex state management that requires sophisticated memory architecture. Without proper memory design, agents suffer from context window overflow, inconsistent behavior, and information degradation over time.

We've seen production agents fail because they forgot critical user preferences three turns into a conversation, or because they tried to cram 50,000 tokens of historical context into a 32k window. Proper memory management isn't optional; it's what separates prototypes from production-ready systems.

The Two-Layer Memory Architecture

Production agents need both short-term and long-term memory working together. Short-term memory handles the immediate conversation context and current task state. Long-term memory stores persistent knowledge about users, historical interactions, and organizational data.

Short-term memory typically lives in working variables, conversation buffers, and temporary session storage. It's fast, volatile, and optimized for the current interaction. Long-term memory uses persistent databases, vector stores, and file systems. It's slower, durable, and designed for retrieval across sessions.

This separation matters because you can't afford to search your entire vector database on every user message. The architecture should retrieve relevant long-term memories at conversation start or key transition points, then work primarily from short-term memory during the active session.

class AgentMemory:
    def __init__(self, session_id, user_id):
        self.short_term = SessionMemory(session_id)
        self.long_term = PersistentMemory(user_id)
        self.context_window = ContextWindow(max_tokens=32000)

    async def process_message(self, user_input):
        # Retrieve relevant long-term memories if new session
        if self.short_term.is_new_session():
            relevant_memories = await self.long_term.retrieve_relevant(user_input)
            self.short_term.load_memories(relevant_memories)

        # Add to conversation history
        self.short_term.add_message("user", user_input)

        # Maintain context window limits
        self.context_window.trim_history(self.short_term.get_conversation())

        return self.context_window.get_current_context()
Enter fullscreen mode Exit fullscreen mode

Context Window Management Strategies

Context windows represent the immediate working memory available to the LLM. Managing this limited resource requires deliberate strategies beyond simple truncation.

Intelligent Trimming

Straight truncation removes the oldest messages first, but this often loses critical context. Instead, prioritize keeping:

  • System prompts and instructions
  • Recent messages (last 5-10 turns)
  • Messages where the user set context or constraints
  • Messages containing key decisions or commitments
def prioritize_messages(messages):
    prioritized = []
    # Always keep system messages
    prioritized.extend([m for m in messages if m.role == "system"])

    # Keep recent user/assistant exchanges
    recent = messages[-10:]
    prioritized.extend([m for m in recent if m.role in ["user", "assistant"]])

    # Keep messages with high importance scores
    important = [m for m in messages if m.importance_score > 0.8]
    prioritized.extend(important)

    return remove_duplicates(prioritized)
Enter fullscreen mode Exit fullscreen mode

Context Summarization

When trimming isn't sufficient, summarize older conversations into condensed versions. This preserves the semantic meaning while reducing token count dramatically.

We've found hierarchical summarization works best: summarize individual conversations first, then create higher-level summaries of multiple conversations. This preserves both detail where needed and overview context.

async def summarize_conversation(messages, model="gpt-4-mini"):
    prompt = f"""
    Summarize this conversation concisely while preserving:
    - Key decisions made
    - User preferences expressed
    - Action items committed
    - Problems solved

    Conversation:
    {format_messages(messages)}
    """

    response = await model_completion(prompt, model=model)
    return response.choices[0].message.content

# Usage in context management
if context_window.token_count() > max_tokens * 0.8:
    old_messages = get_old_messages()
    summary = await summarize_conversation(old_messages)
    context_window.replace_messages(old_messages, summary)
Enter fullscreen mode Exit fullscreen mode

Vector Retrieval Patterns for Long-Term Memory

Vector databases enable semantic search across historical interactions, but naive implementation leads to poor performance and irrelevant results.

Query Transformation

Raw user queries often don't work well for memory retrieval. Transform them into search-optimized queries:

async def transform_query_for_retrieval(original_query, conversation_context):
    prompt = f"""
    Based on the current conversation and user query, create an optimal search query 
    for finding relevant historical information. Focus on key entities, intent, and context.

    Current conversation context: {conversation_context}
    User query: {original_query}

    Output only the search query, nothing else.
    """

    response = await model_completion(prompt, model="gpt-3.5-turbo")
    return response.choices[0].message.content.strip()

# Then use transformed query for vector search
transformed_query = await transform_query_for_retrieval(user_input, current_context)
relevant_memories = vector_db.similarity_search(transformed_query, k=5)
Enter fullscreen mode Exit fullscreen mode

Time-Aware Retrieval

Not all memories are equally relevant. Recent interactions usually matter more than year-old conversations. Implement recency weighting in your retrieval:

def time_aware_retrieval(query, vector_db, max_results=5, recency_bias=0.3):
    # Get semantic matches
    results = vector_db.similarity_search(query, k=max_results*2)

    # Apply recency scoring
    for result in results:
        age_days = (datetime.now() - result.timestamp).days
        recency_score = 1.0 / (1.0 + age_days * recency_bias)
        result.combined_score = result.similarity_score * 0.7 + recency_score * 0.3

    # Return top combined scores
    results.sort(key=lambda x: x.combined_score, reverse=True)
    return results[:max_results]
Enter fullscreen mode Exit fullscreen mode

Multi-Column Retrieval

Store different types of memories in separate columns with appropriate metadata. This allows targeted retrieval instead of dumping everything into the context window.

# Define memory types with different retrieval strategies
memory_columns = {
    "user_preferences": {"embedding_model": "text-embedding-3-small"},
    "historical_decisions": {"embedding_model": "text-embedding-3-large"},
    "technical_context": {"embedding_model": "all-MiniLM-L6-v2"},
    "conversation_summaries": {"embedding_model": "text-embedding-3-small"}
}

async def retrieve_relevant_memories(user_input, context):
    relevant_memories = []

    # Retrieve from each column with appropriate queries
    preferences_query = await create_preferences_query(user_input, context)
    prefs = await memory_columns["user_preferences"].search(preferences_query, k=2)
    relevant_memories.extend(prefs)

    decisions_query = await create_decisions_query(user_input, context)
    decisions = await memory_columns["historical_decisions"].search(decisions_query, k=3)
    relevant_memories.extend(decisions)

    return relevant_memories
Enter fullscreen mode Exit fullscreen mode

Preventing Context Poisoning and Memory Corruption

Malicious users or edge cases can attempt to corrupt agent memory. This isn't just a security issue; it's a reliability concern.

Input Validation and Sanitization

Validate all inputs before allowing memory storage. This includes checking for prompt injection patterns, excessive length, and malformed data.

def validate_memory_content(content, max_length=1000):
    if len(content) > max_length:
        raise ValidationError(f"Memory content exceeds {max_length} characters")

    # Check for common prompt injection patterns
    injection_patterns = [
        r"ignore previous instructions",
        r"system prompt",
        r"role play",
        r"as a helpful assistant",
        # Add organization-specific patterns
    ]

    for pattern in injection_patterns:
        if re.search(pattern, content, re.IGNORECASE):
            raise SecurityError("Potential prompt injection detected")

    return True
Enter fullscreen mode Exit fullscreen mode

Memory Storage Governance

Not every conversation should become long-term memory. Implement rules for what gets stored and what doesn't.

class MemoryGovernance:
    def __init__(self):
        self.rules = [
            {"pattern": r"password|api[_-]?key|secret", "action": "redact"},
            {"pattern": r"my favorite.*is", "action": "store_preference"},
            {"pattern": r"never mind|forget that", "action": "delete_previous"},
            {"pattern": r"always remember", "action": "store_priority"}
        ]

    async def apply_rules(self, content, context):
        for rule in self.rules:
            if re.search(rule["pattern"], content, re.IGNORECASE):
                await getattr(self, rule["action"])(content, context)
Enter fullscreen mode Exit fullscreen mode

Memory Versioning and Rollback

Implement version control for critical memories. This allows recovery from corruption and provides audit trails.

class VersionedMemory:
    def __init__(self, vector_db):
        self.db = vector_db
        self.version_history = {}

    async def update_memory(self, memory_id, new_content, reason="update"):
        # Store current version
        current = await self.db.get(memory_id)
        self.version_history[memory_id] = self.version_history.get(memory_id, [])
        self.version_history[memory_id].append({
            "timestamp": datetime.now(),
            "content": current.content,
            "reason": reason
        })

        # Update to new content
        await self.db.update(memory_id, new_content)

    async def rollback_memory(self, memory_id, version_index=-2):
        if memory_id in self.version_history and len(self.version_history[memory_id]) > 0:
            previous_version = self.version_history[memory_id][version_index]
            await self.db.update(memory_id, previous_version["content"])
Enter fullscreen mode Exit fullscreen mode

Cross-Session Memory Persistence

Agents that remember users across sessions provide dramatically better experiences. This requires careful design to avoid stale data and ensure consistency.

Session Linking and User Identity

Implement robust user identification that works across devices and sessions while respecting privacy regulations.

class UserIdentityManager:
    def __init__(self, auth_system, anonymization_salt):
        self.auth = auth_system
        self.salt = anonymization_salt

    async def get_user_id(self, session_data, request_headers):
        # Try authenticated user first
        if "authorization" in request_headers:
            user_id = await self.auth.verify_token(request_headers["authorization"])
            if user_id:
                return user_id

        # Fall back to anonymous session with persistent cookie
        if "session_cookie" in session_data:
            anonymous_id = self._hash_with_salt(session_data["session_cookie"])
            return f"anonymous_{anonymous_id}"

        # Create new anonymous session
        new_cookie = generate_secure_cookie()
        anonymous_id = self._hash_with_salt(new_cookie)
        return f"anonymous_{anonymous_id}"
Enter fullscreen mode Exit fullscreen mode

Memory Freshness and Expiration

Not all memories should persist forever. Implement expiration policies based on memory type and importance.

class MemoryExpiration:
    def __init__(self):
        self.policies = {
            "conversation_history": {"ttl_days": 30, "auto_extend": False},
            "user_preferences": {"ttl_days": 365, "auto_extend": True},
            "technical_context": {"ttl_days": 90, "auto_extend": True},
            "temporary_data": {"ttl_days": 1, "auto_extend": False}
        }

    async def cleanup_expired_memories(self):
        for memory_type, policy in self.policies.items():
            expired = await self._find_expired_memories(memory_type, policy["ttl_days"])
            for memory in expired:
                if policy["auto_extend"] and await self._is_still_relevant(memory):
                    await self._extend_ttl(memory, policy["ttl_days"])
                else:
                    await self._delete_memory(memory)
Enter fullscreen mode Exit fullscreen mode

Monitoring and Observability

Memory systems need the same observability as other production components. Track hit rates, latency, accuracy, and errors.

class MemoryMonitor:
    def __init__(self, metrics_client):
        self.metrics = metrics_client
        self.gauges = {
            "retrieval_latency": self.metrics.gauge("memory_retrieval_latency_ms"),
            "hit_rate": self.metrics.gauge("memory_cache_hit_rate"),
            "context_window_usage": self.metrics.gauge("context_window_token_usage")
        }

    async def track_retrieval(self, query, results, latency_ms):
        self.gauges["retrieval_latency"].set(latency_ms)

        # Track relevance of results
        if results:
            relevance_score = await self._calculate_relevance(query, results)
            self.metrics.histogram("memory_relevance_score").observe(relevance_score)

        # Track cache performance
        cache_hits = len([r for r in results if r.from_cache])
        hit_rate = cache_hits / len(results) if results else 0
        self.gauges["hit_rate"].set(hit_rate)
Enter fullscreen mode Exit fullscreen mode

Key metrics to monitor:

  • Memory retrieval latency (p95, p99)
  • Cache hit rates for frequently accessed memories
  • Context window usage distribution
  • Memory relevance scores (how often retrieved memories are actually used)
  • Error rates for memory operations
  • Storage growth rates

Implementation Checklist

When implementing memory management for production agents:

  1. Separate short-term and long-term memory with clear boundaries
  2. Implement intelligent context window management beyond simple truncation
  3. Use query transformation for better vector retrieval results
  4. Apply time-aware retrieval to prioritize recent information
  5. Validate and sanitize all memory inputs to prevent corruption
  6. Implement memory governance rules for what gets stored
  7. Add version control for critical memories
  8. Handle cross-session persistence with proper user identity
  9. Set memory expiration policies to avoid stale data
  10. Implement comprehensive monitoring for memory systems

Conclusion

Memory management separates prototype agents from production systems. The strategies discussed here—intelligent context window management, sophisticated retrieval patterns, robust governance, and cross-session persistence—enable agents that remember what matters while avoiding context overload and corruption.

These patterns come from real deployment experience. We've seen teams waste months trying to scale simple chat examples to production, only to hit fundamental memory limitations. The architecture decisions you make about memory will determine whether your agents remain useful over time or degrade into frustrating amnesiac systems.

Start with a clear separation of short-term and long-term memory, implement basic retrieval and context management, then progressively add sophistication as your usage patterns emerge. Monitor everything, because you'll be surprised which memories matter and which don't. Good memory architecture makes agents feel intelligent; poor architecture makes them feel broken.

Top comments (0)