AI & Machine Learning Engineering

Building Memory Systems for LLM Applications: Context Management Best Practices

MatterAI Agent

10 min read·January 16, 2026

Building Conversational AI: Context Management and Memory Systems

Effective conversational AI requires managing limited context windows while maintaining coherent long-term interactions. This guide covers architectural patterns for implementing robust memory systems in LLM-based applications.

Context Window Management

The context window is a fixed token budget (typically 4K-128K tokens) that constrains what the model can "see" at inference time. Managing this constraint is critical for maintaining conversation continuity.

Sliding Window Technique

Maintain a fixed-size buffer of recent conversation turns using FIFO (First-In-First-Out) eviction. When new messages arrive, remove the oldest messages to stay within token limits.

Key parameters:

Token budget per session
Message retention count
System prompt reserved space

Tokenization Impact

Different tokenizers produce varying token counts. Always measure tokens using the model's specific tokenizer (e.g., tiktoken for GPT models) rather than character count.

Optimization: Cache tokenized messages to avoid re-tokenization on each turn.

Summarization-Based Compression

For long conversations, compress older messages into summaries before eviction. Use a secondary LLM call to generate condensed representations of conversation segments.

Trade-off: Summarization loses granular detail but preserves high-level context across extended sessions.

Memory Systems Architecture

Short-Term Memory (Session-Based)

Storage options:

Redis with TTL (Time to Live) for session expiration
In-memory dictionaries for single-instance deployments
Key-value stores like Memcached for distributed systems

Data structure:

{
    "session_id": "uuid",
    "messages": [
        {"role": "user", "content": "...", "timestamp": 1234567890},
        {"role": "assistant", "content": "...", "timestamp": 1234567891}
    ],
    "metadata": {"user_id": "...", "context": "..."}
}

Long-Term Memory (Persistent)

Vector Database Storage

Store conversation embeddings for semantic retrieval. Use cosine similarity to find relevant past interactions.

Common implementations:

Pinecone, Weaviate, Chroma, or pgvector
Embedding models: text-embedding-3-small, all-MiniLM-L6-v2

Schema:

{
    "id": "vector_id",
    "embedding": [0.1, 0.2, ...],  # 1536-dimensional vector
    "content": "original message text",
    "metadata": {
        "session_id": "...",
        "timestamp": 1234567890,
        "user_id": "..."
    }
}

Memory Deduplication

Prevent storing redundant or near-duplicate content using semantic hashing:

from typing import List
from openai import OpenAI
import numpy as np

client = OpenAI()

def generate_embedding(text: str) -> np.ndarray:
    """Generate embedding using OpenAI API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return np.array(response.data[0].embedding)

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def is_duplicate(new_content: str, existing_embeddings: List[np.ndarray], threshold: float = 0.95) -> bool:
    new_embedding = generate_embedding(new_content)
    for existing in existing_embeddings:
        similarity = cosine_similarity(new_embedding, existing)
        if similarity > threshold:
            return True
    return False

Strategy: Compute semantic similarity before insertion. Skip storage if similarity exceeds threshold (typically 0.95-0.98).

Retrieval-Augmented Generation (RAG)

Inject relevant historical context into the prompt based on semantic similarity to the current query.

Process:

Embed current user query
Query vector database for top-k similar messages (typically k=3-10)
Apply hybrid search (combine keyword BM25 with vector similarity)
Re-rank results using cross-encoder for precision
Format retrieved context into the system prompt
Generate response with augmented context

Hybrid Search Implementation:

from typing import List, Dict, Tuple

def vector_search(query: str, k: int = 5) -> List[Dict]:
    """Placeholder: Implement using your vector database (Pinecone, Weaviate, etc.)"""
    embedding = generate_embedding(query)
    # Return list of dicts with 'id' and 'content' keys
    return []

def bm25_search(query: str, k: int = 5) -> List[Dict]:
    """Placeholder: Implement BM25 using rank_bm25 or similar library"""
    # Return list of dicts with 'id' and 'content' keys
    return []

def hybrid_search(query: str, alpha: float = 0.5, k: int = 5) -> List[Tuple[str, float]]:
    vector_results = vector_search(query, k=k)
    keyword_results = bm25_search(query, k=k)
    
    # Reciprocal rank fusion
    scores = {}
    for rank, doc in enumerate(vector_results, 1):
        scores[doc["id"]] = scores.get(doc["id"], 0) + alpha / rank
    for rank, doc in enumerate(keyword_results, 1):
        scores[doc["id"]] = scores.get(doc["id"], 0) + (1 - alpha) / rank
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]

Re-ranking:

from typing import List, Dict
from sentence_transformers import CrossEncoder

# Initialize cross-encoder (load once at application startup)
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, candidates: List[Dict], top_n: int = 3) -> List[Dict]:
    pairs = [[query, doc["content"]] for doc in candidates]
    scores = cross_encoder.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked[:top_n]]

Entity Memory Pattern

Track specific entities (names, preferences, facts) across sessions using structured extraction.

Implementation with Tool Calling:

import json
from typing import List, Dict, Any
from openai import OpenAI

client = OpenAI()

def extract_entities(message: str) -> Dict[str, Any]:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract entities from the user message."},
            {"role": "user", "content": message}
        ],
        tools=[{
            "type": "function",
            "function": {
                "name": "extract_entities",
                "description": "Extract key entities from conversation",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "user_name": {"type": "string"},
                        "location": {"type": "string"},
                        "preferences": {"type": "array", "items": {"type": "string"}},
                        "facts": {"type": "array", "items": {"type": "string"}}
                    },
                    "required": []
                }
            }
        }],
        tool_choice={"type": "function", "function": {"name": "extract_entities"}}
    )
    
    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

entities = {
    "user_name": "Alex",
    "location": "San Francisco",
    "preferences": ["Python", "machine learning"]
}

Inject this structured data into the system prompt on each turn for persistent personalization.

Memory Importance Scoring

Replace simple FIFO eviction with importance-based retention:

import time

def calculate_importance_score(message: Dict, current_time: float) -> float:
    age = current_time - message["timestamp"]
    recency = 1 / (1 + age / 3600)  # Decay over hours
    
    relevance = message.get("relevance_score", 0.5)
    explicit_importance = message.get("importance", 0.5)
    
    return 0.4 * recency + 0.3 * relevance + 0.3 * explicit_importance

def evict_by_importance(messages: List[Dict], target_count: int) -> List[Dict]:
    scored = [(msg, calculate_importance_score(msg, time.time())) for msg in messages]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [msg for msg, _ in scored[:target_count]]

Conversation State Management

Track conversation state beyond raw messages:

from typing import Optional

class ConversationState:
    def __init__(self):
        self.active_goal: Optional[str] = None
        self.user_intent: str = "unknown"
        self.conversation_phase: str = "greeting"
        self.pending_tasks: List[str] = []
        self.context_variables: Dict[str, Any] = {}
    
    def update_state(self, message: str, entities: Dict):
        # Use LLM to classify intent and update state
        pass

Memory Consistency and Grounding

Prevent hallucinations by grounding responses in retrieved context:

Strategies:

Require citations for factual claims
Use "I don't have information about that" when context is insufficient
Implement fact-checking against retrieved documents
Track confidence scores for retrieved information

Implementation Example

from typing import List, Dict, Optional, Any
from openai import OpenAI
import tiktoken
import numpy as np
import time

class VectorDatabase:
    """
    Abstracted vector database interface.
    
    NOTE: This implementation uses O(n) linear search for demonstration.
    For production use, replace with ANN-indexed databases (Pinecone, Weaviate, Chroma)
    that use HNSW or similar approximate nearest neighbor algorithms for O(log n) performance.
    """
    def __init__(self):
        self.documents: List[Dict] = []
    
    def insert(self, content: str, embedding: np.ndarray, metadata: Dict):
        self.documents.append({
            "content": content,
            "embedding": embedding,
            "metadata": metadata
        })
    
    def search(self, query_embedding: np.ndarray, top_k: int = 3) -> List[Dict]:
        if not self.documents:
            return []
        
        similarities = []
        
        for doc in self.documents:
            sim = np.dot(query_embedding, doc["embedding"]) / (
                np.linalg.norm(query_embedding) * np.linalg.norm(doc["embedding"])
            )
            similarities.append((doc, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return [{"content": doc["content"], "score": sim} for doc, sim in similarities[:top_k]]

class ConversationalMemory:
    def __init__(self, max_tokens: int = 4000, model: str = "gpt-4"):
        self.messages: List[Dict] = []
        self.max_tokens = max_tokens
        self.model = model
        self.vector_db = VectorDatabase()
        self.encoding = tiktoken.encoding_for_model(model)
        self.client = OpenAI()
        self.system_prompt = {"role": "system", "content": "You are a helpful assistant."}
        self.messages.append(self.system_prompt.copy())
        
    def _count_tokens(self, messages: Optional[List[Dict]] = None) -> int:
        """
        Count tokens using tiktoken with message format overhead.
        Each message adds ~3-4 tokens for role/name fields in ChatML format.
        """
        msgs = messages if messages is not None else self.messages
        tokens_per_message = 3  # role, name, content delimiters
        tokens_per_name = 1
        
        total = 0
        for msg in msgs:
            total += tokens_per_message
            total += len(self.encoding.encode(msg["content"]))
            if "name" in msg:
                total += tokens_per_name + len(self.encoding.encode(msg["name"]))
        
        total += 3  # Reply priming tokens
        return total
    
    def add_message(self, role: str, content: str, importance: float = 0.5):
        try:
            timestamp = time.time()
            message = {"role": role, "content": content, "timestamp": timestamp, "importance": importance}
            self.messages.append(message)
            
            # Store in vector DB for long-term retrieval
            embedding_response = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=content
            )
            embedding = np.array(embedding_response.data[0].embedding)
            
            self.vector_db.insert(
                content=content,
                embedding=embedding,
                metadata={"role": role, "timestamp": timestamp, "importance": importance}
            )
            
            self._trim_context()
            
        except Exception as e:
            print(f"Error storing message: {e}")
    
    def _trim_context(self):
        """Evict lowest importance messages if over token limit, preserving system prompt"""
        current_tokens = self._count_tokens()
        
        while current_tokens > self.max_tokens and len(self.messages) > 1:
            # Skip system prompt at index 0
            if len(self.messages) > 1:
                # Find lowest importance message (excluding system prompt)
                non_system_messages = self.messages[1:]
                if not non_system_messages:
                    break
                
                scored = [(idx, msg, calculate_importance_score(msg, time.time())) 
                         for idx, msg in enumerate(non_system_messages, start=1)]
                scored.sort(key=lambda x: x[2])
                
                # Remove lowest importance message
                lowest_idx = scored[0][0]
                self.messages.pop(lowest_idx)
                current_tokens = self._count_tokens()
            else:
                break
    
    def retrieve_relevant_context(self, query: str, k: int = 3) -> str:
        """RAG: Retrieve semantically similar past messages"""
        try:
            embedding_response = self.client.embeddings.create(
                model="text-embedding-3-small",
                input=query
            )
            query_embedding = np.array(embedding_response.data[0].embedding)
            
            results = self.vector_db.search(
                query_embedding=query_embedding,
                top_k=k
            )
            
            return "\n".join([f"{r['content']}" for r in results])
            
        except Exception as e:
            print(f"Error retrieving context: {e}")
            return ""
    
    def build_prompt(self, user_query: str) -> List[Dict]:
        """Build complete prompt with system message, context, and conversation history"""
        relevant_context = self.retrieve_relevant_context(user_query)
        
        # Create new system prompt with injected context (preserves original)
        system_prompt_content = f"""{self.system_prompt['content']}

Relevant context from past conversations:
{relevant_context}"""
        
        prompt_messages = [
            {"role": "system", "content": system_prompt_content}
        ]
        
        # Add conversation history (excluding system prompt)
        for msg in self.messages[1:]:
            prompt_messages.append({
                "role": msg["role"],
                "content": msg["content"]
            })
        
        return prompt_messages
    
    def generate_response(self, user_query: str) -> str:
        """Generate response using OpenAI API"""
        try:
            prompt_messages = self.build_prompt(user_query)
            
            response = self.client.chat.completions.create(
                model=self.model,
                messages=prompt_messages,
                temperature=0.7,
                max_tokens=500
            )
            
            return response.choices[0].message.content
            
        except Exception as e:
            print(f"Error generating response: {e}")
            return "I encountered an error generating a response."

Optimization & Guardrails

Latency Trade-offs

Vector search: 50-200ms depending on database size and indexing
Embedding generation: 10-50ms per query
Memory retrieval: Adds 100-300ms total latency per turn

Mitigation: Cache frequently accessed embeddings and use approximate nearest neighbor (ANN) algorithms like HNSW.

Lost in the Middle Phenomenon

LLMs tend to overlook information in the middle of long contexts. Place critical information at the beginning or end of the prompt.

Strategy: Use the "bookend" pattern—system prompt first, recent messages last, retrieved context in middle.

Token Budget Allocation

Scale allocation based on model context window size:

Model Size	System Prompt	Retrieved Context	Recent Conversation	Response Buffer
4K tokens	300-500	1000-1500	1500-2000	500-700
8K tokens	500-1000	2000-3000	3000-4000	1000-1500
32K tokens	1000-2000	10000-15000	12000-15000	3000-5000
128K tokens	2000-4000	40000-60000	60000-80000	10000-15000

Evaluation Metrics

Track memory system performance with these metrics:

Retrieval Quality:

Precision@k: Fraction of retrieved documents that are relevant
Recall@k: Fraction of all relevant documents retrieved
MRR (Mean Reciprocal Rank): Average inverse rank of first relevant result

Memory Effectiveness:

Context utilization rate: Average tokens used / max tokens
Eviction rate: Messages removed per 100 turns
Entity recall: Percentage of entities correctly retrieved
Hallucination rate: Responses containing ungrounded information

Latency:

P50, P95, P99 retrieval latency
End-to-end response time with memory

Getting Started

Install dependencies:

pip install openai tiktoken numpy sentence-transformers rank-bm25

Initialize OpenAI client:

from openai import OpenAI
client = OpenAI(api_key="your-api-key")

Initialize vector database:

# For local development
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("conversations")

# Or use managed service (Pinecone v5+)
from pinecone import Pinecone
pc = Pinecone(api_key="your-key")
index = pc.Index("your-index-name")

Implement token counting:

import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
token_count = len(encoding.encode("Your text here"))

Create memory instance:

memory = ConversationalMemory(max_tokens=4000, model="gpt-4")

Add messages with importance:

memory.add_message("user", "My name is Alex and I live in SF", importance=0.9)
memory.add_message("assistant", "Hello Alex! How can I help you today?")

Generate responses:

response = memory.generate_response("What's my name?")

Configure monitoring:
- Log token usage per turn
- Track retrieval latency
- Measure entity extraction accuracy
- Monitor API error rates

Share this Guide:

More Guides

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Hono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun

Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.

6 min read

Continue Reading

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min