AI & Machine Learning Engineering
Building Memory Systems for LLM Applications: Context Management Best Practices
Building Conversational AI: Context Management and Memory Systems
Effective conversational AI requires managing limited context windows while maintaining coherent long-term interactions. This guide covers architectural patterns for implementing robust memory systems in LLM-based applications.
Context Window Management
The context window is a fixed token budget (typically 4K-128K tokens) that constrains what the model can "see" at inference time. Managing this constraint is critical for maintaining conversation continuity.
Sliding Window Technique
Maintain a fixed-size buffer of recent conversation turns using FIFO (First-In-First-Out) eviction. When new messages arrive, remove the oldest messages to stay within token limits.
Key parameters:
- Token budget per session
- Message retention count
- System prompt reserved space
Tokenization Impact
Different tokenizers produce varying token counts. Always measure tokens using the model's specific tokenizer (e.g., tiktoken for GPT models) rather than character count.
Optimization: Cache tokenized messages to avoid re-tokenization on each turn.
Summarization-Based Compression
For long conversations, compress older messages into summaries before eviction. Use a secondary LLM call to generate condensed representations of conversation segments.
Trade-off: Summarization loses granular detail but preserves high-level context across extended sessions.
Memory Systems Architecture
Short-Term Memory (Session-Based)
Storage options:
- Redis with TTL (Time to Live) for session expiration
- In-memory dictionaries for single-instance deployments
- Key-value stores like Memcached for distributed systems
Data structure:
{
"session_id": "uuid",
"messages": [
{"role": "user", "content": "...", "timestamp": 1234567890},
{"role": "assistant", "content": "...", "timestamp": 1234567891}
],
"metadata": {"user_id": "...", "context": "..."}
}
Long-Term Memory (Persistent)
Vector Database Storage
Store conversation embeddings for semantic retrieval. Use cosine similarity to find relevant past interactions.
Common implementations:
- Pinecone, Weaviate, Chroma, or pgvector
- Embedding models: text-embedding-3-small, all-MiniLM-L6-v2
Schema:
{
"id": "vector_id",
"embedding": [0.1, 0.2, ...], # 1536-dimensional vector
"content": "original message text",
"metadata": {
"session_id": "...",
"timestamp": 1234567890,
"user_id": "..."
}
}
Memory Deduplication
Prevent storing redundant or near-duplicate content using semantic hashing:
from typing import List
from openai import OpenAI
import numpy as np
client = OpenAI()
def generate_embedding(text: str) -> np.ndarray:
"""Generate embedding using OpenAI API"""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding)
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
"""Compute cosine similarity between two vectors"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def is_duplicate(new_content: str, existing_embeddings: List[np.ndarray], threshold: float = 0.95) -> bool:
new_embedding = generate_embedding(new_content)
for existing in existing_embeddings:
similarity = cosine_similarity(new_embedding, existing)
if similarity > threshold:
return True
return False
Strategy: Compute semantic similarity before insertion. Skip storage if similarity exceeds threshold (typically 0.95-0.98).
Retrieval-Augmented Generation (RAG)
Inject relevant historical context into the prompt based on semantic similarity to the current query.
Process:
- Embed current user query
- Query vector database for top-k similar messages (typically k=3-10)
- Apply hybrid search (combine keyword BM25 with vector similarity)
- Re-rank results using cross-encoder for precision
- Format retrieved context into the system prompt
- Generate response with augmented context
Hybrid Search Implementation:
from typing import List, Dict, Tuple
def vector_search(query: str, k: int = 5) -> List[Dict]:
"""Placeholder: Implement using your vector database (Pinecone, Weaviate, etc.)"""
embedding = generate_embedding(query)
# Return list of dicts with 'id' and 'content' keys
return []
def bm25_search(query: str, k: int = 5) -> List[Dict]:
"""Placeholder: Implement BM25 using rank_bm25 or similar library"""
# Return list of dicts with 'id' and 'content' keys
return []
def hybrid_search(query: str, alpha: float = 0.5, k: int = 5) -> List[Tuple[str, float]]:
vector_results = vector_search(query, k=k)
keyword_results = bm25_search(query, k=k)
# Reciprocal rank fusion
scores = {}
for rank, doc in enumerate(vector_results, 1):
scores[doc["id"]] = scores.get(doc["id"], 0) + alpha / rank
for rank, doc in enumerate(keyword_results, 1):
scores[doc["id"]] = scores.get(doc["id"], 0) + (1 - alpha) / rank
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]
Re-ranking:
from typing import List, Dict
from sentence_transformers import CrossEncoder
# Initialize cross-encoder (load once at application startup)
cross_encoder = CrossEncoder('ms-marco-MiniLM-L-6-v2')
def rerank_results(query: str, candidates: List[Dict], top_n: int = 3) -> List[Dict]:
pairs = [[query, doc["content"]] for doc in candidates]
scores = cross_encoder.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked[:top_n]]
Entity Memory Pattern
Track specific entities (names, preferences, facts) across sessions using structured extraction.
Implementation with Tool Calling:
import json
from typing import List, Dict, Any
from openai import OpenAI
client = OpenAI()
def extract_entities(message: str) -> Dict[str, Any]:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract entities from the user message."},
{"role": "user", "content": message}
],
tools=[{
"type": "function",
"function": {
"name": "extract_entities",
"description": "Extract key entities from conversation",
"parameters": {
"type": "object",
"properties": {
"user_name": {"type": "string"},
"location": {"type": "string"},
"preferences": {"type": "array", "items": {"type": "string"}},
"facts": {"type": "array", "items": {"type": "string"}}
},
"required": []
}
}
}],
tool_choice={"type": "function", "function": {"name": "extract_entities"}}
)
tool_call = response.choices[0].message.tool_calls[0]
return json.loads(tool_call.function.arguments)
entities = {
"user_name": "Alex",
"location": "San Francisco",
"preferences": ["Python", "machine learning"]
}
Inject this structured data into the system prompt on each turn for persistent personalization.
Memory Importance Scoring
Replace simple FIFO eviction with importance-based retention:
import time
def calculate_importance_score(message: Dict, current_time: float) -> float:
age = current_time - message["timestamp"]
recency = 1 / (1 + age / 3600) # Decay over hours
relevance = message.get("relevance_score", 0.5)
explicit_importance = message.get("importance", 0.5)
return 0.4 * recency + 0.3 * relevance + 0.3 * explicit_importance
def evict_by_importance(messages: List[Dict], target_count: int) -> List[Dict]:
scored = [(msg, calculate_importance_score(msg, time.time())) for msg in messages]
scored.sort(key=lambda x: x[1], reverse=True)
return [msg for msg, _ in scored[:target_count]]
Conversation State Management
Track conversation state beyond raw messages:
from typing import Optional
class ConversationState:
def __init__(self):
self.active_goal: Optional[str] = None
self.user_intent: str = "unknown"
self.conversation_phase: str = "greeting"
self.pending_tasks: List[str] = []
self.context_variables: Dict[str, Any] = {}
def update_state(self, message: str, entities: Dict):
# Use LLM to classify intent and update state
pass
Memory Consistency and Grounding
Prevent hallucinations by grounding responses in retrieved context:
Strategies:
- Require citations for factual claims
- Use "I don't have information about that" when context is insufficient
- Implement fact-checking against retrieved documents
- Track confidence scores for retrieved information
Implementation Example
from typing import List, Dict, Optional, Any
from openai import OpenAI
import tiktoken
import numpy as np
import time
class VectorDatabase:
"""
Abstracted vector database interface.
NOTE: This implementation uses O(n) linear search for demonstration.
For production use, replace with ANN-indexed databases (Pinecone, Weaviate, Chroma)
that use HNSW or similar approximate nearest neighbor algorithms for O(log n) performance.
"""
def __init__(self):
self.documents: List[Dict] = []
def insert(self, content: str, embedding: np.ndarray, metadata: Dict):
self.documents.append({
"content": content,
"embedding": embedding,
"metadata": metadata
})
def search(self, query_embedding: np.ndarray, top_k: int = 3) -> List[Dict]:
if not self.documents:
return []
similarities = []
for doc in self.documents:
sim = np.dot(query_embedding, doc["embedding"]) / (
np.linalg.norm(query_embedding) * np.linalg.norm(doc["embedding"])
)
similarities.append((doc, sim))
similarities.sort(key=lambda x: x[1], reverse=True)
return [{"content": doc["content"], "score": sim} for doc, sim in similarities[:top_k]]
class ConversationalMemory:
def __init__(self, max_tokens: int = 4000, model: str = "gpt-4"):
self.messages: List[Dict] = []
self.max_tokens = max_tokens
self.model = model
self.vector_db = VectorDatabase()
self.encoding = tiktoken.encoding_for_model(model)
self.client = OpenAI()
self.system_prompt = {"role": "system", "content": "You are a helpful assistant."}
self.messages.append(self.system_prompt.copy())
def _count_tokens(self, messages: Optional[List[Dict]] = None) -> int:
"""
Count tokens using tiktoken with message format overhead.
Each message adds ~3-4 tokens for role/name fields in ChatML format.
"""
msgs = messages if messages is not None else self.messages
tokens_per_message = 3 # role, name, content delimiters
tokens_per_name = 1
total = 0
for msg in msgs:
total += tokens_per_message
total += len(self.encoding.encode(msg["content"]))
if "name" in msg:
total += tokens_per_name + len(self.encoding.encode(msg["name"]))
total += 3 # Reply priming tokens
return total
def add_message(self, role: str, content: str, importance: float = 0.5):
try:
timestamp = time.time()
message = {"role": role, "content": content, "timestamp": timestamp, "importance": importance}
self.messages.append(message)
# Store in vector DB for long-term retrieval
embedding_response = self.client.embeddings.create(
model="text-embedding-3-small",
input=content
)
embedding = np.array(embedding_response.data[0].embedding)
self.vector_db.insert(
content=content,
embedding=embedding,
metadata={"role": role, "timestamp": timestamp, "importance": importance}
)
self._trim_context()
except Exception as e:
print(f"Error storing message: {e}")
def _trim_context(self):
"""Evict lowest importance messages if over token limit, preserving system prompt"""
current_tokens = self._count_tokens()
while current_tokens > self.max_tokens and len(self.messages) > 1:
# Skip system prompt at index 0
if len(self.messages) > 1:
# Find lowest importance message (excluding system prompt)
non_system_messages = self.messages[1:]
if not non_system_messages:
break
scored = [(idx, msg, calculate_importance_score(msg, time.time()))
for idx, msg in enumerate(non_system_messages, start=1)]
scored.sort(key=lambda x: x[2])
# Remove lowest importance message
lowest_idx = scored[0][0]
self.messages.pop(lowest_idx)
current_tokens = self._count_tokens()
else:
break
def retrieve_relevant_context(self, query: str, k: int = 3) -> str:
"""RAG: Retrieve semantically similar past messages"""
try:
embedding_response = self.client.embeddings.create(
model="text-embedding-3-small",
input=query
)
query_embedding = np.array(embedding_response.data[0].embedding)
results = self.vector_db.search(
query_embedding=query_embedding,
top_k=k
)
return "\n".join([f"{r['content']}" for r in results])
except Exception as e:
print(f"Error retrieving context: {e}")
return ""
def build_prompt(self, user_query: str) -> List[Dict]:
"""Build complete prompt with system message, context, and conversation history"""
relevant_context = self.retrieve_relevant_context(user_query)
# Create new system prompt with injected context (preserves original)
system_prompt_content = f"""{self.system_prompt['content']}
Relevant context from past conversations:
{relevant_context}"""
prompt_messages = [
{"role": "system", "content": system_prompt_content}
]
# Add conversation history (excluding system prompt)
for msg in self.messages[1:]:
prompt_messages.append({
"role": msg["role"],
"content": msg["content"]
})
return prompt_messages
def generate_response(self, user_query: str) -> str:
"""Generate response using OpenAI API"""
try:
prompt_messages = self.build_prompt(user_query)
response = self.client.chat.completions.create(
model=self.model,
messages=prompt_messages,
temperature=0.7,
max_tokens=500
)
return response.choices[0].message.content
except Exception as e:
print(f"Error generating response: {e}")
return "I encountered an error generating a response."
Optimization & Guardrails
Latency Trade-offs
- Vector search: 50-200ms depending on database size and indexing
- Embedding generation: 10-50ms per query
- Memory retrieval: Adds 100-300ms total latency per turn
Mitigation: Cache frequently accessed embeddings and use approximate nearest neighbor (ANN) algorithms like HNSW.
Lost in the Middle Phenomenon
LLMs tend to overlook information in the middle of long contexts. Place critical information at the beginning or end of the prompt.
Strategy: Use the "bookend" pattern—system prompt first, recent messages last, retrieved context in middle.
Token Budget Allocation
Scale allocation based on model context window size:
| Model Size | System Prompt | Retrieved Context | Recent Conversation | Response Buffer |
|---|---|---|---|---|
| 4K tokens | 300-500 | 1000-1500 | 1500-2000 | 500-700 |
| 8K tokens | 500-1000 | 2000-3000 | 3000-4000 | 1000-1500 |
| 32K tokens | 1000-2000 | 10000-15000 | 12000-15000 | 3000-5000 |
| 128K tokens | 2000-4000 | 40000-60000 | 60000-80000 | 10000-15000 |
Evaluation Metrics
Track memory system performance with these metrics:
Retrieval Quality:
- Precision@k: Fraction of retrieved documents that are relevant
- Recall@k: Fraction of all relevant documents retrieved
- MRR (Mean Reciprocal Rank): Average inverse rank of first relevant result
Memory Effectiveness:
- Context utilization rate: Average tokens used / max tokens
- Eviction rate: Messages removed per 100 turns
- Entity recall: Percentage of entities correctly retrieved
- Hallucination rate: Responses containing ungrounded information
Latency:
- P50, P95, P99 retrieval latency
- End-to-end response time with memory
Getting Started
-
Install dependencies:
pip install openai tiktoken numpy sentence-transformers rank-bm25 -
Initialize OpenAI client:
from openai import OpenAI client = OpenAI(api_key="your-api-key") -
Initialize vector database:
# For local development import chromadb chroma_client = chromadb.Client() collection = chroma_client.create_collection("conversations") # Or use managed service (Pinecone v5+) from pinecone import Pinecone pc = Pinecone(api_key="your-key") index = pc.Index("your-index-name") -
Implement token counting:
import tiktoken encoding = tiktoken.encoding_for_model("gpt-4") token_count = len(encoding.encode("Your text here")) -
Create memory instance:
memory = ConversationalMemory(max_tokens=4000, model="gpt-4") -
Add messages with importance:
memory.add_message("user", "My name is Alex and I live in SF", importance=0.9) memory.add_message("assistant", "Hello Alex! How can I help you today?") -
Generate responses:
response = memory.generate_response("What's my name?") -
Configure monitoring:
- Log token usage per turn
- Track retrieval latency
- Measure entity extraction accuracy
- Monitor API error rates
Share this Guide:
More Guides
API Gateway Showdown: Kong vs Ambassador vs AWS API Gateway for Microservices
Compare Kong, Ambassador, and AWS API Gateway across architecture, performance, security, and cost to choose the right gateway for your microservices.
12 min readGitHub Actions vs GitLab CI vs Jenkins: The Ultimate CI/CD Platform Comparison for 2026
Compare GitHub Actions, GitLab CI, and Jenkins across architecture, scalability, cost, and security to choose the best CI/CD platform for your team in 2026.
7 min readKafka vs RabbitMQ vs EventBridge: Complete Messaging Backbone Comparison
Compare Apache Kafka, RabbitMQ, and AWS EventBridge across throughput, latency, delivery guarantees, and operational complexity to choose the right event-driven architecture for your use case.
4 min readChaos Engineering: A Practical Guide to Failure Injection and System Resilience
Learn how to implement chaos engineering using the scientific method: define steady state, form hypotheses, inject failures, and verify system resilience. This practical guide covers application and infrastructure-level failure injection patterns with code examples.
4 min readScaling PostgreSQL for High-Traffic: Read Replicas, Sharding, and Connection Pooling Strategies
Master PostgreSQL horizontal scaling with read replicas, sharding with Citus, and connection pooling. Learn practical implementation strategies to handle high-traffic workloads beyond single-server limits.
4 min readContinue Reading
API Gateway Showdown: Kong vs Ambassador vs AWS API Gateway for Microservices
Compare Kong, Ambassador, and AWS API Gateway across architecture, performance, security, and cost to choose the right gateway for your microservices.
12 min readGitHub Actions vs GitLab CI vs Jenkins: The Ultimate CI/CD Platform Comparison for 2026
Compare GitHub Actions, GitLab CI, and Jenkins across architecture, scalability, cost, and security to choose the best CI/CD platform for your team in 2026.
7 min readKafka vs RabbitMQ vs EventBridge: Complete Messaging Backbone Comparison
Compare Apache Kafka, RabbitMQ, and AWS EventBridge across throughput, latency, delivery guarantees, and operational complexity to choose the right event-driven architecture for your use case.
4 min read