Transformers

KV Cache

LLMs

How KV Caching Works in Large Language Models

Vatsal Bajpai

10 min read·September 13, 2025

Large language models (LLMs) are remarkable at generating coherent text. Behind the scenes, they rely on transformer layers that process tokens sequentially, computing attention scores across all previous tokens. While this gives models long-term context, it comes at a cost: repeated computation for tokens that have already been processed. KV caching is the optimization that solves this problem, making LLMs faster and more efficient.

Quick visualisation of KV cache in action

Without KV caching

All prompt data is computed everytime
src: https://lmcache.ai

With KV caching

Cached prompt data is re-used
src: https://lmcache.ai

The Transformer Attention Mechanism

Each transformer layer computes attention using queries (Q), keys (K), and values (V).

Query (Q): Represents the token you are currently generating or attending from.
Key (K): Encodes the information of each previous token in a way that can be compared with queries.
Value (V): Stores the content information of each token that will be aggregated based on attention scores.

Mathematically, attention is computed as:

Attention(Q, K, V) = softmax(Q × K^T / sqrt(d)) × V

Where d is the dimensionality of the hidden states.

Each token produces one K and one V per layer.
K and V are matrices shaped [num_heads, seq_len, head_dim].
These matrices are what allow the model to “remember” previous tokens.

What KV Caching Is

When generating text sequentially, the model computes K and V for every token at each layer. If a prefix of the input is repeated in multiple requests, recomputing K and V for those tokens is wasteful.

KV caching stores these K and V tensors so that they can be reused for future tokens, avoiding recomputation while producing the same outputs.

Example: KV Caching in Action

Suppose you have a prompt:

System: You are a helpful assistant.
User: What is the capital of France?

Tokenize and process Each token generates K and V tensors in every layer.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")

# Compute KV cache
outputs = model(**inputs, use_cache=True)
kv_cache = outputs.past_key_values  # List of (K, V) tuples for each layer

Extend the prompt Now, generate the next token with a new query:

new_prompt = " and Germany?"
new_inputs = tokenizer(new_prompt, return_tensors="pt")

# Reuse KV cache
new_outputs = model(**new_inputs, past_key_values=kv_cache, use_cache=True)

The model reuses the cached K and V for the first part of the sequence.
Only the new tokens are processed fully.
Output is identical to recomputing the entire sequence.

How Transformers Use KV Cache Internally

Layer-level storage
- Each transformer layer stores a separate (K, V) tuple for each token.
- The shape is [num_heads, seq_len, head_dim].
Attention computation with cache
- New queries are compared with cached K matrices via dot product.
- The resulting attention scores weight the cached V matrices to compute outputs.
Multi-turn conversations
- Cached K/V states can persist across turns in a chat.
- Only new tokens trigger full computation, making chat sessions more efficient.

LMCache: Advanced KV Caching for vLLM

LMCache is a library designed to store and reuse KV caches efficiently across GPU, CPU, and disk.

Features:

Stores K/V tensors for repeated prompts.
Supports multi-turn conversations and retrieval-augmented generation (RAG).
Reduces latency and GPU usage.

Example usage:

from vllm import LLM
from lmcache import LMCache

llm = LLM("gpt2")
cache = LMCache(storage="disk", path="/tmp/kv_cache")

prompt = "The capital of France is"
outputs = llm.generate(prompt, use_cache=True, cache=cache)

# Next turn
new_prompt = " and Germany?"
new_outputs = llm.generate(new_prompt, use_cache=True, cache=cache)

LMCache handles storing and retrieving K/V tensors transparently.

Visualizing KV Caching

Prompt A → KV cache generated.
Prompt B → Reuse KV cache from Prompt A.
Only new tokens go through full attention computation.

This can speed up multi-turn conversations by 5-10x depending on sequence length.

Key Takeaways

K and V tensors encode the memory of each token in a transformer layer.
KV caching reuses these tensors to avoid redundant computation.
True KV caching requires access to the model internals (open-source or local).
LMCache shows how caching can be scaled across multi-turn sessions and storage tiers.
Understanding KV caching is essential for engineers building high-performance LLM applications.

Share this Article:

AI Engineering Productivity: Transforming Software Development

Artificial intelligence isn't just another tool in the developer's toolkit—it's fundamentally changing how we approach problem-solving, code creation, and system design.

Understanding Abstract Syntax Trees

How compilers understand your code, how linters spot bugs or how tools like Prettier can reformat thousands of lines of code in milliseconds

How to Improve the PR Review Process for Engineering Teams

Let's dive deep into the PR review process and see how we can improve it

Introducing MatterAI Code Reviews for VSCode, Cursor, CLine and more

MatterAI MCP for Cursor to get real-time code reviews

Understanding Attention: Coherency in LLMs

How LLMs generate coherent text across long contexts

Continue Reading

AI Engineering Productivity: Transforming Software Development

Artificial intelligence isn't just another tool in the developer's toolkit—it's fundamentally changing how we approach problem-solving, code creation, and system design.

Understanding Abstract Syntax Trees

How compilers understand your code, how linters spot bugs or how tools like Prettier can reformat thousands of lines of code in milliseconds

How to Improve the PR Review Process for Engineering Teams

Let's dive deep into the PR review process and see how we can improve it

How KV Caching Works in Large Language Models

Quick visualisation of KV cache in action

Without KV caching

With KV caching

The Transformer Attention Mechanism

What KV Caching Is

Example: KV Caching in Action

How Transformers Use KV Cache Internally

LMCache: Advanced KV Caching for vLLM

Visualizing KV Caching

Key Takeaways

More Articles

AI Engineering Productivity: Transforming Software Development

Understanding Abstract Syntax Trees

How to Improve the PR Review Process for Engineering Teams

Introducing MatterAI Code Reviews for VSCode, Cursor, CLine and more

Understanding Attention: Coherency in LLMs

Continue Reading

AI Engineering Productivity: Transforming Software Development

Understanding Abstract Syntax Trees

How to Improve the PR Review Process for Engineering Teams