Transformers

KV Cache

LLMs

How KV Caching Works in Large Language Models

Cover Image for How KV Caching Works in Large Language Models
Vatsal Bajpai
Vatsal Bajpai
10 min read·

Large language models (LLMs) are remarkable at generating coherent text. Behind the scenes, they rely on transformer layers that process tokens sequentially, computing attention scores across all previous tokens. While this gives models long-term context, it comes at a cost: repeated computation for tokens that have already been processed. KV caching is the optimization that solves this problem, making LLMs faster and more efficient.

Quick visualisation of KV cache in action

Without KV caching

  • All prompt data is computed everytime
    prompt-caching-without
    src: https://lmcache.ai

With KV caching

  • Cached prompt data is re-used
    prompt-caching-with
    src: https://lmcache.ai

The Transformer Attention Mechanism

Each transformer layer computes attention using queries (Q), keys (K), and values (V).

  • Query (Q): Represents the token you are currently generating or attending from.
  • Key (K): Encodes the information of each previous token in a way that can be compared with queries.
  • Value (V): Stores the content information of each token that will be aggregated based on attention scores.

Mathematically, attention is computed as:

Attention(Q, K, V) = softmax(Q × K^T / sqrt(d)) × V

Where d is the dimensionality of the hidden states.

  • Each token produces one K and one V per layer.
  • K and V are matrices shaped [num_heads, seq_len, head_dim].
  • These matrices are what allow the model to “remember” previous tokens.

What KV Caching Is

When generating text sequentially, the model computes K and V for every token at each layer. If a prefix of the input is repeated in multiple requests, recomputing K and V for those tokens is wasteful.

KV caching stores these K and V tensors so that they can be reused for future tokens, avoiding recomputation while producing the same outputs.


Example: KV Caching in Action

Suppose you have a prompt:

System: You are a helpful assistant.
User: What is the capital of France?
  1. Tokenize and process Each token generates K and V tensors in every layer.
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt")

# Compute KV cache
outputs = model(**inputs, use_cache=True)
kv_cache = outputs.past_key_values  # List of (K, V) tuples for each layer
  1. Extend the prompt Now, generate the next token with a new query:
new_prompt = " and Germany?"
new_inputs = tokenizer(new_prompt, return_tensors="pt")

# Reuse KV cache
new_outputs = model(**new_inputs, past_key_values=kv_cache, use_cache=True)
  • The model reuses the cached K and V for the first part of the sequence.
  • Only the new tokens are processed fully.
  • Output is identical to recomputing the entire sequence.

How Transformers Use KV Cache Internally

  1. Layer-level storage

    • Each transformer layer stores a separate (K, V) tuple for each token.
    • The shape is [num_heads, seq_len, head_dim].
  2. Attention computation with cache

    • New queries are compared with cached K matrices via dot product.
    • The resulting attention scores weight the cached V matrices to compute outputs.
  3. Multi-turn conversations

    • Cached K/V states can persist across turns in a chat.
    • Only new tokens trigger full computation, making chat sessions more efficient.

LMCache: Advanced KV Caching for vLLM

LMCache is a library designed to store and reuse KV caches efficiently across GPU, CPU, and disk.

Features:

  • Stores K/V tensors for repeated prompts.
  • Supports multi-turn conversations and retrieval-augmented generation (RAG).
  • Reduces latency and GPU usage.

Example usage:

from vllm import LLM
from lmcache import LMCache

llm = LLM("gpt2")
cache = LMCache(storage="disk", path="/tmp/kv_cache")

prompt = "The capital of France is"
outputs = llm.generate(prompt, use_cache=True, cache=cache)

# Next turn
new_prompt = " and Germany?"
new_outputs = llm.generate(new_prompt, use_cache=True, cache=cache)

LMCache handles storing and retrieving K/V tensors transparently.


Visualizing KV Caching

  1. Prompt A → KV cache generated.
  2. Prompt B → Reuse KV cache from Prompt A.
  3. Only new tokens go through full attention computation.

This can speed up multi-turn conversations by 5-10x depending on sequence length.


Key Takeaways

  • K and V tensors encode the memory of each token in a transformer layer.
  • KV caching reuses these tensors to avoid redundant computation.
  • True KV caching requires access to the model internals (open-source or local).
  • LMCache shows how caching can be scaled across multi-turn sessions and storage tiers.
  • Understanding KV caching is essential for engineers building high-performance LLM applications.

Share this Article: