Production LLM Deployment Guide: Quantization, vLLM Serving & GPU Memory Optimization
Local LLM Production Deployment: Quantization, vLLM Serving, and GPU Memory Optimization
Production LLM deployment requires systematic optimization across model quantization, serving infrastructure, and GPU memory management. This guide provides engineering-focused strategies for maximizing throughput while minimizing latency and costs.
Quantization Fundamentals
Quantization reduces model memory footprint by converting FP16/FP32 weights to lower precision formats. The choice between quantization methods directly impacts model quality and serving efficiency.
AWQ (Activation-Aware Weight Quantization)
- 4-bit quantization with minimal quality degradation
- Best for: Production deployments requiring quality preservation
- Memory reduction: 4x compared to FP16
- Quality impact: <2% perplexity increase on average
GPTQ (Generative Pre-trained Transformer Quantization)
- Post-training quantization for transformer models
- Best for: Models with stable architectures
- Supports: INT4 and INT8 precision modes
- Calibration: Requires representative dataset for optimal results
GGUF (GPT-Generated Unified Format)
- Optimized for local inference engines like llama.cpp
- Quantization levels: Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_K
- Memory calculation:
model_size_gb * (quantization_bits/16) - Example: 70B model at Q4_K_M = 70 * (4/16) = 17.5GB VRAM
vLLM Serving Architecture
vLLM delivers 2-24x throughput improvements over conventional serving frameworks through PagedAttention, continuous batching, and Flash Attention 2.
PagedAttention Mechanism
- Eliminates 60-80% memory waste from KV cache fragmentation
- Virtual memory approach: Treats GPU memory like OS virtual memory
- Non-contiguous storage: Enables efficient memory utilization
- Dynamic allocation: Pages KV cache on-demand
Flash Attention 2
- Critical for v0.6.x+ performance: Requires compute capability 8.0+ (Ampere or newer GPUs: A100, A10, A30, A40, RTX 3090/4090)
- Memory bandwidth optimization: Reduces memory reads by 2-4x
- Automatic detection: Enabled by default on supported hardware
- Fallback: Gracefully degrades to standard attention on older GPUs
Core vLLM Configuration
# Production vLLM server launch (v0.6.x+)
# IMPORTANT: Pre-quantized AWQ models auto-detect quantization - omit --quantization flag
# Adding --quantization awq to a pre-quantized model may cause double-quantization errors
vllm serve --model TheBloke/Llama-2-70B-AWQ \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill \
--enable-prefix-caching \
--swap-space 16 \
--disable-log-requests
OpenAI-Compatible API Endpoints
vLLM provides drop-in OpenAI-compatible endpoints for seamless integration:
| Endpoint | Purpose |
|---|---|
/v1/chat/completions |
Chat-style interactions with conversation history |
/v1/completions |
Legacy text completion |
/v1/models |
List available models |
/v1/embeddings |
Text embeddings (when enabled) |
# OpenAI SDK compatibility - no code changes required
from openai import OpenAI
client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="TheBloke/Llama-2-70B-AWQ",
messages=[{"role": "user", "content": "Hello"}],
stream=True # Enable streaming for responsive UX
)
Streaming Response Support
Critical for production chat applications - streaming reduces perceived latency by returning tokens as generated:
# Streaming with OpenAI SDK
for chunk in client.chat.completions.create(
model="TheBloke/Llama-2-70B-AWQ",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
):
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Raw SSE streaming via curl
curl -N http://vllm-server:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "TheBloke/Llama-2-70B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'
Production benefits: Time-to-first-token remains identical, but users see progressive output instead of waiting for complete response.
Continuous Batching
- Dynamic batch formation: Groups requests for efficient GPU utilization
- Latency optimization: Maintains <100ms TTFT SLA
- Throughput scaling: 23x improvement over non-batched serving
- Configuration:
--max-num-batched-tokenscontrols batch size
Tensor Parallelism
- Model distribution across GPUs:
--tensor-parallel-size Nsplits model weights - Memory scaling: Enables larger models on multiple GPUs
- Communication overhead: Requires high-bandwidth interconnect (NVLink/NVSwitch)
- Configuration: Set based on GPU count and model size
LoRA Adapter Support
vLLM supports multi-tenant LoRA serving for cost-efficient model customization:
# Enable LoRA adapter serving
vllm serve --model TheBloke/Llama-2-70B-AWQ \
--enable-lora \
--lora-modules sentiment-analysis=/adapters/sentiment \
code-assist=/adapters/code \
--max-loras 4 \
--max-lora-rank 64
Multi-tenant scenario: Single base model serves multiple fine-tuned adapters, reducing GPU memory by 10-50x compared to deploying separate models. Adapters are loaded on-demand and cached based on --max-loras.
GPU Memory Optimization
Memory management determines serving capacity and stability. Calculate requirements using accurate formulas.
Accurate Memory Calculation Formula
Total VRAM = Model_Weights + KV_Cache + overhead
KV_Cache = 2 * num_layers * max_seq_len * num_heads * head_dim * dtype_size_bytes
The factor of 2 accounts for separate K (Key) and V (Value) tensors stored for each layer.
For FP16 models: 2 * layers * seq_len * heads * head_dim * 2 bytes
Example for Llama-2-70B:
- Layers: 80, Heads: 64, Head_dim: 128
- KV Cache per token: 2 * 80 * 64 * 128 * 2 = 2,621,440 bytes ≈ 2.5MB
- 4K context: 4,096 tokens × 2.5MB = ~10GB VRAM for KV cache
Critical: Underestimating KV cache by half leads to production OOM failures. Always validate calculations against actual memory profiling.
Key Memory Parameters
# Optimal memory configuration
--gpu-memory-utilization 0.85 # Reserve 15% for system overhead
--max-model-len 4096 # Maximum sequence length
--max-num-seqs 32 # Concurrent sequences
--swap-space 16 # CPU swap in GB
--enable-chunked-prefill # Reduces memory spikes
KV Cache Management
- Primary memory consumer: 60-80% of total VRAM usage
- Dynamic sizing: Automatically adjusts based on sequence length
- Memory fragmentation: PagedAttention eliminates traditional waste
- Monitoring: Track
kv_cache_usage_percentagein production
Memory Optimization Strategies
-
Sequence Length Tuning
- Shorter sequences: Enable more concurrent requests
- Context window: Balance quality vs. capacity
- Typical range: 2K-8K tokens for most applications
-
Batch Size Optimization
- Larger batches: Improve throughput but increase TTFT
- Target TTFT: <100ms for interactive applications
- Monitor:
time_to_first_token_percentiles
-
Model Parallelism
- Tensor parallelism: Split model across multiple GPUs using
--tensor-parallel-size - Pipeline parallelism: Stage-based model distribution
- Selection criteria: Based on model size and GPU availability
- Tensor parallelism: Split model across multiple GPUs using
Production Deployment
Docker Deployment
FROM nvidia/cuda:12.1-devel-ubuntu20.04
# Create non-root user for security
RUN useradd -m -s /bin/bash vllm && \
mkdir -p /app /models && \
chown -R vllm:vllm /app /models
WORKDIR /app
# Install vLLM 0.6.x+
RUN pip install vllm==0.6.3
# Copy and set permissions
COPY start-vllm.sh /app/start-vllm.sh
RUN chmod +x /app/start-vllm.sh && chown vllm:vllm /app/start-vllm.sh
USER vllm
CMD ["/app/start-vllm.sh"]
#!/bin/bash
# start-vllm.sh - Production entrypoint with graceful shutdown
set -euo pipefail
# Graceful shutdown handler - prevents orphaned GPU processes
cleanup() {
echo "Received shutdown signal, cleaning up..."
# Send SIGTERM to vLLM process group for clean exit
kill -TERM -$$ 2>/dev/null || true
exit 0
}
trap cleanup SIGTERM SIGINT
# Launch vLLM
exec vllm serve --model "${MODEL_NAME}" \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size "${TP_SIZE:-1}" \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--enable-prefix-caching
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 3
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
nodeSelector:
nvidia.com/gpu.product: A100-SXM4-40GB
containers:
- name: vllm
image: vllm:latest
resources:
requests:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
env:
- name: MODEL_NAME
value: "TheBloke/Llama-2-70B-AWQ"
- name: TP_SIZE
value: "1"
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 10
metrics:
# CPU utilization is ineffective for GPU-bound inference
# Use custom metrics via Prometheus Adapter for GPU-aware scaling
- type: Pods
pods:
metric:
name: vllm_avg_prompt_tokens_per_request
target:
type: AverageValue
averageValue: "500"
- type: Pods
pods:
metric:
name: vllm_gpu_memory_utilization
target:
type: AverageValue
averageValue: "80"
---
# Alternative: Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: vllm-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
updatePolicy:
updateMode: "Auto"
Note: HPA with CPU metrics is ineffective for GPU-bound inference. Deploy Prometheus Adapter with custom metrics (e.g., vllm_gpu_memory_utilization, vllm_avg_prompt_tokens_per_request) or use Vertical Pod Autoscaler for memory-constrained workloads.
Performance Monitoring
Key metrics for production inference:
-
Throughput Metrics
- Tokens per second (TPS): Target >500 TPS for 70B models
- Requests per second (RPS): Capacity planning metric
- Batch efficiency:
actual_batch_size / max_batch_size
-
Latency Metrics
- Time to First Token (TTFT): p50 <100ms, p95 <500ms
- Time between tokens: Target <50ms for generation
- Total response time: End-to-end request completion
-
Resource Metrics
- GPU memory utilization: Maintain <90% to prevent OOM
- KV cache hit rate: >80% indicates effective caching
- CPU utilization: Monitor for offloading overhead
Health Checks and Reliability
vLLM health endpoint varies by version:
- v0.6.x and earlier:
/health - v0.6.5+ (OpenAI-compatible):
/healthor/v1/health
# Version-aware health check
import requests
def check_health(base_url: str) -> bool:
for path in ["/health", "/v1/health"]:
try:
response = requests.get(f"{base_url}{path}", timeout=5)
if response.status_code == 200:
return True
except requests.RequestException:
continue
return False
if check_health("http://vllm-server:8000"):
print("Service healthy")
else:
print("Service unhealthy")
Capacity Planning
Calculate required GPU count:
GPUs_needed = (daily_tokens * avg_seq_len) / (tokens_per_second_per_gpu * 86400)
Example for 70B model:
- Target: 10M tokens/day
- Per GPU: 500 TPS = 43.2M tokens/day
- Required GPUs: 10M / 43.2M = 0.23 → 1 GPU sufficient
Advanced Optimizations
Speculative Decoding
- 2-3x speedup for predictable outputs
- Configuration example:
vllm serve --model TheBloke/Llama-2-70B-AWQ \
--speculative-model TheBloke/Llama-2-7B-AWQ \
--speculative-draft-length 5 \
--speculative-max-draft-tokens 20
- Draft model: Smaller model generates candidates
- Verification: Larger model validates in single pass
- Best for: Structured outputs, low temperature
Prefix Caching
- 400%+ utilization improvement for standardized prompts
- Cache hits: Reuse computed KV cache for repeated prefixes
- Configuration:
--enable-prefix-caching(v0.6.x+) - Memory overhead: Minimal additional VRAM usage
- Hash algorithms:
--prefix-caching-hash-algo sha256_cborfor reproducible caching
Cross-Instance KV Cache Sharing
- 3-10x latency reduction for repetitive workloads
- Distributed cache: Share KV cache across serving instances
- Network overhead: Requires high-bandwidth interconnect
- Best for: Multi-instance deployments with repetitive context
Implementation Checklist
-
Quantization Selection
- Profile model quality vs. quantization level
- Select AWQ for production quality requirements
- Calculate memory requirements using accurate quantization formula (include factor of 2 for KV cache)
-
vLLM Configuration
- Use vLLM 0.6.x+ with
vllm serve --model <model>command - Omit
--quantizationflag for pre-quantized models (auto-detected) - Set
gpu-memory-utilizationto 0.85-0.9 - Configure
max-num-seqsbased on memory capacity - Enable
enable-chunked-prefillandenable-prefix-caching
- Use vLLM 0.6.x+ with
-
Production Deployment
- Implement health checks using
/healthor/v1/healthendpoint - Configure monitoring for TPS, TTFT, and memory metrics
- Set up alerting for latency degradation and OOM conditions
- Use custom metrics or VPA for Kubernetes autoscaling (not CPU-based HPA)
- Run containers as non-root user with proper WORKDIR
- Implement signal handling in entrypoint scripts for graceful shutdown
- Implement health checks using
-
Optimization
- Enable speculative decoding for appropriate workloads
- Configure prefix caching for standardized prompts
- Implement cross-instance KV cache sharing if applicable
- Verify Flash Attention 2 support on target GPUs (Ampere+)
- Enable streaming for chat/completion endpoints
- Configure LoRA adapters for multi-tenant scenarios
Cost optimization results vary by workload and infrastructure. Benchmark your specific deployment using throughput-per-dollar analysis: compare vLLM's continuous batching against traditional request-per-instance architectures. Key factors include request distribution, sequence length variance, and GPU memory bandwidth. Validate memory calculations against actual production metrics before capacity planning.
Share this Guide:
More Guides
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readGleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems
Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.
5 min readHono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun
Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.
6 min readContinue Reading
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readReady to Supercharge Your Development Workflow?
Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.
