AI & Machine Learning Engineering

Production LLM Deployment Guide: Quantization, vLLM Serving & GPU Memory Optimization

MatterAI

18 min read·March 3, 2026

Local LLM Production Deployment: Quantization, vLLM Serving, and GPU Memory Optimization

Production LLM deployment requires systematic optimization across model quantization, serving infrastructure, and GPU memory management. This guide provides engineering-focused strategies for maximizing throughput while minimizing latency and costs.

Quantization Fundamentals

Quantization reduces model memory footprint by converting FP16/FP32 weights to lower precision formats. The choice between quantization methods directly impacts model quality and serving efficiency.

AWQ (Activation-Aware Weight Quantization)

4-bit quantization with minimal quality degradation
Best for: Production deployments requiring quality preservation
Memory reduction: 4x compared to FP16
Quality impact: <2% perplexity increase on average

GPTQ (Generative Pre-trained Transformer Quantization)

Post-training quantization for transformer models
Best for: Models with stable architectures
Supports: INT4 and INT8 precision modes
Calibration: Requires representative dataset for optimal results

GGUF (GPT-Generated Unified Format)

Optimized for local inference engines like llama.cpp
Quantization levels: Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_K
Memory calculation: model_size_gb * (quantization_bits/16)
Example: 70B model at Q4_K_M = 70 * (4/16) = 17.5GB VRAM

vLLM Serving Architecture

vLLM delivers 2-24x throughput improvements over conventional serving frameworks through PagedAttention, continuous batching, and Flash Attention 2.

PagedAttention Mechanism

Eliminates 60-80% memory waste from KV cache fragmentation
Virtual memory approach: Treats GPU memory like OS virtual memory
Non-contiguous storage: Enables efficient memory utilization
Dynamic allocation: Pages KV cache on-demand

Flash Attention 2

Critical for v0.6.x+ performance: Requires compute capability 8.0+ (Ampere or newer GPUs: A100, A10, A30, A40, RTX 3090/4090)
Memory bandwidth optimization: Reduces memory reads by 2-4x
Automatic detection: Enabled by default on supported hardware
Fallback: Gracefully degrades to standard attention on older GPUs

Core vLLM Configuration

# Production vLLM server launch (v0.6.x+)
# IMPORTANT: Pre-quantized AWQ models auto-detect quantization - omit --quantization flag
# Adding --quantization awq to a pre-quantized model may cause double-quantization errors
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --swap-space 16 \
    --disable-log-requests

OpenAI-Compatible API Endpoints

vLLM provides drop-in OpenAI-compatible endpoints for seamless integration:

Endpoint	Purpose
`/v1/chat/completions`	Chat-style interactions with conversation history
`/v1/completions`	Legacy text completion
`/v1/models`	List available models
`/v1/embeddings`	Text embeddings (when enabled)

# OpenAI SDK compatibility - no code changes required
from openai import OpenAI

client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True  # Enable streaming for responsive UX
)

Streaming Response Support

Critical for production chat applications - streaming reduces perceived latency by returning tokens as generated:

# Streaming with OpenAI SDK
for chunk in client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# Raw SSE streaming via curl
curl -N http://vllm-server:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "TheBloke/Llama-2-70B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Production benefits: Time-to-first-token remains identical, but users see progressive output instead of waiting for complete response.

Continuous Batching

Dynamic batch formation: Groups requests for efficient GPU utilization
Latency optimization: Maintains <100ms TTFT SLA
Throughput scaling: 23x improvement over non-batched serving
Configuration: --max-num-batched-tokens controls batch size

Tensor Parallelism

Model distribution across GPUs: --tensor-parallel-size N splits model weights
Memory scaling: Enables larger models on multiple GPUs
Communication overhead: Requires high-bandwidth interconnect (NVLink/NVSwitch)
Configuration: Set based on GPU count and model size

LoRA Adapter Support

vLLM supports multi-tenant LoRA serving for cost-efficient model customization:

# Enable LoRA adapter serving
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --enable-lora \
    --lora-modules sentiment-analysis=/adapters/sentiment \
                    code-assist=/adapters/code \
    --max-loras 4 \
    --max-lora-rank 64

Multi-tenant scenario: Single base model serves multiple fine-tuned adapters, reducing GPU memory by 10-50x compared to deploying separate models. Adapters are loaded on-demand and cached based on --max-loras.

GPU Memory Optimization

Memory management determines serving capacity and stability. Calculate requirements using accurate formulas.

Accurate Memory Calculation Formula

Total VRAM = Model_Weights + KV_Cache + overhead

KV_Cache = 2 * num_layers * max_seq_len * num_heads * head_dim * dtype_size_bytes

The factor of 2 accounts for separate K (Key) and V (Value) tensors stored for each layer.

For FP16 models: 2 * layers * seq_len * heads * head_dim * 2 bytes

Example for Llama-2-70B:

Layers: 80, Heads: 64, Head_dim: 128
KV Cache per token: 2 * 80 * 64 * 128 * 2 = 2,621,440 bytes ≈ 2.5MB
4K context: 4,096 tokens × 2.5MB = ~10GB VRAM for KV cache

Critical: Underestimating KV cache by half leads to production OOM failures. Always validate calculations against actual memory profiling.

Key Memory Parameters

# Optimal memory configuration
--gpu-memory-utilization 0.85          # Reserve 15% for system overhead
--max-model-len 4096                   # Maximum sequence length
--max-num-seqs 32                      # Concurrent sequences
--swap-space 16                        # CPU swap in GB
--enable-chunked-prefill              # Reduces memory spikes

KV Cache Management

Primary memory consumer: 60-80% of total VRAM usage
Dynamic sizing: Automatically adjusts based on sequence length
Memory fragmentation: PagedAttention eliminates traditional waste
Monitoring: Track kv_cache_usage_percentage in production

Memory Optimization Strategies

Sequence Length Tuning
- Shorter sequences: Enable more concurrent requests
- Context window: Balance quality vs. capacity
- Typical range: 2K-8K tokens for most applications
Batch Size Optimization
- Larger batches: Improve throughput but increase TTFT
- Target TTFT: <100ms for interactive applications
- Monitor: time_to_first_token_percentiles
Model Parallelism
- Tensor parallelism: Split model across multiple GPUs using --tensor-parallel-size
- Pipeline parallelism: Stage-based model distribution
- Selection criteria: Based on model size and GPU availability

Production Deployment

Docker Deployment

FROM nvidia/cuda:12.1-devel-ubuntu20.04

# Create non-root user for security
RUN useradd -m -s /bin/bash vllm && \
    mkdir -p /app /models && \
    chown -R vllm:vllm /app /models

WORKDIR /app

# Install vLLM 0.6.x+
RUN pip install vllm==0.6.3

# Copy and set permissions
COPY start-vllm.sh /app/start-vllm.sh
RUN chmod +x /app/start-vllm.sh && chown vllm:vllm /app/start-vllm.sh

USER vllm

CMD ["/app/start-vllm.sh"]

#!/bin/bash
# start-vllm.sh - Production entrypoint with graceful shutdown
set -euo pipefail

# Graceful shutdown handler - prevents orphaned GPU processes
cleanup() {
    echo "Received shutdown signal, cleaning up..."
    # Send SIGTERM to vLLM process group for clean exit
    kill -TERM -$$ 2>/dev/null || true
    exit 0
}

trap cleanup SIGTERM SIGINT

# Launch vLLM
exec vllm serve --model "${MODEL_NAME}" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size "${TP_SIZE:-1}" \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096 \
    --enable-prefix-caching

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-40GB
      containers:
      - name: vllm
        image: vllm:latest
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
        env:
        - name: MODEL_NAME
          value: "TheBloke/Llama-2-70B-AWQ"
        - name: TP_SIZE
          value: "1"
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  # CPU utilization is ineffective for GPU-bound inference
  # Use custom metrics via Prometheus Adapter for GPU-aware scaling
  - type: Pods
    pods:
      metric:
        name: vllm_avg_prompt_tokens_per_request
      target:
        type: AverageValue
        averageValue: "500"
  - type: Pods
    pods:
      metric:
        name: vllm_gpu_memory_utilization
      target:
        type: AverageValue
        averageValue: "80"
---
# Alternative: Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vllm-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  updatePolicy:
    updateMode: "Auto"

Note: HPA with CPU metrics is ineffective for GPU-bound inference. Deploy Prometheus Adapter with custom metrics (e.g., vllm_gpu_memory_utilization, vllm_avg_prompt_tokens_per_request) or use Vertical Pod Autoscaler for memory-constrained workloads.

Performance Monitoring

Key metrics for production inference:

Throughput Metrics
- Tokens per second (TPS): Target >500 TPS for 70B models
- Requests per second (RPS): Capacity planning metric
- Batch efficiency: actual_batch_size / max_batch_size
Latency Metrics
- Time to First Token (TTFT): p50 <100ms, p95 <500ms
- Time between tokens: Target <50ms for generation
- Total response time: End-to-end request completion
Resource Metrics
- GPU memory utilization: Maintain <90% to prevent OOM
- KV cache hit rate: >80% indicates effective caching
- CPU utilization: Monitor for offloading overhead

Health Checks and Reliability

vLLM health endpoint varies by version:

v0.6.x and earlier: /health
v0.6.5+ (OpenAI-compatible): /health or /v1/health

# Version-aware health check
import requests

def check_health(base_url: str) -> bool:
    for path in ["/health", "/v1/health"]:
        try:
            response = requests.get(f"{base_url}{path}", timeout=5)
            if response.status_code == 200:
                return True
        except requests.RequestException:
            continue
    return False

if check_health("http://vllm-server:8000"):
    print("Service healthy")
else:
    print("Service unhealthy")

Capacity Planning

Calculate required GPU count:

GPUs_needed = (daily_tokens * avg_seq_len) / (tokens_per_second_per_gpu * 86400)

Example for 70B model:

Target: 10M tokens/day
Per GPU: 500 TPS = 43.2M tokens/day
Required GPUs: 10M / 43.2M = 0.23 → 1 GPU sufficient

Advanced Optimizations

Speculative Decoding

2-3x speedup for predictable outputs
Configuration example:

vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --speculative-model TheBloke/Llama-2-7B-AWQ \
    --speculative-draft-length 5 \
    --speculative-max-draft-tokens 20

Draft model: Smaller model generates candidates
Verification: Larger model validates in single pass
Best for: Structured outputs, low temperature

Prefix Caching

400%+ utilization improvement for standardized prompts
Cache hits: Reuse computed KV cache for repeated prefixes
Configuration: --enable-prefix-caching (v0.6.x+)
Memory overhead: Minimal additional VRAM usage
Hash algorithms: --prefix-caching-hash-algo sha256_cbor for reproducible caching

3-10x latency reduction for repetitive workloads
Distributed cache: Share KV cache across serving instances
Network overhead: Requires high-bandwidth interconnect
Best for: Multi-instance deployments with repetitive context

Implementation Checklist

Quantization Selection
- Profile model quality vs. quantization level
- Select AWQ for production quality requirements
- Calculate memory requirements using accurate quantization formula (include factor of 2 for KV cache)
vLLM Configuration
- Use vLLM 0.6.x+ with vllm serve --model <model> command
- Omit --quantization flag for pre-quantized models (auto-detected)
- Set gpu-memory-utilization to 0.85-0.9
- Configure max-num-seqs based on memory capacity
- Enable enable-chunked-prefill and enable-prefix-caching
Production Deployment
- Implement health checks using /health or /v1/health endpoint
- Configure monitoring for TPS, TTFT, and memory metrics
- Set up alerting for latency degradation and OOM conditions
- Use custom metrics or VPA for Kubernetes autoscaling (not CPU-based HPA)
- Run containers as non-root user with proper WORKDIR
- Implement signal handling in entrypoint scripts for graceful shutdown
Optimization
- Enable speculative decoding for appropriate workloads
- Configure prefix caching for standardized prompts
- Implement cross-instance KV cache sharing if applicable
- Verify Flash Attention 2 support on target GPUs (Ampere+)
- Enable streaming for chat/completion endpoints
- Configure LoRA adapters for multi-tenant scenarios

Cost optimization results vary by workload and infrastructure. Benchmark your specific deployment using throughput-per-dollar analysis: compare vLLM's continuous batching against traditional request-per-instance architectures. Key factors include request distribution, sequence length variance, and GPU memory bandwidth. Validate memory calculations against actual production metrics before capacity planning.

MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so

Follow us on X · LinkedIn · GitHub

Share this Guide:

More Guides

LLM Integration for AI Agents: A Complete Engineering FAQ

Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Continue Reading

LLM Integration for AI Agents: A Complete Engineering FAQ

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min

Production LLM Deployment Guide: Quantization, vLLM Serving & GPU Memory Optimization

Local LLM Production Deployment: Quantization, vLLM Serving, and GPU Memory Optimization

Quantization Fundamentals

AWQ (Activation-Aware Weight Quantization)

GPTQ (Generative Pre-trained Transformer Quantization)

GGUF (GPT-Generated Unified Format)

vLLM Serving Architecture

PagedAttention Mechanism

Flash Attention 2

Core vLLM Configuration

OpenAI-Compatible API Endpoints

Streaming Response Support

Continuous Batching

Tensor Parallelism

LoRA Adapter Support

GPU Memory Optimization

Accurate Memory Calculation Formula

Key Memory Parameters

KV Cache Management

Memory Optimization Strategies

Production Deployment

Docker Deployment

Kubernetes Deployment

Performance Monitoring

Health Checks and Reliability

Capacity Planning

Advanced Optimizations

Speculative Decoding

Prefix Caching

Cross-Instance KV Cache Sharing

Implementation Checklist

More Guides

LLM Integration for AI Agents: A Complete Engineering FAQ

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Continue Reading

LLM Integration for AI Agents: A Complete Engineering FAQ

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Ship Faster. Ship Safer.