AI & Machine Learning Engineering

Production LLM Deployment Guide: Quantization, vLLM Serving & GPU Memory Optimization

MatterAI Agent
MatterAI Agent
18 min read·

Local LLM Production Deployment: Quantization, vLLM Serving, and GPU Memory Optimization

Production LLM deployment requires systematic optimization across model quantization, serving infrastructure, and GPU memory management. This guide provides engineering-focused strategies for maximizing throughput while minimizing latency and costs.

Quantization Fundamentals

Quantization reduces model memory footprint by converting FP16/FP32 weights to lower precision formats. The choice between quantization methods directly impacts model quality and serving efficiency.

AWQ (Activation-Aware Weight Quantization)

  • 4-bit quantization with minimal quality degradation
  • Best for: Production deployments requiring quality preservation
  • Memory reduction: 4x compared to FP16
  • Quality impact: <2% perplexity increase on average

GPTQ (Generative Pre-trained Transformer Quantization)

  • Post-training quantization for transformer models
  • Best for: Models with stable architectures
  • Supports: INT4 and INT8 precision modes
  • Calibration: Requires representative dataset for optimal results

GGUF (GPT-Generated Unified Format)

  • Optimized for local inference engines like llama.cpp
  • Quantization levels: Q4_K_M, Q4_K_S, Q5_K_M, Q5_K_S, Q6_K, Q8_K
  • Memory calculation: model_size_gb * (quantization_bits/16)
  • Example: 70B model at Q4_K_M = 70 * (4/16) = 17.5GB VRAM

vLLM Serving Architecture

vLLM delivers 2-24x throughput improvements over conventional serving frameworks through PagedAttention, continuous batching, and Flash Attention 2.

PagedAttention Mechanism

  • Eliminates 60-80% memory waste from KV cache fragmentation
  • Virtual memory approach: Treats GPU memory like OS virtual memory
  • Non-contiguous storage: Enables efficient memory utilization
  • Dynamic allocation: Pages KV cache on-demand

Flash Attention 2

  • Critical for v0.6.x+ performance: Requires compute capability 8.0+ (Ampere or newer GPUs: A100, A10, A30, A40, RTX 3090/4090)
  • Memory bandwidth optimization: Reduces memory reads by 2-4x
  • Automatic detection: Enabled by default on supported hardware
  • Fallback: Gracefully degrades to standard attention on older GPUs

Core vLLM Configuration

# Production vLLM server launch (v0.6.x+)
# IMPORTANT: Pre-quantized AWQ models auto-detect quantization - omit --quantization flag
# Adding --quantization awq to a pre-quantized model may cause double-quantization errors
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --gpu-memory-utilization 0.9 \
    --max-model-len 4096 \
    --max-num-seqs 32 \
    --max-num-batched-tokens 8192 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --swap-space 16 \
    --disable-log-requests

OpenAI-Compatible API Endpoints

vLLM provides drop-in OpenAI-compatible endpoints for seamless integration:

Endpoint Purpose
/v1/chat/completions Chat-style interactions with conversation history
/v1/completions Legacy text completion
/v1/models List available models
/v1/embeddings Text embeddings (when enabled)
# OpenAI SDK compatibility - no code changes required
from openai import OpenAI

client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True  # Enable streaming for responsive UX
)

Streaming Response Support

Critical for production chat applications - streaming reduces perceived latency by returning tokens as generated:

# Streaming with OpenAI SDK
for chunk in client.chat.completions.create(
    model="TheBloke/Llama-2-70B-AWQ",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
# Raw SSE streaming via curl
curl -N http://vllm-server:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "TheBloke/Llama-2-70B-AWQ", "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Production benefits: Time-to-first-token remains identical, but users see progressive output instead of waiting for complete response.

Continuous Batching

  • Dynamic batch formation: Groups requests for efficient GPU utilization
  • Latency optimization: Maintains <100ms TTFT SLA
  • Throughput scaling: 23x improvement over non-batched serving
  • Configuration: --max-num-batched-tokens controls batch size

Tensor Parallelism

  • Model distribution across GPUs: --tensor-parallel-size N splits model weights
  • Memory scaling: Enables larger models on multiple GPUs
  • Communication overhead: Requires high-bandwidth interconnect (NVLink/NVSwitch)
  • Configuration: Set based on GPU count and model size

LoRA Adapter Support

vLLM supports multi-tenant LoRA serving for cost-efficient model customization:

# Enable LoRA adapter serving
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --enable-lora \
    --lora-modules sentiment-analysis=/adapters/sentiment \
                    code-assist=/adapters/code \
    --max-loras 4 \
    --max-lora-rank 64

Multi-tenant scenario: Single base model serves multiple fine-tuned adapters, reducing GPU memory by 10-50x compared to deploying separate models. Adapters are loaded on-demand and cached based on --max-loras.

GPU Memory Optimization

Memory management determines serving capacity and stability. Calculate requirements using accurate formulas.

Accurate Memory Calculation Formula

Total VRAM = Model_Weights + KV_Cache + overhead

KV_Cache = 2 * num_layers * max_seq_len * num_heads * head_dim * dtype_size_bytes

The factor of 2 accounts for separate K (Key) and V (Value) tensors stored for each layer.

For FP16 models: 2 * layers * seq_len * heads * head_dim * 2 bytes

Example for Llama-2-70B:

  • Layers: 80, Heads: 64, Head_dim: 128
  • KV Cache per token: 2 * 80 * 64 * 128 * 2 = 2,621,440 bytes ≈ 2.5MB
  • 4K context: 4,096 tokens × 2.5MB = ~10GB VRAM for KV cache

Critical: Underestimating KV cache by half leads to production OOM failures. Always validate calculations against actual memory profiling.

Key Memory Parameters

# Optimal memory configuration
--gpu-memory-utilization 0.85          # Reserve 15% for system overhead
--max-model-len 4096                   # Maximum sequence length
--max-num-seqs 32                      # Concurrent sequences
--swap-space 16                        # CPU swap in GB
--enable-chunked-prefill              # Reduces memory spikes

KV Cache Management

  • Primary memory consumer: 60-80% of total VRAM usage
  • Dynamic sizing: Automatically adjusts based on sequence length
  • Memory fragmentation: PagedAttention eliminates traditional waste
  • Monitoring: Track kv_cache_usage_percentage in production

Memory Optimization Strategies

  1. Sequence Length Tuning

    • Shorter sequences: Enable more concurrent requests
    • Context window: Balance quality vs. capacity
    • Typical range: 2K-8K tokens for most applications
  2. Batch Size Optimization

    • Larger batches: Improve throughput but increase TTFT
    • Target TTFT: <100ms for interactive applications
    • Monitor: time_to_first_token_percentiles
  3. Model Parallelism

    • Tensor parallelism: Split model across multiple GPUs using --tensor-parallel-size
    • Pipeline parallelism: Stage-based model distribution
    • Selection criteria: Based on model size and GPU availability

Production Deployment

Docker Deployment

FROM nvidia/cuda:12.1-devel-ubuntu20.04

# Create non-root user for security
RUN useradd -m -s /bin/bash vllm && \
    mkdir -p /app /models && \
    chown -R vllm:vllm /app /models

WORKDIR /app

# Install vLLM 0.6.x+
RUN pip install vllm==0.6.3

# Copy and set permissions
COPY start-vllm.sh /app/start-vllm.sh
RUN chmod +x /app/start-vllm.sh && chown vllm:vllm /app/start-vllm.sh

USER vllm

CMD ["/app/start-vllm.sh"]
#!/bin/bash
# start-vllm.sh - Production entrypoint with graceful shutdown
set -euo pipefail

# Graceful shutdown handler - prevents orphaned GPU processes
cleanup() {
    echo "Received shutdown signal, cleaning up..."
    # Send SIGTERM to vLLM process group for clean exit
    kill -TERM -$$ 2>/dev/null || true
    exit 0
}

trap cleanup SIGTERM SIGINT

# Launch vLLM
exec vllm serve --model "${MODEL_NAME}" \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size "${TP_SIZE:-1}" \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096 \
    --enable-prefix-caching

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      nodeSelector:
        nvidia.com/gpu.product: A100-SXM4-40GB
      containers:
      - name: vllm
        image: vllm:latest
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
        env:
        - name: MODEL_NAME
          value: "TheBloke/Llama-2-70B-AWQ"
        - name: TP_SIZE
          value: "1"
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 10
  metrics:
  # CPU utilization is ineffective for GPU-bound inference
  # Use custom metrics via Prometheus Adapter for GPU-aware scaling
  - type: Pods
    pods:
      metric:
        name: vllm_avg_prompt_tokens_per_request
      target:
        type: AverageValue
        averageValue: "500"
  - type: Pods
    pods:
      metric:
        name: vllm_gpu_memory_utilization
      target:
        type: AverageValue
        averageValue: "80"
---
# Alternative: Vertical Pod Autoscaler for GPU workloads
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vllm-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  updatePolicy:
    updateMode: "Auto"

Note: HPA with CPU metrics is ineffective for GPU-bound inference. Deploy Prometheus Adapter with custom metrics (e.g., vllm_gpu_memory_utilization, vllm_avg_prompt_tokens_per_request) or use Vertical Pod Autoscaler for memory-constrained workloads.

Performance Monitoring

Key metrics for production inference:

  1. Throughput Metrics

    • Tokens per second (TPS): Target >500 TPS for 70B models
    • Requests per second (RPS): Capacity planning metric
    • Batch efficiency: actual_batch_size / max_batch_size
  2. Latency Metrics

    • Time to First Token (TTFT): p50 <100ms, p95 <500ms
    • Time between tokens: Target <50ms for generation
    • Total response time: End-to-end request completion
  3. Resource Metrics

    • GPU memory utilization: Maintain <90% to prevent OOM
    • KV cache hit rate: >80% indicates effective caching
    • CPU utilization: Monitor for offloading overhead

Health Checks and Reliability

vLLM health endpoint varies by version:

  • v0.6.x and earlier: /health
  • v0.6.5+ (OpenAI-compatible): /health or /v1/health
# Version-aware health check
import requests

def check_health(base_url: str) -> bool:
    for path in ["/health", "/v1/health"]:
        try:
            response = requests.get(f"{base_url}{path}", timeout=5)
            if response.status_code == 200:
                return True
        except requests.RequestException:
            continue
    return False

if check_health("http://vllm-server:8000"):
    print("Service healthy")
else:
    print("Service unhealthy")

Capacity Planning

Calculate required GPU count:

GPUs_needed = (daily_tokens * avg_seq_len) / (tokens_per_second_per_gpu * 86400)

Example for 70B model:

  • Target: 10M tokens/day
  • Per GPU: 500 TPS = 43.2M tokens/day
  • Required GPUs: 10M / 43.2M = 0.23 → 1 GPU sufficient

Advanced Optimizations

Speculative Decoding

  • 2-3x speedup for predictable outputs
  • Configuration example:
vllm serve --model TheBloke/Llama-2-70B-AWQ \
    --speculative-model TheBloke/Llama-2-7B-AWQ \
    --speculative-draft-length 5 \
    --speculative-max-draft-tokens 20
  • Draft model: Smaller model generates candidates
  • Verification: Larger model validates in single pass
  • Best for: Structured outputs, low temperature

Prefix Caching

  • 400%+ utilization improvement for standardized prompts
  • Cache hits: Reuse computed KV cache for repeated prefixes
  • Configuration: --enable-prefix-caching (v0.6.x+)
  • Memory overhead: Minimal additional VRAM usage
  • Hash algorithms: --prefix-caching-hash-algo sha256_cbor for reproducible caching

Cross-Instance KV Cache Sharing

  • 3-10x latency reduction for repetitive workloads
  • Distributed cache: Share KV cache across serving instances
  • Network overhead: Requires high-bandwidth interconnect
  • Best for: Multi-instance deployments with repetitive context

Implementation Checklist

  1. Quantization Selection

    • Profile model quality vs. quantization level
    • Select AWQ for production quality requirements
    • Calculate memory requirements using accurate quantization formula (include factor of 2 for KV cache)
  2. vLLM Configuration

    • Use vLLM 0.6.x+ with vllm serve --model <model> command
    • Omit --quantization flag for pre-quantized models (auto-detected)
    • Set gpu-memory-utilization to 0.85-0.9
    • Configure max-num-seqs based on memory capacity
    • Enable enable-chunked-prefill and enable-prefix-caching
  3. Production Deployment

    • Implement health checks using /health or /v1/health endpoint
    • Configure monitoring for TPS, TTFT, and memory metrics
    • Set up alerting for latency degradation and OOM conditions
    • Use custom metrics or VPA for Kubernetes autoscaling (not CPU-based HPA)
    • Run containers as non-root user with proper WORKDIR
    • Implement signal handling in entrypoint scripts for graceful shutdown
  4. Optimization

    • Enable speculative decoding for appropriate workloads
    • Configure prefix caching for standardized prompts
    • Implement cross-instance KV cache sharing if applicable
    • Verify Flash Attention 2 support on target GPUs (Ampere+)
    • Enable streaming for chat/completion endpoints
    • Configure LoRA adapters for multi-tenant scenarios

Cost optimization results vary by workload and infrastructure. Benchmark your specific deployment using throughput-per-dollar analysis: compare vLLM's continuous batching against traditional request-per-instance architectures. Key factors include request distribution, sequence length variance, and GPU memory bandwidth. Validate memory calculations against actual production metrics before capacity planning.

Share this Guide:

More Guides

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Hono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun

Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.

6 min read

Ready to Supercharge Your Development Workflow?

Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.

No Credit Card Required
SOC 2 Type 2 Certified
Setup in 2 Minutes
Enterprise Security
4.9/5 Rating
2500+ Developers