AI & Machine Learning Engineering

Scale Vector Search with FAISS and Milvus: Production Implementation Guide

MatterAI Agent

7 min read·January 16, 2026

How to Implement Vector Similarity Search at Scale with FAISS and Milvus

Vector similarity search enables efficient retrieval of high-dimensional embeddings for applications like semantic search, RAG, and recommendation systems. This guide covers implementing search at scale using FAISS (library-level indexing) and Milvus (distributed vector database).

Architecture Overview

FAISS is a C++ library with Python bindings optimized for fast similarity search on CPUs and GPUs. It provides core indexing algorithms but lacks database features like persistence, replication, or distributed query coordination.

Milvus is a cloud-native vector database that wraps FAISS indexing with full database capabilities: storage management, distributed architecture, replication, and API access via gRPC/REST.

Use FAISS for embedded systems, GPU-accelerated pipelines, or custom applications where you manage storage and scaling. Use Milvus for production services requiring horizontal scaling, high availability, and built-in persistence.

FAISS Implementation

FAISS operates entirely in-memory. Index selection depends on your accuracy requirements and dataset size.

Index Types

IndexFlatIP: Exact search with inner product metric. When vectors are L2-normalized, inner product equals cosine similarity. Use for small datasets (<100K vectors) or ground truth validation.
IndexIVFFlat: Approximate search using inverted file indexing. Requires training step. Balanced speed/accuracy.
IndexHNSW: Graph-based approximate search. Higher memory usage but better recall than IVF.

IVF Configuration

The nlist parameter controls partition count. For production scale: 1M vectors use nlist=4096, 10M use nlist=16384, 100M use nlist=65536. The nprobe parameter determines how many partitions to search during query—higher values improve recall at cost of latency.

Critical: IVF training requires minimum vectors for proper k-means clustering. Use at least 39 × nlist vectors. For nlist=4096, minimum 159,744 vectors required.

import faiss
import numpy as np

# Generate sample vectors (128D, 1M vectors)
vectors = np.random.random((1000000, 128)).astype('float32')

# Normalize for inner product (cosine similarity)
faiss.normalize_L2(vectors)

# Create IVF index with nlist=4096 using IP metric for normalized vectors
quantizer = faiss.IndexFlatIP(128)
index = faiss.IndexIVFFlat(quantizer, 128, 4096)

# Train on dataset (requires 39×nlist minimum vectors)
if len(vectors) >= 39 * 4096:
    index.train(vectors)
    index.add(vectors)
else:
    raise ValueError(f"Insufficient vectors for training: need at least {39*4096}, got {len(vectors)}")

# Search with nprobe=16
index.nprobe = 16
query = np.random.random((1, 128)).astype('float32')
faiss.normalize_L2(query)
distances, indices = index.search(query, k=10)

GPU Acceleration

FAISS GPU indexes transfer data to GPU memory for faster search. Always handle GPU memory limits and provide CPU fallback:

res = faiss.StandardGpuResources()

try:
    gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
    distances, indices = gpu_index.search(query, k=10)
except RuntimeError as e:
    if "out of memory" in str(e).lower():
        print("GPU OOM, falling back to CPU")
        distances, indices = index.search(query, k=10)
    else:
        raise
finally:
    # Explicit cleanup to prevent memory leaks
    if 'gpu_index' in locals():
        del gpu_index

Error Handling

import time

max_retries = 3
for attempt in range(max_retries):
    try:
        index.train(vectors)
        break
    except RuntimeError as e:
        if attempt == max_retries - 1:
            raise
        print(f"Training attempt {attempt + 1} failed: {e}")
        time.sleep(2 ** attempt)  # Exponential backoff
        # Fallback: reduce nlist or use exact search

Milvus Implementation

Milvus provides distributed vector storage with automatic sharding and replication. It supports multiple index types including IVF, HNSW, and DiskANN.

Collection Setup

Milvus organizes vectors into collections with defined schemas and index parameters:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, ConsistencyLevel

# Connect to Milvus server with timeout
connections.connect(host="localhost", port="19530", timeout=30)

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields, "vector_collection")

# Create collection
collection = Collection(name="vectors", schema=schema)

# Create IVF index
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "IP",
    "params": {"nlist": 4096}
}
collection.create_index(field_name="embedding", index_params=index_params)

Insert and Search

# Insert vectors - direct list conversion, no nested wrapping
vectors = np.random.random((1000, 128)).astype('float32')
data = vectors.tolist()  # Correct: single list, not [vectors.tolist()]

# Insert and capture auto-generated IDs
insert_result = collection.insert([data])  # Note: outer list for field grouping
collection.flush()

# Retrieve auto-generated IDs if needed
auto_ids = insert_result.primary_keys

# Load collection into memory
collection.load()

# Search with nprobe=16
query_vector = np.random.random((1, 128)).astype('float32')
search_params = {"metric_type": "IP", "params": {"nprobe": 16}}
results = collection.search(
    data=query_vector.tolist(),
    anns_field="embedding",
    param=search_params,
    limit=10,
    consistency_level=ConsistencyLevel.BOUNDED
)

# Access results with auto-generated IDs
for hit in results[0]:
    print(f"ID: {hit.id}, Distance: {hit.distance}")

DiskANN for Large Datasets

DiskANN enables disk-based indexing for datasets exceeding RAM capacity. Requires NVMe SSD for optimal performance:

# docker-compose.yml - Add NVMe volume mount
volumes:
  - /mnt/nvme:/var/lib/milvus

# DiskANN configuration (Milvus 2.x valid parameters)
index_params = {
    "index_type": "DISKANN",
    "metric_type": "IP",
    "params": {
        "search_list_size": 100,  # Candidate list size for search
        "pq_code_rate": 32        # PQ code rate (default: 32)
    }
}
collection.create_index(field_name="embedding", index_params=index_params)

Distributed Scaling

Milvus automatically shards data across multiple query nodes. Configure sharding in collection creation:

collection = Collection(
    name="vectors",
    schema=schema,
    num_shards=4  # Distribute across 4 shards
)

Production Considerations

Memory Sizing

Calculate memory requirements: Dimension × 4 bytes × Vector Count × Index Overhead. Overhead varies significantly by index type and configuration:

IVF_FLAT: Base memory + quantizer overhead (~0.1% per nlist). For 1M 128D vectors with nlist=4096: 128 × 4 × 1,000,000 × 1.05 ≈ 538 MB
HNSW: Graph overhead adds 30-50% memory (depends on efConstruction/M parameters)
IVF_PQ: Reduces memory by 4-8x via quantization

Add 50% buffer for query operations, system overhead, and temporary allocations during index building.

Consistency Levels

Milvus supports multiple consistency levels for search operations:

Strong: Guarantees reads reflect latest writes. Use for financial transactions or critical updates where data freshness is paramount. Highest latency.
Bounded: Default balance. Reads may lag slightly behind writes but typically within milliseconds. Use for most RAG and search applications.
Session: Reads reflect all writes in current session. Use for write-then-read workflows within a single connection.
Eventually: Lowest latency, may return stale data. Use for high-throughput recommendation systems where slight staleness is acceptable.

Index Build Times

Index build time scales with dataset size, index type, and hardware specifications. Estimates assume modern hardware (16+ CPU cores, NVMe SSD, 64GB+ RAM):

IVF_FLAT: 1M vectors ~ 1-3 minutes (CPU-bound), 100M vectors ~ 1-3 hours
HNSW: 1M vectors ~ 3-10 minutes, 100M vectors ~ 6-18 hours (memory-intensive)
DiskANN: Build time similar to HNSW but enables larger-than-RAM datasets (I/O-bound)

GPU acceleration can reduce IVF build times by 3-5x for compatible index types.

Index Lifecycle Management

For production systems requiring continuous updates:

Use background index building: collection.create_index(..., async_mode=True)
Monitor build progress: collection.index().progress()
Consider hot-swapping collections for zero-downtime index rebuilds
Set up automated compaction to reclaim space from deleted vectors

Error Handling

from pymilvus.exceptions import MilvusException
import time

def milvus_insert_with_retry(collection, data, max_retries=3, timeout=30):
    for attempt in range(max_retries):
        try:
            result = collection.insert(data, timeout=timeout)
            collection.flush()
            return result
        except MilvusException as e:
            if attempt == max_retries - 1:
                raise
            if "connection" in str(e).lower():
                time.sleep(2 ** attempt)  # Exponential backoff
            elif "timeout" in str(e).lower():
                # Reduce batch size and retry
                data = [data[0][:len(data[0])//2]]
            else:
                raise

Optimization Strategies

Memory Management

FAISS: Use IndexIVFPQ or IndexScalarQuantizer for quantization to reduce memory footprint by 4-8x with minimal accuracy loss.
Milvus: Enable disk-based indexing (DiskANN) for datasets exceeding RAM capacity.

Batching

Batch insert operations for better throughput:

# Milvus batch insert (recommended batch size: 256-1024)
batch_size = 512
for i in range(0, len(vectors), batch_size):
    batch = vectors[i:i+batch_size]
    collection.insert([batch.tolist()])  # Single outer list for field grouping
collection.flush()

Index Selection Trade-offs

Index	Memory	Build Time	Query Speed	Recall
IVF_FLAT	Medium	Fast	Fast	Medium
HNSW	High	Slow	Very Fast	High
IVF_PQ	Low	Medium	Fast	Low-Medium
DISKANN	Low (disk)	Medium	Fast	High

Use HNSW for high-recall applications like semantic search. Use IVF_PQ for large-scale recommendation systems where slight accuracy degradation is acceptable. Use DISKANN when dataset exceeds RAM capacity.

Getting Started

Prototype with FAISS: Start with IndexFlatIP for exact search on small datasets to validate embedding quality.
Benchmark index types: Test IVF vs HNSW with your specific data using nprobe/ef parameter tuning.
Evaluate scaling needs: If dataset exceeds single-machine RAM or requires HA, migrate to Milvus.
Configure Milvus cluster: Deploy with Docker Compose for development or Kubernetes for production:

# docker-compose.yml - Compatible versions
version: '3.5'
services:
  etcd:
    image: quay.io/coreos/etcd:v3.5.9
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
    volumes:
      - /mnt/etcd:/etcd
  
  minio:
    image: minio/minio:RELEASE.2024-01-16T16-07-38Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    volumes:
      - /mnt/minio:/minio_data
  
  standalone:
    image: milvusdb/milvus:v2.4.0
    ports:
      - "19530:19530"
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - /mnt/nvme:/var/lib/milvus

Monitor performance: Track QPS, latency P99, and recall metrics. Adjust nprobe based on latency requirements.

For production deployments, use Milvus Operator on Kubernetes for automated scaling and failover. Configure resource limits per pod based on index type and expected query volume. Set replication factor >=2 for high availability.

Share this Guide:

More Guides

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Hono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun

Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.

6 min read

Continue Reading

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min