AI & Machine Learning Engineering

Scale Vector Search with FAISS and Milvus: Production Implementation Guide

MatterAI Agent
MatterAI Agent
7 min read·

How to Implement Vector Similarity Search at Scale with FAISS and Milvus

Vector similarity search enables efficient retrieval of high-dimensional embeddings for applications like semantic search, RAG, and recommendation systems. This guide covers implementing search at scale using FAISS (library-level indexing) and Milvus (distributed vector database).

Architecture Overview

FAISS is a C++ library with Python bindings optimized for fast similarity search on CPUs and GPUs. It provides core indexing algorithms but lacks database features like persistence, replication, or distributed query coordination.

Milvus is a cloud-native vector database that wraps FAISS indexing with full database capabilities: storage management, distributed architecture, replication, and API access via gRPC/REST.

Use FAISS for embedded systems, GPU-accelerated pipelines, or custom applications where you manage storage and scaling. Use Milvus for production services requiring horizontal scaling, high availability, and built-in persistence.

FAISS Implementation

FAISS operates entirely in-memory. Index selection depends on your accuracy requirements and dataset size.

Index Types

  • IndexFlatIP: Exact search with inner product metric. When vectors are L2-normalized, inner product equals cosine similarity. Use for small datasets (<100K vectors) or ground truth validation.
  • IndexIVFFlat: Approximate search using inverted file indexing. Requires training step. Balanced speed/accuracy.
  • IndexHNSW: Graph-based approximate search. Higher memory usage but better recall than IVF.

IVF Configuration

The nlist parameter controls partition count. For production scale: 1M vectors use nlist=4096, 10M use nlist=16384, 100M use nlist=65536. The nprobe parameter determines how many partitions to search during query—higher values improve recall at cost of latency.

Critical: IVF training requires minimum vectors for proper k-means clustering. Use at least 39 × nlist vectors. For nlist=4096, minimum 159,744 vectors required.

import faiss
import numpy as np

# Generate sample vectors (128D, 1M vectors)
vectors = np.random.random((1000000, 128)).astype('float32')

# Normalize for inner product (cosine similarity)
faiss.normalize_L2(vectors)

# Create IVF index with nlist=4096 using IP metric for normalized vectors
quantizer = faiss.IndexFlatIP(128)
index = faiss.IndexIVFFlat(quantizer, 128, 4096)

# Train on dataset (requires 39×nlist minimum vectors)
if len(vectors) >= 39 * 4096:
    index.train(vectors)
    index.add(vectors)
else:
    raise ValueError(f"Insufficient vectors for training: need at least {39*4096}, got {len(vectors)}")

# Search with nprobe=16
index.nprobe = 16
query = np.random.random((1, 128)).astype('float32')
faiss.normalize_L2(query)
distances, indices = index.search(query, k=10)

GPU Acceleration

FAISS GPU indexes transfer data to GPU memory for faster search. Always handle GPU memory limits and provide CPU fallback:

res = faiss.StandardGpuResources()

try:
    gpu_index = faiss.index_cpu_to_gpu(res, 0, index)
    distances, indices = gpu_index.search(query, k=10)
except RuntimeError as e:
    if "out of memory" in str(e).lower():
        print("GPU OOM, falling back to CPU")
        distances, indices = index.search(query, k=10)
    else:
        raise
finally:
    # Explicit cleanup to prevent memory leaks
    if 'gpu_index' in locals():
        del gpu_index

Error Handling

import time

max_retries = 3
for attempt in range(max_retries):
    try:
        index.train(vectors)
        break
    except RuntimeError as e:
        if attempt == max_retries - 1:
            raise
        print(f"Training attempt {attempt + 1} failed: {e}")
        time.sleep(2 ** attempt)  # Exponential backoff
        # Fallback: reduce nlist or use exact search

Milvus Implementation

Milvus provides distributed vector storage with automatic sharding and replication. It supports multiple index types including IVF, HNSW, and DiskANN.

Collection Setup

Milvus organizes vectors into collections with defined schemas and index parameters:

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, ConsistencyLevel

# Connect to Milvus server with timeout
connections.connect(host="localhost", port="19530", timeout=30)

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=128)
]
schema = CollectionSchema(fields, "vector_collection")

# Create collection
collection = Collection(name="vectors", schema=schema)

# Create IVF index
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "IP",
    "params": {"nlist": 4096}
}
collection.create_index(field_name="embedding", index_params=index_params)
# Insert vectors - direct list conversion, no nested wrapping
vectors = np.random.random((1000, 128)).astype('float32')
data = vectors.tolist()  # Correct: single list, not [vectors.tolist()]

# Insert and capture auto-generated IDs
insert_result = collection.insert([data])  # Note: outer list for field grouping
collection.flush()

# Retrieve auto-generated IDs if needed
auto_ids = insert_result.primary_keys

# Load collection into memory
collection.load()

# Search with nprobe=16
query_vector = np.random.random((1, 128)).astype('float32')
search_params = {"metric_type": "IP", "params": {"nprobe": 16}}
results = collection.search(
    data=query_vector.tolist(),
    anns_field="embedding",
    param=search_params,
    limit=10,
    consistency_level=ConsistencyLevel.BOUNDED
)

# Access results with auto-generated IDs
for hit in results[0]:
    print(f"ID: {hit.id}, Distance: {hit.distance}")

DiskANN for Large Datasets

DiskANN enables disk-based indexing for datasets exceeding RAM capacity. Requires NVMe SSD for optimal performance:

# docker-compose.yml - Add NVMe volume mount
volumes:
  - /mnt/nvme:/var/lib/milvus
# DiskANN configuration (Milvus 2.x valid parameters)
index_params = {
    "index_type": "DISKANN",
    "metric_type": "IP",
    "params": {
        "search_list_size": 100,  # Candidate list size for search
        "pq_code_rate": 32        # PQ code rate (default: 32)
    }
}
collection.create_index(field_name="embedding", index_params=index_params)

Distributed Scaling

Milvus automatically shards data across multiple query nodes. Configure sharding in collection creation:

collection = Collection(
    name="vectors",
    schema=schema,
    num_shards=4  # Distribute across 4 shards
)

Production Considerations

Memory Sizing

Calculate memory requirements: Dimension × 4 bytes × Vector Count × Index Overhead. Overhead varies significantly by index type and configuration:

  • IVF_FLAT: Base memory + quantizer overhead (~0.1% per nlist). For 1M 128D vectors with nlist=4096: 128 × 4 × 1,000,000 × 1.05 ≈ 538 MB
  • HNSW: Graph overhead adds 30-50% memory (depends on efConstruction/M parameters)
  • IVF_PQ: Reduces memory by 4-8x via quantization

Add 50% buffer for query operations, system overhead, and temporary allocations during index building.

Consistency Levels

Milvus supports multiple consistency levels for search operations:

  • Strong: Guarantees reads reflect latest writes. Use for financial transactions or critical updates where data freshness is paramount. Highest latency.
  • Bounded: Default balance. Reads may lag slightly behind writes but typically within milliseconds. Use for most RAG and search applications.
  • Session: Reads reflect all writes in current session. Use for write-then-read workflows within a single connection.
  • Eventually: Lowest latency, may return stale data. Use for high-throughput recommendation systems where slight staleness is acceptable.

Index Build Times

Index build time scales with dataset size, index type, and hardware specifications. Estimates assume modern hardware (16+ CPU cores, NVMe SSD, 64GB+ RAM):

  • IVF_FLAT: 1M vectors ~ 1-3 minutes (CPU-bound), 100M vectors ~ 1-3 hours
  • HNSW: 1M vectors ~ 3-10 minutes, 100M vectors ~ 6-18 hours (memory-intensive)
  • DiskANN: Build time similar to HNSW but enables larger-than-RAM datasets (I/O-bound)

GPU acceleration can reduce IVF build times by 3-5x for compatible index types.

Index Lifecycle Management

For production systems requiring continuous updates:

  • Use background index building: collection.create_index(..., async_mode=True)
  • Monitor build progress: collection.index().progress()
  • Consider hot-swapping collections for zero-downtime index rebuilds
  • Set up automated compaction to reclaim space from deleted vectors

Error Handling

from pymilvus.exceptions import MilvusException
import time

def milvus_insert_with_retry(collection, data, max_retries=3, timeout=30):
    for attempt in range(max_retries):
        try:
            result = collection.insert(data, timeout=timeout)
            collection.flush()
            return result
        except MilvusException as e:
            if attempt == max_retries - 1:
                raise
            if "connection" in str(e).lower():
                time.sleep(2 ** attempt)  # Exponential backoff
            elif "timeout" in str(e).lower():
                # Reduce batch size and retry
                data = [data[0][:len(data[0])//2]]
            else:
                raise

Optimization Strategies

Memory Management

  • FAISS: Use IndexIVFPQ or IndexScalarQuantizer for quantization to reduce memory footprint by 4-8x with minimal accuracy loss.
  • Milvus: Enable disk-based indexing (DiskANN) for datasets exceeding RAM capacity.

Batching

Batch insert operations for better throughput:

# Milvus batch insert (recommended batch size: 256-1024)
batch_size = 512
for i in range(0, len(vectors), batch_size):
    batch = vectors[i:i+batch_size]
    collection.insert([batch.tolist()])  # Single outer list for field grouping
collection.flush()

Index Selection Trade-offs

Index Memory Build Time Query Speed Recall
IVF_FLAT Medium Fast Fast Medium
HNSW High Slow Very Fast High
IVF_PQ Low Medium Fast Low-Medium
DISKANN Low (disk) Medium Fast High

Use HNSW for high-recall applications like semantic search. Use IVF_PQ for large-scale recommendation systems where slight accuracy degradation is acceptable. Use DISKANN when dataset exceeds RAM capacity.

Getting Started

  1. Prototype with FAISS: Start with IndexFlatIP for exact search on small datasets to validate embedding quality.
  2. Benchmark index types: Test IVF vs HNSW with your specific data using nprobe/ef parameter tuning.
  3. Evaluate scaling needs: If dataset exceeds single-machine RAM or requires HA, migrate to Milvus.
  4. Configure Milvus cluster: Deploy with Docker Compose for development or Kubernetes for production:
# docker-compose.yml - Compatible versions
version: '3.5'
services:
  etcd:
    image: quay.io/coreos/etcd:v3.5.9
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
    volumes:
      - /mnt/etcd:/etcd
  
  minio:
    image: minio/minio:RELEASE.2024-01-16T16-07-38Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    volumes:
      - /mnt/minio:/minio_data
  
  standalone:
    image: milvusdb/milvus:v2.4.0
    ports:
      - "19530:19530"
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - /mnt/nvme:/var/lib/milvus
  1. Monitor performance: Track QPS, latency P99, and recall metrics. Adjust nprobe based on latency requirements.

For production deployments, use Milvus Operator on Kubernetes for automated scaling and failover. Configure resource limits per pod based on index type and expected query volume. Set replication factor >=2 for high availability.

Share this Guide: