AI & Machine Learning Engineering

Master HNSW Parameter Tuning for Billion-Scale Vector Search in Milvus and Pinecone

MatterAI
MatterAI
12 min read·

Vector Database Tuning: Optimizing HNSW Parameters in Milvus and Pinecone for Billion-Scale Search

HNSW (Hierarchical Navigable Small World) indexing powers billion-scale vector search by balancing recall, latency, and memory through logarithmic complexity. This guide provides precise parameter tuning for Milvus and Pinecone deployments handling 100M+ vectors.

Core HNSW Parameters

M (Max Connections): Maximum edges per node in the base layer graph. Higher values improve recall but increase memory usage and slow index building.

efConstruction: Index building search scope. Higher values create better graph connectivity, improving recall. This enables achieving target recall at lower efSearch values, indirectly reducing query latency at the cost of longer build times.

efSearch: Query-time search scope. Higher values improve recall but increase latency linearly.

Parameter Impact Matrix

ParameterMemoryBuild TimeQuery LatencyRecall
M ↑↑↑
efConstruction ↑↑↑↓ (indirect, via lower efSearch)
efSearch ↑--

Milvus Implementation

Index Configuration

from pymilvus import MilvusClient, Collection

# Billion-scale HNSW configuration
index_params = MilvusClient.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    metric_type="COSINE",
    params={
        "M": 32,              # Max connections per node
        "efConstruction": 200 # Build-time search scope
    }
)

# Search parameters
search_params = {"params": {"ef": 512}}

Memory Requirements

HNSW memory consists of vector data plus graph overhead:

  • Vector data: num_vectors × dimensions × 4 bytes
  • Graph overhead: num_vectors × M × 8 bytes (neighbor IDs as 64-bit integers)
  • Auxiliary structures: ~10-15% overhead

Example for 1B vectors at 768 dimensions with M=32:

  • Vector data: 1B × 768 × 4 = 3.07 TB
  • Graph overhead: 1B × 32 × 8 = 256 GB
  • Total (with 15% auxiliary): ~3.8 TB

This scale requires quantization or distributed deployment.

Pinecone Implementation

Pod-Based Configuration

Pinecone abstracts HNSW parameters. Users cannot tune M or efConstruction directly.

from pinecone import Pinecone, PodSpec

pc = Pinecone(api_key="YOUR_API_KEY")

pc.create_index(
    name="billion-scale-index",
    dimension=768,
    metric="cosine",
    spec=PodSpec(
        environment="us-west1-gcp",
        pod_type="p1.x1",
        pods=10
    )
)

Serverless Configuration

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")

pc.create_index(
    name="serverless-index",
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

Pinecone serverless fully abstracts index parameters. HNSW tuning is not exposed to users.

Billion-Scale Optimization

Quantization Options

For memory-constrained deployments, Milvus supports quantization through separate index types or scalar quantization on vector fields:

# IVF with Scalar Quantization (4x memory reduction)
index_params.add_index(
    field_name="embedding",
    index_type="IVF_SQ8",
    metric_type="COSINE",
    params={"nlist": 4096}
)

# IVF with Product Quantization (8-32x memory reduction)
index_params.add_index(
    field_name="embedding",
    index_type="IVF_PQ",
    metric_type="COSINE",
    params={
        "nlist": 4096,
        "m": 16,    # Number of subquantizers
        "nbits": 8  # Bits per subquantizer
    }
)

For HNSW with reduced memory, apply scalar quantization to the vector field before indexing, or consider GPU indexes like GPU_IVF_PQ for billion-scale deployments.

Golden Ratio Parameters

ScaleMefConstructionefSearchMemory/Vector (768d)
100M16128256~3.6 KB
500M24160384~3.7 KB
1B+32200512~3.8 KB

Distributed Deployment Architecture

True billion-scale deployments require distributed architecture:

  • Sharding: Partition data across multiple query nodes using collection segments or explicit sharding keys
  • Query routing: Route queries to relevant shards using metadata filtering or routing proxies
  • Replication: Deploy replica sets for high availability and read throughput scaling
  • Resource isolation: Separate index build nodes from query nodes to prevent build-time latency spikes

Workload Considerations

Update/Delete Performance: HNSW is optimized for static and append-heavy workloads. Frequent updates and deletes cause graph fragmentation, degrading recall and increasing latency over time. For high-churn workloads:

  • Schedule periodic index rebuilds during low-traffic windows
  • Consider IVF indexes which handle updates more gracefully
  • Use soft deletes with periodic compaction

Warm-up Strategy: Cold HNSW queries can be 2-3x slower until graph pages are cached in memory. Implement warm-up before production traffic:

def warmup_index(collection, sample_queries, iterations=100):
    """Pre-load graph pages into memory cache."""
    for _ in range(iterations):
        collection.search(
            data=sample_queries,
            anns_field="embedding",
            param={"params": {"ef": 64}},
            limit=10
        )

Latency-Recall Trade-offs

import time
from pymilvus import Collection

def benchmark_hnsw_params(collection, test_vectors, ground_truth, ef_values, k=100):
    """Benchmark recall and latency across efSearch values."""
    results = {"recall": [], "latency_ms": []}
    
    for ef in ef_values:
        search_params = {"params": {"ef": ef}}
        
        start = time.time()
        search_results = collection.search(
            data=test_vectors,
            anns_field="embedding",
            param=search_params,
            limit=k
        )
        latency_ms = (time.time() - start) * 1000 / len(test_vectors)
        
        # Calculate recall@k
        total_hits = 0
        for i, result in enumerate(search_results):
            result_ids = {hit.id for hit in result}
            gt_ids = set(ground_truth[i][:k])
            total_hits += len(result_ids & gt_ids)
        
        recall = total_hits / (len(test_vectors) * k)
        
        results["recall"].append(recall)
        results["latency_ms"].append(latency_ms)
    
    return results

Platform-Specific Tuning

Milvus Optimization

# milvus.yaml
queryNode:
  cache:
    cache_size: 64GB
  enableDisk: true
indexNode:
  maxWorkingThreads: 16

Pinecone Optimization

from pinecone import Pinecone, PodSpec

def create_scaled_index(pc, vector_count, target_qps, dimension=768):
    """Create a pod-based index scaled for workload.
    
    Note: Actual capacity varies by dimension, metric, and pod type.
    p1.x1 holds ~1M vectors at 768d; p2.x1 offers higher QPS.
    """
    # Approximate pods needed (adjust based on actual testing)
    vectors_per_p1_pod = 1_000_000 * (768 / dimension)
    pods_needed = max(1, int(vector_count / vectors_per_p1_pod))
    
    pod_type = "p2.x1" if target_qps > 1000 else "p1.x1"
    
    # Format index name appropriately
    if vector_count >= 1_000_000:
        name_suffix = f"{vector_count // 1_000_000}m"
    elif vector_count >= 1_000:
        name_suffix = f"{vector_count // 1_000}k"
    else:
        name_suffix = str(vector_count)
    
    pc.create_index(
        name=f"scaled-index-{name_suffix}",
        dimension=dimension,
        metric="cosine",
        spec=PodSpec(
            environment="us-west1-gcp",
            pod_type=pod_type,
            pods=pods_needed
        )
    )

Monitoring and Validation

Key Metrics

  • Recall@K: Target >95% for production workloads
  • P95 Latency: <50ms for interactive applications
  • Memory Efficiency: ~3.8 KB per vector (768d, M=32) before quantization
  • Index Build Time: Plan for hours on billion-scale datasets

Validation Query

def validate_parameters(collection, test_vectors, ground_truth, k=100):
    """Validate recall against ground truth.
    
    Args:
        collection: Milvus collection object
        test_vectors: Query vectors (list of lists)
        ground_truth: Expected top-k IDs per query (list of lists)
        k: Number of neighbors to retrieve
    """
    search_params = {"params": {"ef": 512}}
    results = collection.search(
        data=test_vectors,
        anns_field="embedding",
        param=search_params,
        limit=k
    )
    
    total_hits = 0
    for i, result in enumerate(results):
        result_ids = {hit.id for hit in result}
        gt_ids = set(ground_truth[i][:k])
        total_hits += len(result_ids & gt_ids)
    
    recall = total_hits / (len(test_vectors) * k)
    print(f"Recall@{k}: {recall:.3f}")
    return recall > 0.95

Implementation Checklist

  1. Estimate memory: Calculate vector data + graph overhead before deployment
  2. Apply quantization (IVF_SQ8/IVF_PQ) for memory-constrained environments
  3. Set M=16-32 based on recall requirements and memory budget
  4. Configure efConstruction=150-250 for graph quality
  5. Tune efSearch iteratively to hit recall targets
  6. Validate recall against labeled ground truth
  7. Monitor memory to prevent OOM errors at scale
  8. Plan distributed architecture for billion-scale (sharding, replication)
  9. Schedule index rebuilds for workloads with frequent updates/deletes
  10. Implement warm-up before production traffic

Start with conservative parameters (M=16, efConstruction=150), increase efSearch until target recall is achieved, then optimize M and quantization for memory constraints.


MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

  • Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
  • AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
  • Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so


Follow us on X · LinkedIn · GitHub

Share this Guide:

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

No credit card requiredSOC 2 Type IISetup in 2 min