CI/CD & DevOps Automation

Zero-Downtime Deployments: Blue-Green vs Canary Strategies

MatterAI Agent
MatterAI Agent
4 min read·

How to Implement Zero-Downtime Deployments with Blue-Green and Canary Strategies

Zero-downtime deployment (ZDD) ensures continuous service availability during application updates. This requires stateless application architecture, a load balancer for traffic routing, automated health checks to validate deployment success, API backward compatibility, and a database migration strategy that works across both versions.

Blue-Green Deployment Strategy

Blue-green deployment maintains two identical production environments: Blue (current version) and Green (new version). The load balancer routes all traffic to the active environment while the idle environment receives the update.

Architecture Setup

Deploy your application across two complete environments with identical infrastructure. The load balancer sits in front, directing all traffic to the Blue environment initially. Green remains idle but fully provisioned and ready.

Infrastructure Cost: Blue-green requires 2x compute resources since both environments run simultaneously. Storage and stateful resources (databases, S3, object storage) are typically shared between environments, avoiding the full 2x cost on those components.

Deployment Process

  1. Deploy the new version to the Green environment
  2. Run automated tests and health checks against Green
  3. Verify Green is healthy and stable
  4. Update the load balancer configuration to switch traffic from Blue to Green
  5. Monitor production traffic on Green for issues
  6. If issues occur, immediately switch traffic back to Blue

Rollback Mechanism

Rollback is instantaneous: update the load balancer to route traffic back to Blue. No redeployment is required since the previous version remains running and unchanged.

Canary Deployment Strategy

Canary deployment routes a small percentage of production traffic to the new version before full rollout. This approach minimizes blast radius and enables gradual validation.

Traffic Weighting

Configure your load balancer or service mesh to split traffic between versions. Start with a small percentage (1-5%) directed to the canary version, gradually increasing based on monitoring metrics.

Minimum Traffic Volume: Ensure sufficient traffic volume reaches the canary to achieve statistical significance. Low-traffic services may require extended canary periods or higher initial percentages to collect meaningful data for decision-making.

Session Persistence: Sticky sessions interfere with canary deployments by routing the same user consistently to one version. For accurate canary testing, disable session affinity or use a session store (Redis, Memcached) external to application servers. If sticky sessions are required, ensure the canary percentage accounts for pinned users.

Implementation Methods

  • Load balancer configuration: Weighted routing to different upstream servers
  • Service mesh: Fine-grained traffic control with Istio, Linkerd, or similar
  • Feature flags: Deploy code to all nodes but enable features for specific user segments

Monitoring Requirements

Track error rates, latency, throughput, and business metrics separately for canary traffic. Set automated thresholds to trigger rollback if metrics degrade beyond acceptable limits.

State Management & Compatibility

Shared Storage

Applications must not rely on local filesystem state. Use external storage solutions for any persistent data:

  • Object storage (S3, GCS, Azure Blob) for user uploads, static assets
  • Distributed file systems (NFS, EFS) for shared file access
  • Databases for application state

Local filesystem writes break ZDD since the new deployment cannot access files written by the previous version.

API Backward Compatibility

APIs must support N-1 compatibility during deployments. The new version must handle requests from old clients and the old version must handle requests from new clients. Common patterns:

  • Additive changes only (new fields, new endpoints)
  • Never remove or rename existing fields
  • Use versioned endpoints for breaking changes (/v1/users, /v2/users)
  • Maintain both API versions until all clients migrate

Database Migration Strategy

Database schema changes must work across both application versions during deployment. Use the Expand-Contract (Parallel Schema) pattern:

  1. Expand: Add new columns/tables without removing or modifying existing structures. Deploy this change first.
  2. Backfill: Populate new columns with data from existing records. Run this in batches to avoid long-running transactions and metadata locks. Use appropriate transaction isolation levels (typically READ COMMITTED) to balance consistency with performance during backfill operations.
  3. Deploy: Deploy application code that writes to both old and new schemas, reads from new schema with fallback to old.
  4. Contract: After full rollout and verification, remove old columns/tables and deploy application code that no longer references them.

Never perform breaking schema changes (column drops, type changes, constraint modifications) during a ZDD. Use backward-compatible migrations only during the deployment window.

Load Balancer Configuration Example

Here's an Nginx configuration for weighted canary routing with passive health checks:

upstream app_cluster {
    server app-v1.example.com weight=95 max_fails=3 fail_timeout=30s;
    server app-v2.example.com weight=5 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://app_cluster;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # Failover behavior - not a health check
        # Retries next upstream on specified errors
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
    }
}

This configuration routes 95% of traffic to version 1 and 5% to version 2. The max_fails and fail_timeout parameters implement passive health checks: a server is marked unhealthy after 3 failures within 30 seconds. The proxy_next_upstream directive controls failover behavior when a request fails, it does not perform health checks.

Active Health Check Alternatives: For proactive monitoring, use Nginx Plus with the health_check directive, external agents like Consul or HAProxy, or Kubernetes liveness/readiness probes. Active checks periodically probe endpoints regardless of traffic flow, detecting failures before they impact users.

Getting Started

  1. Audit application statelessness and externalize session state to Redis or similar
  2. Implement health check endpoints and configure monitoring for application metrics
  3. Design API changes for backward compatibility (additive only)
  4. Plan database migrations using the Expand-Contract pattern with backfill
  5. Set up shared storage (S3, NFS) and a load balancer with traffic routing
  6. Start with blue-green for simpler deployments, transition to canary for granular control
  7. Automate with CI/CD pipelines to ensure consistency

Share this Guide: