Observability & Monitoring

Service Mesh Monitoring: Complete Guide to Prometheus, Grafana & Alerting

MatterAI Agent
MatterAI Agent
12 min readยท

Service Mesh Monitoring: Prometheus Metrics, Grafana Dashboards, and Alerting

A practical guide to implementing observability for service meshes using Prometheus, Grafana, and Alertmanager. Focuses on Istio Telemetry v2 with Envoy-based metrics.

Core Metrics: The Golden Signals

Service mesh monitoring centers on four Golden Signals: latency, traffic, errors, and saturation. Istio Telemetry v2 exposes these directly through Envoy sidecars at port 15090, replacing the deprecated Mixer-based telemetry.

Key Istio Metrics

Metric Type Description
istio_requests_total Counter Total requests by source, destination, and response code
istio_request_duration_milliseconds_bucket Histogram Request latency distribution
istio_request_bytes_bucket Histogram Request body size
istio_response_bytes_bucket Histogram Response body size

Prometheus Configuration

Scrape Config for Istio

# prometheus-config.yaml
scrape_configs:
  # Envoy sidecar metrics from each pod
  - job_name: 'istio-proxy'
    metrics_path: /stats/prometheus
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: istio-proxy
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        replacement: "${1}:15090"
        target_label: __address__

  # Istiod control plane metrics
  - job_name: 'istiod'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [istio-system]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: istiod
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: http-monitoring
      - source_labels: [__address__]
        action: replace
        regex: ([^:]+)(?::\d+)?
        replacement: "${1}:15014"
        target_label: __address__

Essential PromQL Queries

Request Rate (Traffic)

sum(rate(istio_requests_total{reporter="destination"}[5m]))

Error Rate (Errors)

sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m]))

P99 Latency by Service (Latency)

histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service)
)

Success Rate by Service

sum(rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service)

Grafana Dashboard Design

Dashboard Hierarchy

Structure dashboards in three tiers: Mesh Overview (global health), Service Dashboard (per-service metrics), and Workload Dashboard (pod-level detail).

Mesh Overview Dashboard

{
  "dashboard": {
    "title": "Istio Mesh Overview",
    "refresh": "30s",
    "panels": [
      {
        "title": "Global Request Rate",
        "type": "stat",
        "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
        "targets": [{
          "expr": "round(sum(rate(istio_requests_total{reporter=\"destination\"}[5m])), 0.01)"
        }]
      },
      {
        "title": "Global Error Rate",
        "type": "stat",
        "gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
        "targets": [{
          "expr": "sum(rate(istio_requests_total{reporter=\"destination\",response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m]))"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": null, "color": "green" },
                { "value": 0.01, "color": "yellow" },
                { "value": 0.05, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency by Service",
        "type": "gauge",
        "gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service))"
        }]
      },
      {
        "title": "Service Traffic Flow",
        "type": "nodeGraph",
        "gridPos": { "x": 0, "y": 4, "w": 24, "h": 10 },
        "targets": [
          {
            "expr": "label_replace(label_replace(sum by (source_workload, destination_workload) (rate(istio_requests_total{reporter=\"destination\"}[5m])), \"source\", \"$1\", \"source_workload\", \"(.+)\"), \"target\", \"$1\", \"destination_workload\", \"(.+)\")",
            "format": "table",
            "instant": true
          }
        ],
        "options": {
          "nodes": {
            "mainStatUnit": "reqps"
          },
          "edges": {
            "mainStatUnit": "reqps"
          }
        }
      }
    ]
  }
}

Latency Heatmap Panel

{
  "title": "Request Latency Distribution",
  "type": "heatmap",
  "targets": [{
    "expr": "sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le)",
    "format": "heatmap"
  }],
  "options": {
    "calculate": false,
    "color": {
      "scheme": "Spectral"
    },
    "yAxis": {
      "decimals": 0,
      "unit": "ms"
    }
  },
  "dataFormat": "tsbuckets"
}

Distributed Tracing Integration

Jaeger Configuration

Enable tracing in Istio mesh:

# istio-tracing.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-control-plane
spec:
  values:
    pilot:
      env:
        ENABLE_EXTERNAL_JAEGER: "true"
    tracing:
      jaeger:
        enabled: true
        hub: docker.io/jaegertracing
        tag: 1.40

Tracing Metrics

# Jaeger query latency
histogram_quantile(0.99,
  sum(rate(jaeger_query_duration_seconds_bucket[5m])) by (le)
)

Note: Istio does not expose trace count metrics directly. For trace volume monitoring, instrument your applications with OpenTelemetry SDKs to emit custom metrics, or monitor the Jaeger collector metrics (e.g., jaeger_collector_spans_received_total).

Security Metrics

mTLS Status Monitoring

# mTLS connection ratio
sum(rate(istio_tcp_connections_opened_total{connection_security_policy="mutual_tls"}[5m]))
/
sum(rate(istio_tcp_connections_opened_total[5m]))

# Certificate expiration
max(istio_certificate_expiry_seconds) by (cluster_id)

Security Alerts

# security-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-security-alerts
  namespace: istio-system
spec:
  groups:
    - name: istio-security
      rules:
        - alert: IstioMTLSFailure
          expr: |
            (
              sum(rate(istio_tcp_connections_opened_total{connection_security_policy!="mutual_tls"}[5m]))
              /
              sum(rate(istio_tcp_connections_opened_total[5m]))
            ) > 0.1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High non-mTLS connection rate"
            description: "More than 10% of connections are not using mTLS"

        - alert: IstioCertificateExpiry
          expr: |
            istio_certificate_expiry_seconds < 86400
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Istio certificate expires soon"

SLO/SLI Implementation

Define Service Level Objectives

Note: The ConfigMap format below is conceptual. For production SLO management, use tools like Sloth, OpenSLO, or the Pyrra SLO Operator which generate Prometheus rules from SLO definitions.

# slo-config.yaml (conceptual example)
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-slo-config
data:
  slo.yaml: |
    services:
      - name: payments-service
        slos:
          - name: availability
            objective: 99.9
            sli: |
              sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[5m]))
              /
              sum(rate(istio_requests_total{service="payments-service"}[5m]))
          - name: latency
            objective: 99
            sli: |
              histogram_quantile(0.99,
                sum(rate(istio_request_duration_milliseconds_bucket{service="payments-service"}[5m])) by (le)
              )

SLO Alerting

# slo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-slo-alerts
  namespace: istio-system
spec:
  groups:
    - name: istio-slo
      rules:
        - alert: IstioSLOViolation
          expr: |
            (
              sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[28d]))
              /
              sum(rate(istio_requests_total{service="payments-service"}[28d]))
            ) < 0.999
          for: 1h
          labels:
            severity: critical
          annotations:
            summary: "SLO violation for payments-service availability"

Custom Metrics and Business KPIs

Custom Metrics via EnvoyFilter

# custom-metrics.yaml
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: custom-metrics-filter
  namespace: istio-system
spec:
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
      patch:
        operation: INSERT_BEFORE
        value:
          name: lua
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
            inline_code: |
              function envoy_on_request(request_handle)
                local headers = request_handle:headers()
                local user_id = headers:get("x-user-id")
                if user_id then
                  request_handle:headers():add("x-custom-metric", "user_" .. user_id)
                end
              end

Business KPI Queries

Note: The metrics referenced below (payment_amount_sum, checkout_completed_total, add_to_cart_total) are not Istio metrics. They require custom instrumentation in your applications using Prometheus client libraries or OpenTelemetry SDKs.

# Revenue per request (requires custom payment_amount_sum metric)
sum(rate(payment_amount_sum[5m])) / sum(rate(istio_requests_total{service="payments-service"}[5m]))

# User conversion rate (requires custom checkout/add_to_cart metrics)
sum(rate(checkout_completed_total[5m])) / sum(rate(add_to_cart_total[5m]))

Cost Optimization Strategies

Metric Cardinality Management

Important: Do not drop *_bucket metrics. Histogram buckets are required for histogram_quantile() functions that calculate P50, P95, P99 latency percentiles. Dropping them would break all latency SLO queries.

Instead, reduce cardinality by dropping high-cardinality labels or non-essential metrics:

# cost-optimization.yaml
scrape_configs:
  - job_name: 'istio-proxy-optimized'
    metrics_path: /stats/prometheus
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: istio-proxy
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        replacement: "${1}:15090"
        target_label: __address__
    metric_relabel_configs:
      # Drop high-cardinality metrics (not histogram buckets)
      - source_labels: [__name__]
        action: drop
        regex: 'istio_request_headers_.+|istio_response_headers_.+|envoy_cluster_upstream_rq_.+_bucket'
      # Drop high-cardinality labels
      - action: labeldrop
        regex: 'source_principal|destination_principal|request_id'

Recording Rules for Efficiency

# recording-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-recording-rules
  namespace: istio-system
spec:
  groups:
    - name: istio-recording-rules
      interval: 30s
      rules:
        - record: istio:service:request_rate:5m
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            )

        - record: istio:service:success_rate:5m
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])
            )
            /
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            )

        - record: istio:service:latency_p99:5m
          expr: |
            histogram_quantile(0.99,
              sum by (destination_service, le) (
                rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
              )
            )

        - record: istio:service:error_rate:5m
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
            )
            /
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            )

Alerting Rules

Critical Service Mesh Alerts

# istio-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-alerts
  namespace: istio-system
spec:
  groups:
    - name: istio-error-alerts
      rules:
        - alert: IstioHighErrorRate
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
            )
            /
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            ) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.destination_service }}"
            description: "Error rate is {{ $value | humanizePercentage }}"

        - alert: IstioMeshHighErrorRate
          expr: |
            sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
            /
            sum(rate(istio_requests_total{reporter="destination"}[5m])) > 0.01
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Elevated mesh-wide error rate"

    - name: istio-latency-alerts
      rules:
        - alert: IstioHighLatency
          expr: |
            histogram_quantile(0.99,
              sum by (destination_service, le) (
                rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
              )
            ) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency on {{ $labels.destination_service }}"
            description: "P99 latency is {{ $value }}ms"

        - alert: IstioLatencySpike
          expr: |
            (
              histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service))
              -
              histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m] offset 1h)) by (le, destination_service))
            ) > 500
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Sudden latency increase detected"

    - name: istio-traffic-alerts
      rules:
        - alert: IstioLowTraffic
          expr: |
            sum(rate(istio_requests_total{reporter="destination"}[5m])) < 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Unusually low traffic in mesh"

        - alert: IstioTrafficDrop
          expr: |
            (
              sum(rate(istio_requests_total{reporter="destination"}[5m]))
              /
              sum(rate(istio_requests_total{reporter="destination"}[5m] offset 1h))
            ) < 0.5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Traffic dropped by more than 50%"

Multi-Cluster Federation

For multi-cluster deployments, configure Prometheus federation to aggregate metrics.

# prometheus-federation.yaml
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="istio-proxy"}'
        - 'istio:service:request_rate:5m'
        - 'istio:service:success_rate:5m'
        - 'istio:service:error_rate:5m'
    static_configs:
      - targets:
          - 'prometheus-cluster-1.example.com:9090'
          - 'prometheus-cluster-2.example.com:9090'

Getting Started

  1. Deploy Prometheus with the scrape configuration above in the istio-system namespace
  2. Apply recording rules to pre-compute metrics and reduce query load
  3. Import Grafana dashboard using the JSON configuration or use Istio's official dashboard (ID: 7639)
  4. Configure Alertmanager to route alerts to Slack, PagerDuty, or email
  5. Set up SLO dashboards using the recording rules for long-term trend analysis
  6. Enable distributed tracing with Jaeger or Zipkin for request flow visibility
  7. Monitor security metrics including mTLS status and certificate expiration
  8. Implement cost optimization by managing metric cardinality and using recording rules

Share this Guide:

Ready to Supercharge Your Development Workflow?

Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.

No Credit Card Required
SOC 2 Type 2 Certified
Setup in 2 Minutes
Enterprise Security
4.9/5 Rating
2500+ Developers