Observability & Monitoring

Service Mesh Monitoring: Complete Guide to Prometheus, Grafana & Alerting

MatterAI

12 min read·March 2, 2026

Service Mesh Monitoring: Prometheus Metrics, Grafana Dashboards, and Alerting

A practical guide to implementing observability for service meshes using Prometheus, Grafana, and Alertmanager. Focuses on Istio Telemetry v2 with Envoy-based metrics.

Core Metrics: The Golden Signals

Service mesh monitoring centers on four Golden Signals: latency, traffic, errors, and saturation. Istio Telemetry v2 exposes these directly through Envoy sidecars at port 15090, replacing the deprecated Mixer-based telemetry.

Key Istio Metrics

Metric	Type	Description
`istio_requests_total`	Counter	Total requests by source, destination, and response code
`istio_request_duration_milliseconds_bucket`	Histogram	Request latency distribution
`istio_request_bytes_bucket`	Histogram	Request body size
`istio_response_bytes_bucket`	Histogram	Response body size

Prometheus Configuration

Scrape Config for Istio

# prometheus-config.yaml
scrape_configs:
  # Envoy sidecar metrics from each pod
  - job_name: 'istio-proxy'
    metrics_path: /stats/prometheus
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: istio-proxy
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        replacement: "${1}:15090"
        target_label: __address__

  # Istiod control plane metrics
  - job_name: 'istiod'
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [istio-system]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: istiod
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: http-monitoring
      - source_labels: [__address__]
        action: replace
        regex: ([^:]+)(?::\d+)?
        replacement: "${1}:15014"
        target_label: __address__

Essential PromQL Queries

Request Rate (Traffic)

sum(rate(istio_requests_total{reporter="destination"}[5m]))

Error Rate (Errors)

sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m]))

P99 Latency by Service (Latency)

histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service)
)

Success Rate by Service

sum(rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service)

Grafana Dashboard Design

Dashboard Hierarchy

Structure dashboards in three tiers: Mesh Overview (global health), Service Dashboard (per-service metrics), and Workload Dashboard (pod-level detail).

Mesh Overview Dashboard

{
  "dashboard": {
    "title": "Istio Mesh Overview",
    "refresh": "30s",
    "panels": [
      {
        "title": "Global Request Rate",
        "type": "stat",
        "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
        "targets": [{
          "expr": "round(sum(rate(istio_requests_total{reporter=\"destination\"}[5m])), 0.01)"
        }]
      },
      {
        "title": "Global Error Rate",
        "type": "stat",
        "gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
        "targets": [{
          "expr": "sum(rate(istio_requests_total{reporter=\"destination\",response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m]))"
        }],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "value": null, "color": "green" },
                { "value": 0.01, "color": "yellow" },
                { "value": 0.05, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency by Service",
        "type": "gauge",
        "gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service))"
        }]
      },
      {
        "title": "Service Traffic Flow",
        "type": "nodeGraph",
        "gridPos": { "x": 0, "y": 4, "w": 24, "h": 10 },
        "targets": [
          {
            "expr": "label_replace(label_replace(sum by (source_workload, destination_workload) (rate(istio_requests_total{reporter=\"destination\"}[5m])), \"source\", \"$1\", \"source_workload\", \"(.+)\"), \"target\", \"$1\", \"destination_workload\", \"(.+)\")",
            "format": "table",
            "instant": true
          }
        ],
        "options": {
          "nodes": {
            "mainStatUnit": "reqps"
          },
          "edges": {
            "mainStatUnit": "reqps"
          }
        }
      }
    ]
  }
}

Latency Heatmap Panel

{
  "title": "Request Latency Distribution",
  "type": "heatmap",
  "targets": [{
    "expr": "sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le)",
    "format": "heatmap"
  }],
  "options": {
    "calculate": false,
    "color": {
      "scheme": "Spectral"
    },
    "yAxis": {
      "decimals": 0,
      "unit": "ms"
    }
  },
  "dataFormat": "tsbuckets"
}

Distributed Tracing Integration

Jaeger Configuration

Enable tracing in Istio mesh:

# istio-tracing.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: istio-control-plane
spec:
  values:
    pilot:
      env:
        ENABLE_EXTERNAL_JAEGER: "true"
    tracing:
      jaeger:
        enabled: true
        hub: docker.io/jaegertracing
        tag: 1.40

Tracing Metrics

# Jaeger query latency
histogram_quantile(0.99,
  sum(rate(jaeger_query_duration_seconds_bucket[5m])) by (le)
)

Note: Istio does not expose trace count metrics directly. For trace volume monitoring, instrument your applications with OpenTelemetry SDKs to emit custom metrics, or monitor the Jaeger collector metrics (e.g., jaeger_collector_spans_received_total).

Security Metrics

mTLS Status Monitoring

# mTLS connection ratio
sum(rate(istio_tcp_connections_opened_total{connection_security_policy="mutual_tls"}[5m]))
/
sum(rate(istio_tcp_connections_opened_total[5m]))

# Certificate expiration
max(istio_certificate_expiry_seconds) by (cluster_id)

Security Alerts

# security-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-security-alerts
  namespace: istio-system
spec:
  groups:
    - name: istio-security
      rules:
        - alert: IstioMTLSFailure
          expr: |
            (
              sum(rate(istio_tcp_connections_opened_total{connection_security_policy!="mutual_tls"}[5m]))
              /
              sum(rate(istio_tcp_connections_opened_total[5m]))
            ) > 0.1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High non-mTLS connection rate"
            description: "More than 10% of connections are not using mTLS"

        - alert: IstioCertificateExpiry
          expr: |
            istio_certificate_expiry_seconds < 86400
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Istio certificate expires soon"

SLO/SLI Implementation

Define Service Level Objectives

Note: The ConfigMap format below is conceptual. For production SLO management, use tools like Sloth, OpenSLO, or the Pyrra SLO Operator which generate Prometheus rules from SLO definitions.

# slo-config.yaml (conceptual example)
apiVersion: v1
kind: ConfigMap
metadata:
  name: istio-slo-config
data:
  slo.yaml: |
    services:
      - name: payments-service
        slos:
          - name: availability
            objective: 99.9
            sli: |
              sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[5m]))
              /
              sum(rate(istio_requests_total{service="payments-service"}[5m]))
          - name: latency
            objective: 99
            sli: |
              histogram_quantile(0.99,
                sum(rate(istio_request_duration_milliseconds_bucket{service="payments-service"}[5m])) by (le)
              )

SLO Alerting

# slo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-slo-alerts
  namespace: istio-system
spec:
  groups:
    - name: istio-slo
      rules:
        - alert: IstioSLOViolation
          expr: |
            (
              sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[28d]))
              /
              sum(rate(istio_requests_total{service="payments-service"}[28d]))
            ) < 0.999
          for: 1h
          labels:
            severity: critical
          annotations:
            summary: "SLO violation for payments-service availability"

Custom Metrics and Business KPIs

Custom Metrics via EnvoyFilter

# custom-metrics.yaml
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: custom-metrics-filter
  namespace: istio-system
spec:
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
      patch:
        operation: INSERT_BEFORE
        value:
          name: lua
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
            inline_code: |
              function envoy_on_request(request_handle)
                local headers = request_handle:headers()
                local user_id = headers:get("x-user-id")
                if user_id then
                  request_handle:headers():add("x-custom-metric", "user_" .. user_id)
                end
              end

Business KPI Queries

Note: The metrics referenced below (payment_amount_sum, checkout_completed_total, add_to_cart_total) are not Istio metrics. They require custom instrumentation in your applications using Prometheus client libraries or OpenTelemetry SDKs.

# Revenue per request (requires custom payment_amount_sum metric)
sum(rate(payment_amount_sum[5m])) / sum(rate(istio_requests_total{service="payments-service"}[5m]))

# User conversion rate (requires custom checkout/add_to_cart metrics)
sum(rate(checkout_completed_total[5m])) / sum(rate(add_to_cart_total[5m]))

Cost Optimization Strategies

Metric Cardinality Management

Important: Do not drop *_bucket metrics. Histogram buckets are required for histogram_quantile() functions that calculate P50, P95, P99 latency percentiles. Dropping them would break all latency SLO queries.

Instead, reduce cardinality by dropping high-cardinality labels or non-essential metrics:

# cost-optimization.yaml
scrape_configs:
  - job_name: 'istio-proxy-optimized'
    metrics_path: /stats/prometheus
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: istio-proxy
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        replacement: "${1}:15090"
        target_label: __address__
    metric_relabel_configs:
      # Drop high-cardinality metrics (not histogram buckets)
      - source_labels: [__name__]
        action: drop
        regex: 'istio_request_headers_.+|istio_response_headers_.+|envoy_cluster_upstream_rq_.+_bucket'
      # Drop high-cardinality labels
      - action: labeldrop
        regex: 'source_principal|destination_principal|request_id'

Recording Rules for Efficiency

# recording-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-recording-rules
  namespace: istio-system
spec:
  groups:
    - name: istio-recording-rules
      interval: 30s
      rules:
        - record: istio:service:request_rate:5m
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            )

        - record: istio:service:success_rate:5m
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])
            )
            /
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            )

        - record: istio:service:latency_p99:5m
          expr: |
            histogram_quantile(0.99,
              sum by (destination_service, le) (
                rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
              )
            )

        - record: istio:service:error_rate:5m
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
            )
            /
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            )

Alerting Rules

Critical Service Mesh Alerts

# istio-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-alerts
  namespace: istio-system
spec:
  groups:
    - name: istio-error-alerts
      rules:
        - alert: IstioHighErrorRate
          expr: |
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
            )
            /
            sum by (destination_service) (
              rate(istio_requests_total{reporter="destination"}[5m])
            ) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate on {{ $labels.destination_service }}"
            description: "Error rate is {{ $value | humanizePercentage }}"

        - alert: IstioMeshHighErrorRate
          expr: |
            sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
            /
            sum(rate(istio_requests_total{reporter="destination"}[5m])) > 0.01
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Elevated mesh-wide error rate"

    - name: istio-latency-alerts
      rules:
        - alert: IstioHighLatency
          expr: |
            histogram_quantile(0.99,
              sum by (destination_service, le) (
                rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
              )
            ) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency on {{ $labels.destination_service }}"
            description: "P99 latency is {{ $value }}ms"

        - alert: IstioLatencySpike
          expr: |
            (
              histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service))
              -
              histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m] offset 1h)) by (le, destination_service))
            ) > 500
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Sudden latency increase detected"

    - name: istio-traffic-alerts
      rules:
        - alert: IstioLowTraffic
          expr: |
            sum(rate(istio_requests_total{reporter="destination"}[5m])) < 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Unusually low traffic in mesh"

        - alert: IstioTrafficDrop
          expr: |
            (
              sum(rate(istio_requests_total{reporter="destination"}[5m]))
              /
              sum(rate(istio_requests_total{reporter="destination"}[5m] offset 1h))
            ) < 0.5
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Traffic dropped by more than 50%"

Multi-Cluster Federation

For multi-cluster deployments, configure Prometheus federation to aggregate metrics.

# prometheus-federation.yaml
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="istio-proxy"}'
        - 'istio:service:request_rate:5m'
        - 'istio:service:success_rate:5m'
        - 'istio:service:error_rate:5m'
    static_configs:
      - targets:
          - 'prometheus-cluster-1.example.com:9090'
          - 'prometheus-cluster-2.example.com:9090'

Getting Started

Deploy Prometheus with the scrape configuration above in the istio-system namespace
Apply recording rules to pre-compute metrics and reduce query load
Import Grafana dashboard using the JSON configuration or use Istio's official dashboard (ID: 7639)
Configure Alertmanager to route alerts to Slack, PagerDuty, or email
Set up SLO dashboards using the recording rules for long-term trend analysis
Enable distributed tracing with Jaeger or Zipkin for request flow visibility
Monitor security metrics including mTLS status and certificate expiration
Implement cost optimization by managing metric cardinality and using recording rules

MatterAI builds frontier AI infrastructure for engineering teams — from inference-optimized models to autonomous coding agents and agentic code reviews.

Explore what we're building:

Orbital IDE — Autonomous AI coding agent with background agents and deep codebase memory
AI Code Reviews — Agentic pre-commit reviews across GitHub, GitLab, and Bitbucket
Axon Models — Frontier-grade reasoning models at 70% lower inference cost

Get started free - https://app.matterai.so

Follow us on X · LinkedIn · GitHub

Share this Guide:

More Guides

LLM Integration for AI Agents: A Complete Engineering FAQ

Everything engineers need to know about integrating, testing, and productionizing LLMs in AI agents: model selection, tool calling, structured outputs, error handling, observability, and cost optimization.

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Continue Reading

LLM Integration for AI Agents: A Complete Engineering FAQ

22 min read

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min