Service Mesh Monitoring: Complete Guide to Prometheus, Grafana & Alerting
Service Mesh Monitoring: Prometheus Metrics, Grafana Dashboards, and Alerting
A practical guide to implementing observability for service meshes using Prometheus, Grafana, and Alertmanager. Focuses on Istio Telemetry v2 with Envoy-based metrics.
Core Metrics: The Golden Signals
Service mesh monitoring centers on four Golden Signals: latency, traffic, errors, and saturation. Istio Telemetry v2 exposes these directly through Envoy sidecars at port 15090, replacing the deprecated Mixer-based telemetry.
Key Istio Metrics
| Metric | Type | Description |
|---|---|---|
istio_requests_total | Counter | Total requests by source, destination, and response code |
istio_request_duration_milliseconds_bucket | Histogram | Request latency distribution |
istio_request_bytes_bucket | Histogram | Request body size |
istio_response_bytes_bucket | Histogram | Response body size |
Prometheus Configuration
Scrape Config for Istio
# prometheus-config.yaml
scrape_configs:
# Envoy sidecar metrics from each pod
- job_name: 'istio-proxy'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: istio-proxy
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
replacement: "${1}:15090"
target_label: __address__
# Istiod control plane metrics
- job_name: 'istiod'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [istio-system]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istiod
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: http-monitoring
- source_labels: [__address__]
action: replace
regex: ([^:]+)(?::\d+)?
replacement: "${1}:15014"
target_label: __address__
Essential PromQL Queries
Request Rate (Traffic)
sum(rate(istio_requests_total{reporter="destination"}[5m]))
Error Rate (Errors)
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m]))
P99 Latency by Service (Latency)
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service)
)
Success Rate by Service
sum(rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service)
Grafana Dashboard Design
Dashboard Hierarchy
Structure dashboards in three tiers: Mesh Overview (global health), Service Dashboard (per-service metrics), and Workload Dashboard (pod-level detail).
Mesh Overview Dashboard
{
"dashboard": {
"title": "Istio Mesh Overview",
"refresh": "30s",
"panels": [
{
"title": "Global Request Rate",
"type": "stat",
"gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "round(sum(rate(istio_requests_total{reporter=\"destination\"}[5m])), 0.01)"
}]
},
{
"title": "Global Error Rate",
"type": "stat",
"gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\",response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m]))"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": null, "color": "green" },
{ "value": 0.01, "color": "yellow" },
{ "value": 0.05, "color": "red" }
]
}
}
}
},
{
"title": "P99 Latency by Service",
"type": "gauge",
"gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service))"
}]
},
{
"title": "Service Traffic Flow",
"type": "nodeGraph",
"gridPos": { "x": 0, "y": 4, "w": 24, "h": 10 },
"targets": [
{
"expr": "label_replace(label_replace(sum by (source_workload, destination_workload) (rate(istio_requests_total{reporter=\"destination\"}[5m])), \"source\", \"$1\", \"source_workload\", \"(.+)\"), \"target\", \"$1\", \"destination_workload\", \"(.+)\")",
"format": "table",
"instant": true
}
],
"options": {
"nodes": {
"mainStatUnit": "reqps"
},
"edges": {
"mainStatUnit": "reqps"
}
}
}
]
}
}
Latency Heatmap Panel
{
"title": "Request Latency Distribution",
"type": "heatmap",
"targets": [{
"expr": "sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le)",
"format": "heatmap"
}],
"options": {
"calculate": false,
"color": {
"scheme": "Spectral"
},
"yAxis": {
"decimals": 0,
"unit": "ms"
}
},
"dataFormat": "tsbuckets"
}
Distributed Tracing Integration
Jaeger Configuration
Enable tracing in Istio mesh:
# istio-tracing.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-control-plane
spec:
values:
pilot:
env:
ENABLE_EXTERNAL_JAEGER: "true"
tracing:
jaeger:
enabled: true
hub: docker.io/jaegertracing
tag: 1.40
Tracing Metrics
# Jaeger query latency
histogram_quantile(0.99,
sum(rate(jaeger_query_duration_seconds_bucket[5m])) by (le)
)
Note: Istio does not expose trace count metrics directly. For trace volume monitoring, instrument your applications with OpenTelemetry SDKs to emit custom metrics, or monitor the Jaeger collector metrics (e.g., jaeger_collector_spans_received_total).
Security Metrics
mTLS Status Monitoring
# mTLS connection ratio
sum(rate(istio_tcp_connections_opened_total{connection_security_policy="mutual_tls"}[5m]))
/
sum(rate(istio_tcp_connections_opened_total[5m]))
# Certificate expiration
max(istio_certificate_expiry_seconds) by (cluster_id)
Security Alerts
# security-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-security-alerts
namespace: istio-system
spec:
groups:
- name: istio-security
rules:
- alert: IstioMTLSFailure
expr: |
(
sum(rate(istio_tcp_connections_opened_total{connection_security_policy!="mutual_tls"}[5m]))
/
sum(rate(istio_tcp_connections_opened_total[5m]))
) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High non-mTLS connection rate"
description: "More than 10% of connections are not using mTLS"
- alert: IstioCertificateExpiry
expr: |
istio_certificate_expiry_seconds < 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Istio certificate expires soon"
SLO/SLI Implementation
Define Service Level Objectives
Note: The ConfigMap format below is conceptual. For production SLO management, use tools like Sloth, OpenSLO, or the Pyrra SLO Operator which generate Prometheus rules from SLO definitions.
# slo-config.yaml (conceptual example)
apiVersion: v1
kind: ConfigMap
metadata:
name: istio-slo-config
data:
slo.yaml: |
services:
- name: payments-service
slos:
- name: availability
objective: 99.9
sli: |
sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[5m]))
/
sum(rate(istio_requests_total{service="payments-service"}[5m]))
- name: latency
objective: 99
sli: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{service="payments-service"}[5m])) by (le)
)
SLO Alerting
# slo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-slo-alerts
namespace: istio-system
spec:
groups:
- name: istio-slo
rules:
- alert: IstioSLOViolation
expr: |
(
sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[28d]))
/
sum(rate(istio_requests_total{service="payments-service"}[28d]))
) < 0.999
for: 1h
labels:
severity: critical
annotations:
summary: "SLO violation for payments-service availability"
Custom Metrics and Business KPIs
Custom Metrics via EnvoyFilter
# custom-metrics.yaml
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: custom-metrics-filter
namespace: istio-system
spec:
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
patch:
operation: INSERT_BEFORE
value:
name: lua
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
inline_code: |
function envoy_on_request(request_handle)
local headers = request_handle:headers()
local user_id = headers:get("x-user-id")
if user_id then
request_handle:headers():add("x-custom-metric", "user_" .. user_id)
end
end
Business KPI Queries
Note: The metrics referenced below (payment_amount_sum, checkout_completed_total, add_to_cart_total) are not Istio metrics. They require custom instrumentation in your applications using Prometheus client libraries or OpenTelemetry SDKs.
# Revenue per request (requires custom payment_amount_sum metric)
sum(rate(payment_amount_sum[5m])) / sum(rate(istio_requests_total{service="payments-service"}[5m]))
# User conversion rate (requires custom checkout/add_to_cart metrics)
sum(rate(checkout_completed_total[5m])) / sum(rate(add_to_cart_total[5m]))
Cost Optimization Strategies
Metric Cardinality Management
Important: Do not drop *_bucket metrics. Histogram buckets are required for histogram_quantile() functions that calculate P50, P95, P99 latency percentiles. Dropping them would break all latency SLO queries.
Instead, reduce cardinality by dropping high-cardinality labels or non-essential metrics:
# cost-optimization.yaml
scrape_configs:
- job_name: 'istio-proxy-optimized'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: istio-proxy
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
replacement: "${1}:15090"
target_label: __address__
metric_relabel_configs:
# Drop high-cardinality metrics (not histogram buckets)
- source_labels: [__name__]
action: drop
regex: 'istio_request_headers_.+|istio_response_headers_.+|envoy_cluster_upstream_rq_.+_bucket'
# Drop high-cardinality labels
- action: labeldrop
regex: 'source_principal|destination_principal|request_id'
Recording Rules for Efficiency
# recording-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-recording-rules
namespace: istio-system
spec:
groups:
- name: istio-recording-rules
interval: 30s
rules:
- record: istio:service:request_rate:5m
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
)
- record: istio:service:success_rate:5m
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])
)
/
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
)
- record: istio:service:latency_p99:5m
expr: |
histogram_quantile(0.99,
sum by (destination_service, le) (
rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
)
)
- record: istio:service:error_rate:5m
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
)
/
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
)
Alerting Rules
Critical Service Mesh Alerts
# istio-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-alerts
namespace: istio-system
spec:
groups:
- name: istio-error-alerts
rules:
- alert: IstioHighErrorRate
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
)
/
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.destination_service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: IstioMeshHighErrorRate
expr: |
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) > 0.01
for: 10m
labels:
severity: warning
annotations:
summary: "Elevated mesh-wide error rate"
- name: istio-latency-alerts
rules:
- alert: IstioHighLatency
expr: |
histogram_quantile(0.99,
sum by (destination_service, le) (
rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
)
) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.destination_service }}"
description: "P99 latency is {{ $value }}ms"
- alert: IstioLatencySpike
expr: |
(
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service))
-
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m] offset 1h)) by (le, destination_service))
) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Sudden latency increase detected"
- name: istio-traffic-alerts
rules:
- alert: IstioLowTraffic
expr: |
sum(rate(istio_requests_total{reporter="destination"}[5m])) < 1
for: 10m
labels:
severity: warning
annotations:
summary: "Unusually low traffic in mesh"
- alert: IstioTrafficDrop
expr: |
(
sum(rate(istio_requests_total{reporter="destination"}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m] offset 1h))
) < 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Traffic dropped by more than 50%"
Multi-Cluster Federation
For multi-cluster deployments, configure Prometheus federation to aggregate metrics.
# prometheus-federation.yaml
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="istio-proxy"}'
- 'istio:service:request_rate:5m'
- 'istio:service:success_rate:5m'
- 'istio:service:error_rate:5m'
static_configs:
- targets:
- 'prometheus-cluster-1.example.com:9090'
- 'prometheus-cluster-2.example.com:9090'
Getting Started
- Deploy Prometheus with the scrape configuration above in the
istio-systemnamespace - Apply recording rules to pre-compute metrics and reduce query load
- Import Grafana dashboard using the JSON configuration or use Istio's official dashboard (ID: 7639)
- Configure Alertmanager to route alerts to Slack, PagerDuty, or email
- Set up SLO dashboards using the recording rules for long-term trend analysis
- Enable distributed tracing with Jaeger or Zipkin for request flow visibility
- Monitor security metrics including mTLS status and certificate expiration
- Implement cost optimization by managing metric cardinality and using recording rules
Share this Guide:
More Guides
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readGleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems
Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.
5 min readHono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun
Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.
6 min readContinue Reading
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readShip Faster. Ship Safer.
Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.
