Service Mesh Monitoring: Complete Guide to Prometheus, Grafana & Alerting
Service Mesh Monitoring: Prometheus Metrics, Grafana Dashboards, and Alerting
A practical guide to implementing observability for service meshes using Prometheus, Grafana, and Alertmanager. Focuses on Istio Telemetry v2 with Envoy-based metrics.
Core Metrics: The Golden Signals
Service mesh monitoring centers on four Golden Signals: latency, traffic, errors, and saturation. Istio Telemetry v2 exposes these directly through Envoy sidecars at port 15090, replacing the deprecated Mixer-based telemetry.
Key Istio Metrics
| Metric | Type | Description |
|---|---|---|
istio_requests_total |
Counter | Total requests by source, destination, and response code |
istio_request_duration_milliseconds_bucket |
Histogram | Request latency distribution |
istio_request_bytes_bucket |
Histogram | Request body size |
istio_response_bytes_bucket |
Histogram | Response body size |
Prometheus Configuration
Scrape Config for Istio
# prometheus-config.yaml
scrape_configs:
# Envoy sidecar metrics from each pod
- job_name: 'istio-proxy'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: istio-proxy
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
replacement: "${1}:15090"
target_label: __address__
# Istiod control plane metrics
- job_name: 'istiod'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names: [istio-system]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istiod
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: http-monitoring
- source_labels: [__address__]
action: replace
regex: ([^:]+)(?::\d+)?
replacement: "${1}:15014"
target_label: __address__
Essential PromQL Queries
Request Rate (Traffic)
sum(rate(istio_requests_total{reporter="destination"}[5m]))
Error Rate (Errors)
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m]))
P99 Latency by Service (Latency)
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service)
)
Success Rate by Service
sum(rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])) by (destination_service)
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service)
Grafana Dashboard Design
Dashboard Hierarchy
Structure dashboards in three tiers: Mesh Overview (global health), Service Dashboard (per-service metrics), and Workload Dashboard (pod-level detail).
Mesh Overview Dashboard
{
"dashboard": {
"title": "Istio Mesh Overview",
"refresh": "30s",
"panels": [
{
"title": "Global Request Rate",
"type": "stat",
"gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "round(sum(rate(istio_requests_total{reporter=\"destination\"}[5m])), 0.01)"
}]
},
{
"title": "Global Error Rate",
"type": "stat",
"gridPos": { "x": 6, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\",response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total{reporter=\"destination\"}[5m]))"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{ "value": null, "color": "green" },
{ "value": 0.01, "color": "yellow" },
{ "value": 0.05, "color": "red" }
]
}
}
}
},
{
"title": "P99 Latency by Service",
"type": "gauge",
"gridPos": { "x": 12, "y": 0, "w": 6, "h": 4 },
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service))"
}]
},
{
"title": "Service Traffic Flow",
"type": "nodeGraph",
"gridPos": { "x": 0, "y": 4, "w": 24, "h": 10 },
"targets": [
{
"expr": "label_replace(label_replace(sum by (source_workload, destination_workload) (rate(istio_requests_total{reporter=\"destination\"}[5m])), \"source\", \"$1\", \"source_workload\", \"(.+)\"), \"target\", \"$1\", \"destination_workload\", \"(.+)\")",
"format": "table",
"instant": true
}
],
"options": {
"nodes": {
"mainStatUnit": "reqps"
},
"edges": {
"mainStatUnit": "reqps"
}
}
}
]
}
}
Latency Heatmap Panel
{
"title": "Request Latency Distribution",
"type": "heatmap",
"targets": [{
"expr": "sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le)",
"format": "heatmap"
}],
"options": {
"calculate": false,
"color": {
"scheme": "Spectral"
},
"yAxis": {
"decimals": 0,
"unit": "ms"
}
},
"dataFormat": "tsbuckets"
}
Distributed Tracing Integration
Jaeger Configuration
Enable tracing in Istio mesh:
# istio-tracing.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-control-plane
spec:
values:
pilot:
env:
ENABLE_EXTERNAL_JAEGER: "true"
tracing:
jaeger:
enabled: true
hub: docker.io/jaegertracing
tag: 1.40
Tracing Metrics
# Jaeger query latency
histogram_quantile(0.99,
sum(rate(jaeger_query_duration_seconds_bucket[5m])) by (le)
)
Note: Istio does not expose trace count metrics directly. For trace volume monitoring, instrument your applications with OpenTelemetry SDKs to emit custom metrics, or monitor the Jaeger collector metrics (e.g., jaeger_collector_spans_received_total).
Security Metrics
mTLS Status Monitoring
# mTLS connection ratio
sum(rate(istio_tcp_connections_opened_total{connection_security_policy="mutual_tls"}[5m]))
/
sum(rate(istio_tcp_connections_opened_total[5m]))
# Certificate expiration
max(istio_certificate_expiry_seconds) by (cluster_id)
Security Alerts
# security-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-security-alerts
namespace: istio-system
spec:
groups:
- name: istio-security
rules:
- alert: IstioMTLSFailure
expr: |
(
sum(rate(istio_tcp_connections_opened_total{connection_security_policy!="mutual_tls"}[5m]))
/
sum(rate(istio_tcp_connections_opened_total[5m]))
) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High non-mTLS connection rate"
description: "More than 10% of connections are not using mTLS"
- alert: IstioCertificateExpiry
expr: |
istio_certificate_expiry_seconds < 86400
for: 1h
labels:
severity: warning
annotations:
summary: "Istio certificate expires soon"
SLO/SLI Implementation
Define Service Level Objectives
Note: The ConfigMap format below is conceptual. For production SLO management, use tools like Sloth, OpenSLO, or the Pyrra SLO Operator which generate Prometheus rules from SLO definitions.
# slo-config.yaml (conceptual example)
apiVersion: v1
kind: ConfigMap
metadata:
name: istio-slo-config
data:
slo.yaml: |
services:
- name: payments-service
slos:
- name: availability
objective: 99.9
sli: |
sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[5m]))
/
sum(rate(istio_requests_total{service="payments-service"}[5m]))
- name: latency
objective: 99
sli: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{service="payments-service"}[5m])) by (le)
)
SLO Alerting
# slo-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-slo-alerts
namespace: istio-system
spec:
groups:
- name: istio-slo
rules:
- alert: IstioSLOViolation
expr: |
(
sum(rate(istio_requests_total{service="payments-service",response_code!~"5.."}[28d]))
/
sum(rate(istio_requests_total{service="payments-service"}[28d]))
) < 0.999
for: 1h
labels:
severity: critical
annotations:
summary: "SLO violation for payments-service availability"
Custom Metrics and Business KPIs
Custom Metrics via EnvoyFilter
# custom-metrics.yaml
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: custom-metrics-filter
namespace: istio-system
spec:
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
patch:
operation: INSERT_BEFORE
value:
name: lua
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
inline_code: |
function envoy_on_request(request_handle)
local headers = request_handle:headers()
local user_id = headers:get("x-user-id")
if user_id then
request_handle:headers():add("x-custom-metric", "user_" .. user_id)
end
end
Business KPI Queries
Note: The metrics referenced below (payment_amount_sum, checkout_completed_total, add_to_cart_total) are not Istio metrics. They require custom instrumentation in your applications using Prometheus client libraries or OpenTelemetry SDKs.
# Revenue per request (requires custom payment_amount_sum metric)
sum(rate(payment_amount_sum[5m])) / sum(rate(istio_requests_total{service="payments-service"}[5m]))
# User conversion rate (requires custom checkout/add_to_cart metrics)
sum(rate(checkout_completed_total[5m])) / sum(rate(add_to_cart_total[5m]))
Cost Optimization Strategies
Metric Cardinality Management
Important: Do not drop *_bucket metrics. Histogram buckets are required for histogram_quantile() functions that calculate P50, P95, P99 latency percentiles. Dropping them would break all latency SLO queries.
Instead, reduce cardinality by dropping high-cardinality labels or non-essential metrics:
# cost-optimization.yaml
scrape_configs:
- job_name: 'istio-proxy-optimized'
metrics_path: /stats/prometheus
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_name]
action: keep
regex: istio-proxy
- source_labels: [__meta_kubernetes_pod_ip]
action: replace
replacement: "${1}:15090"
target_label: __address__
metric_relabel_configs:
# Drop high-cardinality metrics (not histogram buckets)
- source_labels: [__name__]
action: drop
regex: 'istio_request_headers_.+|istio_response_headers_.+|envoy_cluster_upstream_rq_.+_bucket'
# Drop high-cardinality labels
- action: labeldrop
regex: 'source_principal|destination_principal|request_id'
Recording Rules for Efficiency
# recording-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-recording-rules
namespace: istio-system
spec:
groups:
- name: istio-recording-rules
interval: 30s
rules:
- record: istio:service:request_rate:5m
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
)
- record: istio:service:success_rate:5m
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination",response_code!~"[45].."}[5m])
)
/
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
)
- record: istio:service:latency_p99:5m
expr: |
histogram_quantile(0.99,
sum by (destination_service, le) (
rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
)
)
- record: istio:service:error_rate:5m
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
)
/
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
)
Alerting Rules
Critical Service Mesh Alerts
# istio-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: istio-alerts
namespace: istio-system
spec:
groups:
- name: istio-error-alerts
rules:
- alert: IstioHighErrorRate
expr: |
sum by (destination_service) (
rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m])
)
/
sum by (destination_service) (
rate(istio_requests_total{reporter="destination"}[5m])
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.destination_service }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: IstioMeshHighErrorRate
expr: |
sum(rate(istio_requests_total{reporter="destination",response_code=~"5.."}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m])) > 0.01
for: 10m
labels:
severity: warning
annotations:
summary: "Elevated mesh-wide error rate"
- name: istio-latency-alerts
rules:
- alert: IstioHighLatency
expr: |
histogram_quantile(0.99,
sum by (destination_service, le) (
rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])
)
) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency on {{ $labels.destination_service }}"
description: "P99 latency is {{ $value }}ms"
- alert: IstioLatencySpike
expr: |
(
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service))
-
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m] offset 1h)) by (le, destination_service))
) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Sudden latency increase detected"
- name: istio-traffic-alerts
rules:
- alert: IstioLowTraffic
expr: |
sum(rate(istio_requests_total{reporter="destination"}[5m])) < 1
for: 10m
labels:
severity: warning
annotations:
summary: "Unusually low traffic in mesh"
- alert: IstioTrafficDrop
expr: |
(
sum(rate(istio_requests_total{reporter="destination"}[5m]))
/
sum(rate(istio_requests_total{reporter="destination"}[5m] offset 1h))
) < 0.5
for: 5m
labels:
severity: critical
annotations:
summary: "Traffic dropped by more than 50%"
Multi-Cluster Federation
For multi-cluster deployments, configure Prometheus federation to aggregate metrics.
# prometheus-federation.yaml
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="istio-proxy"}'
- 'istio:service:request_rate:5m'
- 'istio:service:success_rate:5m'
- 'istio:service:error_rate:5m'
static_configs:
- targets:
- 'prometheus-cluster-1.example.com:9090'
- 'prometheus-cluster-2.example.com:9090'
Getting Started
- Deploy Prometheus with the scrape configuration above in the
istio-systemnamespace - Apply recording rules to pre-compute metrics and reduce query load
- Import Grafana dashboard using the JSON configuration or use Istio's official dashboard (ID: 7639)
- Configure Alertmanager to route alerts to Slack, PagerDuty, or email
- Set up SLO dashboards using the recording rules for long-term trend analysis
- Enable distributed tracing with Jaeger or Zipkin for request flow visibility
- Monitor security metrics including mTLS status and certificate expiration
- Implement cost optimization by managing metric cardinality and using recording rules
Share this Guide:
More Guides
eBPF Networking: High-Performance Policy Enforcement, Traffic Mirroring, and Load Balancing
Master kernel-level networking with eBPF: implement XDP firewalls, traffic mirroring for observability, and Maglev load balancing with Direct Server Return for production-grade infrastructure.
18 min readFinOps Reporting Mastery: Cost Attribution, Trend Analysis & Executive Dashboards
Technical blueprint for building automated cost visibility pipelines with SQL-based attribution, Python anomaly detection, and executive decision dashboards.
4 min readJava Performance Mastery: Complete JVM Tuning Guide for Production Systems
Master Java performance optimization with comprehensive JVM tuning, garbage collection algorithms, and memory management strategies for production microservices and distributed systems.
14 min readPrisma vs TypeORM vs Drizzle: Performance Benchmarks for Node.js Applications
A technical deep-dive comparing three leading TypeScript ORMs on bundle size, cold start overhead, and runtime performance to help you choose the right tool for serverless and traditional Node.js deployments.
8 min readPlatform Engineering Roadmap: From Ad-Hoc Tooling to Mature Internal Developer Platforms
A practical guide to advancing platform maturity using the CNCF framework, capability assessment matrices, and phased strategy for building self-service developer platforms.
9 min readContinue Reading
eBPF Networking: High-Performance Policy Enforcement, Traffic Mirroring, and Load Balancing
Master kernel-level networking with eBPF: implement XDP firewalls, traffic mirroring for observability, and Maglev load balancing with Direct Server Return for production-grade infrastructure.
18 min readFinOps Reporting Mastery: Cost Attribution, Trend Analysis & Executive Dashboards
Technical blueprint for building automated cost visibility pipelines with SQL-based attribution, Python anomaly detection, and executive decision dashboards.
4 min readJava Performance Mastery: Complete JVM Tuning Guide for Production Systems
Master Java performance optimization with comprehensive JVM tuning, garbage collection algorithms, and memory management strategies for production microservices and distributed systems.
14 min readReady to Supercharge Your Development Workflow?
Join thousands of engineering teams using MatterAI to accelerate code reviews, catch bugs earlier, and ship faster.
