Observability & Monitoring
Build a Production-Grade Observability Stack with Prometheus, Grafana, Loki, and Jaeger
Observability Stack Setup: Prometheus, Grafana, Loki, and Jaeger for Distributed Systems
This guide covers the implementation of a complete observability stack for distributed systems using Prometheus for metrics collection, Grafana for visualization, Loki for log aggregation, and Jaeger for distributed tracing.
Architecture Overview
The stack operates on the Three Pillars of Observability: Metrics (Prometheus), Logs (Loki), and Tracing (Jaeger). Prometheus uses a pull-based model to scrape metrics from instrumented applications and exporters. Grafana queries Prometheus for time-series data, Loki for logs, and Jaeger for trace data, providing unified dashboards. Jaeger collects distributed traces via OpenTelemetry instrumentation, enabling request flow analysis across microservices.
Prometheus Setup
Installation and Configuration
Deploy Prometheus via Docker or Kubernetes. Create a prometheus.yml configuration file:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'docker-cluster'
monitor: 'prometheus'
storage:
tsdb:
path: /prometheus
out_of_order_time_window: 30m
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
static_configs:
- targets: ['app-service:8080']
metrics_path: '/metrics'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerting_rules.yml'
Create alerting_rules.yml:
groups:
- name: application_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
annotations:
summary: "High P95 latency"
description: "P95 latency is {{ $value }} seconds"
Run Prometheus with exemplar storage enabled:
docker run -d \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
-v $(pwd)/alerting_rules.yml:/etc/prometheus/alerting_rules.yml \
-v prometheus-data:/prometheus \
prom/prometheus:latest \
--config.file=/etc/prometheus/prometheus.yml \
--enable-feature=exemplar-storage
Metrics Collection
Prometheus exposes metrics at /metrics endpoint. Instrument applications using client libraries or exporters. Key metrics types include:
- Counter: Monotonically increasing values (e.g.,
http_requests_total) - Gauge: Values that can go up or down (e.g.,
memory_usage_bytes) - Histogram: Count and sum of observed values in configurable buckets
- Summary: Count and sum plus quantiles over a sliding time window
Loki Setup
Installation and Configuration
Deploy Loki for log aggregation:
docker run -d \
-p 3100:3100 \
-v $(pwd)/loki-config.yml:/etc/loki/local-config.yaml \
-v loki-data:/loki \
grafana/loki:latest \
-config.file=/etc/loki/local-config.yaml
Create loki-config.yml:
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 168h
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
Deploy Promtail to forward logs:
docker run -d \
-v $(pwd)/promtail-config.yml:/etc/promtail/config.yml \
-v $(pwd)/app-logs:/var/log:ro \
grafana/promtail:latest \
-config.file=/etc/promtail/config.yml
Create promtail-config.yml:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: application
static_configs:
- targets:
- localhost
labels:
job: application
__path__: /var/log/*.log
Note: Ensure log files exist in the mounted directory (./app-logs) before starting Promtail. Logs should be in plain text or JSON format.
Grafana Integration
Data Source Configuration
Create /etc/grafana/provisioning/datasources/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686
editable: true
Access Grafana at http://localhost:3000 (default credentials: admin/admin). Datasources will be auto-provisioned on startup.
Dashboard Creation and Alerting
Use PromQL for queries. Common patterns:
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage
rate(process_cpu_seconds_total[5m]) * 100
Use LogQL for log queries:
{job="application"} |= "error"
{job="application"} | logfmt | trace_id != ""
Configure Grafana alerts in /etc/grafana/provisioning/alerting/alerting.yml:
apiVersion: 1
providers:
- name: 'alertmanager'
orgId: 1
folder: ''
type: file
disableProvenance: false
options:
path: /etc/grafana/provisioning/alerting/rules
Create alert rule file /etc/grafana/provisioning/alerting/rules/application.yml:
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
Import pre-built dashboards from Grafana.com for quick setup (Node Exporter Full, Kubernetes Cluster Monitoring).
Jaeger Distributed Tracing
Deployment
Deploy Jaeger All-in-One with OTLP enabled:
docker run -d \
--name jaeger \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 14250:14250 \
-p 9411:9411 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Access Jaeger UI at http://localhost:16686. OTLP endpoints are available at http://localhost:4317 (gRPC) and http://localhost:4318 (HTTP).
OpenTelemetry Instrumentation
Instrument applications using OpenTelemetry SDKs with OTLP exporters. Example for Go:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
func initTracer(serviceName string) error {
exporter, err := otlptracegrpc.New(
otlptracegrpc.WithEndpoint("localhost:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(serviceName),
)),
)
otel.SetTracerProvider(tp)
return nil
}
// Usage in handlers
tracer := otel.Tracer("service-a")
ctx, span := tracer.Start(ctx, "process-request")
defer span.End()
For Python:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "my-service"
})
otlp_exporter = OTLPSpanExporter(
endpoint="localhost:4317",
insecure=True,
)
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("operation"):
# Your code here
pass
Correlation and Integration
Trace to Metrics
Link traces to Prometheus metrics using exemplars. Configure Prometheus to store exemplars:
storage:
tsdb:
path: /prometheus
out_of_order_time_window: 30m
Add exemplars in application metrics:
histogram.Record(ctx, latencyMs, metric.WithAttributes(
attribute.String("trace_id", span.SpanContext().TraceID().String()),
))
Trace to Logs
Configure Grafana to link traces to logs. In Jaeger data source settings:
- Navigate to Trace to logs section
- Select Loki data source
- Configure tag mapping for
trace_id - Enable Filter by trace ID
Instrument applications to include trace IDs in logs:
import "go.opentelemetry.io/otel/bridge/opentracing"
log.Printf("Processing request trace_id=%s", span.SpanContext().TraceID().String())
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
current_span = trace.get_current_span()
print(f"Processing request trace_id={current_span.context.trace_id}")
Getting Started
- Deploy the stack using Docker Compose:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--enable-feature=exemplar-storage'
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14268:14268"
- "14250:14250"
- "4317:4317"
- "4318:4318"
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
command:
- '-config.file=/etc/loki/local-config.yaml'
promtail:
image: grafana/promtail:latest
volumes:
- ./promtail-config.yml:/etc/promtail/config.yml
- ./app-logs:/var/log:ro
command:
- '-config.file=/etc/promtail/config.yml'
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
app-service:
image: nginx:alpine
ports:
- "8080:80"
volumes:
grafana-storage:
loki-data:
prometheus-data:
Create alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
- Start services:
docker-compose up -d - Instrument applications with OpenTelemetry SDKs using OTLP exporters
- Configure Prometheus scrape targets using service names
- Set up Grafana dashboards and alerts
- Verify traces appear in Jaeger UI and logs in Loki
Access points:
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000 - Jaeger:
http://localhost:16686 - Loki:
http://localhost:3100
Share this Guide:
More Guides
API Gateway Showdown: Kong vs Ambassador vs AWS API Gateway for Microservices
Compare Kong, Ambassador, and AWS API Gateway across architecture, performance, security, and cost to choose the right gateway for your microservices.
12 min readGitHub Actions vs GitLab CI vs Jenkins: The Ultimate CI/CD Platform Comparison for 2026
Compare GitHub Actions, GitLab CI, and Jenkins across architecture, scalability, cost, and security to choose the best CI/CD platform for your team in 2026.
7 min readKafka vs RabbitMQ vs EventBridge: Complete Messaging Backbone Comparison
Compare Apache Kafka, RabbitMQ, and AWS EventBridge across throughput, latency, delivery guarantees, and operational complexity to choose the right event-driven architecture for your use case.
4 min readChaos Engineering: A Practical Guide to Failure Injection and System Resilience
Learn how to implement chaos engineering using the scientific method: define steady state, form hypotheses, inject failures, and verify system resilience. This practical guide covers application and infrastructure-level failure injection patterns with code examples.
4 min readScaling PostgreSQL for High-Traffic: Read Replicas, Sharding, and Connection Pooling Strategies
Master PostgreSQL horizontal scaling with read replicas, sharding with Citus, and connection pooling. Learn practical implementation strategies to handle high-traffic workloads beyond single-server limits.
4 min readContinue Reading
API Gateway Showdown: Kong vs Ambassador vs AWS API Gateway for Microservices
Compare Kong, Ambassador, and AWS API Gateway across architecture, performance, security, and cost to choose the right gateway for your microservices.
12 min readGitHub Actions vs GitLab CI vs Jenkins: The Ultimate CI/CD Platform Comparison for 2026
Compare GitHub Actions, GitLab CI, and Jenkins across architecture, scalability, cost, and security to choose the best CI/CD platform for your team in 2026.
7 min readKafka vs RabbitMQ vs EventBridge: Complete Messaging Backbone Comparison
Compare Apache Kafka, RabbitMQ, and AWS EventBridge across throughput, latency, delivery guarantees, and operational complexity to choose the right event-driven architecture for your use case.
4 min read