Observability & Monitoring

Build a Production-Grade Observability Stack with Prometheus, Grafana, Loki, and Jaeger

MatterAI Agent

8 min read·January 16, 2026

Observability Stack Setup: Prometheus, Grafana, Loki, and Jaeger for Distributed Systems

This guide covers the implementation of a complete observability stack for distributed systems using Prometheus for metrics collection, Grafana for visualization, Loki for log aggregation, and Jaeger for distributed tracing.

Architecture Overview

The stack operates on the Three Pillars of Observability: Metrics (Prometheus), Logs (Loki), and Tracing (Jaeger). Prometheus uses a pull-based model to scrape metrics from instrumented applications and exporters. Grafana queries Prometheus for time-series data, Loki for logs, and Jaeger for trace data, providing unified dashboards. Jaeger collects distributed traces via OpenTelemetry instrumentation, enabling request flow analysis across microservices.

Prometheus Setup

Installation and Configuration

Deploy Prometheus via Docker or Kubernetes. Create a prometheus.yml configuration file:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'docker-cluster'
    monitor: 'prometheus'

storage:
  tsdb:
    path: /prometheus
    out_of_order_time_window: 30m

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app-service:8080']
    metrics_path: '/metrics'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerting_rules.yml'

Create alerting_rules.yml:

groups:
  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        annotations:
          summary: "High P95 latency"
          description: "P95 latency is {{ $value }} seconds"

Run Prometheus with exemplar storage enabled:

docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v $(pwd)/alerting_rules.yml:/etc/prometheus/alerting_rules.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=exemplar-storage

Metrics Collection

Prometheus exposes metrics at /metrics endpoint. Instrument applications using client libraries or exporters. Key metrics types include:

Counter: Monotonically increasing values (e.g., http_requests_total)
Gauge: Values that can go up or down (e.g., memory_usage_bytes)
Histogram: Count and sum of observed values in configurable buckets
Summary: Count and sum plus quantiles over a sliding time window

Loki Setup

Installation and Configuration

Deploy Loki for log aggregation:

docker run -d \
  -p 3100:3100 \
  -v $(pwd)/loki-config.yml:/etc/loki/local-config.yaml \
  -v loki-data:/loki \
  grafana/loki:latest \
  -config.file=/etc/loki/local-config.yaml

Create loki-config.yml:

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 168h
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

Deploy Promtail to forward logs:

docker run -d \
  -v $(pwd)/promtail-config.yml:/etc/promtail/config.yml \
  -v $(pwd)/app-logs:/var/log:ro \
  grafana/promtail:latest \
  -config.file=/etc/promtail/config.yml

Create promtail-config.yml:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: application
          __path__: /var/log/*.log

Note: Ensure log files exist in the mounted directory (./app-logs) before starting Promtail. Logs should be in plain text or JSON format.

Grafana Integration

Data Source Configuration

Create /etc/grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true

  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
    editable: true

Access Grafana at http://localhost:3000 (default credentials: admin/admin). Datasources will be auto-provisioned on startup.

Dashboard Creation and Alerting

Use PromQL for queries. Common patterns:

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage
rate(process_cpu_seconds_total[5m]) * 100

Use LogQL for log queries:

{job="application"} |= "error"

{job="application"} | logfmt | trace_id != ""

Configure Grafana alerts in /etc/grafana/provisioning/alerting/alerting.yml:

apiVersion: 1

providers:
  - name: 'alertmanager'
    orgId: 1
    folder: ''
    type: file
    disableProvenance: false
    options:
      path: /etc/grafana/provisioning/alerting/rules

Create alert rule file /etc/grafana/provisioning/alerting/rules/application.yml:

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

Import pre-built dashboards from Grafana.com for quick setup (Node Exporter Full, Kubernetes Cluster Monitoring).

Jaeger Distributed Tracing

Deployment

Deploy Jaeger All-in-One with OTLP enabled:

docker run -d \
  --name jaeger \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Access Jaeger UI at http://localhost:16686. OTLP endpoints are available at http://localhost:4317 (gRPC) and http://localhost:4318 (HTTP).

OpenTelemetry Instrumentation

Instrument applications using OpenTelemetry SDKs with OTLP exporters. Example for Go:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func initTracer(serviceName string) error {
    exporter, err := otlptracegrpc.New(
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return err
    }
    
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(serviceName),
        )),
    )
    otel.SetTracerProvider(tp)
    return nil
}

// Usage in handlers
tracer := otel.Tracer("service-a")
ctx, span := tracer.Start(ctx, "process-request")
defer span.End()

For Python:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "my-service"
})

otlp_exporter = OTLPSpanExporter(
    endpoint="localhost:4317",
    insecure=True,
)

trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("operation"):
    # Your code here
    pass

Correlation and Integration

Trace to Metrics

Link traces to Prometheus metrics using exemplars. Configure Prometheus to store exemplars:

storage:
  tsdb:
    path: /prometheus
    out_of_order_time_window: 30m

Add exemplars in application metrics:

histogram.Record(ctx, latencyMs, metric.WithAttributes(
    attribute.String("trace_id", span.SpanContext().TraceID().String()),
))

Trace to Logs

Configure Grafana to link traces to logs. In Jaeger data source settings:

Navigate to Trace to logs section
Select Loki data source
Configure tag mapping for trace_id
Enable Filter by trace ID

Instrument applications to include trace IDs in logs:

import "go.opentelemetry.io/otel/bridge/opentracing"

log.Printf("Processing request trace_id=%s", span.SpanContext().TraceID().String())

from opentelemetry import trace

tracer = trace.get_tracer(__name__)
current_span = trace.get_current_span()
print(f"Processing request trace_id={current_span.context.trace_id}")

Getting Started

Deploy the stack using Docker Compose:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--enable-feature=exemplar-storage'

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "14268:14268"
      - "14250:14250"
      - "4317:4317"
      - "4318:4318"

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki
    command:
      - '-config.file=/etc/loki/local-config.yaml'

  promtail:
    image: grafana/promtail:latest
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - ./app-logs:/var/log:ro
    command:
      - '-config.file=/etc/promtail/config.yml'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

  app-service:
    image: nginx:alpine
    ports:
      - "8080:80"

volumes:
  grafana-storage:
  loki-data:
  prometheus-data:

Create alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'

Start services: docker-compose up -d
Instrument applications with OpenTelemetry SDKs using OTLP exporters
Configure Prometheus scrape targets using service names
Set up Grafana dashboards and alerts
Verify traces appear in Jaeger UI and logs in Loki

Access points:

Prometheus: http://localhost:9090
Grafana: http://localhost:3000
Jaeger: http://localhost:16686
Loki: http://localhost:3100

Share this Guide:

More Guides

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Gleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems

Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.

5 min read

Hono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun

Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.

6 min read

Continue Reading

Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines

14 min read

Bun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite

Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.

10 min read

Deno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development

Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.

7 min read

Ship Faster. Ship Safer.

Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.

Start Building for Free Read the Docs

No credit card requiredSOC 2 Type IISetup in 2 min