Distributed Monitoring

Live RED metrics across 5 microservices. Anomaly injected at ~8s to demonstrate alert escalation.

IDLE

Scale100K

DAU

60K

Peak QPS

30K

WS Conns

10K

Data/Year

10 TB

Go Instances

Service Health (RED Metrics)

Service	Rate	Error %	p50	p95	p99	CPU	Memory

Distributed Trace (POST /api/pages)

Error Budget (99.9% SLO)

Monthly allowance: 43.2 minutes of downtime. When budget runs out, freeze deployments and focus on reliability.

Active Alerts

No active alerts. System healthy.

Observability Stack

RED Method

Rate (requests/s), Errors (%), Duration (latency). The three signals that tell you if a service is healthy. Every API endpoint exposes these via Prometheus.

Distributed Tracing

OpenTelemetry propagates trace_id across all services. One request = one trace with spans from every service it touches. Find the bottleneck in seconds.

Error Budget

SLO 99.9% = 43 min/month downtime budget. When budget is consumed, freeze features and fix reliability. This prevents the "move fast break things" trap.