02
Distributed Monitoring
Live RED metrics across 5 microservices. Anomaly injected at ~8s to demonstrate alert escalation.
IDLE
Scale100K
DAU
60K
Peak QPS
30K
WS Conns
10K
Data/Year
10 TB
Go Instances
6
Observability Stack
RED Method
Rate (requests/s), Errors (%), Duration (latency). The three signals that tell you if a service is healthy. Every API endpoint exposes these via Prometheus.
Distributed Tracing
OpenTelemetry propagates trace_id across all services. One request = one trace with spans from every service it touches. Find the bottleneck in seconds.
Error Budget
SLO 99.9% = 43 min/month downtime budget. When budget is consumed, freeze features and fix reliability. This prevents the "move fast break things" trap.