High-Concurrency IM Architecture
Enterprise IM platform spec: 200K-400K concurrent WebSocket connections, sub-200ms latency, zero message loss, horizontal scaling on K8s. Drag the scale slider to see every layer resize.
Simulated reference values. Real numbers depend on message size, fanout ratio, and hardware.
Six-Layer Architecture
Every message flows top-to-bottom. Each layer has a single responsibility and can scale independently.
Kafka Partition Distribution
Partition key = chat_id. Same room always lands on the same partition, preserving message order within the room. Different rooms spread across partitions for parallelism.
Message Lifecycle
One message, seven stops, end-to-end in under 200 ms under normal load.
Observability (4 signals)
Every layer emits metrics, logs, and traces. One request, one trace id, end-to-end.
Prometheus scrapes RED metrics + custom (connection count, Kafka lag, consumer lag).
Grafana panels for business KPIs. One glance answers "is the platform healthy?"
EFK or Loki. Streamed to stdout from containers, no disk write on the pod.
OpenTelemetry. Trace id propagates through WS, Kafka, Consumer, DB. Find the bottleneck in seconds.
System Risks & Countermeasures
Four scenarios the demo cycles through. Watch metrics shift when each failure is simulated.
Redis Pub/Sub drops. Fast path stops. Mitigation: clients resubscribe, consumer replays last N minutes from Kafka.
Mobile network drops. Mitigation: client exponential-backoff reconnect, replays missed messages using last_seen_id.
Spike overruns consumers. Mitigation: HPA scales consumer service on lag metric, Kafka auto-rebalances partitions.
Postgres write queue grows. Mitigation: batch inserts every 50 ms, async write off the hot path, read-replica offload for queries.
Hot / Cold Storage Split
Recent traffic lives in fast memory; audit history moves to cheap object storage on a background pipeline.
Redis + Cassandra / DynamoDB. Indexed by chat_id + time range. Answers "load last page of chat" in milliseconds.
S3 / Data Lake. Columnar format (Parquet) for compliance, audit, and analytics queries. Seconds-to-minutes retrieval is acceptable.
Fast Path vs Slow Path
Two paths for one message. Fast path wins latency, slow path wins durability. Both are mandatory.
Redis Pub/Sub broadcasts to all WS instances subscribing to the chat room. Sub-millisecond fan-out. No disk write on the critical path.
Consumer asynchronously writes to Postgres. Batched inserts, accepts tens-of-ms latency. Guarantees history survives Redis crash.