Interactive Simulation — No Backend→ Capacity Plan

High-Concurrency IM Architecture

Enterprise IM platform spec: 200K-400K concurrent WebSocket connections, sub-200ms latency, zero message loss, horizontal scaling on K8s. Drag the scale slider to see every layer resize.

Simulated reference values. Real numbers depend on message size, fanout ratio, and hardware.

Scale100K

DAU

60K

Peak QPS

30K

WS Conns

10K

Data/Year

10 TB

Go Instances

IDLEscale: 100K

60K

3 WS Instances

8 Consumers

5.5 GB

0% CPU

0 ms

Consumer Lag: 0 ms

Six-Layer Architecture

Every message flows top-to-bottom. Each layer has a single responsibility and can scale independently.

Client Layer

Web / Mobile

Long-lived WebSocket with exponential-backoff reconnection and offline-message replay on reconnect.

60K

conns

@ 100K

Gateway Layer

Nginx / Kong

TLS termination, rate limiting, consistent-hash routing to WS instances. Keep it thin.

nodes

@ 100K

Connection Layer

Go + WebSocket (stateless)

Goroutine per connection. Session lives in Redis so any instance can serve any client.

pods

@ 100K

Message Queue

Kafka (partition by chat_id)

Source of truth. Partition key = chat_id guarantees in-room ordering; retention = offline replay window.

partitions

@ 100K

Processing Layer

Consumer Service

Reads Kafka, publishes to Redis Pub/Sub for delivery, writes Postgres for history. Scale consumers against Kafka lag.

consumers

@ 100K

Storage Layer

Redis Hot / Postgres Cold

Hot cache for 7-day history, durable store for archive and audit queries. Cold tier migrates to S3 / Data Lake.

5.5 GB

@ 100K

Kafka Partition Distribution

Partition key = chat_id. Same room always lands on the same partition, preserving message order within the room. Different rooms spread across partitions for parallelism.

Producers

24K msg/s

→

Brokers

3 × replica

→

Groups

6 delivery+2 persist

Cd0

Cd1

Cd2

Cd3

Cd4

Cd5

avg 0 msgeach P → 1 Cd (delivery) + 1 Cp (persistence)total 0 msg

Delivery Group6 pods · Redis Pub/Sub → WS

ownerlagspare

Cd0P0

Cd1P1

Cd2P2

Cd3P3

Cd4P4

Cd5P5

Persistence Group2 pods · batched writes → Postgres

Cp0P0-2

Cp1P3-5

Lag (messages behind)

Add consumers to drain lag; Kafka rebalances partitions automatically.

Message Lifecycle

One message, seven stops, end-to-end in under 200 ms under normal load.

Client A

Client A sends over WebSocket

WS Ingress

WS server issues UUIDv7 message id

Kafka Write

Producer writes to Kafka (partition = hash(chat_id))

Consumer

Consumer reads, publishes to Redis Pub/Sub

WS Deliver

Any WS instance subscribed to the room delivers to Client B

Client B

Client B ACKs; state flips to delivered

Postgres

Consumer asynchronously persists to Postgres for history

Σ in-flight = 0 msg · = |{f ∈ flights}| at each stagein-flight msg ≠ worker pod · pipeline holds many msgs per podpods: 8 = 6 delivery + 2 persist

Observability (4 signals)

Every layer emits metrics, logs, and traces. One request, one trace id, end-to-end.

Metrics

24000 msg/s

Prometheus scrapes RED metrics + custom (connection count, Kafka lag, consumer lag).

Dashboard

3/5 pods

Grafana panels for business KPIs. One glance answers "is the platform healthy?"

Logging

0 lag

EFK or Loki. Streamed to stdout from containers, no disk write on the pod.

Tracing

0 ms p50

OpenTelemetry. Trace id propagates through WS, Kafka, Consumer, DB. Find the bottleneck in seconds.

System Risks & Countermeasures

Four scenarios the demo cycles through. Watch metrics shift when each failure is simulated.

Redis outage

Redis Pub/Sub drops. Fast path stops. Mitigation: clients resubscribe, consumer replays last N minutes from Kafka.

WebSocket disconnect

Mobile network drops. Mitigation: client exponential-backoff reconnect, replays missed messages using last_seen_id.

Kafka consumer lag

Spike overruns consumers. Mitigation: HPA scales consumer service on lag metric, Kafka auto-rebalances partitions.

DB write pressure

Postgres write queue grows. Mitigation: batch inserts every 50 ms, async write off the hot path, read-replica offload for queries.

Hot / Cold Storage Split

Recent traffic lives in fast memory; audit history moves to cheap object storage on a background pipeline.

HOT0

Redis + Cassandra / DynamoDB. Indexed by chat_id + time range. Answers "load last page of chat" in milliseconds.

Fast Path vs Slow Path

Two paths for one message. Fast path wins latency, slow path wins durability. Both are mandatory.

SEND

~0 ms

~40 ms batch

RECV

Fast Path (hot)

Redis Pub/Sub broadcasts to all WS instances subscribing to the chat room. Sub-millisecond fan-out. No disk write on the critical path.

Slow Path (durable)

Consumer asynchronously writes to Postgres. Batched inserts, accepts tens-of-ms latency. Guarantees history survives Redis crash.

Trade-off: Fast path alone = fast but data loss on Redis failure. Slow path alone = reliable but too slow. Running both gives sub-200ms delivery plus durability.

Event Log

Press Start to watch the simulation. Events from 4 failure scenarios and archive cycles will appear here.

Terminology in this page