GoTech Demo
Interactive Simulation — No Backend→ Capacity Plan

High-Concurrency IM Architecture

Enterprise IM platform spec: 200K-400K concurrent WebSocket connections, sub-200ms latency, zero message loss, horizontal scaling on K8s. Drag the scale slider to see every layer resize.

Simulated reference values. Real numbers depend on message size, fanout ratio, and hardware.

Scale100K
DAU
60K
Peak QPS
30K
WS Conns
10K
Data/Year
10 TB
Go Instances
6
IDLEscale: 100K
60K
3 WS Instances
6
8 Consumers
5.5 GB
0% CPU
0 ms
Consumer Lag: 0 ms

Six-Layer Architecture

Every message flows top-to-bottom. Each layer has a single responsibility and can scale independently.

CL
Client Layer
Web / Mobile
Long-lived WebSocket with exponential-backoff reconnection and offline-message replay on reconnect.
60K
conns
@ 100K
GW
Gateway Layer
Nginx / Kong
TLS termination, rate limiting, consistent-hash routing to WS instances. Keep it thin.
4
nodes
@ 100K
WS
Connection Layer
Go + WebSocket (stateless)
Goroutine per connection. Session lives in Redis so any instance can serve any client.
3
pods
@ 100K
MQ
Message Queue
Kafka (partition by chat_id)
Source of truth. Partition key = chat_id guarantees in-room ordering; retention = offline replay window.
6
partitions
@ 100K
PR
Processing Layer
Consumer Service
Reads Kafka, publishes to Redis Pub/Sub for delivery, writes Postgres for history. Scale consumers against Kafka lag.
8
consumers
@ 100K
ST
Storage Layer
Redis Hot / Postgres Cold
Hot cache for 7-day history, durable store for archive and audit queries. Cold tier migrates to S3 / Data Lake.
5.5 GB
@ 100K

Kafka Partition Distribution

Partition key = chat_id. Same room always lands on the same partition, preserving message order within the room. Different rooms spread across partitions for parallelism.

Producers
24K msg/s
Brokers
3 × replica
Groups
6 delivery+2 persist
0
Cd0
0
Cd1
0
Cd2
0
Cd3
0
Cd4
0
Cd5
P0
P1
P2
P3
P4
P5
avg 0 msgeach P → 1 Cd (delivery) + 1 Cp (persistence)total 0 msg
Delivery Group6 pods · Redis Pub/Sub → WS
ownerlagspare
Cd0P0
Cd1P1
Cd2P2
Cd3P3
Cd4P4
Cd5P5
Persistence Group2 pods · batched writes → Postgres
Cp0P0-2
Cp1P3-5
Lag (messages behind)
0
Add consumers to drain lag; Kafka rebalances partitions automatically.

Message Lifecycle

One message, seven stops, end-to-end in under 200 ms under normal load.

1
·
Client A
Client A sends over WebSocket
2
·
WS Ingress
WS server issues UUIDv7 message id
3
·
Kafka Write
Producer writes to Kafka (partition = hash(chat_id))
4
·
Consumer
Consumer reads, publishes to Redis Pub/Sub
5
·
WS Deliver
Any WS instance subscribed to the room delivers to Client B
6
·
Client B
Client B ACKs; state flips to delivered
7
·
Postgres
Consumer asynchronously persists to Postgres for history
Σ in-flight = 0 msg · = |{f ∈ flights}| at each stagein-flight msg worker pod · pipeline holds many msgs per podpods: 8 = 6 delivery + 2 persist

Observability (4 signals)

Every layer emits metrics, logs, and traces. One request, one trace id, end-to-end.

Metrics
24000 msg/s

Prometheus scrapes RED metrics + custom (connection count, Kafka lag, consumer lag).

Dashboard
3/5 pods

Grafana panels for business KPIs. One glance answers "is the platform healthy?"

Logging
0 lag

EFK or Loki. Streamed to stdout from containers, no disk write on the pod.

Tracing
0 ms p50

OpenTelemetry. Trace id propagates through WS, Kafka, Consumer, DB. Find the bottleneck in seconds.

System Risks & Countermeasures

Four scenarios the demo cycles through. Watch metrics shift when each failure is simulated.

Redis outage

Redis Pub/Sub drops. Fast path stops. Mitigation: clients resubscribe, consumer replays last N minutes from Kafka.

WebSocket disconnect

Mobile network drops. Mitigation: client exponential-backoff reconnect, replays missed messages using last_seen_id.

Kafka consumer lag

Spike overruns consumers. Mitigation: HPA scales consumer service on lag metric, Kafka auto-rebalances partitions.

DB write pressure

Postgres write queue grows. Mitigation: batch inserts every 50 ms, async write off the hot path, read-replica offload for queries.

Hot / Cold Storage Split

Recent traffic lives in fast memory; audit history moves to cheap object storage on a background pipeline.

HOT0

Redis + Cassandra / DynamoDB. Indexed by chat_id + time range. Answers "load last page of chat" in milliseconds.

archive
idle
COLD0

S3 / Data Lake. Columnar format (Parquet) for compliance, audit, and analytics queries. Seconds-to-minutes retrieval is acceptable.

Fast Path vs Slow Path

Two paths for one message. Fast path wins latency, slow path wins durability. Both are mandatory.

SEND
~0 ms
~40 ms batch
RECV
DB
Fast Path (hot)

Redis Pub/Sub broadcasts to all WS instances subscribing to the chat room. Sub-millisecond fan-out. No disk write on the critical path.

Slow Path (durable)

Consumer asynchronously writes to Postgres. Batched inserts, accepts tens-of-ms latency. Guarantees history survives Redis crash.

Trade-off: Fast path alone = fast but data loss on Redis failure. Slow path alone = reliable but too slow. Running both gives sub-200ms delivery plus durability.

Event Log

Press Start to watch the simulation. Events from 4 failure scenarios and archive cycles will appear here.
Terminology in this page