Capacity Plan · 5 tiers← /im simulator · /guide IM section

IM Architecture — Capacity Plan

What changes between 10K and 1M users: infrastructure, breakpoints, and operations. Use the tier tabs to compare any two scales side by side.

Infra numbers match the /im simulator formulas.

01 · Scale Matrix

Pick a tier to highlight the column. Infra numbers match the /im simulator formulas.

Dimension	10K	50K	100K	500K	1M
Traffic Registered users	10K	50K	100K	500K	1M
DAU	6K	30K	60K	300K	600K
Peak QPS	3K	15K	30K	150K	300K
WS concurrent	1K	5K	10K	50K	100K
Messages / sec	400	2K	4K	20K	40K
Annual volume	1 TB	5 TB	10 TB	50 TB	100 TB
Infrastructure Kafka partitions	3	6	6	12	24
Delivery consumers	3	6	6	12	24
Persistence consumers	2	2	2	4	8
WS pods	2	2	3	10	25
Gateway nodes	1	2	3	8	20
Redis memory	1 GB	3 GB	5 GB	25 GB	50 GB
DB storage	0.5 TB	2 TB	5 TB	25 TB	100 TB
Kafka broker replicas	1	3	3	3	3
Operations Deployment strategy	Single AZ · Docker Compose or small K8s	Single region, 2 AZs · small K8s	Single region, 3 AZs · managed K8s (EKS/GKE)	Single region, 3 AZs · managed K8s + tiered Kafka	Multi-region active-active · region-local Kafka + Postgres · S3/ScyllaDB cold tier
Risks First to break	DB connection pool exhaustion under bursty writes	Single-broker Kafka becomes the unavailability risk	Redis memory pressure during reconnect storms	Postgres write IO becomes the bottleneck; index bloat in hot tables	Kafka consumer rebalance causes seconds-scale delivery pauses
Key tradeoff	Use Redpanda (Kafka-compatible) single node to keep ops simple	Move to 3-broker Kafka; Postgres primary + 1 read replica	Redis Cluster over single node; dedicated on-call rotation	Shard Postgres by chat_id OR migrate history tier to ScyllaDB; enable Kafka tiered storage to S3	Static partition assignment over dynamic rebalancing; cross-region message dedup via UUIDv7 ordering

02 · Topology Evolution

Components at 100K · Single region · 3 AZ · dedicated on-call. Switch tier tabs above to see the topology grow.

Clients

Nginx LB

×3

WS Pod

×3

Kafka

×3

Redis Cluster

×3

×6

×2

Postgres + 2 replicas

×3

Deployment

Single region, 3 AZs · managed K8s (EKS/GKE)

First thing to break

Redis memory pressure during reconnect storms

Key tradeoff here

Redis Cluster over single node; dedicated on-call rotation

03 · Scaling Breakpoints

Thresholds where a new capability must come online. Skipping one = technical debt compounds until incident.

10K

50K

100K

500K

@ 10K

Worker pool + rate limit

Without a bounded worker pool, one burst fills Go goroutines and DB connections.

@ 50K

Kafka (not SQS)

Need fan-out to multiple consumer groups (delivery + persistence + future search/analytics). SQS cannot replay per-group.

@ 100K

Multi-AZ Redis + Kafka

Single-node Redis/Kafka becomes the unavailability budget. Move to clustered Redis + 3-broker Kafka.

@ 250K

Consistent-hash WS routing

Random WS pod assignment causes cross-pod Pub/Sub storms. Gateway must route user_id to a stable pod for session affinity.

@ 500K

Shard history tier

Postgres write IO saturates. Shard by chat_id OR migrate hot history to ScyllaDB. Enable Kafka tiered storage to S3.

@ 1M

Multi-region active-active

Single-region latency to distant users breaches 200ms target. Region-local Kafka + Postgres with async cross-region audit replication.

04 · Decision FAQ

The tradeoffs reviewers and interviewers most often challenge. Click to expand.

Q1Why Kafka at 50K+ instead of SQS or RabbitMQ?▸

IM needs fan-out: delivery group pushes via WS, persistence group writes Postgres, future groups add search/analytics — each with its own offset cursor. SQS consumes each message exactly once per queue, so fan-out requires N queues + deduplication. Kafka consumer groups do this natively and replay history from retention.

Q2Why partition by chat_id, not user_id?▸

Ordering guarantee is per-chatroom, not per-user. A user can be in many rooms and message arrival order across rooms does not matter. Partitioning by chat_id keeps all messages of one conversation in one partition (ordered) and spreads different rooms across partitions (parallel).

Q3When switch Postgres history to ScyllaDB/Cassandra?▸

At ~500K users the annual message volume crosses 50 TB and write IOPS saturates Postgres on a single primary. Options: shard Postgres by chat_id (more operational work, keeps SQL joins), OR migrate history to ScyllaDB (wide-column, partition-friendly, gives up ad-hoc SQL). Pick ScyllaDB when the read pattern is "latest N messages in room X" and not "join across messages and users".

Q4Why UUIDv7 over Snowflake?▸

UUIDv7 embeds a 48-bit timestamp prefix so IDs sort by creation time (B-tree friendly) AND are globally unique across regions without a coordinator. Snowflake needs a worker-ID service and collides if two workers share an ID. UUIDv7 also fits standard UUID columns in Postgres/any client library — no custom type.

Q5Why Redis Pub/Sub for fan-out, not Kafka directly?▸

Latency. Kafka round-trip (consumer poll + commit) is 20–100ms under load. Redis Pub/Sub is sub-millisecond. The delivery group reads Kafka once, publishes to Redis, and every WS pod subscribed to the room gets the event instantly. Kafka remains the durable source of truth; Redis is just the fast fan-out bus.

Q6When split into multiple regions?▸

When P95 round-trip latency to any 20%+ of users crosses 200ms from the single region. Usually that is the cross-continent threshold — e.g. US-only deployment handling APAC users. Multi-region trades single-leader simplicity for two active regions with async cross-region replication for audit; in-region reads/writes stay local.

Q7Why consistent hashing in the gateway?▸

WS connections are stateful (goroutine per conn, Redis session). Random routing means a reconnect lands on a new pod, which must re-subscribe. Consistent hashing on user_id keeps a user pinned to one pod during steady state; pod additions/removals only reshuffle 1/N of users, not all of them.

Q8Why not serverless WebSocket (API Gateway, Cloudflare Durable Objects)?▸

Cost and control. At 100K concurrent connections, API Gateway WS pricing becomes 5–10× self-hosted Go pods. Durable Objects have per-object throughput limits that surface under 1000+ msg/s per room. Self-hosted Go + WS is 100 lines of code and gives full control over backpressure, reconnect, and session eviction policy.

05 · Incident Runbook

First-ten-minutes response playbook for each failure simulated on /im. Detect → Mitigate → Restore → Escalate.

Redis cluster failure / Pub-Sub delivery drops

Detect: Fast-path delivery success rate drops (< 99%); redis_pubsub_publish_errors spike; WS pods log subscribe reconnect loops.
Mitigate: Fall back fast path to Kafka-direct: delivery consumers publish to a fan-out topic, WS pods consume directly. Accept +50ms latency. Page the on-call SRE.
Restore: Restart Redis primary / promote replica. Replay missed Pub/Sub window from Kafka (retention covers it). Revert fast path to Redis after 15-minute stability.
Escalate: If Redis down > 30 min, rotate all affected users to the backup region (if multi-region) and begin incident postmortem.

WebSocket disconnect storm / client reconnect pile-up

Detect: ws_disconnect_rate spikes; ws_reconnect_attempts_per_sec is 10× baseline; gateway CPU at 100%.
Mitigate: Gateway enters drain mode: reject new connections with 503 Retry-After: 30; existing connections stay. Client library uses exponential backoff starting at 2s. Scale gateway nodes horizontally (HPA).
Restore: Offline message replay on reconnect: client sends last_seen_message_id, server replays from Kafka from that offset. Verify no duplicate deliveries via UUIDv7 ordering.
Escalate: If reconnect success rate < 80% after 10 minutes, investigate upstream (ISP / CDN) or gateway config; consider rolling back last deploy.

Kafka consumer lag spike / delivery group falling behind

Detect: kafka_consumer_lag_msgs > 10K or consumer_lag_ms > 500ms; delivery p50 end-to-end exceeds 200ms SLO.
Mitigate: HPA scales delivery consumers (up to partition count). If lag keeps growing, reduce message fan-out: disable optional consumers (analytics/search) temporarily; pause persistence batch writes if storage pressure is also flagged.
Restore: Drain the lag over 10–30 min. Once lag < 1K, scale consumers back down gradually (not all at once — avoid rebalance storms).
Escalate: If lag > 100K and growing, add partitions (1-way migration, requires full group rebalance). Post-incident: review message size / compression / idempotency of downstream.

Postgres write queue saturation / DB pressure

Detect: db_write_queue_depth > 500; pg_stat_activity shows >90% connections active-writing; p99 INSERT latency > 1s.
Mitigate: Switch persistence consumers to batch mode (1000 msgs / batch, commit every 1s instead of every message). Route all reads to replicas. Throttle new connections at gateway (backpressure).
Restore: Queue drains when batch throughput catches up. Monitor replica lag — if > 10s, pause writes briefly to let replicas sync.
Escalate: If write queue grows despite batch mode, begin partition-level read-only degradation (serve history from S3 cold tier) while operations team initiates DB shard or ScyllaDB migration plan.

Reading this document

This plan is the engineering companion to the /im interactive simulator. The simulator answers "does the system behave at scale X?"; this page answers "what do I build, how do I know when to change it?". Follow the breakpoints in order — skipping one pushes technical debt into an incident.