GoTech Demo

IM Architecture — Capacity Plan

What changes between 10K and 1M users: infrastructure, breakpoints, and operations. Use the tier tabs to compare any two scales side by side.

Infra numbers match the /im simulator formulas.

01 · Scale Matrix

Pick a tier to highlight the column. Infra numbers match the /im simulator formulas.

Dimension10K50K100K500K1M
Traffic
Registered users
10K50K100K500K1M
DAU6K30K60K300K600K
Peak QPS3K15K30K150K300K
WS concurrent1K5K10K50K100K
Messages / sec4002K4K20K40K
Annual volume1 TB5 TB10 TB50 TB100 TB
Infrastructure
Kafka partitions
3661224
Delivery consumers3661224
Persistence consumers22248
WS pods2231025
Gateway nodes123820
Redis memory1 GB3 GB5 GB25 GB50 GB
DB storage0.5 TB2 TB5 TB25 TB100 TB
Kafka broker replicas13333
Operations
Deployment strategy
Single AZ · Docker Compose or small K8sSingle region, 2 AZs · small K8sSingle region, 3 AZs · managed K8s (EKS/GKE)Single region, 3 AZs · managed K8s + tiered KafkaMulti-region active-active · region-local Kafka + Postgres · S3/ScyllaDB cold tier
Risks
First to break
DB connection pool exhaustion under bursty writesSingle-broker Kafka becomes the unavailability riskRedis memory pressure during reconnect stormsPostgres write IO becomes the bottleneck; index bloat in hot tablesKafka consumer rebalance causes seconds-scale delivery pauses
Key tradeoffUse Redpanda (Kafka-compatible) single node to keep ops simpleMove to 3-broker Kafka; Postgres primary + 1 read replicaRedis Cluster over single node; dedicated on-call rotationShard Postgres by chat_id OR migrate history tier to ScyllaDB; enable Kafka tiered storage to S3Static partition assignment over dynamic rebalancing; cross-region message dedup via UUIDv7 ordering

02 · Topology Evolution

Components at 100K · Single region · 3 AZ · dedicated on-call. Switch tier tabs above to see the topology grow.

Clients
Nginx LB
×3
WS Pod
×3
Kafka
×3
Redis Cluster
×3
Cd
×6
Cp
×2
Postgres + 2 replicas
×3
Deployment
Single region, 3 AZs · managed K8s (EKS/GKE)
First thing to break
Redis memory pressure during reconnect storms
Key tradeoff here
Redis Cluster over single node; dedicated on-call rotation

03 · Scaling Breakpoints

Thresholds where a new capability must come online. Skipping one = technical debt compounds until incident.

10K
50K
100K
500K
1M
@ 10K
Worker pool + rate limit
Without a bounded worker pool, one burst fills Go goroutines and DB connections.
@ 50K
Kafka (not SQS)
Need fan-out to multiple consumer groups (delivery + persistence + future search/analytics). SQS cannot replay per-group.
@ 100K
Multi-AZ Redis + Kafka
Single-node Redis/Kafka becomes the unavailability budget. Move to clustered Redis + 3-broker Kafka.
@ 250K
Consistent-hash WS routing
Random WS pod assignment causes cross-pod Pub/Sub storms. Gateway must route user_id to a stable pod for session affinity.
@ 500K
Shard history tier
Postgres write IO saturates. Shard by chat_id OR migrate hot history to ScyllaDB. Enable Kafka tiered storage to S3.
@ 1M
Multi-region active-active
Single-region latency to distant users breaches 200ms target. Region-local Kafka + Postgres with async cross-region audit replication.

04 · Decision FAQ

The tradeoffs reviewers and interviewers most often challenge. Click to expand.

Q1Why Kafka at 50K+ instead of SQS or RabbitMQ?
IM needs fan-out: delivery group pushes via WS, persistence group writes Postgres, future groups add search/analytics — each with its own offset cursor. SQS consumes each message exactly once per queue, so fan-out requires N queues + deduplication. Kafka consumer groups do this natively and replay history from retention.
Q2Why partition by chat_id, not user_id?
Ordering guarantee is per-chatroom, not per-user. A user can be in many rooms and message arrival order across rooms does not matter. Partitioning by chat_id keeps all messages of one conversation in one partition (ordered) and spreads different rooms across partitions (parallel).
Q3When switch Postgres history to ScyllaDB/Cassandra?
At ~500K users the annual message volume crosses 50 TB and write IOPS saturates Postgres on a single primary. Options: shard Postgres by chat_id (more operational work, keeps SQL joins), OR migrate history to ScyllaDB (wide-column, partition-friendly, gives up ad-hoc SQL). Pick ScyllaDB when the read pattern is "latest N messages in room X" and not "join across messages and users".
Q4Why UUIDv7 over Snowflake?
UUIDv7 embeds a 48-bit timestamp prefix so IDs sort by creation time (B-tree friendly) AND are globally unique across regions without a coordinator. Snowflake needs a worker-ID service and collides if two workers share an ID. UUIDv7 also fits standard UUID columns in Postgres/any client library — no custom type.
Q5Why Redis Pub/Sub for fan-out, not Kafka directly?
Latency. Kafka round-trip (consumer poll + commit) is 20–100ms under load. Redis Pub/Sub is sub-millisecond. The delivery group reads Kafka once, publishes to Redis, and every WS pod subscribed to the room gets the event instantly. Kafka remains the durable source of truth; Redis is just the fast fan-out bus.
Q6When split into multiple regions?
When P95 round-trip latency to any 20%+ of users crosses 200ms from the single region. Usually that is the cross-continent threshold — e.g. US-only deployment handling APAC users. Multi-region trades single-leader simplicity for two active regions with async cross-region replication for audit; in-region reads/writes stay local.
Q7Why consistent hashing in the gateway?
WS connections are stateful (goroutine per conn, Redis session). Random routing means a reconnect lands on a new pod, which must re-subscribe. Consistent hashing on user_id keeps a user pinned to one pod during steady state; pod additions/removals only reshuffle 1/N of users, not all of them.
Q8Why not serverless WebSocket (API Gateway, Cloudflare Durable Objects)?
Cost and control. At 100K concurrent connections, API Gateway WS pricing becomes 5–10× self-hosted Go pods. Durable Objects have per-object throughput limits that surface under 1000+ msg/s per room. Self-hosted Go + WS is 100 lines of code and gives full control over backpressure, reconnect, and session eviction policy.

05 · Incident Runbook

First-ten-minutes response playbook for each failure simulated on /im. Detect → Mitigate → Restore → Escalate.

Redis cluster failure / Pub-Sub delivery drops
Detect
Fast-path delivery success rate drops (< 99%); redis_pubsub_publish_errors spike; WS pods log subscribe reconnect loops.
Mitigate
Fall back fast path to Kafka-direct: delivery consumers publish to a fan-out topic, WS pods consume directly. Accept +50ms latency. Page the on-call SRE.
Restore
Restart Redis primary / promote replica. Replay missed Pub/Sub window from Kafka (retention covers it). Revert fast path to Redis after 15-minute stability.
Escalate
If Redis down > 30 min, rotate all affected users to the backup region (if multi-region) and begin incident postmortem.
WebSocket disconnect storm / client reconnect pile-up
Detect
ws_disconnect_rate spikes; ws_reconnect_attempts_per_sec is 10× baseline; gateway CPU at 100%.
Mitigate
Gateway enters drain mode: reject new connections with 503 Retry-After: 30; existing connections stay. Client library uses exponential backoff starting at 2s. Scale gateway nodes horizontally (HPA).
Restore
Offline message replay on reconnect: client sends last_seen_message_id, server replays from Kafka from that offset. Verify no duplicate deliveries via UUIDv7 ordering.
Escalate
If reconnect success rate < 80% after 10 minutes, investigate upstream (ISP / CDN) or gateway config; consider rolling back last deploy.
Kafka consumer lag spike / delivery group falling behind
Detect
kafka_consumer_lag_msgs > 10K or consumer_lag_ms > 500ms; delivery p50 end-to-end exceeds 200ms SLO.
Mitigate
HPA scales delivery consumers (up to partition count). If lag keeps growing, reduce message fan-out: disable optional consumers (analytics/search) temporarily; pause persistence batch writes if storage pressure is also flagged.
Restore
Drain the lag over 10–30 min. Once lag < 1K, scale consumers back down gradually (not all at once — avoid rebalance storms).
Escalate
If lag > 100K and growing, add partitions (1-way migration, requires full group rebalance). Post-incident: review message size / compression / idempotency of downstream.
Postgres write queue saturation / DB pressure
Detect
db_write_queue_depth > 500; pg_stat_activity shows >90% connections active-writing; p99 INSERT latency > 1s.
Mitigate
Switch persistence consumers to batch mode (1000 msgs / batch, commit every 1s instead of every message). Route all reads to replicas. Throttle new connections at gateway (backpressure).
Restore
Queue drains when batch throughput catches up. Monitor replica lag — if > 10s, pause writes briefly to let replicas sync.
Escalate
If write queue grows despite batch mode, begin partition-level read-only degradation (serve history from S3 cold tier) while operations team initiates DB shard or ScyllaDB migration plan.
Reading this document

This plan is the engineering companion to the /im interactive simulator. The simulator answers "does the system behave at scale X?"; this page answers "what do I build, how do I know when to change it?". Follow the breakpoints in order — skipping one pushes technical debt into an incident.