IM Architecture — Capacity Plan
What changes between 10K and 1M users: infrastructure, breakpoints, and operations. Use the tier tabs to compare any two scales side by side.
Infra numbers match the /im simulator formulas.
01 · Scale Matrix
Pick a tier to highlight the column. Infra numbers match the /im simulator formulas.
| Dimension | 10K | 50K | 100K | 500K | 1M |
|---|---|---|---|---|---|
Traffic Registered users | 10K | 50K | 100K | 500K | 1M |
| DAU | 6K | 30K | 60K | 300K | 600K |
| Peak QPS | 3K | 15K | 30K | 150K | 300K |
| WS concurrent | 1K | 5K | 10K | 50K | 100K |
| Messages / sec | 400 | 2K | 4K | 20K | 40K |
| Annual volume | 1 TB | 5 TB | 10 TB | 50 TB | 100 TB |
Infrastructure Kafka partitions | 3 | 6 | 6 | 12 | 24 |
| Delivery consumers | 3 | 6 | 6 | 12 | 24 |
| Persistence consumers | 2 | 2 | 2 | 4 | 8 |
| WS pods | 2 | 2 | 3 | 10 | 25 |
| Gateway nodes | 1 | 2 | 3 | 8 | 20 |
| Redis memory | 1 GB | 3 GB | 5 GB | 25 GB | 50 GB |
| DB storage | 0.5 TB | 2 TB | 5 TB | 25 TB | 100 TB |
| Kafka broker replicas | 1 | 3 | 3 | 3 | 3 |
Operations Deployment strategy | Single AZ · Docker Compose or small K8s | Single region, 2 AZs · small K8s | Single region, 3 AZs · managed K8s (EKS/GKE) | Single region, 3 AZs · managed K8s + tiered Kafka | Multi-region active-active · region-local Kafka + Postgres · S3/ScyllaDB cold tier |
Risks First to break | DB connection pool exhaustion under bursty writes | Single-broker Kafka becomes the unavailability risk | Redis memory pressure during reconnect storms | Postgres write IO becomes the bottleneck; index bloat in hot tables | Kafka consumer rebalance causes seconds-scale delivery pauses |
| Key tradeoff | Use Redpanda (Kafka-compatible) single node to keep ops simple | Move to 3-broker Kafka; Postgres primary + 1 read replica | Redis Cluster over single node; dedicated on-call rotation | Shard Postgres by chat_id OR migrate history tier to ScyllaDB; enable Kafka tiered storage to S3 | Static partition assignment over dynamic rebalancing; cross-region message dedup via UUIDv7 ordering |
02 · Topology Evolution
Components at 100K · Single region · 3 AZ · dedicated on-call. Switch tier tabs above to see the topology grow.
03 · Scaling Breakpoints
Thresholds where a new capability must come online. Skipping one = technical debt compounds until incident.
04 · Decision FAQ
The tradeoffs reviewers and interviewers most often challenge. Click to expand.
Q1Why Kafka at 50K+ instead of SQS or RabbitMQ?▸
Q2Why partition by chat_id, not user_id?▸
Q3When switch Postgres history to ScyllaDB/Cassandra?▸
Q4Why UUIDv7 over Snowflake?▸
Q5Why Redis Pub/Sub for fan-out, not Kafka directly?▸
Q6When split into multiple regions?▸
Q7Why consistent hashing in the gateway?▸
Q8Why not serverless WebSocket (API Gateway, Cloudflare Durable Objects)?▸
05 · Incident Runbook
First-ten-minutes response playbook for each failure simulated on /im. Detect → Mitigate → Restore → Escalate.
- Detect
- Fast-path delivery success rate drops (< 99%); redis_pubsub_publish_errors spike; WS pods log subscribe reconnect loops.
- Mitigate
- Fall back fast path to Kafka-direct: delivery consumers publish to a fan-out topic, WS pods consume directly. Accept +50ms latency. Page the on-call SRE.
- Restore
- Restart Redis primary / promote replica. Replay missed Pub/Sub window from Kafka (retention covers it). Revert fast path to Redis after 15-minute stability.
- Escalate
- If Redis down > 30 min, rotate all affected users to the backup region (if multi-region) and begin incident postmortem.
- Detect
- ws_disconnect_rate spikes; ws_reconnect_attempts_per_sec is 10× baseline; gateway CPU at 100%.
- Mitigate
- Gateway enters drain mode: reject new connections with 503 Retry-After: 30; existing connections stay. Client library uses exponential backoff starting at 2s. Scale gateway nodes horizontally (HPA).
- Restore
- Offline message replay on reconnect: client sends last_seen_message_id, server replays from Kafka from that offset. Verify no duplicate deliveries via UUIDv7 ordering.
- Escalate
- If reconnect success rate < 80% after 10 minutes, investigate upstream (ISP / CDN) or gateway config; consider rolling back last deploy.
- Detect
- kafka_consumer_lag_msgs > 10K or consumer_lag_ms > 500ms; delivery p50 end-to-end exceeds 200ms SLO.
- Mitigate
- HPA scales delivery consumers (up to partition count). If lag keeps growing, reduce message fan-out: disable optional consumers (analytics/search) temporarily; pause persistence batch writes if storage pressure is also flagged.
- Restore
- Drain the lag over 10–30 min. Once lag < 1K, scale consumers back down gradually (not all at once — avoid rebalance storms).
- Escalate
- If lag > 100K and growing, add partitions (1-way migration, requires full group rebalance). Post-incident: review message size / compression / idempotency of downstream.
- Detect
- db_write_queue_depth > 500; pg_stat_activity shows >90% connections active-writing; p99 INSERT latency > 1s.
- Mitigate
- Switch persistence consumers to batch mode (1000 msgs / batch, commit every 1s instead of every message). Route all reads to replicas. Throttle new connections at gateway (backpressure).
- Restore
- Queue drains when batch throughput catches up. Monitor replica lag — if > 10s, pause writes briefly to let replicas sync.
- Escalate
- If write queue grows despite batch mode, begin partition-level read-only degradation (serve history from S3 cold tier) while operations team initiates DB shard or ScyllaDB migration plan.
This plan is the engineering companion to the /im interactive simulator. The simulator answers "does the system behave at scale X?"; this page answers "what do I build, how do I know when to change it?". Follow the breakpoints in order — skipping one pushes technical debt into an incident.