System design interviews can feel overwhelming — there's a mountain of concepts, and you never know which ones will come up. I put together a visual cheatsheet that covers the most essential topics, organized so you can see the big picture at a glance. 👇
Here's a topic-by-topic breakdown of everything on it. 🚀
1️⃣ Non-Functional Characteristics
Before designing anything, clarify the -ilities: availability, scalability, reliability, maintainability, latency, throughput, and consistency. These drive every architectural decision you'll make. 🎯
💡 Interview tip: Always ask about expected scale (QPS, data size, latency SLAs) before diving into a design.
2️⃣ CAP Theorem
You can only guarantee two of three:
- 🔄 Consistency — every read gets the latest write
- ✅ Availability — every request gets a response
- 🌐 Partition Tolerance — the system works despite network splits
In distributed systems, P is non-negotiable, so you're really choosing between CP (banking, inventory) and AP (social feeds, DNS).
3️⃣ Horizontal vs. Vertical Scaling ⚖️
| 📈 Vertical | 📊 Horizontal | |
|---|---|---|
| How | Bigger machine | More machines |
| Limit | Hardware ceiling | Theoretically unlimited |
| Cost | Exponential | Linear-ish |
| Complexity | Low | High (needs load balancing, data partitioning) |
Most production systems use horizontal scaling — it's the only way to handle massive traffic. 🏗️
4️⃣ DNS (Domain Name System) 🌍
DNS translates human-readable domains to IP addresses. Key concepts:
- 🔍 Recursive resolvers do the heavy lifting
- ⏱️ TTL controls caching duration
- 🗺️ Geographic DNS routes users to the nearest data center
For system design, think about DNS as your first layer of traffic routing. 🛣️
5️⃣ Load Balancing ⚖️
Distributes traffic across multiple servers. Common algorithms:
- 🔄 Round Robin — simple rotation
- 📉 Least Connections — route to the least busy server
- 🔗 IP Hash — sticky sessions by client IP
- ⚖️ Weighted — more traffic to beefier servers
Works at Layer 4 (TCP) or Layer 7 (HTTP). Use health checks to automatically remove dead backends. 🏥
6️⃣ API Gateway 🚪
A single entry point for all client requests. Handles:
- 🔐 Authentication & authorization
- 🚦 Rate limiting
- 🛤️ Request routing & transformation
- 🔒 SSL termination
- 📝 Logging & analytics
Think of it as the front door to your microservices architecture. 🏠
7️⃣ Content Delivery Network (CDN) 🌐
Caches static assets (images, CSS, JS, video) at edge locations close to users.
- ⬆️ Push CDN — you upload content proactively
- ⬇️ Pull CDN — fetches from origin on first request
Reduces latency dramatically. Pair with proper cache-control headers for best results. ⚡
8️⃣ Caching 💾
The fastest database query is the one you never make. 🎯
- 🌐 Browser cache → CDN cache → ⚡ Application cache → 💽 Database cache
- 🛠️ Tools: Redis, Memcached
- 📋 Strategies: Cache-aside, Write-through, Write-behind, Read-through
⚠️ Watch out for: cache invalidation (hard), thundering herd, and stale data.
9️⃣ Polling vs. WebSockets 📡
| 🔄 Polling | 🔌 WebSockets | |
|---|---|---|
| Direction | Client → Server | Bidirectional |
| Latency | Depends on interval | Real-time |
| Overhead | New HTTP connection each time | Single persistent connection |
| Use case | Email checks, dashboards | Chat, live feeds, gaming |
Long polling is a middle ground — the server holds the connection open until data is available. 🔗
🔟 Forward & Reverse Proxy 🛡️
- ➡️ Forward proxy — sits in front of clients (VPN, ad blockers, corporate firewalls)
- ⬅️ Reverse proxy — sits in front of servers (load balancer, API gateway, Nginx)
Both hide the real origin. Reverse proxies are a fundamental building block of scalable systems. 🧱
1️⃣1️⃣ Consistent Hashing 🔄
Solves the "what happens when we add/remove servers" problem.
- 🗺️ Maps both servers and keys to a hash ring
- 🔄 When a server is added/removed, only K/N keys need to be remapped (not all of them)
- 🛠️ Used in distributed caches, database sharding, CDNs
Virtual nodes improve even distribution across the ring. 💫
1️⃣2️⃣ Database Types 🗄️
A quick taxonomy:
- 📊 Relational (SQL): MySQL, PostgreSQL — structured data, ACID transactions
- 📄 Document: MongoDB — flexible schemas, JSON-like storage
- 🔑 Key-Value: Redis, DynamoDB — blazing fast lookups
- 📈 Column-Family: Cassandra, HBase — wide-column, high write throughput
- 🔗 Graph: Neo4j — relationships are first-class citizens
- ⏱️ Time-Series: InfluxDB — metrics, IoT data
💡 Pick the right tool for the job. There's no "best" database.
1️⃣3️⃣ SQL vs. NoSQL ⚔️
| 📊 SQL | 🍃 NoSQL | |
|---|---|---|
| Schema | Fixed | Flexible |
| Scaling | Vertical (mostly) | Horizontal |
| Transactions | Strong ACID | Eventual consistency (usually) |
| Joins | Native | Application-level |
| Best for | Complex queries, relationships | Scale, flexibility, speed |
Modern apps often use both — SQL for transactional data, NoSQL for caching/analytics. 🤝
1️⃣4️⃣ Database Scaling 📈
Two main strategies:
📖 Read Replicas
- 📋 Copy data to multiple follower nodes
- 🔄 Reads spread across replicas
- ✍️ Writes go to the leader only
🔪 Sharding
- ✂️ Split data across multiple databases
- 📦 Each shard holds a subset of the data
- 🧩 Hard problems: cross-shard queries, rebalancing
1️⃣5️⃣ Indexes 📇
A B-tree (or hash index) that makes lookups O(log n) instead of full table scans. ⚡
- 📄 Single-column vs. 📑 composite indexes
- 🎯 Covering index — query answered entirely from the index
- ⚖️ Trade-off: faster reads, slower writes (index maintenance overhead)
💡 Rule of thumb: index columns used in
WHERE,JOIN, andORDER BY.
1️⃣6️⃣ Leader Election 👑
In distributed systems, you often need a single coordinator:
- 🚀 Raft — understandable consensus (etcd, Consul)
- 📚 Paxos — the classic (harder to implement)
- 🏗️ ZooKeeper — battle-tested coordination service
Used in database replication, distributed locks, and task schedulers. 🔐
1️⃣7️⃣ Message Queues 📬
Decouple producers from consumers:
- 🚀 Kafka — high throughput, durable, great for event streaming
- 🐰 RabbitMQ — traditional broker, flexible routing
- ☁️ SQS — managed, serverless-friendly
Benefits: buffering, async processing, retry logic, fan-out. 🎯
1️⃣8️⃣ Event-Driven Architecture ⚡
Systems communicate through events rather than direct calls:
- 📤 Event producer → 🚌 Event bus → 📥 Event consumer
- 🔗 Enables loose coupling and independent scaling
- 🧩 Patterns: Event sourcing, CQRS, Saga
Think: "When X happens, trigger Y" at scale. 💭
1️⃣9️⃣ Microservices 🧱
Break a monolith into small, independently deployable services:
- 📦 Each service owns its data and logic
- 📡 Communicate via APIs or message queues
- ⚖️ Trade simplicity for scalability and team autonomy
✅ When to use: large teams, independent scaling needs, polyglot tech stacks.
❌ When not to: small teams, early-stage products.
2️⃣0️⃣ Communication Patterns 📡
- 🔄 Synchronous: REST, gRPC, GraphQL — request/response
- ⚡ Asynchronous: Message queues, event streams — fire and forget
- 🚀 gRPC — binary, fast, great for inter-service communication
- 🎯 GraphQL — client specifies exactly what data it needs
2️⃣1️⃣ Rate Limiting 🚦
Protect your system from abuse and overload:
- 🪣 Token bucket — tokens refill at a fixed rate
- 📊 Sliding window — counts requests in a rolling time window
- 💧 Leaky bucket — processes at a constant rate
Implement at the API gateway level. Return 429 Too Many Requests with Retry-After header. 🛑
2️⃣2️⃣ Idempotency 🔁
The same request applied multiple times has the same effect as once.
Why it matters: network retries, message queue redelivery, double-clicks. 🖱️
How: use idempotency keys — client sends a unique key, server deduplicates. 🔑
💰 Critical for payment systems and any write operation.
2️⃣3️⃣ Bloom & Cuckoo Filters 🌸
Probabilistic data structures for "is this element in the set?" 🤔
- 🌸 Bloom filter — space-efficient, no false negatives, possible false positives
- 🐦 Cuckoo filter — supports deletion, better false positive rates
Use cases: cache hit prediction, spam filtering, preventing duplicate writes. 🎯
2️⃣4️⃣ Single Point of Failure (SPOF) 💀
Any component whose failure brings down the entire system.
Eliminate SPOFs with:
- 🔄 Redundancy (multiple instances)
- 🔀 Failover mechanisms
- 🏥 Health checks + automatic recovery
- 🌍 Geographic distribution
🗣️ Interview mantra: "What happens when this component dies?" ☠️
2️⃣5️⃣ Heartbeat 💓
Periodic "I'm alive" signals between components.
- 💓 Server sends heartbeat to a monitor at regular intervals
- ⏰ If heartbeat is missed → mark as unhealthy → trigger failover
- 🛠️ Used in: leader election, cluster management, load balancer health checks
2️⃣6️⃣ Checksum ✅
Detects data corruption during transfer or storage.
- 🔓 MD5 — fast but not cryptographically secure
- 🔐 SHA-256 — secure, widely used
- ⚡ CRC32 — fast, good for error detection
Applied at: file transfers, network packets, distributed storage verification. 📁
2️⃣7️⃣ Database Replication 🔁
Copy data across multiple nodes:
- 🔄 Synchronous — writes confirmed after all replicas update (strong consistency, higher latency)
- ⚡ Asynchronous — writes confirmed immediately, replicas catch up (eventual consistency, lower latency)
Leader-follower is the most common pattern. Multi-leader and leaderless for advanced use cases. 🏗️
2️⃣8️⃣ Database Sharding & Partitioning 🔪
- 🔪 Sharding — horizontal split across databases/servers
- 📊 Partitioning — split within a single database
Sharding strategies:
- 📏 Range-based — by date, ID range
- 🔢 Hash-based — hash the shard key
- 📖 Directory-based — lookup table
🧩 Hard parts: rebalancing, cross-shard joins, hotspot avoidance.
🏁 Final Thoughts
This cheatsheet covers the 28 core concepts that come up again and again in system design interviews. You don't need to memorize everything — focus on understanding when and why to use each one. 🎯
The real skill in system design isn't knowing the tools. It's knowing which tools to reach for, and being able to explain your tradeoffs clearly. 💪
Good luck on your next interview. 🚀🔥

Top comments (0)