DEV Community

tem chelsy
tem chelsy

Posted on

My First System Design Class Taught Me More Than a Year of Coding

I walked into my first system design class expecting to write code. I walked out drawing boxes on a whiteboard and asking questions I had never thought to ask before. No syntax. No frameworks. Just shapes, arrows, and a deceptively simple prompt:

Design a service where users paste in a long URL and get back a short one.

That was it. The whole brief. Something every developer has used a hundred times — bit.ly, TinyURL, the link shortener baked into Twitter. I had never once thought about what was behind it. By the end of that session, I couldn't stop thinking about it.

This is what I learned.


The system design interview has a shape

Before we drew anything, our instructor made one thing clear: every system design problem follows the same five steps.

Clarify → Estimate → Sketch → Deep dive → Failure modes

That structure sounds obvious on paper. In practice, it is the hardest discipline to maintain. Every instinct in a developer's brain says start building. System design punishes that instinct. The first thing you do is ask questions — not answer them.

For the URL shortener, those questions were:

  • Is this read-heavy or write-heavy?
  • How short is "short"? What is the key format?
  • Do URLs expire?
  • Do users have accounts, or is it anonymous?
  • Do we need analytics?
  • Is this a global service?

Each answer reshapes the entire design. If URLs never expire, you don't need TTL logic anywhere in the system. If it's anonymous only, you don't need a users table. If you don't need analytics, you can use a 301 redirect instead of a 302 — a small technical detail with surprisingly large consequences.

The act of clarifying is not stalling. It is the design.


One number that changes everything

Once we agreed on scope, we did math. Quick, rough, back-of-envelope math — the kind where you round aggressively and the goal is order of magnitude, not precision.

Assume 100 million URLs created per day. That is roughly 1,200 writes per second. Now assume each short URL gets clicked ten times on average. That is 12,000 reads per second. Round up for peak traffic and you are looking at something closer to 120,000 reads per second.

The ratio of reads to writes is about 100 to 1.

💡 Key idea: That single number — 100:1 — is the most important fact about a URL shortener. It tells you where to spend your engineering effort, what will break first, and what your architecture needs to optimise for.

This system is not a writing system that also does reads. It is a reading system that occasionally accepts writes. That distinction changes every decision that follows.


The two endpoints

The API for a URL shortener is almost insultingly simple. Two endpoints:

POST /shorten    →  takes a long URL, returns a short code
GET  /:code      →  redirects to the original URL
Enter fullscreen mode Exit fullscreen mode

That second endpoint is where all the complexity lives. It runs 100 times more often than the first. It needs to respond in under 10 milliseconds. It needs to handle viral traffic spikes — a single link being clicked by millions of people in minutes.

There is also a small but meaningful technical decision hiding in that GET endpoint: 301 or 302?

301 Permanent 302 Temporary
Browser caches it? ✅ Yes — forever ❌ No — re-checks every time
Analytics work? ❌ No ✅ Yes
Can expire the URL? ❌ No ✅ Yes
Saves bandwidth? ✅ Yes ❌ No

We use 302. The analytics and expiry control are worth the extra traffic.


How do you generate the short code?

This is where it gets interesting. There are three approaches, each with real trade-offs.

Option 1 — Hash the URL

Run the long URL through MD5. Take the first 43 bits. Encode in base62. You get a 7-character code.

Pro: Same URL always gives the same short code. Natural deduplication.

Con: Collisions happen at scale. You need a DB check and retry loop on every write.

Option 2 — Auto-incrementing counter

Keep a global counter. Every new URL gets the next number, encoded in base62.

Pro: Simple, no collisions.

Con: A global counter is hard to share across many servers. At 1,200 writes/sec it becomes a bottleneck. Also leaks how many URLs have been created.

Option 3 — Key Generation Service ✅ (preferred)

A separate service pre-generates a pool of random 7-character base62 keys offline. The Write API claims one atomically. Each server holds a local batch of 1,000 keys in memory.

Pro: Zero collisions. No retry loops. Sub-millisecond key assignment.

Con: Extra service to maintain. Needs at least two replicas (it is a single point of failure).

💡 Remember: In a design interview, none of these is wrong. What matters is naming the trade-offs and defending your choice.


Where does the data live?

The storage layer has two parts working together.

PostgreSQL holds the source of truth — a urls table mapping each short key to its long URL, owner, creation time, and expiry. We use read replicas (three or four of them) because the redirect path is read-only and those 120,000 reads per second can be distributed across multiple servers.

Redis sits in front of Postgres as a cache. Every time a URL is created, it is written to both Postgres and Redis simultaneously (write-through caching). When someone follows a short link, the Read API checks Redis first.

Cache hit  → return long URL instantly (no DB touched)
Cache miss → fetch from Postgres, repopulate Redis, return URL
Enter fullscreen mode Exit fullscreen mode

At an 80% cache hit rate, only 24,000 of those 120,000 reads per second reach the database. Without caching, you would need five times more database capacity.

Caching is the cheat code for read-heavy systems.


The full architecture

When you put all of this together, the system looks like this:

         Client / Browser
               ↓
         Load Balancer
         ↙            ↘
    Write API        Read API
    ↙      ↘              ↓
Key Gen  PostgreSQL ← Redis
              ↓         (cache miss)
         Analytics
          (async)
Enter fullscreen mode Exit fullscreen mode

The write path and read path are intentionally separated. Writes are rare and can be slightly slower — a user waiting 200ms to get a short link is acceptable. Reads are constant and must be instant.

How analytics work without slowing redirects

This is one of the trickiest parts. A redirect must be fast. Recording a click is slow. You cannot do both synchronously.

When someone follows a short link, the Read API returns the 302 redirect immediately. After the response is sent — while the user is already being redirected — the server records the click to an in-memory queue. A background process drains that queue every 2 seconds with a single bulk INSERT.

User clicks → 302 returned instantly
            ↓ (after response, async)
        click pushed to queue
            ↓ (every 2 seconds)
        bulk INSERT into clicks table
Enter fullscreen mode Exit fullscreen mode

The user never waits. The database is not hammered. This is the core principle: never make the user wait for work they don't need to see the result of.


What could go wrong

This is the senior-engineer move. After the happy path works, ask: what fails?

🔴 Cache stampede

A viral link goes cold, its cache entry expires, then suddenly 100,000 users click it at the same moment. All miss Redis and flood Postgres simultaneously.

Fix: Redis mutex lock on miss — one thread fetches from DB, the rest wait briefly for the cache to repopulate.

🔴 Key Generation Service goes down

No new short URLs can be created. The entire write path is blocked.

Fix: Run 2+ KGS replicas. Each Write API server holds a local batch of 1,000 pre-fetched keys and continues working for minutes during a KGS outage.

🔴 Single link gets too hot

5 million clicks per minute overwhelms even the Redis shard serving that key.

Fix: Local in-memory cache on each Read API server (5s TTL). CDN caches the redirect at the edge — requests never reach your servers at all for the hottest links.

🟡 Database primary fails

Writes are blocked. Read replicas continue but may be slightly stale.

Fix: Automated leader election (Patroni). Queue writes in Redis during the ~30-second failover window.

🟡 Expired URL still in Redis

The URL expired in Postgres but Redis still returns the old entry.

Fix: Set Redis TTL 30 seconds shorter than the actual expiry. Check expires_at on every DB read as a safety net.


What I actually learned

The URL shortener is the "hello world" of system design for a reason. It is small enough to finish in one session and rich enough to introduce almost every major concept: caching, replication, key generation, rate limiting, async processing, failure modes.

But the deeper lesson was not about any of those things specifically.

1. The shape of a system matters more than the code inside it.
The same code that works for 1,000 users collapses under 1,000,000 — not because the code changed, but because the shape around it was never designed to scale.

2. The first question is never "how do I build this?"
It is "what does this actually need to do?" The 100:1 ratio was not a number I calculated and forgot. It was a constraint that filtered every subsequent decision. Good system design is a chain of consequences from the question you asked at the start.

3. Naming what fails is not pessimism — it is engineering.
Every system fails eventually. The difference between a resilient system and a fragile one is not whether failures happen, but whether someone thought through what to do when they do.

I came in expecting to write code. I left knowing how to think about systems.

That is a harder skill. And a more important one.


This is my write-up from Session 1 of a system design course. If you are following along, your homework is to sketch this same architecture on your own — without looking at notes — and find where you get stuck. That is exactly where the learning is.

Drop your questions or your own diagrams in the comments. Let's figure it out together. 👇

Top comments (0)