What I Learned Building an HTTP Server From Scratch
Most of us never think about what happens when a web server actually receives a request. Frameworks handle it. Infrastructure hides it. And that's fine — until you want to really understand what's going on underneath.
So I built one myself. An HTTP server in C++, starting from raw POSIX sockets. No frameworks, no libraries for the hard parts. Just system calls, byte buffers, and a lot of edge cases.
What started as a learning exercise turned into something more specific: watching performance bottlenecks shift layers as the architecture improved. That turned out to be the most interesting part.
Why Build This at All?
A few questions kept nagging me that I couldn't answer confidently:
- How does a server know when a full HTTP request has arrived?
- What actually happens when headers come in as fragments?
- Why does a server handle 10 users fine, then struggle at 500?
- Where do production servers like NGINX actually spend their time?
The only way to stop guessing was to build it and find out.
The goal wasn't to beat NGINX. It was to make the costs visible.
The Architecture (Deliberately Simple)
I kept the design modular so failures were easy to trace:
Client → Accept → HTTP Read/Parse → Route → Response → Write
Each layer had one job:
-
Socket layer —
socket,bind,listen,accept - HTTP I/O — buffered reads, parsing, response writing
- Router — static and dynamic path matching
- Runtime — thread pool and/or epoll-based execution
That separation paid off. When something broke, it was obvious where to look.
Networking Is Messier Than the Textbook
The textbook version of a server looks clean:
- Accept connection
- Read request
- Process it
- Write response
- Close
The real version? Reads are partial. Clients disconnect mid-write. Malformed requests arrive constantly. Keep-alive connections blur the line between "done" and "waiting."
Even tiny decisions matter. Here's what the actual socket setup looks like:
int fd = socket(AF_INET, SOCK_STREAM, 0);
// Avoid "address already in use" on restart
int opt = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
sockaddr_in addr{};
addr.sin_family = AF_INET;
addr.sin_port = htons(8080);
addr.sin_addr.s_addr = INADDR_ANY;
bind(fd, (sockaddr*)&addr, sizeof(addr));
listen(fd, SOMAXCONN);
int client_fd = accept(fd, nullptr, nullptr);
That SO_REUSEADDR line — just one option — prevents restart failures caused by sockets stuck in TIME_WAIT. The details add up fast.
HTTP Parsing: The First Humbling Moment
My first assumption was wrong immediately:
One
read()call = one complete HTTP request.
Almost never true.
What actually works:
- Accumulate incoming bytes in a buffer
- Scan for
\r\n\r\n(the end of headers) - Only then parse the headers
- Use
Content-Lengthto know how much body to expect
And you need guardrails:
- Cap header size (16KB is common)
- Cap body size
- Reject malformed requests early
These defensive checks improved stability more than any performance optimization I made. Correctness has to come before speed.
The Concurrency Problem (Where It Gets Interesting)
My first concurrency model was simple: a thread pool with blocking I/O. Each thread picks up a connection and handles it start to finish.
This works great — until it doesn't.
The breaking point: threads block while waiting for slow or idle clients. With enough connections, every thread is just waiting. New requests queue up. Latency climbs. Throughput flatlines.
That's when I started benchmarking seriously.
Benchmarking: Watching Bottlenecks Move
I used wrk (4 threads, 20s test runs) to measure throughput and latency across four configurations. The question I asked at every stage:
What's the bottleneck now, and why?
Stage 1 — Baseline (Backlog Maxed): ~5,000 req/s
Connections Req/s Avg Latency p99 Latency
──────────────────────────────────────────────────
50 5,009 9.87ms 33ms
100 5,087 19.68ms 49ms
200 4,973 40.40ms 97ms
400 5,137 77.40ms 158ms
800 5,135 154.08ms 283ms
The numbers tell a clear story: throughput is essentially glued at ~5K req/s regardless of how many connections are thrown at it. But latency keeps doubling as connections increase.
This is textbook queueing saturation — like a single checkout lane with a growing line. The system is fully occupied. More load just means more waiting, not more work done.
The lesson: the architecture itself was the ceiling, not the code.
Stage 2 — Thread Pool Only: ~21,000 req/s
4 threads, 800 connections
Req/s: 21,489
Avg Latency: 23.50ms
p99 Latency: 48.85ms
A 4× jump just from parallel request handling. Threads stop blocking on each other, CPU stops idling on slow clients.
But perf told the real story. Heavy time in:
82.45% entry_SYSCALL_64_after_hwframe
82.13% do_syscall_64
40.89% __tcp_push_pending_frames
40.85% tcp_write_xmit
38.94% write (libc)
34.79% tcp_sendmsg
27.37% ip_output
Almost all time is in syscalls and the TCP stack — not in application code. That's actually a good sign. It means the bottleneck has moved out of user space and into the kernel's networking layer.
Stage 3 — Epoll Only: ~43,000 req/s
4 threads, 800 connections
Req/s: 43,308
Avg Latency: 18.40ms
p99 Latency: 21.92ms
Another 2× jump — and notice how tight the latency distribution got. The old model scanned all connections to find active ones — O(N) work even for idle sockets.
Epoll flips this. Instead of polling, you register sockets with the kernel and it tells you which ones are ready. Here's what that looks like:
int epfd = epoll_create1(0);
// Register the listening socket
epoll_event ev{};
ev.events = EPOLLIN;
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);
// Event loop
epoll_event events[64];
while (true) {
int n = epoll_wait(epfd, events, 64, -1);
for (int i = 0; i < n; i++) {
int fd = events[i].data.fd;
if (fd == server_fd) {
// New connection — accept and register it
int client_fd = accept(server_fd, nullptr, nullptr);
ev.events = EPOLLIN;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
} else {
// Existing connection is ready to read
handle_client(fd);
}
}
}
perf confirmed the shift — time is now dominated by writev and the TCP send path, not connection scanning:
57.10% entry_SYSCALL_64_after_hwframe
47.79% writev
40.56% tcp_sendmsg
35.46% tcp_write_xmit
23.21% ip_output
11.50% HTTP::Router::dispatch ← actual app logic, barely visible
Epoll isn't an optimization. It's a different cost model entirely. Without it, high connection counts just waste CPU on sockets that aren't doing anything.
Stage 4 — Epoll + Threads: ~57,000 req/s
4 threads, 1200 connections
Req/s: 57,650
Avg Latency: 20.50ms
p99 Latency: 30.39ms
Combining event-driven I/O with parallel execution got very close to NGINX territory. Workers stayed fully utilized, latency held steady even at 1200 connections, and perf showed the bottleneck firmly in the kernel:
52.96% entry_SYSCALL_64_after_hwframe
39.03% writev
32.86% tcp_sendmsg
28.10% tcp_write_xmit
18.55% ip_output
At this point, there's almost nothing left to optimize in user space.
How Does This Compare to NGINX?
4 threads, 800 connections
Req/s: 60,438
Avg Latency: 13.09ms
p99 Latency: 32.54ms
NGINX edges ahead — but the perf breakdown explains exactly why. Its top kernel functions look almost identical to ours:
tcp_sendmsg_locked 0.77%
__tcp_transmit_skb 0.76%
tcp_write_xmit 0.63%
And its user space is razor thin:
ngx_vslprintf 0.92%
ngx_http_write_filter 0.87%
ngx_http_parse_header_line 0.76%
NGINX spends almost no time in its own code. Everything is kernel work. The gap between our server and NGINX isn't a conceptual one — it's maturity. The architecture is the same. NGINX just has years of micro-optimizations, tighter buffering, and fewer syscalls per request layered on top.
The Pattern That Surprised Me
Looking back at all four stages, the same thing kept happening: as throughput improved, the bottleneck moved downward.
- ~5K req/s — architectural ceiling (queueing saturation)
- ~21K req/s — concurrency fixed, hit kernel I/O costs
- ~43K req/s — epoll eliminated idle scanning, TCP stack dominates
- ~57K req/s — parallel epoll, kernel networking is the last frontier
That progression — bottlenecks migrating from your code toward the kernel — is exactly what you want to see. It means you've eliminated most of what's in your control.
What I'd Do Next
If I continued this project:
- Fully event-driven model (no blocking anywhere)
- Better HTTP compliance (chunked encoding, more header handling)
- Keep-alive connection tuning
- Response and file caching
- Built-in metrics and tracing
The Real Takeaway
This started as "build a web server."
It ended as: learn to read where performance goes by watching it move.
Frameworks are great. But rebuilding the abstractions they hide is one of the best ways to understand what they're actually doing — and what it costs.
Top comments (0)