Krishna Aditya Srivastava

Posted on Apr 21 • Edited on Apr 23

From Sockets to Server: What I Learned Building My Own Web Server

#linux #cpp #c #http

What I Learned Building an HTTP Server From Scratch

Most of us never think about what happens when a web server actually receives a request. Frameworks handle it. Infrastructure hides it. And that's fine — until you want to really understand what's going on underneath.

So I built one myself. An HTTP server in C++, starting from raw POSIX sockets. No frameworks, no libraries for the hard parts. Just system calls, byte buffers, and a lot of edge cases.

What started as a learning exercise turned into something more specific: watching performance bottlenecks shift layers as the architecture improved. That turned out to be the most interesting part.

Why Build This at All?

A few questions kept nagging me that I couldn't answer confidently:

How does a server know when a full HTTP request has arrived?
What actually happens when headers come in as fragments?
Why does a server handle 10 users fine, then struggle at 500?
Where do production servers like NGINX actually spend their time?

The only way to stop guessing was to build it and find out.

The goal wasn't to beat NGINX. It was to make the costs visible.

The Architecture (Deliberately Simple)

I kept the design modular so failures were easy to trace:

Client → Accept → HTTP Read/Parse → Route → Response → Write

Each layer had one job:

Socket layer — socket, bind, listen, accept
HTTP I/O — buffered reads, parsing, response writing
Router — static and dynamic path matching
Runtime — thread pool and/or epoll-based execution

That separation paid off. When something broke, it was obvious where to look.

Networking Is Messier Than the Textbook

The textbook version of a server looks clean:

Accept connection
Read request
Process it
Write response
Close

The real version? Reads are partial. Clients disconnect mid-write. Malformed requests arrive constantly. Keep-alive connections blur the line between "done" and "waiting."

Even tiny decisions matter. Here's what the actual socket setup looks like:

int fd = socket(AF_INET, SOCK_STREAM, 0);

// Avoid "address already in use" on restart
int opt = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));

sockaddr_in addr{};
addr.sin_family      = AF_INET;
addr.sin_port        = htons(8080);
addr.sin_addr.s_addr = INADDR_ANY;

bind(fd, (sockaddr*)&addr, sizeof(addr));
listen(fd, SOMAXCONN);

int client_fd = accept(fd, nullptr, nullptr);

That SO_REUSEADDR line — just one option — prevents restart failures caused by sockets stuck in TIME_WAIT. The details add up fast.

HTTP Parsing: The First Humbling Moment

My first assumption was wrong immediately:

One read() call = one complete HTTP request.

Almost never true.

What actually works:

Accumulate incoming bytes in a buffer
Scan for \r\n\r\n (the end of headers)
Only then parse the headers
Use Content-Length to know how much body to expect

And you need guardrails:

Cap header size (16KB is common)
Cap body size
Reject malformed requests early

These defensive checks improved stability more than any performance optimization I made. Correctness has to come before speed.

The Concurrency Problem (Where It Gets Interesting)

My first concurrency model was simple: a thread pool with blocking I/O. Each thread picks up a connection and handles it start to finish.

This works great — until it doesn't.

The breaking point: threads block while waiting for slow or idle clients. With enough connections, every thread is just waiting. New requests queue up. Latency climbs. Throughput flatlines.

That's when I started benchmarking seriously.

Benchmarking: Watching Bottlenecks Move

I used wrk (4 threads, 20s test runs) to measure throughput and latency across four configurations. The question I asked at every stage:

What's the bottleneck now, and why?

Stage 1 — Baseline (Backlog Maxed): ~5,000 req/s

Connections   Req/s    Avg Latency   p99 Latency
──────────────────────────────────────────────────
50            5,009    9.87ms        33ms
100           5,087    19.68ms       49ms
200           4,973    40.40ms       97ms
400           5,137    77.40ms       158ms
800           5,135    154.08ms      283ms

The numbers tell a clear story: throughput is essentially glued at ~5K req/s regardless of how many connections are thrown at it. But latency keeps doubling as connections increase.

This is textbook queueing saturation — like a single checkout lane with a growing line. The system is fully occupied. More load just means more waiting, not more work done.

The lesson: the architecture itself was the ceiling, not the code.

Stage 2 — Thread Pool Only: ~21,000 req/s

4 threads, 800 connections
Req/s:        21,489
Avg Latency:  23.50ms
p99 Latency:  48.85ms

A 4× jump just from parallel request handling. Threads stop blocking on each other, CPU stops idling on slow clients.

But perf told the real story. Heavy time in:

82.45%  entry_SYSCALL_64_after_hwframe
82.13%  do_syscall_64
40.89%  __tcp_push_pending_frames
40.85%  tcp_write_xmit
38.94%  write (libc)
34.79%  tcp_sendmsg
27.37%  ip_output

Almost all time is in syscalls and the TCP stack — not in application code. That's actually a good sign. It means the bottleneck has moved out of user space and into the kernel's networking layer.

Stage 3 — Epoll Only: ~43,000 req/s

4 threads, 800 connections
Req/s:        43,308
Avg Latency:  18.40ms
p99 Latency:  21.92ms

Another 2× jump — and notice how tight the latency distribution got. The old model scanned all connections to find active ones — O(N) work even for idle sockets.

Epoll flips this. Instead of polling, you register sockets with the kernel and it tells you which ones are ready. Here's what that looks like:

int epfd = epoll_create1(0);

// Register the listening socket
epoll_event ev{};
ev.events  = EPOLLIN;
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);

// Event loop
epoll_event events[64];
while (true) {
    int n = epoll_wait(epfd, events, 64, -1);
    for (int i = 0; i < n; i++) {
        int fd = events[i].data.fd;
        if (fd == server_fd) {
            // New connection — accept and register it
            int client_fd = accept(server_fd, nullptr, nullptr);
            ev.events  = EPOLLIN;
            ev.data.fd = client_fd;
            epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
        } else {
            // Existing connection is ready to read
            handle_client(fd);
        }
    }
}

perf confirmed the shift — time is now dominated by writev and the TCP send path, not connection scanning:

57.10%  entry_SYSCALL_64_after_hwframe
47.79%  writev
40.56%  tcp_sendmsg
35.46%  tcp_write_xmit
23.21%  ip_output
11.50%  HTTP::Router::dispatch   ← actual app logic, barely visible

Epoll isn't an optimization. It's a different cost model entirely. Without it, high connection counts just waste CPU on sockets that aren't doing anything.

Stage 4 — Epoll + Threads: ~57,000 req/s

4 threads, 1200 connections
Req/s:        57,650
Avg Latency:  20.50ms
p99 Latency:  30.39ms

Combining event-driven I/O with parallel execution got very close to NGINX territory. Workers stayed fully utilized, latency held steady even at 1200 connections, and perf showed the bottleneck firmly in the kernel:

52.96%  entry_SYSCALL_64_after_hwframe
39.03%  writev
32.86%  tcp_sendmsg
28.10%  tcp_write_xmit
18.55%  ip_output

At this point, there's almost nothing left to optimize in user space.

How Does This Compare to NGINX?

4 threads, 800 connections
Req/s:        60,438
Avg Latency:  13.09ms
p99 Latency:  32.54ms

NGINX edges ahead — but the perf breakdown explains exactly why. Its top kernel functions look almost identical to ours:

tcp_sendmsg_locked   0.77%
__tcp_transmit_skb   0.76%
tcp_write_xmit       0.63%

And its user space is razor thin:

ngx_vslprintf              0.92%
ngx_http_write_filter      0.87%
ngx_http_parse_header_line 0.76%

NGINX spends almost no time in its own code. Everything is kernel work. The gap between our server and NGINX isn't a conceptual one — it's maturity. The architecture is the same. NGINX just has years of micro-optimizations, tighter buffering, and fewer syscalls per request layered on top.

The Pattern That Surprised Me

Looking back at all four stages, the same thing kept happening: as throughput improved, the bottleneck moved downward.

~5K req/s — architectural ceiling (queueing saturation)
~21K req/s — concurrency fixed, hit kernel I/O costs
~43K req/s — epoll eliminated idle scanning, TCP stack dominates
~57K req/s — parallel epoll, kernel networking is the last frontier

That progression — bottlenecks migrating from your code toward the kernel — is exactly what you want to see. It means you've eliminated most of what's in your control.

What I'd Do Next

If I continued this project:

Fully event-driven model (no blocking anywhere)
Better HTTP compliance (chunked encoding, more header handling)
Keep-alive connection tuning
Response and file caching
Built-in metrics and tracing

The Real Takeaway

This started as "build a web server."

It ended as: learn to read where performance goes by watching it move.

Frameworks are great. But rebuilding the abstractions they hide is one of the best ways to understand what they're actually doing — and what it costs.

DEV Community