Streaming LLM Tokens to 10K Concurrent Users

#programming #webdev

---
title: "Scaling LLM Token Streaming to 10K SSE Clients"
published: true
description: "A practical walkthrough of scaling server-sent event streams for LLM token delivery — coroutine channels, backpressure, connection draining, and the memory math for 4GB containers."
tags: kotlin, architecture, cloud, api
canonical_url: https://blog.mvpfactory.co/scaling-llm-token-streaming-to-10k-sse-clients
---

## What We're Building

Let me show you the architecture that keeps 10,000 concurrent SSE connections alive while streaming LLM tokens — without melting your server. We'll walk through coroutine-per-connection fan-out, bounded channel buffers for backpressure, connection draining for zero-downtime deploys, and the per-connection memory math that determines your real ceiling on a 4GB container.

## Prerequisites

- Kotlin coroutines and `Channel` basics
- Familiarity with Server-Sent Events (SSE)
- A Ktor or Netty-based HTTP server
- Understanding of Kubernetes pod lifecycle (helpful, not required)

## Step 1: Understand the Problem

LLM APIs emit tokens every 20–80ms. When you proxy those tokens to thousands of users via SSE, every connection becomes a long-lived coroutine holding an open HTTP response. One slow client that can't consume fast enough bloats your buffers, and without backpressure, you're one GC pause away from an OOM kill.

The naive approach — unbounded lists, no draining strategy, fire-and-forget writes — collapses around 2,000 connections. Here is the minimal setup to get this working at scale.

## Step 2: Wire Up Bounded Channels for Fan-Out

The core pattern is a bounded `Channel<String>` per SSE connection, fed by a shared upstream coroutine consuming the LLM stream:

kotlin
val upstream = Channel(capacity = 64) // shared LLM token source

fun fanOut(clients: List>, token: String) {
for (client in clients) {
client.trySend(token).onFailure {
// Client buffer full — apply backpressure policy
client.close() // or drop oldest, depending on SLA
}
}
}


Each client gets its own bounded channel (I recommend 32–128 slots). When a slow client fills its buffer, `trySend` fails immediately. No blocking the upstream, no cascading stalls.

| Approach | Memory Under Load | Slow Client Impact | Failure Mode |
|---|---|---|---|
| Unbounded list per client | Grows without limit | Heap exhaustion | OOM kill, all clients die |
| Single shared channel | Bounded | Slowest client blocks all | Head-of-line blocking |
| Bounded channel per client | Predictable ceiling | Only that client affected | Graceful disconnect |

## Step 3: Run the Memory Math

Here is the gotcha that will save you hours. This arithmetic determines your actual concurrency ceiling:

| Component | Per-Connection Cost | At 10K Connections |
|---|---|---|
| Coroutine stack | ~1–2 KB | 10–20 MB |
| Bounded channel (64 slots × 40B) | ~2.5 KB | 25 MB |
| Ktor/Netty response buffer | ~8 KB | 80 MB |
| Connection metadata + headers | ~1 KB | 10 MB |
| **Total per connection** | **~13 KB** | **~130 MB** |

On a 4GB container with ~2.5GB available heap (after JVM overhead, metaspace, GC headroom), you land at roughly 12,000 connections before pressure mounts. In practice, target 8,000–10,000 to leave room for burst traffic and GC breathing room. If you need more, scale horizontally. Don't increase buffer sizes.

## Step 4: Implement Connection Draining

During rolling deployments, you can't just kill 10,000 open SSE connections. Let me show you a pattern I use in every project:

1. Stop accepting new connections. Remove the pod from the load balancer.
2. Send a custom SSE event (`event: reconnect`) telling clients to reconnect to a healthy pod.
3. Set a drain deadline (30 seconds) and forcibly close remaining connections after it expires.
4. Use structured concurrency so `coroutineScope` ensures all child coroutines complete or cancel cleanly.

kotlin
suspend fun drainConnections(clients: List, deadline: Duration) {
withTimeoutOrNull(deadline) {
clients.forEach { it.sendEvent("reconnect", """{"reason":"deploy"}""") }
clients.forEach { it.awaitDisconnect() }
}
// Force-close stragglers after deadline
clients.forEach { it.close() }
}


Without this, Kubernetes will SIGTERM your pod, TCP connections reset, and users see a broken stream with no retry hint.

## Gotchas

- **Unbounded queues are silent killers.** A single stalled client accumulating 50,000 tokens at ~40 bytes each eats 2MB. Multiply by a few hundred slow mobile clients and you've consumed your entire heap.
- **Disconnecting slow clients feels aggressive** — but the alternative is an OOM that disconnects *everyone*. Drop one to save thousands.
- **Structured concurrency is non-negotiable.** Every SSE connection must run inside a `coroutineScope` tied to the request lifecycle. When a client disconnects, the coroutine cancels. When the server drains, all children cancel cooperatively. No leaked coroutines, no zombie connections.
- **Retrofit draining after an incident is miserable.** Implement it from day one. You'll thank yourself the first time you push a hotfix under load.

## Wrapping Up

Budget ~13–15 KB per SSE connection. Use bounded channels (32–128 slots) per client with `trySend` for non-blocking fan-out. Implement connection draining from day one with a reconnect event and a hard deadline. On 4GB, plan for 8K–10K connections max, then scale horizontally.

The docs don't mention this, but the architecture isn't complex — it's disciplined. Bounded buffers, predictable memory, cooperative cancellation. That's what keeps your server running at 10K concurrent streams.

DEV Community

Streaming LLM Tokens to 10K Concurrent Users

Top comments (0)