When we started scaling our real-time Speech-to-Speech (S2S) translation service beyond a single instance, we hit a wall that HTTP services don't have to worry about: WebSockets are stateful, and load balancers aren't.
Adding a second server didn't double our capacity. It broke everything. Here's why, and how we fixed it.
The Problem: Connections Dropping on Scale-Out
Our single-instance FastAPI service handled WebSocket connections beautifully. The moment we put a load balancer in front of two instances, clients started experiencing random disconnections, lost audio mid-stream, and sessions that just... vanished.
The errors on the client side were cryptic. On the server side, we'd see a connection established on instance A, and then suddenly a message arriving at instance B — which had no idea who this client was.
What Wasn't the Cause
- Network issues: TCP connections were healthy. The problem reproduced even on localhost with two instances.
- Timeouts: Our WebSocket keepalive pings were working. Connections weren't idle-dropping.
- Application bugs: The single-instance version was solid. Same code, different behavior at scale.
Root Cause: The Stateful Nature of WebSockets
HTTP is stateless by design. A request comes in, a response goes out, done. The load balancer can send request #1 to server A and request #2 to server B — they're independent.
WebSockets are the opposite. Once the handshake completes, a persistent, bidirectional connection is established between that specific client and that specific server instance. All subsequent messages travel over that same TCP connection.
The problem breaks into two layers:
1. The Handshake Problem
The WebSocket protocol starts as an HTTP Upgrade request. Most load balancers handle this correctly, routing the initial handshake to one server. The connection is then established with that server.
2. The Reconnection Problem
This is where things silently break. When a client reconnects (after a brief network hiccup, a tab going to sleep, a mobile radio switching towers), the load balancer has no memory of which server handled the previous session. It routes the new connection to whichever server has the lowest load — which might not be the one holding the session state.
In our case, each server instance held in-memory state: the active ASR context, the translation buffer, the TTS queue, and the session audio history. When a client reconnected to the wrong instance, all of that was gone.
The Solution: Three Approaches, One Right Answer for Us
Approach 1: Sticky Sessions (IP Hash)
The simplest fix is telling the load balancer to always route the same client to the same server, based on their IP address.
In Nginx:
upstream speech_service {
ip_hash; # Route same IP to same upstream
server 10.0.0.1:8000;
server 10.0.0.2:8000;
}
server {
location /ws/ {
proxy_pass http://speech_service;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 3600s; # Don't timeout long-lived connections
}
}
The catch: IP hashing works until it doesn't. Mobile clients switch IPs constantly. Corporate networks hide thousands of users behind one NAT IP, hammering a single server. And if one server goes down, all its sticky sessions are lost anyway.
It's a band-aid, not a solution. We used it to stop the bleeding while building something better.
Approach 2: Cookie-Based Sticky Sessions
More reliable than IP hashing. The load balancer assigns a cookie on the first request, and subsequent requests carry that cookie, allowing the balancer to route them consistently.
HAProxy configuration:
backend speech_backend
balance roundrobin
cookie SERVERID insert indirect nocache
server s1 10.0.0.1:8000 check cookie s1
server s2 10.0.0.2:8000 check cookie s2
Better, but still fundamentally tied to a single server. Any server failure means session loss, and WebSocket Upgrade requests don't always carry cookies depending on client implementation.
Approach 3: Externalized Session State (What We Actually Did)
The real fix is to stop treating session state as something that lives on a server instance. We moved all session state into Redis, turning our servers into stateless workers.
import redis.asyncio as redis
import json
from fastapi import WebSocket
class SessionStore:
def __init__(self, redis_url: str):
self.redis = redis.from_url(redis_url)
self.TTL = 3600 # 1 hour session expiry
async def save_session(self, session_id: str, data: dict):
await self.redis.setex(
f"session:{session_id}",
self.TTL,
json.dumps(data)
)
async def get_session(self, session_id: str) -> dict | None:
raw = await self.redis.get(f"session:{session_id}")
return json.loads(raw) if raw else None
async def delete_session(self, session_id: str):
await self.redis.delete(f"session:{session_id}")
Each WebSocket handler now looks up and persists its state through Redis, not in-memory:
@app.websocket("/ws/{session_id}")
async def websocket_endpoint(websocket: WebSocket, session_id: str):
await websocket.accept()
# Restore session from Redis — works on ANY server instance
session = await store.get_session(session_id) or {
"language_pair": "en-es",
"audio_buffer": [],
"segment_count": 0
}
try:
while True:
data = await websocket.receive_bytes()
# Process audio...
session["segment_count"] += 1
# Persist updated state back to Redis
await store.save_session(session_id, session)
await websocket.send_bytes(translated_audio)
except WebSocketDisconnect:
# State preserved in Redis — client can reconnect to any instance
await store.save_session(session_id, session)
With this in place, the load balancer can route reconnections anywhere. The new server instance picks up the session exactly where it left off.
The Missing Piece: Redis Pub/Sub for Audio Streaming
Externalizing session metadata was straightforward. But our audio pipeline had a trickier problem: what if a client is actively streaming audio on instance A, and a second browser tab opens a connection that lands on instance B?
Both connections need to receive the same translated audio stream. We solved this with Redis Pub/Sub:
async def stream_audio_to_client(
websocket: WebSocket,
session_id: str,
redis_client
):
pubsub = redis_client.pubsub()
await pubsub.subscribe(f"audio:{session_id}")
async for message in pubsub.listen():
if message["type"] == "message":
audio_chunk = message["data"]
await websocket.send_bytes(audio_chunk)
async def publish_translated_audio(
session_id: str,
audio_chunk: bytes,
redis_client
):
# Any server instance can publish — all subscribers receive it
await redis_client.publish(f"audio:{session_id}", audio_chunk)
Now any server instance that processes a translation chunk publishes to Redis, and all WebSocket connections subscribed to that session — regardless of which server they landed on — receive the audio.
Results
| Scenario | Before | After |
|---|---|---|
| Server failover | Session lost, client must restart | Seamless reconnect, state preserved |
| Client reconnect on mobile | ~40% session loss rate | <1% (only within Redis TTL expiry) |
| Horizontal scale-out | Broke sessions | Linear capacity increase |
| Multi-tab same session | Not supported | Works transparently |
Key Takeaway
WebSocket load balancing is a distributed state problem, not a networking problem. Sticky sessions are a reasonable stopgap but they trade one failure mode for another — they just hide the statefulness instead of solving it.
Once we treated our servers as stateless workers and Redis as the source of truth for session state, we stopped thinking about which server a client was connected to entirely. That mental shift — from "connection-bound state" to "session-bound state" — is what actually makes WebSocket services horizontally scalable.
If you're running a real-time service and haven't hit this yet: you will, the moment you add that second server.
Top comments (0)