DEV Community

Cover image for The Tokio/Rayon Trap and Why Async/Await Fails Concurrency
Peter Mbanugo
Peter Mbanugo

Posted on • Originally published at pmbanugo.me

The Tokio/Rayon Trap and Why Async/Await Fails Concurrency

Over the last decade, async/await won the concurrency wars because it is exceptionally easy. It allows developers to write asynchronous code that looks virtually identical to synchronous code.

But beneath that familiar syntax lies massive structural complexity. It hides control flow, obscures hardware realities, and ultimately pushes the burden of scheduling back onto the developer.

Rich Hickey articulated this perfectly in his talk Simple Made Easy: "Easy" is what is familiar and close at hand, while "Simple" is what is structurally untangled1. async/await is easy to write, but it is fiercely complex to operate.

Rob Pike talked about this architectural shift during his 2023 GopherConAU address:

Compared to goroutines, channels and select, async/await is easier and smaller for language implementers to build... But it pushes some of the complexity back on the programmer, often resulting in what Bob Nystrom has called 'colored functions'. [...] It’s important, though, whatever concurrency model you do provide, you do it exactly once, because an environment providing multiple concurrency implementations can be problematic.2

Pike's remark about "multiple concurrency implementations" and async/await is exactly what is failing in production today.

The Production Trap: Confusing Asynchrony with Concurrency

The fundamental trap of async/await is that it conflates asynchrony (yielding while waiting for I/O) with concurrency (doing multiple things at once).

The syntax is a trap because it disguises interleaved state machines as isolated, sequential threads. Lulled by this illusion, a developer writes an async function exactly as they would blocking code — fetching a database record over the network, then immediately crunching the data. But what happens when that data crunching involves parsing a 10MB JSON payload, traversing a massive collection, or executing a compute-heavy cryptographic proof?

The cooperative executor halts.

In a cooperative runtime like Rust's Tokio or Node.js, the thread does not yield until it hits an await point. A 50-millisecond CPU-bound task in a function stalls the entire execution thread. Suddenly, thousands of unrelated network requests spike in latency and the system becomes unresponsive. Meanwhile the hardware is barely utilised.

The Broken Promise: Human in the loop Scheduler

When these latency spikes occur, the answer is always the same: separate your runtimes. Use Tokio for I/O, and send CPU-bound work to a dedicated thread pool like Rayon.

Recent postmortems highlight the resulting disaster. Engineering teams at PostHog3 and Meilisearch4 have documented the painful reality of untangling these complexities in production. Developers must carefully analyse every function to decide if it belongs in the "I/O pool" or the "Compute pool," and then manually orchestrate the message-passing boundary between them.

If a developer must manually partition I/O and compute, strictly police the boundaries to prevent deadlocks, and ferry data between two different runtimes with two different mental models, the async abstraction has failed. The language feature promised to hide the complexity of concurrency. Instead, it turned the application developer into the human in the loop scheduler.

Unbounded by Default is OOM by Default

The second failure mode of async/await runtimes is how frictionless they make unbounded capacity.

Calling tokio::spawn(...) is cheap. When a downstream database slows down during a traffic spike, the ingress network loop happily continues accepting connections and spawning tasks. Because async tasks and memory allocations are typically unbounded by default in these ecosystems, the system does not push back.

In-flight tasks queue indefinitely. The application consumes RAM until the OS out-of-memory (OOM) killer violently terminates the process. Postmortems from major platforms consistently reveal the same root cause: queues do not fix overload, they simply delay the crash while making it catastrophic. Infinite capacity is a lie, and defaults that pretend otherwise are dangerous.

The Work-Stealing Myth

When systems hit these bottlenecks, developers often demand smarter, preemptive, work-stealing schedulers to distribute the load. The assumption is that if a core is idle, it should steal tasks from a busy core to guarantee fairness. But at massive scale, fairness is the enemy of throughput. Work-stealing destroys CPU cache locality.

When WhatsApp pushed the Erlang BEAM virtual machine to its limits on 100+ core machines, the system choked. As detailed by Robin Morisset, idle threads trying to steal work spent all their CPU cycles fighting over the global runq_lock5 — a lock used to synchronise access to a scheduler's run queue.

Even with optimised locks, moving a state machine to a different CPU core means abandoning the L1 and L2 cache. Fairness does not matter if every stolen task incurs a 100+ nanosecond main memory fetch penalty. If you are already forced to manually partition threads for I/O versus CPU tasks to survive production, the generic work-stealing algorithm has already failed you. You understand your workload's topology better than the runtime does.

The Alternative

I got tired of the complexities and traps of async/await. I wanted the rock-solid fault-tolerance of the BEAM, but without the opaque manoeuvres of garbage collection and global work-stealing. As Leslie Lamport has long argued, state machines are the mathematically sound foundation of concurrent programming. async/await is merely compiler magic that tries to hide the state machine from you, poorly.

Instead of hiding the state machine, why not expose it and give the user better control primitives? The result is Project Tina: an opinionated, shared-nothing, thread-per-core concurrency framework.

Tina embraces strict constraints to guarantee massive throughput and reliability:

  1. One Primitive. One Mental Model. There is no async or await, no Promises, and no Futures. You write an Isolate — a unit of concurrent work. The handler is a standard, synchronous function that reacts to a message and returns an Effect.
  2. Thread-Per-Core (Shared Nothing). Tina shards the workload across OS threads. There is no work stealing. Isolates never migrate. All cross-core communication occurs via the messaging subsystem.
  3. Strictly Bounded. Memory is pre-allocated at process boot. Mailboxes are strictly bounded. If a traffic spike hits and a mailbox is full, the caller is notified immediately. The system sheds load predictably rather than OOM-crashing the process.
  4. Architectural Determinism. In modern async runtimes, task polling order and thread-pool scheduling are opaque, non-deterministic sources of chaos. You rarely know exactly when or where your task will wake up. Tina strips this away. The scheduler is a strict, visible, single-threaded loop per core. Because the framework explicitly controls execution order, I/O, and the clock, the system’s behaviour is radically predictable. This unlocks Tina's ultimate superpower: Deterministic Simulation Testing (DST). You can simulate network partitions or dropped messages on a single thread, and the same seed will yield the exact same execution order, every single time.

Wrap Up

async/await makes concurrency easy to write, but it makes systems complex to operate. By forcing developers to explicitly manage state transitions, strict memory bounds, and deliberate architectural topologies upfront, I replace runtime magic with structural guarantees. Because;

Predictability beats brevity.

Tina is open source. You can view the architecture, read the design documents, and critique the code on GitHub.


  1. Rich Hickey, "Simple Made Easy", Strange Loop 2011

  2. Rob Pike, "What We Got Right, What We Got Wrong", GopherConAU 2023

  3. PostHog Engineering, "Untangling Rayon and Tokio", posthog.com/blog

  4. Louis Dureuill, "Don't mix Rayon and Tokio", blog.dureuill.net

  5. Robin Morisset, "Optimizing the BEAM's Scheduler for Many-Core Machines", Code BEAM Europe 

Top comments (3)

Collapse
 
motedb profile image
mote

The "human in the loop scheduler" line hit hard. I run a dual Tokio+Rayon setup in an embedded storage engine (moteDB) and the boundary policing is exactly as painful as you describe.

One thing I don't see discussed enough: on ARM Cortex-A devices, the cost of getting the partition wrong is worse than on x86. A 30ms blocking parse on a Pi 4 doesn't just stall one Tokio thread — the kernel scheduler migrates tasks across the big.LITTLE cluster and you lose cache warmth on both the I/O and compute sides. I've measured 2-3x latency spikes on the I/O path from this alone.

Your Tina approach (strict thread-per-core, no work stealing) is basically what we converged on for the embedded case too. Pre-allocated mailboxes, bounded channels, no surprises. The DST angle is interesting — does Tina handle simulated network partitions across isolates, or just message ordering determinism?

Collapse
 
motedb profile image
mote

The "human in the loop scheduler" line hit hard. I run a dual Tokio+Rayon setup in an embedded storage engine (moteDB) and the boundary policing is exactly as painful as you describe.

One thing I don't see discussed enough: on ARM Cortex-A devices, the cost of getting the partition wrong is worse than on x86. A 30ms blocking parse on a Pi 4 doesn't just stall one Tokio thread — the kernel scheduler migrates tasks across the big.LITTLE cluster and you lose cache warmth on both the I/O and compute sides. I've measured 2-3x latency spikes on the I/O path from this alone.

Your Tina approach (strict thread-per-core, no work stealing) is basically what we converged on for the embedded case too. Pre-allocated mailboxes, bounded channels, no surprises. The DST angle is interesting — does Tina handle simulated network partitions across isolates, or just message ordering determinism?

Collapse
 
peacebinflow profile image
PEACEBINFLOW

The observation that "queues do not fix overload, they simply delay the crash while making it catastrophic" is the kind of thing that sounds obvious once stated but is ignored by almost every default configuration in modern async runtimes. Unbounded task spawning is the path of least resistance. It works fine until a downstream service slows by 200ms, and suddenly the in-flight task count doubles, then quadruples, then the OOM killer wakes up. The system didn't fail because of a bug. It failed because the defaults assumed infinite memory.

What I find myself thinking about is how the "human in the loop scheduler" problem maps onto the broader trajectory of abstractions in our field. We keep building layers that promise to hide complexity, and those layers work beautifully until the workload doesn't fit the assumptions baked into the abstraction. Then the developer has to understand both the abstraction and what it was hiding—often under time pressure, during an incident. The Tokio/Rayon split is a perfect example. The async syntax hides the state machine. Then you hit a CPU-bound function. Now you need to understand the state machine, the work-stealing scheduler, the thread pool topology, and the message-passing boundary between two runtimes. The abstraction didn't remove complexity. It deferred it to the worst possible moment.

The bounded-by-default design in Tina—pre-allocated memory, strict mailbox limits, immediate backpressure on the caller—feels like a return to the engineering philosophy that dominated before cloud scaling made us lazy about resources. If you can't accept more work, you say so immediately. The caller gets a clear signal instead of a hung request. That's how TCP works. That's how Erlang's process model works at the messaging layer. It's not a new idea. It just got lost somewhere in the async/await era, replaced by the assumption that memory is infinite and queues should be too. The deterministic simulation testing capability is the part that's genuinely novel as a framework primitive—being able to replay a failure with the exact same execution order is something most production systems can only dream of. How far along is the DST support in practice—is it something you can use to reproduce a production failure today, or is it still primarily a development-time tool?