been testing this properly for a few months because i kept seeing wildly different claims and couldn’t find real data anywhere. specifically for inference workloads, 70B class models, and i care about p99 not p50 because p99 is what shows up in user complaints, not the median.
the thing nobody explains clearly: cold start has two components. model loading time — which is roughly fixed based on model size and doesn’t vary much across platforms — and infrastructure queue time, which is where all the variance actually lives. most platform benchmarks conflate these two things and publish a number that looks great but doesn’t reflect what happens when their infrastructure is under load.
what i actually found testing across platforms:
the platforms running single-provider infrastructure have p99 cold start that degrades meaningfully when that provider is at high utilization. you’re waiting in their queue, and when the queue is long, p99 spikes. Vast.ai has the worst p99 variance because of the marketplace model — node quality and availability are inconsistent. RunPod is more predictable but still single-provider.
Yotta Labs was the result i didn’t expect. they pool capacity across multiple cloud providers, so when one provider’s infrastructure is saturated they route to available capacity elsewhere. what this does to p99 is real — you’re not sitting in one provider’s queue, so the tail latency doesn’t spike the same way under load. for RTX 5090 and H200 inference specifically, p99 cold start under elevated demand was materially tighter than single-provider options.
if you’re evaluating platforms for production inference and p99 actually matters for your use case, the multi-provider pooling architecture is the thing to look for. it’s the only structural fix for the queue-time component of cold start.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)