Ashwin Hariharan

Posted on May 12

Why does paying more make your LLM reply faster?

#ai #discuss #deeplearning #llm

Why does Claude respond faster when you pay more? And why does a longer conversation cost disproportionately more than a short one?

For the longest time I simply accepted these as "it's just how it works". Like most engineers, I burn through Claude and GPT tokens all day and assumed "longer prompts cost more" was just a billing convention.

As it turns out, memory is one of the factors that influence LLM pricing.

Now memory in AI systems lives in a lot of places. For example:

Vector stores for RAG
Redis for semantic caches and session state
In-process caches for short-lived data

Each layer has its own latency budget and its own access pattern.

One layer that doesn't get talked about much, but determines almost every LLM pricing decision from Claude, GPT, and Gemini, is HBM - the high-bandwidth memory inside the GPU itself.

At every token generation phase, the GPU does two reads from this high-bandwidth memory:

reading the model's weights
reading the KV cache

Let's unpack each briefly:

Reading the model's weights

Every time the model generates a token, your input flows through the model's layers one by one - from the first layer all the way to the output. This is called a forward pass.

Each forward pass reads the model weights just once, regardless of how many users are calling the API at the same moment. The weights are constant and don't change between users.

This means the cost of that one weight read can be split. If the GPU packs 100 user requests into the same forward pass as a batch, those 100 users share the single weight read. The cost is split amongst 100 users.

💡 Basically, it means that the "fast tier" modes in tools like Cursor are smaller batches (fewer people splitting the bill) - so you pay more per token.

Reading the `KV` Cache

The KV cache works differently. It is a variable cost that grows with your conversation.

To understand how this influences the cost, we need to know conceptually what the Attention Mechanism does:

A quick detour of the Attention Mechanism

When the transformer generates a new token, it doesn't treat every earlier token equally. It uses attention to decide which earlier tokens matter most.

The easiest way to picture it: imagine every token in your conversation is a sticky note with two parts.

The key is a short tag describing what kind of information this token carries.
The value is what's written inside the note - the actual information the model can pull in.

Take the sentence: "The cat sat on the mat. It was fluffy."

When the model gets to "It was fluffy" and tries to predict the next word, it needs to know what "It" refers to. So it scans the tabs (keys) of every earlier token:

cat: key indicates "I'm a noun, an animal, the subject". Value carries "small, furry, four legs, often a pet."
mat: key indicates "I'm a noun, an object, a location". Value carries "flat thing on the floor."

Both are nouns, but the cat key matches the question "what could 'It' refer to that could be fluffy?" better. So the model pulls in cat's value more strongly than mat's, and uses that to shape the next token.

Note: In reality keys and values aren't English sentences - they're vectors of numbers the model learned during training. But functionally that's the job they do: the key is how this token gets found, the value is what gets pulled in once it's found.

For every token in your conversation, the model saves a key (a searchable label) and a value (the content). Without the cache, the attention mechanism would recompute these from scratch on every step. With it, it just reads them back.

But that read grows linearly with every conversation:

1,000 tokens of context -> 1,000 key-value pairs read per generated token
100,000 tokens of context -> 100,000 key–value pairs read per generated token

And unlike weights, this cache is unique to your session - the GPU can't read user A's KV cache and reuse it for user B, because the data is different. Every user pays the full cost of reading their own KV cache, with no sharing.

💡 So under the hood, it's about how fast a chip can read memory.

The weight bill gets split across the batch, whereas the KV bill is just yours.

Learn more here

The math behind how LLMs are trained and served
A visualization of how transformers and attention mechanism works

Transformer Explainer: LLM Transformer Model Visually Explained

An interactive visualization tool showing you how transformer models work in large language models (LLM) like GPT.

poloclub.github.io

Top comments (1)

Vikrant Shukla • May 12

Good framing. In our internal benchmarks the latency delta between standard and "priority/premium" tiers correlates much more with queue admission policy and dedicated capacity than with raw inference speed — same model weights, different scheduler. If you log TTFT (time-to-first-token) and inter-token latency separately, you'll typically see premium tiers compressing TTFT (admission) while inter-token latency stays roughly constant. Worth pairing this analysis with p50/p95/p99, not just averages; tail latency is what blows up agent workflows where one slow turn cascades through a planner-executor loop.

Reading the model's weights

Reading the KV Cache

A quick detour of the Attention Mechanism

Learn more here

Transformer Explainer: LLM Transformer Model Visually Explained

Reading the `KV` Cache