Update (May 4, 2026): A reader (Gary Stupak in the comments) pointed out that Cloudflare AI Gateway supports custom metadata headers (cf-aig-metada...
For further actions, you may consider blocking this person and/or reporting abuse
It’s wild how much we 'fly blind' with standard billing dashboards until we see a breakdown like this. Identifying a 100x cost gap is a huge win for a lean team. I love the 'boring UI' philosophy—it’s all about the insights, not the fluff. Keep the updates coming, Ali; your 'journey is the content' approach is incredibly high-value for the rest of us.
Thanks Syed this means a lot.
The "lean team" point is exactly right — when you're solo or small, you can't afford to optimize what you can't see. Most of the AI-cost articles I've read assume you have an ops team to build observability later. For us, observability has to be the first thing, not the last, because we're making architectural decisions in real-time based on what the dashboard shows.
The "boring UI" thing is honestly because I don't have time for fluff — I'm building Provia, writing articles, and trying to ship features all from the same kitchen table. The dashboard had to be ugly, fast, and useful. Turns out that's also what people actually want.
Saw you're building Commerza — full-stack from scratch in PHP is the real work. Following back.
Nice build!!!! Usage visibility is such an underrated problem. Knowing you spent money is one thing, but understanding where and why is what actually helps you optimize and control costs.
Thanks Varsha! "Underrated" is the perfect word for it — most teams instrument after the bill scares them, not before. Mobile observability is the same story I think; you write a lot about scale and architecture, so you've probably seen this exact pattern play out with Firebase or Sentry costs too.
Yeah, exactly. It usually shows up only after costs spike, by then it’s already reactive. Seen the same with Firebase and Sentry, where teams realize too late what’s actually driving usage. Feels like observability should be designed in from day one, not added after the bill hits.
That phrase — "designed in from day one" — is exactly the framing I've been trying to land on. Observability is treated like a luxury you earn after shipping, but it's actually the cheapest insurance policy in software. Five extra columns in your logs table at the start cost nothing. Five extra columns retrofitted across a year of production data is a migration nightmare.
The teams that learn this the easy way are the ones who watched it happen at their last job.
That “cheapest insurance in software” line is spot on.
Most teams only realize the value of observability when they’re already debugging blind. And by then, even a small missing field becomes expensive because nobody wants to touch old logging once production is messy.
I’ve seen the same pattern with AI usage too. If you don’t track prompts, endpoints, users, token patterns, and cost drivers early, every optimization later becomes guesswork.
Honestly, observability feels boring until it saves a release, a budget, or a week of engineering time.
Thanks Varsha — the "boring until it saves you" framing is right. I've come around to thinking observability is closer to seatbelts than to feature work: the value compounds quietly until you need it, and then it's all you have. Appreciate the read.
100x between features you thought were similar is the kind of variance that breaks sprint estimates. seen teams absorb these as 'AI overhead' because there's no per-call drill-down.
"AI overhead" is the right name for it. The moment it becomes a line item in the budget, the questioning stops.
The sprint angle is the one I underplayed in the article — costs are bad, but unpredictable costs that look similar to each other on paper are worse. Estimation collapses. You can't say "this feature takes 3 days" when "3 days" might mean $5 or $500.
Curious how the teams you've seen handle it once they spot the variance — per-feature budgets, quotas, or just absorbing the bill?
yeah once it's budgeted it stops being a question. the variance is still there, it's just invisible until estimation falls apart mid-sprint. that's the harder problem to sell upward.
Right — once it's absorbed, the only way to make it visible again is failure. Which is the worst possible time to make the case, because now you're explaining a missed deadline AND asking for tooling budget.
The pitch that works upward, in my limited experience, isn't about cost. It's about predictability. "We can't estimate AI features within an order of magnitude" lands differently than "AI is expensive." The first is a delivery risk. The second is a line item leadership has already accepted.
Are you seeing this play out somewhere specific, or is it pattern recognition across teams?
predictability framing works because it shifts the ask from 'trust us' to 'here's our signal'. harder to reject a prediction than a budget line.
Hey Mykola — appreciated the back-and-forth on my Dev.to article. Your "trust us vs. here's our signal" line is going into something I'm writing. Connecting here too.
Great point. I built something similar for my own app a while back. I usually rely on Cloudflare AI Gateway for this because it offers features like request caching for significant savings, per-request costs, rate limiting, request retries, model fallbacks, and detailed logs. However, having a custom dashboard definitely provides more flexibility for specific needs.
Gary — Cloudflare AI Gateway is exactly the kind of comparison I should have addressed in the article. The tradeoff I see: gateway-level tools are great for infra concerns (caching, retries, fallbacks) but they treat all calls as equal. They can't tell you "this 100-token call costs $0.0001 and this 1,820-token call costs $0.02" for the same user-facing feature. That tenant/feature/conversation breakdown has to live closer to your application code, where you have the labels.
So I think the right architecture is probably both: gateway for infra-level wins, custom dashboard for product-level cost attribution. Did you find yourself running both in parallel, or did the Gateway end up being enough for your use case?
Thanks for the detailed response, Ali! You've raised an interesting point about cost attribution.
From my experience with Cloudflare AI Gateway, it actually handles per-request logging quite well. It shows the exact token count and estimated cost for every single call in the logs. Regarding product-level attribution, I found that using the custom metadata headers (cf-aig-metadata) allows you to tag requests with IDs from your app, which bridges the gap between infra-level logs and product-level analytics.
To answer your question, I actually built my own calculator mainly out of curiosity to compare my internal logs with the Cloudflare AI Gateway data. I ran them both in parallel for a while and, to my satisfaction, they matched up perfectly. While the Gateway ended up being sufficient for my production needs, building a custom tool was a great way to verify the accuracy of the data.
Gary — that's genuinely useful, and I have to update my mental model. The cf-aig-metadata header is exactly the missing piece I assumed didn't exist; if it lets you propagate tenant/feature/conversation IDs from your app into the gateway logs, then Cloudflare does solve the attribution problem cleanly.
The honest revised take: my dashboard isn't an alternative to AI Gateway, it's what you build when you don't know AI Gateway has metadata headers. The "I built it because I had to" framing in the article only holds if you're not on Cloudflare's stack. For anyone already on it, your approach (Gateway as source of truth, custom tool for verification) is the better starting point.
Adding a follow-up note to the article. Appreciate the correction.
I appreciate the follow-up, Ali. Just wanted to share some insights from my workflow. Glad you found it helpful!
Genuinely. Threads like this are why I keep writing here.
I praise your effort and encourage your continued application. Additionally I would like to use this evidence for a deeper reflection: wasn't this "reality" already known "a-priori" before using any type of chat based AI tool?
I strongly believe that it was the moment I first learned the notion of "tokens" and the fact that no platform was willing to disclose them openly and upfront. That "evidence" left me very critical of the AI era and prompted a deep reflection that led me to refuse to jump on the bandwagon without speaking of the limitations and moral corruption that it fosters.
I compare it with the network traffic billing of 15 years ago, when VPS cost was determined by TB of traffic: the service providers disclosed (and accounted for) every bit of data they billed for.
I invite the young generation to see beyond the "offering" and accept any solution provided as "the only available". We had better services and options when we owned software not rented it.
Emanuele — token counts and tokenizer libraries are publicly documented by every major provider; what's missing isn't transparency at the API level, it's product-level cost attribution inside your own application. That's what the article addresses. Appreciate the read.
Spot on. The 'fire-and-forget' logging pattern is absolutely non-negotiable here. I see this exact 'blind spend' problem in the SDET and QA automation space all the time. When teams integrate LLMs into their CI/CD pipelines to validate complex API responses or generate dynamic payloads, the costs can spiral instantly without anyone knowing where to look.
When you use AI to generate semantic test data at scale—which is exactly the problem I tackle with my Python library, FixtureForge—you're making hundreds of API calls per test run. Without a granular observability wrapper like the one you built, a single unoptimized prompt or a loop in the pipeline can drain the budget overnight, and you'd have no idea which specific test suite caused it. Catching that 100x variance on day one proves this architecture is a must-have. Brilliant, actionable write-up!
Yaniv — the CI/CD angle is one I hadn't connected. Test pipelines with LLMs in the loop are exactly the kind of place where cost can explode silently because nobody's watching the per-test spend, just the per-deployment one. The dashboard pattern would surface a runaway test suite within hours instead of at end-of-month billing.
The fire-and-forget logging pattern stuck with me — not because it's technically clever, but because it quietly solves a problem that usually gets overengineered to death.
I've seen teams spend weeks wiring up OpenTelemetry, setting up collectors, configuring exporters, only to end up with dashboards nobody looks at because the setup was so heavy it became someone's full-time maintenance burden. Three files and a silent
.catch(() => {})is almost uncomfortably simple by comparison.What I find myself wondering though: at what scale does fire-and-forget stop being "good enough" and start becoming a blind spot? You mentioned losing maybe 2–3 entries out of thousands. That's nothing when you're tracing cost anomalies. But if someone's monitoring for security signals or abuse patterns, 0.3% data loss might be the exact 0.3% that matters.
Not a criticism of the approach — I think it's the right call for this use case. More just thinking out loud about how the same pattern can be perfectly appropriate for one goal and subtly risky for another, and how easy it is to confuse the two.
This is the question that should have been in the article. The "good enough" calculus changes entirely when the data IS the product, not just the instrumentation around it.
The way I think about it now: fire-and-forget is right when (1) you're optimizing for the aggregate, not the individual record, and (2) the cost of slowing down the user-facing path exceeds the cost of any single missed log. Cost tracking matches both — I care about p95 spend per tenant per day, not whether I have every call. Lose 3 logs out of 1,000, the picture barely shifts.
Security and abuse signals invert both conditions. You're often hunting for the one anomalous record, not the distribution. And in many cases, slowing the request path is acceptable (or even desirable — you might want the auth check to block) compared to missing the signal. So the same pattern that's perfectly fine for billing observability would be a serious bug for fraud detection.
The dangerous version is when teams adopt fire-and-forget as a default pattern across all logging because it "worked for cost tracking," and quietly accept silent gaps in security telemetry. That's the failure mode worth naming.
The \"100x cost gap\" is a familiar pain when fine-tuning or inferencing. We see this acutely with multilingual models. Processing \"paracetamol\" versus its equivalent brand name, say \"क्रोसिन\" (Crocin), across 22 Indian languages often hits different tokenization costs and model pathing.
Without a dashboard like yours, pinpointing these subtle cost variations per language, or even per region-specific drug name, becomes impossible. It's not just feature A vs B, but 'lang A' vs 'lang B' for the same feature.
Crucial for managing API spend, especially when mapping complex data like drug interaction graphs across diverse linguistic inputs. I'm building GoDavaii.
Pururva — the multilingual angle is one I hadn't thought through. The tokenization variance between scripts is exactly the kind of thing the dashboard would surface as a "cost mystery" (why is this query 4× more expensive than that one?) but you'd never spot the pattern without language as a column.
For Provia I'm dealing with Arabic vs English chat in the same store — already seeing token counts run higher for Arabic responses, but I haven't broken it down by script yet. Going to add a language column this week.
The drug interaction graph use case sounds genuinely hard. Are you finding that certain languages need different model routing entirely, or is it more about predicting per-language cost variance for budgeting?
Great idea! Now could you make it for Claude?
Thanks Matteo! The wrapper itself is model-agnostic — only thing that changes is the pricing table and the field names from the response.
For Anthropic's SDK, pricing table looks like:
And the response uses usage.input_tokens / output_tokens instead of OpenAI's prompt_tokens / completion_tokens — straightforward swap.
One thing worth adding to the table for Anthropic specifically: cache read/write tokens. Prompt caching gives ~90% discount on cache hits, so if you're not tracking those columns separately you'll undercount savings. Probably worth its own follow-up post.
Are you mixing both providers in one app? That's actually the most interesting case — comparing cost-per-feature across providers from the same dashboard.
Yes it would be quite interesting if you made a mode where you could compare token usages across multiple suppliers (e.g Anthropic, DeepSeek, ChatGPT, etc). You could even create bar charts and pie charts to show your earnings across all models and providers.
That's a really good angle, Matteo — a unified multi-provider dashboard would basically turn the provider column into the most important dimension in the whole system.
The architecture isn't hard. One api_logs table with provider, model, endpoint, and a normalized cost column. Each SDK gets its own thin wrapper that maps to the same shape. The dashboard groups by whatever you want — provider, model, feature, tenant.
Where it gets interesting is the comparison views you're describing:
Pie chart by provider — am I actually diversified, or 95% locked into one vendor?
Bar chart by feature × provider — chat on Claude vs GPT vs DeepSeek for the same workload
Cost-per-task — same prompt across providers, normalized by output quality
Honestly you've just outlined my next post. I'll build a multi-provider version of this and write it up — would you want me to tag you when it goes live?
Yeah that sounds good! Excited to see the post!
Great work!!
Thank you 🔥
One more thing I thought was to reduce redundant calls by caching the responses for similar prompts. This works for me as I ask the same things again and again 😂
Smart move — that's actually a different layer than the per-call optimization. Response caching skips the API entirely on repeats, prompt caching cuts the cost on the context but you still pay for fresh generation. Both win, depending on how deterministic the task is. For dev questions you ask repeatedly, response caching is probably the bigger lever.
What are you using as the cache key — the raw prompt, an embedding similarity check, or something else?
A useful idea—turning raw spending data into a clear dashboard helps users actually understand and control their AI costs.
Thanks Laura — the dashboard came out of needing exactly that for myself. Glad it resonated.
Nice!
👀👀