Most teams trying to build their own AI receptionist think the hard part is the AI.
It's not. The AI is the easy part now.
The hard part is everything around the AI. The part that doesn't show up in demos or tutorials. The part that takes six to eight months to figure out and breaks every time it goes near production.
I've watched a few teams try to build this themselves. They all hit the same wall.
- 1000ms — Total latency budget per response
- 6-8 mo — Orchestration buildout before production
- 8 layers — Hidden under the 30-second demo
What they think they're building
You watch a Vapi or Retell demo. Agent answers a call, takes a booking, sends a confirmation. Looks simple.
So they think the build is:
- Pick an LLM
- Write some prompts
- Pick a voice
- Connect a phone number
- Ship it
A weekend project.
What they're actually building
Here's what's underneath that 30-second demo.
Telephony layer. SIP trunking. Carrier integration. STIR/SHAKEN attestation so calls don't get marked as spam. Inbound number provisioning. Outbound caller ID verification. DTMF detection. Call recording compliance per state.
Audio infrastructure. Voice activity detection that doesn't false-trigger on background noise. Barge-in handling so the agent stops talking when the caller interrupts. Echo cancellation. Silence detection. Dropped audio recovery.
Latency budget. The whole call has a 1000ms response window before it sounds robotic. That 1000ms gets split across speech-to-text, LLM inference, tool calls, text-to-speech, telephony round trip. Each one has to be optimized. Miss the budget and customers hang up.
Tool reliability. The agent calls your CRM to book an appointment. The API times out at 8 seconds. Agent already said "perfect, you're booked for Thursday." Customer gets no confirmation. Shows up. No record. Trust gone.
State management. Call drops mid-conversation. Customer calls back. How does the agent know they were already 80% through booking? Handoff between inbound and outbound. Retry logic. Idempotency so the same booking doesn't get created twice.
Escalation logic. When does the agent transfer to a human. When does it just take a message. How does it handle threats, lawsuits, contract disputes, refund demands. These aren't AI problems. They're product problems with hard rules.
Monitoring. How do you know the agent is failing? You can't watch every call. You need three layers — system health (uptime, error rates), leading indicators (transfer rate, low-confidence responses), business outcomes (bookings, conversion, revenue).
Model and data drift. The LLM provider updates their model. Agent behavior shifts subtly. Nobody notices for two weeks. You find out when bookings drop 15%.
The build vs buy moment
This is the conversation I have with operators who think they want to build it themselves.
They're not wrong about the AI. Anyone can prompt an LLM to sound friendly on the phone.
They're wrong about the rest.
I talked to a guy who'd been building his own setup for 8 months. He had the agent working great in test calls. The moment he tried to ship it into production, everything broke.
His telephony provider's webhook signing wasn't matching. His CRM API was throwing 500s on bookings during peak hours. His agent was confirming bookings before the API actually wrote them, so customers got told they had appointments that didn't exist. His latency was 2.4 seconds because he was running STT → LLM → TTS sequentially instead of streaming.
He asked me how long it took us to solve those problems.
About a year of running it in production with real shops.
He stopped trying to build his own.
The difference between a $300/month AI receptionist and one that actually works is everything underneath the conversation.
Why this matters if you're shopping
If you're an operator looking at AI receptionist providers, the question isn't "do you have an AI that sounds good." Every provider sounds good in the demo.
The question is "what happens when something goes wrong."
Ask them:
- What's your average end-to-end response latency under load
- How do you handle webhook timeouts on CRM bookings
- What happens if a call drops mid-conversation
- Show me how you detect false success — when the agent says "booked" but the booking didn't actually happen
- What's your transfer rate to humans and what triggers it
Most cheap providers can't answer these. They shipped the demo. They didn't ship the production system.
The takeaway
Building AI is no longer the hard part. Infrastructure around the AI is.
If you're an operator, ask the harder questions before you buy. The conversation quality is table stakes. The orchestration is what determines whether the agent actually books the job.
If you're a builder thinking about competing in this space, plan for six to eight months on the orchestration before you ship. Or pick a different problem. This one is solved by people who have already taken the lumps.
If you want to see what running the orchestration looks like from the operator side, my last long-form was on how I replaced hours of manual work with a self-hosted AI agent — same NeverMiss, different stack, full build log including the security layer most tutorials skip.
Want this kind of automation built into your business?
If you'd rather not spend six to eight months on the orchestration yourself, that's what NeverMiss does. nevermisshq.com
Top comments (3)
The 1000ms latency budget is the detail that reframes everything else. It's not just a technical constraint — it's a product constraint that dictates your entire architecture before you've written a single prompt. Once you accept that the total window from caller finishing their sentence to agent beginning its response is one second, you realize you can't afford sequential processing, you can't afford retries on timeouts, and you definitely can't afford an LLM that thinks for two seconds before responding. The demo doesn't care about this because the demo isn't running under load. But in production, the latency budget is the thing that separates a conversation from an automated phone tree that happens to use natural language.
What makes this harder than it looks is that the latency budget isn't evenly distributed. Speech-to-text takes what it takes. Text-to-speech takes what it takes. The network round trips are physics. By the time you subtract those, the LLM and any tool calls are fighting for whatever milliseconds are left. And the tool calls are the unpredictable part — one slow CRM API response and you've blown the budget, and the caller has already decided this thing is broken and hung up.
The false success problem — where the agent says "booked" but the API didn't actually commit — feels like the truly scary one. A slow response is frustrating. A confidently wrong confirmation is trust-destroying. And the worst part is you might not know it's happening unless you're specifically monitoring for mismatches between what the agent said and what the CRM actually recorded. Most teams wouldn't think to build that reconciliation check until after the first customer complaint. I'm curious how much of that monitoring is standard across providers now, or if it's still mostly custom infrastructure that teams have to build themselves after getting burned.
yeah the false success problem is the scary one. silent failure is way worse than visible failure because by the time you find it the customer has already lost trust
honest answer on monitoring across providers — its still mostly custom. the bigger players have some reconciliation built in but its usually around their own tool calls not your crm. anyone building cross-system has to handle it themselves
the pattern that works for us is treating every booking confirmation as a two-stage commit. agent never says booked until the api returns success. if the api times out the agent says we will confirm by sms in the next minute then handles the actual confirmation async. customer never gets told something happened that didnt
most teams skip this because it adds latency to the happy path. but trust is binary. one false confirmation and you lose the relationship
The 1000ms latency budget is the detail that reframes the entire architecture. I hit a version of this building an AI support agent for DeFi protocols. My constraint isn't voice latency but trust latency: when someone asks "am I about to get liquidated?" during a market crash, a 3-second answer might be the difference between saving a position and losing it.
The fix was the same pattern you describe: don't let the AI do the heavy lifting in real time. I pre-compute everything before the AI touches it. On-chain data, health factors, liquidation prices, risk simulations all run through dedicated service layers that return structured results. The AI's job is to narrate the answer in plain English, not to calculate it. The moment you let the AI do the math, you get confident wrong answers that sound perfect.
Your "false success" problem (agent says "booked" but the booking didn't write) maps exactly to what I call the never-lie principle. My system tags every field with a data availability flag: full, partial, or unavailable. If an RPC call failed, the AI sees "couldn't determine" instead of a default value. The default is where the lie lives. A health factor of "unknown" is honest. A health factor of 999 (which was in my codebase before I caught it) is a confident wrong answer that could cost someone real money.
The build vs buy framing is honest. The 8-month orchestration timeline matches my experience. Six months in, 95K lines, and the orchestration layer is still where most of the complexity lives.