The Exact Prompt Engineering That Makes Our Voice AI Sound Human (Full Prompts Included)

#ai #webdev #deeplearning #typescript

Last month, a patient called one of our dental clinic clients at 11pm on a Saturday and had a full conversation about rescheduling their root canal — and didn't realize they were talking to an AI until the receptionist mentioned it at their next visit. That wasn't an accident. It was the result of 4 months of prompt iteration, 10,000+ call recordings analyzed, and about 140 prompt versions before we landed on something that actually works.

We build Loquent, a production voice AI platform that handles thousands of automated calls per month for healthcare and dental clinics across Canada. The system runs on Anthropic's Claude for conversation logic, Deepgram for speech-to-text, ElevenLabs for text-to-speech, and Twilio for telephony. When we started, our AI sounded like a chatbot reading a script. Now it sounds like a receptionist who's been working at the clinic for three years. The difference was almost entirely in the prompts.

I'm going to share the actual prompt architecture we use, the specific techniques that moved the needle, and the failures that taught us the most. If you're building voice AI for any domain, most of this transfers directly.

Why Voice Prompts Are a Completely Different Problem

The first mistake we made was treating voice AI prompts like chatbot prompts. We took our best-performing text prompts and plugged them into the voice pipeline. The result was technically correct and completely unusable.

Here's why: in a text chat, a user reads a 3-sentence response in about 4 seconds. In a voice call, that same response takes 12-15 seconds to speak aloud. By sentence two, the caller has already mentally checked out or tried to interrupt. We learned this the hard way — our first production deployment had a 34% hang-up rate within the first 30 seconds.

Voice has three constraints that text doesn't: latency sensitivity (callers expect sub-second responses), interruption handling (people talk over AI constantly), and conversational pacing (long responses feel robotic regardless of how natural the voice sounds).

The Prompt Architecture That Actually Works

After 140+ iterations, we settled on a three-layer prompt architecture. Here's the actual structure:

Layer 1: The System Identity Prompt

This is the foundation. We keep it under 400 tokens because longer system prompts measurably increase response latency with Claude. Here's a representative version (clinic details changed):

You are Sarah, a receptionist at Bright Dental Care in Toronto.
You answer phone calls. You are friendly, efficient, and 
knowledgeable about the clinic.

CRITICAL RULES:
- Respond in 1-2 short sentences maximum. Never more.
- Use natural filler words occasionally: "sure", "of course", 
  "absolutely", "let me check on that"
- If you don't know something, say "Let me check with the 
  team and get back to you" — never guess
- Always confirm spelled-out details back to the caller
- You cannot provide medical advice. Ever. Redirect to 
  the dentist.

CLINIC HOURS: Mon-Fri 8am-6pm, Sat 9am-2pm, Closed Sunday
EMERGENCY LINE: 416-555-0199

Notice what's NOT in there: no verbose personality descriptions, no "you are a helpful assistant," no lengthy backstory. Every token in the system prompt costs you latency, and in voice, latency kills the experience.

Layer 2: The Conversation State Manager

This is where most voice AI projects fail. They treat each turn independently. We inject a dynamic context block that updates every turn:

CURRENT CALL STATE:
- Caller intent: {appointment_reschedule}
- Caller name: {collected: "Michael"}
- Caller verified: {yes — DOB matched}
- Current appointment: {May 15, 2pm, Dr. Patel, cleaning}
- Turn count: {4}
- Sentiment: {neutral, slightly impatient}

NEXT LIKELY ACTIONS: confirm_new_time, check_availability

This state block is assembled programmatically from our NestJS backend. The sentiment field comes from Deepgram's tone analysis on the caller's voice. The "turn count" matters because we found that after turn 6-7, callers get noticeably more impatient, so we prompt Claude to be more concise and direct.

Layer 3: The Response Shaping Instructions

This is the layer we iterated on the most. Here's the version that cut our hang-up rate from 34% to 11%:

RESPONSE FORMAT RULES:
- Maximum 25 words per response unless reading back 
  specific information
- Lead with the answer, then context. Never context first.
- End with a clear next step or question
- Use contractions always (it's, we've, that's)
- No lists. No bullet points. No "firstly/secondly"
- If the caller seems confused, ask ONE clarifying question. 
  Not two.

PACING:
- After giving appointment details, pause with "Does that 
  work for you?" before continuing
- Never stack multiple pieces of information in one response
- If you need to relay 3+ data points, break across turns

The "lead with the answer" rule alone improved our caller satisfaction scores by 22%. When someone asks "Do you have anything available Thursday?" — the old prompt would say "Let me check our availability for Thursday. We have several options..." The new prompt produces: "Thursday works. We have 10am or 2:30pm with Dr. Patel. Which do you prefer?"

The Three Techniques That Made the Biggest Difference

1. The "Overheard Conversation" Training Method

We stopped writing prompts from scratch and started transcribing our best human receptionists. We recorded 40 hours of real receptionist calls (with consent), transcribed them, and identified the specific phrases and patterns that made callers respond positively. Then we encoded those exact patterns into the prompt.

For example, we noticed that the best receptionists always said "perfect" or "great" after a caller confirmed information, before moving on. Small thing. But when we added After caller confirms any information, acknowledge with a brief affirmation ("perfect", "great", "got it") before your next question to the prompt, our post-call satisfaction ratings went up 8%.

2. The 25-Word Ceiling

We tested response lengths systematically across 2,000 calls. The data was clear:

Under 15 words: callers felt the AI was too curt, asked "are you still there?"
15-25 words: optimal range, lowest hang-up rate, highest task completion
25-40 words: hang-up rate increased 18%
Over 40 words: hang-up rate increased 41%, callers started interrupting mid-response

We hard-coded the 25-word ceiling into the prompt and added a programmatic check that flags any Claude response over 30 words for review. In production, Claude stays under 25 words about 89% of the time with this prompt instruction alone.

3. Sentiment-Adaptive Prompting

We inject real-time sentiment into the conversation state (from Deepgram's audio analysis) and include conditional instructions:

IF sentiment = frustrated or impatient:
- Be extra concise. Under 15 words if possible.
- Skip pleasantries. Get to the point.
- Offer to transfer to a human: "Would you like me to 
  connect you with someone from our team?"

IF sentiment = confused:
- Slow down. One piece of information at a time.
- Repeat back what you understood.
- Ask a yes/no question to re-anchor.

IF sentiment = positive/chatty:
- Match energy briefly but stay on task.
- One friendly comment maximum, then redirect.

This single addition reduced our human transfer rate from 23% to 18%. The frustrated callers who previously would have demanded a human were getting handled faster, which resolved their frustration before it escalated.

The Failures Worth Mentioning

The persona trap. We spent two weeks crafting elaborate backstories for our AI personas — hobbies, favorite coffee orders, years of "experience." None of it mattered. Callers don't ask receptionists about their hobbies. Every token spent on backstory was latency we couldn't afford. We stripped it all out.

The politeness overcorrection. After getting feedback that the AI sounded "too robotic," we over-indexed on politeness. The AI started saying things like "I'd be absolutely delighted to help you with that!" on every turn. Three callers in one day asked if they were talking to a bot. Ironically, being too polite was the tell. Real receptionists are friendly but efficient, not performatively enthusiastic.

The temperature disaster. We ran Claude at temperature 0.9 for two days thinking it would sound more "natural." It did — until it started confidently inventing appointment slots that didn't exist and telling one caller that Dr. Patel "usually runs about 10 minutes behind on Tuesdays." Temperature 0.3 is where we live now. Boring is better than wrong.

Key Takeaways

Voice prompts must be ruthlessly short. Every word costs latency and attention. The 25-word ceiling isn't arbitrary — it's data-driven from 10,000+ calls. If your voice AI responses regularly exceed 30 words, you have a prompt problem.
Lead with the answer, always. Context-first responses are a text pattern that fails completely in voice. Callers want the answer in the first 3 seconds, then they'll listen to details.
Inject real-time state into every turn. Treating each turn independently produces conversations that feel like talking to someone with amnesia. The state manager layer is the difference between a demo and a product.
Copy real humans, not chatbots. Transcribe your best human operators. The specific words and micro-patterns they use (affirmations, pacing, question framing) are worth more than any prompt engineering framework.
Measure everything, trust nothing. We thought longer, more detailed prompts would perform better. The data said the opposite. We thought higher temperature would sound more natural. It created hallucinations. Test with real callers, not vibes.

We're planning to open-source our prompt benchmarking tool in the next few weeks (that's the Week 9 article). If you're building voice AI — healthcare or otherwise — and want to compare notes on prompt architecture, we'd love to hear about it. Reach out at hello@autor.ca or visit autor.ca.