Olivier EBRAHIM

Posted on May 7

Voice AI for jobsite estimating: a developer perspective

#construction #ai #saas #webdev

Voice AI for Jobsite Estimating: A Developer Perspective

When I started working on voice-driven estimating for construction SMBs, I thought the hard part would be the AI. It wasn't.

The hard part was understanding that a mason on a muddy jobsite doesn't want to pull out a tablet and type. He wants to speak into his phone and have his estimate ready—without losing accuracy or forgetting critical details. That insight shaped everything about how we built voice-to-estimate at Anodos.

This post walks through the technical and UX decisions we made, so that if you're building voice interfaces for blue-collar workflows, you don't repeat our mistakes.

The Deceptive Simplicity of "Just Use STT"

The obvious move: pipe jobsite audio through a speech-to-text API (Google Cloud Speech, Azure, Whisper) and pray it understands construction vocabulary.

Reality check: A mason saying "2 by 4 studs" might get transcribed as "2 by for studs" or "to by 4 studs." Drywall joint compound becomes "drywall jownt konpownd." And when he's yelling over a saw, accuracy tanks.

Our first prototype used raw Google Cloud Speech. On clean office audio, it hit 95% WER (word error rate). On jobsites with ambient noise ≥70dB, it dropped to 68% WER. That's not usable—a 30% error rate means the estimator is babysitting every transcription instead of moving faster.

What we did instead:

Domain-specific language model training: We collected ~500 hours of actual jobsite audio (with permission) and fine-tuned a smaller Whisper model on construction vocabulary and accent patterns. WER improved to 87% in noisy conditions—still imperfect, but human-correctable.
Contextual slots, not free-form speech: Instead of "tell me everything about this wall," we guide the user: "How many linear meters? ... What material? ... What finish?" Each question gets its own audio segment, so the STT system has tighter bounds and lower error surface.
Confidence scoring with feedback loops: If confidence < 0.80, we ask for confirmation. "Did you say 'oak trim,' or 'awl trim'?" Humans are fast at saying "yes" vs. "no."

The Real Problem: Mapping Speech to Structured Data

You've got clean transcription. Now what?

The pipe flow looks like:

Audio → STT → Text → NLU → Slots → Estimate Calculation

Most teams focus on the STT→Text part. We learned the bottleneck is NLU (natural language understanding) to Slots (structured data your DB needs).

Example: User says "Two 4-meter sections, wall finish E-101."

Your NLU needs to parse:

Quantity: "Two" = 2
Unit: "4-meter sections" = 4m per section
Material code: "E-101" (might be misheard as "E one-oh-one", "E-101", "ee-cent-un")

We built a hybrid approach:

Regex + gazetteer: Catch known material codes and unit patterns upfront. If the user says a valid SKU (even partially), grab it and score high.
LLM fallback (Claude / GPT-4 mini): For ambiguous cases, ask the LLM to fill missing slots or reconcile conflicts. "User said 2 studs at 4m, confirm that's 8m total?" Then return a JSON confirmation payload.
Fast path for known patterns: Cache the 20 most common phrase patterns (e.g., "X linear meters of [material]") as templates. Regex them first before calling any model.

On a jobsite, latency matters. Our target was <2 seconds from audio-end to "confirm these details" prompt. The LLM fallback adds ~500ms in the worst case, but we only hit it ~20% of the time. The 80% common-case runs in <300ms via regex + templates.

Handling Accents and Jargon

Construction has regional dialects and insider language.

A "poteau" (post) in Quebec is a "pilier" in France. "Dalle" means concrete slab in French but is colloquial in some regions. A Toronto framer says "sill plate"; a London carpenter says "wall plate."

Our approach:

Localization at init: When the user sets their region (Canada East, France, UK, etc.), we inject a region-specific vocabulary boost into the STT model and the NLU gazetteer.
User vocabulary personalization: After 10 estimates, we log which terms the user repeats and their manual corrections. "User corrects 'poteau' STT error 3 times → boost 'poteau' in their personal model." Simple, effective.
Community vocabulary: When we see 10+ users in the same region all correcting the same phrase the same way, we add it to the public regional model (with anonymization, obviously).

This is lightweight ML, but it compounds. After 6 months, a user's voice profile is 12% more accurate than when they started.

The Gotchas: Offline, Liability, and User Friction

Offline Capability

Jobsites often have patchy connectivity. You can't block estimating on a cell signal.

We built a lightweight fallback STT using Vosk (open-source offline speech recognition, ~50MB model). Accuracy is ~60% WER offline, but paired with our contextual slot-filling, it's good enough for "confirm or correct." Users see:

Online (4G/5G): full accuracy, 2–3 second end-to-end
Offline: degraded accuracy, user corrects manually, then sync when signal returns

The offline cache also includes the user's personal vocabulary and material codebook, so even in a dead zone, they can estimate using their own jargon.

Liability and Audit Trail

Estimates are legally binding. If an estimate is wrong, who's liable?

We store:

Raw audio (30-day retention, encrypted)
Transcription with confidence scores per segment
All user corrections (timestamp + edit)
Final estimate (signed by user with their digital signature)

If a dispute arises, we can replay the audio, show which parts had low confidence, and which corrections the user made. This protects both us and the user.

Combating User Abandonment

Early pilots showed users loved voice for quick checks but abandoned it for "real" estimates because they didn't trust the speed.

Fix: We added a verification mode where the system reads back the estimate in plain language. "Estimate: 50 linear meters of maple trim, finish E-101, $2400. Confirm?" If they say yes, it locks the estimate. If they say "no," it goes back to manual entry. Confidence in voice jumped from 40% to 85% when users heard their own data back to them.

Practical Lessons

Domain-specific beats general. A fine-tuned smaller model beats a raw general large model. Costs less too.
Confidence scores are your friend. Don't hide uncertainty—surface it and ask. Humans are happy to confirm.
Contextual slots reduce error surface. Free-form speech is hard. Guided questions with small answer sets are much easier to STT and NLU.
Personalization matters fast. After 20 estimates, a user's accuracy improves 8–12% because you've learned their accent and jargon. Communicate this.
Test with actual users on actual jobsites. Office testing is misleading. Noise, interruptions, and cognitive load are real on a jobsite. Simulate them early.
Offline fallback is not optional. Connectivity on jobsites is fragile. Have a degraded-but-functional path.

What's Next

We're exploring visual confirmation (after voice estimate, auto-snap a photo of the wall/element being estimated and overlay the estimate details). It's a UX win: users feel more confident because the system has "seen" what they're describing.

Also experimenting with post-estimate learning: if an estimator's voice estimate consistently differs from a human auditor's final measurement, we flag it for personalized coaching. "Your voice estimates run 5% high on drywall—want tips on how to phrase measurements?"

Voice AI in construction isn't about replacing estimators. It's about letting them move faster on the jobsite and spend less time hunched over a laptop afterward. If you're building tools for the trades, start there.

Olivier Ebrahim, fondateur d'Anodos

Anodos helps SMB construction teams digitize their workflow: real-time jobsite management, voice-driven estimating with AI, Factur-X 2026 invoicing, and GPS-verified crew planning. Built for the trades, by people who've spent time on jobsites.

DEV Community