Voice AI for jobsite estimating: a developer perspective
The problem: estimating on muddy ground
Last year, I spent 50 hours interviewing mason and carpentry teams on active construction sites. The recurring frustration? Estimating is still paper-based or Excel-driven, despite mobile phones being ubiquitous. Why? Because typing on a jobsite sucks. Wind, rain, gloved hands, and attention divided between a trench and a screen create a friction loop that no traditional SaaS can solve.
One crew leader told me: "I have a €50,000 tablet in the truck that I use for photos. For estimates, I use a napkin."
That observation sparked a technical investigation: could voice AI genuinely solve this? Not as a gimmick, but as a developer building a real product used by 200+ crews daily.
Voice AI on construction sites: the stack that works
Why not use ChatGPT API directly?
The naive approach is tempting: use OpenAI's audio endpoint to transcribe "2 meters oak framing at €45 per meter" into a line item. The problems surface fast:
Acoustic nightmare: jackhammers, concrete saws, and dump trucks create 85-95 dB. Standard speech-to-text (including Whisper) achieves ~75% accuracy at 90 dB. That means 1 in 4 items corrupted.
Domain-specific vocabulary: STT models trained on conversational English mangle construction terminology. "Aggregate base course" becomes "agg brick base horse." Domain adaptation is non-trivial.
Latency: cloud roundtrips to OpenAI add 800-2000ms. On-site workers expect sub-300ms feedback (like text input).
What we built: hybrid on-device + cloud
Our current architecture:
-
On-device Whisper model (tiny variant, 39M parameters) runs on iOS/Android for local transcription. Accuracy drops ~5% vs. the large model, but we gain:
- Zero privacy concerns (voice never leaves the phone)
- Sub-200ms latency
- Works offline
Domain adapter layer (100-line PyTorch module) fine-tuned on 12,000 real jobsite estimates. Maps "oak frame" → wood framing SKU, "45 euro per meter" → rate parsing. This alone improved accuracy from 71% → 89%.
Cloud fallback (Firebase Functions, Python) handles ambiguous items asynchronously. If the on-device model returns confidence < 0.75, we send to Whisper-large + Claude-3-sonnet for semantic validation.
Streaming UI (React Native) shows real-time transcription and parsed line items. User can tap-to-correct before saving.
Practical lessons from 50+ deployed jobsites
Lesson 1: Context matters more than accuracy
A mason saying "one bag of mortar" should parse to a specific mortar type from that crew's historical material list. We built a lightweight context module that pre-loads the user's past 30 estimates and recent material preferences as a system prompt to the validation layer. Accuracy jumped 12% just from this.
Lesson 2: Confidence scoring is non-negotiable
Users stopped trusting voice input when erroneous items silently corrupted estimates. We now badge every line item with a confidence score (0-100%) in the UI. Items < 65% trigger a "please review" flag. This transparency cut post-estimate corrections by 60%.
Lesson 3: Ambient noise requires multi-modal input
On truly loud sites (pile driving, demolition), we added visual cues: users can point the phone camera at material piles, and our model identifies item count/size using a fine-tuned YOLOv8 detector. Combining voice + vision raised accuracy to 94% even in 95 dB environments.
Lesson 4: The business case doesn't live in accuracy alone
One general contractor reported: "I used to spend 1.5 hours writing estimates by hand. Voice + AI cuts it to 15 minutes, even with errors fixed. I just go faster." The product solved the speed problem before it solved the accuracy problem. Developer teams often get this backwards.
Technical stack snapshot
For those interested in building something similar:
| Component | Tech | Why |
|---|---|---|
| On-device STT | Whisper tiny + CoreML | 39M params, 50ms latency on A15 |
| Domain adapter | PyTorch Lightning | 12K estimate dataset, 4-layer MLM |
| Validation | Claude-3-sonnet (vision) | Handles ambiguity, cost-effective at ~0.001€ per estimate |
| UI | React Native + Expo | iOS/Android parity, offline-first sync |
| Sync engine | Firebase Realtime DB + Cloud Functions | Handles 200+ concurrent crew sessions |
| Analytics | PostHog | Understand where accuracy breaks down |
The total latency from "start recording" to "parsed estimate visible" is ~450ms (50ms local + 150ms adapter + 250ms UI render). Fast enough that crews don't wait.
What this taught us about AI in construction
AI in trades is not ChatGPT + hype. It's domain adaptation, multi-modal input, and honest confidence scoring. Crews test you. They'll abandon a tool that burns 30 minutes chasing phantom line items.
Offline-first is a feature, not a limitation. Construction sites have spotty connectivity. The teams that win build smart local processing first, cloud validation second.
Voice alone isn't the answer. Camera, GPS, accelerometer data all improve context. The best construction AI is sensory fusion.
Developer experience matters as much as user experience. If your API requires 5 auth layers and custom domain training, teams won't adopt it. We built Anodos with a single API key and pre-trained models for common material types.
The road ahead
We're currently exploring:
- Real-time material cost indexing (linking on-site estimates to live supplier APIs)
- Crew handwriting recognition (photos of paper notes parsed into line items)
- Multimodal safety checks (detecting OSHA violations in photos, flagging risky estimates)
The goal is simple: make estimating so frictionless that a crew chief can produce a professional, legally compliant estimate while standing on the jobsite, without a truck trip to the office.
How this fits into your stack
If you're building construction software — whether scheduling, invoicing (Factur-X 2026 compliance?), or risk management — voice AI for intake is your secret weapon. It solves the coldest problem in the workflow: getting the data in.
We've open-sourced our domain adapter at github.com/anodos/estimator-adapter (MIT license). Use it as a baseline, fine-tune on your own estimates, and reduce voice-to-structured-data latency in your own products.
Olivier Ebrahim is founder of Anodos, a jobsite management platform for construction SMBs. Previously, ML engineer at Mistral. Obsessed with reducing friction in the trades.
Top comments (0)