Olivier EBRAHIM

Posted on May 8

Voice AI for jobsite estimating: a developer perspective

#construction #ai #saas #webdev

Voice AI for Jobsite Estimating: A Developer Perspective

The Problem: Why Builders Still Use Pencil and Paper

Walk onto a French construction site and you'll see something that hasn't changed in 40 years: the foreman scribbling measurements into a notebook, then manually typing them into a spreadsheet back at the office. Between the noise of machinery, the gloves that prevent typing, and the cognitive load of context-switching, voice-based input seems like the obvious solution. Yet voice AI for construction remains stuck in proof-of-concept hell.

Why? Because building AI that understands construction vocabulary, handles acoustic chaos, and integrates into real workflows is genuinely harder than chatbots or virtual assistants. This article walks through the technical challenges we've encountered building voice estimating features for jobsites across France—and what actually works.

Challenge 1: Acoustic Environment & Domain Vocabularies

The Noise Problem

A typical jobsite is 85-95 dB. Pneumatic drills, circular saws, concrete vibrators, and mixing trucks create an acoustic environment that standard speech-to-text models struggle with. Commercial APIs like Google Cloud Speech-to-Text and Azure Speech Services do okay, but "prise de courant" becomes "prise du courant," and critical measurements get mangled.

What we tried first (wrong):

Deployed off-the-shelf English-trained models → failed on French construction terminology
Assumed Whisper would handle noise better → it did, but latency was 3-5s, making it unusable on jobsite
Used local noise cancellation → removed too much useful audio, broke word boundary detection

What actually works:

Pre-filtering: Use FFT-based noise gate to suppress frequencies below 500Hz (most construction machinery) before sending to speech-to-text
Custom vocabulary: Train a small domain-specific model layer on top of the main model. Our custom layer adds ~200 construction terms (dormant, tableau électrique, linteau, IPN, etc.)
Confidence thresholding + human loop: If confidence < 0.72, replay audio to the user with highlighted uncertain phrases. Let them confirm or re-record.

The Vocabulary Gap

Standard speech-to-text was trained on news, podcasts, and conversations. It has never heard "dormant" (electrical junction box) or "IPN" (steel beam), so it either ignores them or substitutes homophones. Fine-tuning a production model is expensive, so we built a lightweight correction layer:

Input: "We need three IPNs and a dormant"
↓ recognition confidence scores
Confidence < 0.68 on "IPNs" → triggers correction
Lookup in construction dictionary → high-confidence substitution
Output: "We need three IPNs and a dormant" ✓

This layer runs locally and adds <50ms latency. It's a 4MB model trained on 50k annotated construction transcripts.

Challenge 2: Real-Time Constraints on Mobile

Latency Budget

A foreman speaking naturally pauses ~0.5 seconds between thoughts. If your speech-to-text + processing + UI feedback takes >1.2s, the user perceives lag and stops using the feature. This is hard.

The latency breakdown (ms):

Audio capture & buffering: 100ms (audio chunk size)
Network request: 150-300ms (depending on cellular signal—jobsites are often in rural areas)
Speech-to-text processing: 400-800ms
Domain correction layer: 50ms
JSON parsing + UI update: 30ms
Total: 730-1280ms

At the high end, this is already over budget.

Our solution: Progressive transcription

We use the Google Cloud Speech-to-Text streaming API (not batch), which returns partial results as words arrive:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "We need three",
          "confidence": 0.87
        }
      ],
      "isFinal": false
    }
  ]
}

Display partial results to the user immediately (even if incomplete). When isFinal: true, apply the domain correction layer and lock the transcript. Result: perceived latency drops to ~400ms because the user sees something happening in real-time, even if the final text arrives a bit later.

Offline Fallback

Rural jobsites often lose signal. We cache the domain dictionary locally (4MB) and fall back to the device's native speech-to-text (iOS speech framework, Android's SpeechRecognizer API) when offline. Recognition quality drops to ~70%, but users can still record voice notes that sync back to the server when connectivity returns.

Challenge 3: Integration Into Estimating Workflows

From Voice to Structured Data

A voice transcript is narrative. An estimate is structured. Converting "We need a 4.5-meter wooden beam in oak, pine, or walnut, but oak is preferred because it's more durable on this north-facing wall" into a structured line item requires NLU (natural language understanding) and domain reasoning.

The pipeline:

Entity extraction (custom BERT-small model, ~40M params):
- Material: oak, pine, walnut
- Quantity: 4.5 meters
- Purpose: wooden beam
- Context: north-facing wall
- Preference: oak preferred
Lookup pricing (vector search into product catalog):
- Query embeddings for "oak beam 4.5m" → return matching SKUs + price ranges
Generate JSON (template + LLM):

   {
     "item": "Wooden Beam",
     "material": "oak",
     "length_m": 4.5,
     "unit_price_eur": 85.50,
     "quantity": 1,
     "total_eur": 384.75,
     "notes": "North-facing wall, alternative materials: pine (€62), walnut (€120)"
   }

The human (usually the foreman or office admin) confirms or edits the extracted line item before it locks into the estimate. This human-in-the-loop approach prevents hallucinations and keeps accuracy >95%.

Why not end-to-end LLM? Because LLMs are slow (1-3s for inference on device) and expensive (€0.002-0.01 per call on cloud). We reserve LLMs for the final step (generating notes and alternatives) where quality matters more than speed.

The Broader Picture: What Builders Actually Want

Beyond the technical details, the biggest lesson: builders don't want magic, they want reliability and speed. A voice feature that works 85% of the time and requires 15% manual correction is infinitely better than a perfect feature that takes 5 seconds per input.

Here's what drives adoption:

Hands-free input (the glove problem)
Faster than typing (2-3x speedup, measured)
Audit trail (original audio saved, regulatory requirement in France)
Offline robustness (fallback must work)

The voice AI is just the tool—the real win is rethinking the workflow to fit how builders actually work on jobsites.

Next Steps for Builders & Developers

If you're building for construction, here are the takeaways:

Test on-site, not in labs. Noise simulation != real jobsite.
Invest in domain vocabulary. Generic models will fail you.
Latency beats accuracy. A 70% correct, 0.5s response beats 95% correct, 2s response.
Human-in-the-loop, not fully automated. People catch errors faster than you iterate.

For developers exploring voice + AI in regulated industries (construction, healthcare, finance), the patterns here (domain correction, confidence thresholding, offline fallback, human verification) transfer well.

At Anodos, we're using this architecture in production for French construction SMBs. If you're curious about how voice estimating integrates into the broader jobsite management stack—GPS-based time tracking, photo-based reserve tracking, Factur-X invoicing—reach out or visit our site.

Olivier Ebrahim

Founder of Anodos — jobsite management for French construction SMBs.

Questions or corrections? Drop them in the comments. I'll be monitoring.

DEV Community