DEV Community

Zain Ul Abideen Rizvi
Zain Ul Abideen Rizvi

Posted on • Originally published at zainulabideenrizvi.hashnode.dev

I Built AI Smart Glasses That Respond in Under 2 Seconds — Here's How

Real-time voice + vision pipeline using Groq, Whisper, and gTTS on a budget


I got tired of watching expensive AI glasses demos that cost $500+ and still have a 5-second lag before they respond. So I built my own — and got the full voice + vision pipeline under 2 seconds end-to-end.

This post covers the exact architecture, the bottlenecks I hit, and what actually made the difference in latency.


What It Does

You put on the glasses, ask a question out loud, and within 2 seconds you get a spoken response — based on both what you said and what the camera sees.

Example: "What's written on this sign?" → glasses see the sign → AI reads it → speaks the answer in your ear.

Or: "Is this a good deal?" → glasses see a price tag → LLM compares context → responds.


The Stack

Component Tool Why
Speech-to-Text faster-whisper / Groq Whisper Speed
Vision LLM Groq llama-4-scout Free tier, fast inference
Text-to-Speech gTTS Lightweight, no API cost
Deployment Oracle Cloud Free Tier Always-free compute
Hardware Raspberry Pi + USB camera + earpiece ~$60 total

The key insight: Groq's inference API is the fastest available right now. Most latency problems in AI pipelines come from the LLM call. Groq runs on LPUs (Language Processing Units) instead of GPUs, which cuts inference time dramatically compared to OpenAI or Gemini.


Architecture Overview

[Microphone]
     ↓
[VAD — Voice Activity Detection]
     ↓
[faster-whisper STT — local]  ← or Groq Whisper API
     ↓
[Frame capture from camera]
     ↓
[Groq llama-4-scout — vision + text input]
     ↓
[gTTS — text to speech]
     ↓
[Earpiece output]
Enter fullscreen mode Exit fullscreen mode

Everything runs on Oracle Cloud Free Tier (ARM instance, 4 cores, 24GB RAM — genuinely free).


Step 1: Speech Detection Without Constant Listening

The first mistake I made was running Whisper on a continuous stream. It's slow and wasteful.

The fix: use Voice Activity Detection (VAD) to only run STT when someone is actually speaking.

import webrtcvad
import pyaudio

vad = webrtcvad.Vad(2)  # aggressiveness 0-3

def is_speech(audio_chunk, sample_rate=16000):
    return vad.is_speech(audio_chunk, sample_rate)
Enter fullscreen mode Exit fullscreen mode

This alone saved ~400ms per request by eliminating unnecessary Whisper calls on silence.


Step 2: Fast Transcription with faster-whisper

faster-whisper is a reimplementation of OpenAI Whisper using CTranslate2. On CPU it's 4x faster than the original.

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

def transcribe(audio_path):
    segments, _ = model.transcribe(audio_path, beam_size=1)
    return " ".join([s.text for s in segments])
Enter fullscreen mode Exit fullscreen mode

Use beam_size=1 for speed. You lose a tiny bit of accuracy, but for conversational input it doesn't matter.

Alternatively, use the Groq Whisper API if you want zero local processing — it's fast and has a generous free tier.


Step 3: Capturing a Frame at the Right Moment

Don't capture video continuously. Capture one frame at the moment the user finishes speaking.

import cv2

def capture_frame():
    cap = cv2.VideoCapture(0)
    ret, frame = cap.read()
    cap.release()
    if ret:
        _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 70])
        return buffer.tobytes()
    return None
Enter fullscreen mode Exit fullscreen mode

JPEG quality 70 is the sweet spot — small enough to send fast, clear enough for the LLM to read text and recognize objects.


Step 4: The Vision LLM Call (Groq llama-4-scout)

This is where the magic happens. You send both the transcribed text and the image to the model.

import base64
import requests

GROQ_API_KEY = "your_groq_api_key"

def ask_vision_llm(question: str, image_bytes: bytes) -> str:
    image_b64 = base64.b64encode(image_bytes).decode("utf-8")

    payload = {
        "model": "meta-llama/llama-4-scout-17b-16e-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_b64}"
                        }
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ],
        "max_tokens": 150  # keep responses short for speed
    }

    response = requests.post(
        "https://api.groq.com/openai/v1/chat/completions",
        headers={"Authorization": f"Bearer {GROQ_API_KEY}"},
        json=payload
    )

    return response.json()["choices"][0]["message"]["content"]
Enter fullscreen mode Exit fullscreen mode

Critical: Set max_tokens to 150 or less. Longer responses mean longer TTS output. For glasses, short answers are better anyway.


Step 5: Text to Speech with gTTS

from gtts import gTTS
import os
import pygame

def speak(text: str):
    tts = gTTS(text=text, lang='en', slow=False)
    tts.save("/tmp/response.mp3")

    pygame.mixer.init()
    pygame.mixer.music.load("/tmp/response.mp3")
    pygame.mixer.music.play()

    while pygame.mixer.music.get_busy():
        continue
Enter fullscreen mode Exit fullscreen mode

gTTS makes an API call to Google's TTS — it's free and sounds natural. The downside is it requires internet. If you want fully offline, use pyttsx3 instead (sounds worse but zero latency from network).


Putting It All Together

import time

def pipeline_loop():
    print("Listening...")

    while True:
        # 1. Detect speech
        audio = record_until_silence()  # implement with VAD above

        # 2. Transcribe
        t1 = time.time()
        question = transcribe(audio)
        print(f"STT: {time.time() - t1:.2f}s — '{question}'")

        # 3. Capture frame
        frame = capture_frame()

        # 4. Ask LLM
        t2 = time.time()
        answer = ask_vision_llm(question, frame)
        print(f"LLM: {time.time() - t2:.2f}s — '{answer}'")

        # 5. Speak
        t3 = time.time()
        speak(answer)
        print(f"TTS: {time.time() - t3:.2f}s")

        print(f"Total: {time.time() - t1:.2f}s")

pipeline_loop()
Enter fullscreen mode Exit fullscreen mode

Latency Breakdown (Real Numbers)

Step Time
VAD detection ~50ms
faster-whisper (base, CPU) ~300-500ms
Frame capture ~80ms
Groq LLM inference ~400-700ms
gTTS generation ~200-300ms
Total ~1.0–1.6s

On most requests I hit under 1.5 seconds. The variance mostly comes from Groq API response time under load.


What I Learned

1. The LLM is not your bottleneck — your audio pipeline is.
Most of the latency people struggle with is in how they handle audio. VAD + chunked processing matters more than which LLM you pick.

2. Groq is genuinely fast.
I tested OpenAI GPT-4o, Gemini Flash, and Groq. Groq was consistently 2-3x faster on inference alone.

3. Short answers are better answers.
For a wearable, nobody wants 3 paragraphs read in their ear. Prompt the LLM explicitly: "Answer in one sentence."

4. Oracle Cloud Free Tier is underrated.
4 ARM cores, 24GB RAM, always free. It handles this pipeline with headroom to spare.


What's Next

I'm working on:

  • Replacing gTTS with a faster local TTS model (Kokoro or Coqui)
  • Adding a wake word so the pipeline doesn't run on every sound
  • Streaming the LLM response directly to TTS instead of waiting for the full answer

If you're building something similar or want to collaborate, connect with me:
→ Portfolio: zainulabideen.com
→ GitHub: github.com/zainulabideen041
→ LinkedIn: linkedin.com/in/zainulabideen041


Built with: Python, faster-whisper, Groq API, gTTS, OpenCV, Oracle Cloud

Tags: ai python machinelearning opensource tutorial

Top comments (0)