Zain Ul Abideen Rizvi

Posted on May 12 • Originally published at zainulabideenrizvi.hashnode.dev

I Built AI Smart Glasses That Respond in Under 2 Seconds — Here's How

#ai #performance #showdev #tutorial

Real-time voice + vision pipeline using Groq, Whisper, and gTTS on a budget

I got tired of watching expensive AI glasses demos that cost $500+ and still have a 5-second lag before they respond. So I built my own — and got the full voice + vision pipeline under 2 seconds end-to-end.

This post covers the exact architecture, the bottlenecks I hit, and what actually made the difference in latency.

What It Does

You put on the glasses, ask a question out loud, and within 2 seconds you get a spoken response — based on both what you said and what the camera sees.

Example: "What's written on this sign?" → glasses see the sign → AI reads it → speaks the answer in your ear.

Or: "Is this a good deal?" → glasses see a price tag → LLM compares context → responds.

The Stack

Component	Tool	Why
Speech-to-Text	faster-whisper / Groq Whisper	Speed
Vision LLM	Groq llama-4-scout	Free tier, fast inference
Text-to-Speech	gTTS	Lightweight, no API cost
Deployment	Oracle Cloud Free Tier	Always-free compute
Hardware	Raspberry Pi + USB camera + earpiece	~$60 total

The key insight: Groq's inference API is the fastest available right now. Most latency problems in AI pipelines come from the LLM call. Groq runs on LPUs (Language Processing Units) instead of GPUs, which cuts inference time dramatically compared to OpenAI or Gemini.

Architecture Overview

[Microphone]
     ↓
[VAD — Voice Activity Detection]
     ↓
[faster-whisper STT — local]  ← or Groq Whisper API
     ↓
[Frame capture from camera]
     ↓
[Groq llama-4-scout — vision + text input]
     ↓
[gTTS — text to speech]
     ↓
[Earpiece output]

Everything runs on Oracle Cloud Free Tier (ARM instance, 4 cores, 24GB RAM — genuinely free).

Step 1: Speech Detection Without Constant Listening

The first mistake I made was running Whisper on a continuous stream. It's slow and wasteful.

The fix: use Voice Activity Detection (VAD) to only run STT when someone is actually speaking.

import webrtcvad
import pyaudio

vad = webrtcvad.Vad(2)  # aggressiveness 0-3

def is_speech(audio_chunk, sample_rate=16000):
    return vad.is_speech(audio_chunk, sample_rate)

This alone saved ~400ms per request by eliminating unnecessary Whisper calls on silence.

Step 2: Fast Transcription with faster-whisper

faster-whisper is a reimplementation of OpenAI Whisper using CTranslate2. On CPU it's 4x faster than the original.

from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")

def transcribe(audio_path):
    segments, _ = model.transcribe(audio_path, beam_size=1)
    return " ".join([s.text for s in segments])

Use beam_size=1 for speed. You lose a tiny bit of accuracy, but for conversational input it doesn't matter.

Alternatively, use the Groq Whisper API if you want zero local processing — it's fast and has a generous free tier.

Step 3: Capturing a Frame at the Right Moment

Don't capture video continuously. Capture one frame at the moment the user finishes speaking.

import cv2

def capture_frame():
    cap = cv2.VideoCapture(0)
    ret, frame = cap.read()
    cap.release()
    if ret:
        _, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 70])
        return buffer.tobytes()
    return None

JPEG quality 70 is the sweet spot — small enough to send fast, clear enough for the LLM to read text and recognize objects.

Step 4: The Vision LLM Call (Groq llama-4-scout)

This is where the magic happens. You send both the transcribed text and the image to the model.

import base64
import requests

GROQ_API_KEY = "your_groq_api_key"

def ask_vision_llm(question: str, image_bytes: bytes) -> str:
    image_b64 = base64.b64encode(image_bytes).decode("utf-8")

    payload = {
        "model": "meta-llama/llama-4-scout-17b-16e-instruct",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_b64}"
                        }
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ]
            }
        ],
        "max_tokens": 150  # keep responses short for speed
    }

    response = requests.post(
        "https://api.groq.com/openai/v1/chat/completions",
        headers={"Authorization": f"Bearer {GROQ_API_KEY}"},
        json=payload
    )

    return response.json()["choices"][0]["message"]["content"]

Critical: Set max_tokens to 150 or less. Longer responses mean longer TTS output. For glasses, short answers are better anyway.

Step 5: Text to Speech with gTTS

from gtts import gTTS
import os
import pygame

def speak(text: str):
    tts = gTTS(text=text, lang='en', slow=False)
    tts.save("/tmp/response.mp3")

    pygame.mixer.init()
    pygame.mixer.music.load("/tmp/response.mp3")
    pygame.mixer.music.play()

    while pygame.mixer.music.get_busy():
        continue

gTTS makes an API call to Google's TTS — it's free and sounds natural. The downside is it requires internet. If you want fully offline, use pyttsx3 instead (sounds worse but zero latency from network).

Putting It All Together

import time

def pipeline_loop():
    print("Listening...")

    while True:
        # 1. Detect speech
        audio = record_until_silence()  # implement with VAD above

        # 2. Transcribe
        t1 = time.time()
        question = transcribe(audio)
        print(f"STT: {time.time() - t1:.2f}s — '{question}'")

        # 3. Capture frame
        frame = capture_frame()

        # 4. Ask LLM
        t2 = time.time()
        answer = ask_vision_llm(question, frame)
        print(f"LLM: {time.time() - t2:.2f}s — '{answer}'")

        # 5. Speak
        t3 = time.time()
        speak(answer)
        print(f"TTS: {time.time() - t3:.2f}s")

        print(f"Total: {time.time() - t1:.2f}s")

pipeline_loop()

Latency Breakdown (Real Numbers)

Step	Time
VAD detection	~50ms
faster-whisper (base, CPU)	~300-500ms
Frame capture	~80ms
Groq LLM inference	~400-700ms
gTTS generation	~200-300ms
Total	~1.0–1.6s

On most requests I hit under 1.5 seconds. The variance mostly comes from Groq API response time under load.

What I Learned

1. The LLM is not your bottleneck — your audio pipeline is.
Most of the latency people struggle with is in how they handle audio. VAD + chunked processing matters more than which LLM you pick.

2. Groq is genuinely fast.
I tested OpenAI GPT-4o, Gemini Flash, and Groq. Groq was consistently 2-3x faster on inference alone.

3. Short answers are better answers.
For a wearable, nobody wants 3 paragraphs read in their ear. Prompt the LLM explicitly: "Answer in one sentence."

4. Oracle Cloud Free Tier is underrated.
4 ARM cores, 24GB RAM, always free. It handles this pipeline with headroom to spare.

What's Next

I'm working on:

Replacing gTTS with a faster local TTS model (Kokoro or Coqui)
Adding a wake word so the pipeline doesn't run on every sound
Streaming the LLM response directly to TTS instead of waiting for the full answer

If you're building something similar or want to collaborate, connect with me:
→ Portfolio: zainulabideen.com
→ GitHub: github.com/zainulabideen041
→ LinkedIn: linkedin.com/in/zainulabideen041

Built with: Python, faster-whisper, Groq API, gTTS, OpenCV, Oracle Cloud

Tags: ai python machinelearning opensource tutorial

DEV Community