Real-time voice + vision pipeline using Groq, Whisper, and gTTS on a budget
I got tired of watching expensive AI glasses demos that cost $500+ and still have a 5-second lag before they respond. So I built my own — and got the full voice + vision pipeline under 2 seconds end-to-end.
This post covers the exact architecture, the bottlenecks I hit, and what actually made the difference in latency.
What It Does
You put on the glasses, ask a question out loud, and within 2 seconds you get a spoken response — based on both what you said and what the camera sees.
Example: "What's written on this sign?" → glasses see the sign → AI reads it → speaks the answer in your ear.
Or: "Is this a good deal?" → glasses see a price tag → LLM compares context → responds.
The Stack
| Component | Tool | Why |
|---|---|---|
| Speech-to-Text | faster-whisper / Groq Whisper | Speed |
| Vision LLM | Groq llama-4-scout | Free tier, fast inference |
| Text-to-Speech | gTTS | Lightweight, no API cost |
| Deployment | Oracle Cloud Free Tier | Always-free compute |
| Hardware | Raspberry Pi + USB camera + earpiece | ~$60 total |
The key insight: Groq's inference API is the fastest available right now. Most latency problems in AI pipelines come from the LLM call. Groq runs on LPUs (Language Processing Units) instead of GPUs, which cuts inference time dramatically compared to OpenAI or Gemini.
Architecture Overview
[Microphone]
↓
[VAD — Voice Activity Detection]
↓
[faster-whisper STT — local] ← or Groq Whisper API
↓
[Frame capture from camera]
↓
[Groq llama-4-scout — vision + text input]
↓
[gTTS — text to speech]
↓
[Earpiece output]
Everything runs on Oracle Cloud Free Tier (ARM instance, 4 cores, 24GB RAM — genuinely free).
Step 1: Speech Detection Without Constant Listening
The first mistake I made was running Whisper on a continuous stream. It's slow and wasteful.
The fix: use Voice Activity Detection (VAD) to only run STT when someone is actually speaking.
import webrtcvad
import pyaudio
vad = webrtcvad.Vad(2) # aggressiveness 0-3
def is_speech(audio_chunk, sample_rate=16000):
return vad.is_speech(audio_chunk, sample_rate)
This alone saved ~400ms per request by eliminating unnecessary Whisper calls on silence.
Step 2: Fast Transcription with faster-whisper
faster-whisper is a reimplementation of OpenAI Whisper using CTranslate2. On CPU it's 4x faster than the original.
from faster_whisper import WhisperModel
model = WhisperModel("base", device="cpu", compute_type="int8")
def transcribe(audio_path):
segments, _ = model.transcribe(audio_path, beam_size=1)
return " ".join([s.text for s in segments])
Use beam_size=1 for speed. You lose a tiny bit of accuracy, but for conversational input it doesn't matter.
Alternatively, use the Groq Whisper API if you want zero local processing — it's fast and has a generous free tier.
Step 3: Capturing a Frame at the Right Moment
Don't capture video continuously. Capture one frame at the moment the user finishes speaking.
import cv2
def capture_frame():
cap = cv2.VideoCapture(0)
ret, frame = cap.read()
cap.release()
if ret:
_, buffer = cv2.imencode('.jpg', frame, [cv2.IMWRITE_JPEG_QUALITY, 70])
return buffer.tobytes()
return None
JPEG quality 70 is the sweet spot — small enough to send fast, clear enough for the LLM to read text and recognize objects.
Step 4: The Vision LLM Call (Groq llama-4-scout)
This is where the magic happens. You send both the transcribed text and the image to the model.
import base64
import requests
GROQ_API_KEY = "your_groq_api_key"
def ask_vision_llm(question: str, image_bytes: bytes) -> str:
image_b64 = base64.b64encode(image_bytes).decode("utf-8")
payload = {
"model": "meta-llama/llama-4-scout-17b-16e-instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_b64}"
}
},
{
"type": "text",
"text": question
}
]
}
],
"max_tokens": 150 # keep responses short for speed
}
response = requests.post(
"https://api.groq.com/openai/v1/chat/completions",
headers={"Authorization": f"Bearer {GROQ_API_KEY}"},
json=payload
)
return response.json()["choices"][0]["message"]["content"]
Critical: Set max_tokens to 150 or less. Longer responses mean longer TTS output. For glasses, short answers are better anyway.
Step 5: Text to Speech with gTTS
from gtts import gTTS
import os
import pygame
def speak(text: str):
tts = gTTS(text=text, lang='en', slow=False)
tts.save("/tmp/response.mp3")
pygame.mixer.init()
pygame.mixer.music.load("/tmp/response.mp3")
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
continue
gTTS makes an API call to Google's TTS — it's free and sounds natural. The downside is it requires internet. If you want fully offline, use pyttsx3 instead (sounds worse but zero latency from network).
Putting It All Together
import time
def pipeline_loop():
print("Listening...")
while True:
# 1. Detect speech
audio = record_until_silence() # implement with VAD above
# 2. Transcribe
t1 = time.time()
question = transcribe(audio)
print(f"STT: {time.time() - t1:.2f}s — '{question}'")
# 3. Capture frame
frame = capture_frame()
# 4. Ask LLM
t2 = time.time()
answer = ask_vision_llm(question, frame)
print(f"LLM: {time.time() - t2:.2f}s — '{answer}'")
# 5. Speak
t3 = time.time()
speak(answer)
print(f"TTS: {time.time() - t3:.2f}s")
print(f"Total: {time.time() - t1:.2f}s")
pipeline_loop()
Latency Breakdown (Real Numbers)
| Step | Time |
|---|---|
| VAD detection | ~50ms |
| faster-whisper (base, CPU) | ~300-500ms |
| Frame capture | ~80ms |
| Groq LLM inference | ~400-700ms |
| gTTS generation | ~200-300ms |
| Total | ~1.0–1.6s |
On most requests I hit under 1.5 seconds. The variance mostly comes from Groq API response time under load.
What I Learned
1. The LLM is not your bottleneck — your audio pipeline is.
Most of the latency people struggle with is in how they handle audio. VAD + chunked processing matters more than which LLM you pick.
2. Groq is genuinely fast.
I tested OpenAI GPT-4o, Gemini Flash, and Groq. Groq was consistently 2-3x faster on inference alone.
3. Short answers are better answers.
For a wearable, nobody wants 3 paragraphs read in their ear. Prompt the LLM explicitly: "Answer in one sentence."
4. Oracle Cloud Free Tier is underrated.
4 ARM cores, 24GB RAM, always free. It handles this pipeline with headroom to spare.
What's Next
I'm working on:
- Replacing gTTS with a faster local TTS model (Kokoro or Coqui)
- Adding a wake word so the pipeline doesn't run on every sound
- Streaming the LLM response directly to TTS instead of waiting for the full answer
If you're building something similar or want to collaborate, connect with me:
→ Portfolio: zainulabideen.com
→ GitHub: github.com/zainulabideen041
→ LinkedIn: linkedin.com/in/zainulabideen041
Built with: Python, faster-whisper, Groq API, gTTS, OpenCV, Oracle Cloud
Tags: ai python machinelearning opensource tutorial
Top comments (0)