DEV Community

Lich Priest
Lich Priest

Posted on

[E2E TEST] Deploy a Real‑Time Voice‑Controlled AI Assistant on a Raspberry Pi

Why Run a Voice Assistant on the Edge?

Running speech‑to‑text and intent detection locally gives you:

  • Zero latency – no round‑trip to the cloud.
  • Privacy – audio never leaves the device.
  • Offline reliability – your assistant works even when the internet is down.

In this tutorial we’ll stitch together OpenAI’s Whisper (small model) for transcription, a tiny TensorFlow Lite intent classifier, and a real‑time audio pipeline that lives entirely on a Raspberry Pi 4 (2 GB or more). By the end you’ll have a Python script that listens for commands like “turn on the lamp” and executes a local function instantly.


What You’ll Need

Item Reason
Raspberry Pi 4 (2 GB+) with Raspberry OS (64‑bit) Provides enough RAM for Whisper‑small
Micro‑USB or USB‑C microphone Captures audio
Python 3.10+ Modern language features
ffmpeg Required by Whisper
git, pip, virtualenv Standard development tools
Optional: GPIO‑controlled relay To demonstrate a real command

Tip: If you’re using a Pi Zero, swap Whisper for a lighter model (e.g., tiny.en) or run only the intent recognizer.


1. Set Up the Development Environment

# Update OS and install system deps
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3-pip python3-venv ffmpeg libportaudio2

# Create a clean virtual environment
python3 -m venv venv
source venv/bin/activate

# Upgrade pip and install core libraries
pip install --upgrade pip
pip install numpy sounddevice tqdm
Enter fullscreen mode Exit fullscreen mode

Install Whisper

Whisper ships as a Python package that downloads the model on first use.

pip install git+https://github.com/openai/whisper.git
Enter fullscreen mode Exit fullscreen mode

Install TensorFlow Lite Runtime

The full TensorFlow package is heavyweight for a Pi. Use the lightweight runtime instead:

pip install tflite-runtime
Enter fullscreen mode Exit fullscreen mode

2. Capture Audio in Real Time

We’ll use sounddevice to stream 16 kHz mono audio directly into a NumPy buffer. Whisper expects 16 kHz, so we set the samplerate accordingly.

import sounddevice as sd
import numpy as np
from collections import deque

SAMPLE_RATE = 16000
CHUNK_DURATION = 0.5   # seconds
CHUNK_SIZE = int(SAMPLE_RATE * CHUNK_DURATION)

# A thread‑safe circular buffer
audio_buffer = deque(maxlen=int(5 * SAMPLE_RATE))  # keep last 5 seconds

def audio_callback(indata, frames, time, status):
    """Called by sounddevice for each audio chunk."""
    if status:
        print(f"Audio status: {status}")
    audio_buffer.extend(indata[:, 0])  # mono channel

stream = sd.InputStream(
    samplerate=SAMPLE_RATE,
    channels=1,
    dtype='float32',
    callback=audio_callback,
)

stream.start()
print("🔊 Listening…")
Enter fullscreen mode Exit fullscreen mode

The buffer continuously holds the most recent audio. We’ll pull a 2‑second slice every loop iteration and feed it to Whisper.


3. Run Whisper on the Edge

Whisper‑small (~39 M parameters) fits into the Pi’s RAM and runs at ~2× real‑time on a Pi 4 with the CPU only. For lower latency we’ll use the non‑beam decoding mode.

import whisper
import torch

# Load Whisper‑small on CPU
model = whisper.load_model("small", device="cpu")

def transcribe_chunk(chunk):
    """Accepts a NumPy array of shape (samples,) and returns text."""
    # Whisper expects a float32 tensor normalized to [-1, 1]
    audio = torch.from_numpy(chunk).float()
    result = model.transcribe(audio, language="en", word_timestamps=False, beam_size=1)
    return result["text"].strip()
Enter fullscreen mode Exit fullscreen mode

Reducing Compute with TorchScript (optional)

If you want a modest speed boost, script the model once:

scripted = torch.jit.script(model)
# Replace `model` with `scripted` in `transcribe_chunk`
Enter fullscreen mode Exit fullscreen mode

4. Build a Tiny Intent Classifier

Instead of parsing full sentences, we’ll map short utterances to intents using a keyword‑spotting model. The architecture is a 1‑D convolution followed by a dense layer – only ~10 k parameters.

import tensorflow as tf

def build_intent_model(num_classes=4):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(16000, 1)),          # 1‑second raw waveform
        tf.keras.layers.Rescaling(1.0 / 32768.0),         # Normalize int16 to [-1, 1]
        tf.keras.layers.Conv1D(8, 13, strides=2, activation='relu'),
        tf.keras.layers.Conv1D(16, 13, strides=2, activation='relu'),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(num_classes, activation='softmax')
    ])
    return model
Enter fullscreen mode Exit fullscreen mode

Training Data (quick example)

Create a tiny dataset with four commands: ["turn on the lamp", "turn off the lamp", "what time is it", "stop listening"]. Record a few seconds for each command, label them, and train for a handful of epochs.

# Assume `X_train` shape = (samples, 16000, 1), y_train one‑hot encoded
model = build_intent_model(num_classes=4)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=15, batch_size=8)
Enter fullscreen mode Exit fullscreen mode

Convert to TensorFlow Lite and Quantize

Quantization shrinks the model to ~30 KB and runs at >100 inferences/sec on the Pi.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # post‑training quantization
tflite_model = converter.convert()

with open("intent_classifier.tflite", "wb") as f:
    f.write(tflite_model)
print("✅ Saved quantized TFLite model")
Enter fullscreen mode Exit fullscreen mode

Load the TFLite Model

import tflite_runtime.interpreter as tflite

interpreter = tflite.Interpreter(model_path="intent_classifier.tflite")
interpreter.allocate_tensors()
input_idx = interpreter.get_input_details()[0]["index"]
output_idx = interpreter.get_output_details()[0]["index"]

def predict_intent(waveform):
    """waveform: np.ndarray shape (16000,)"""
    # Reshape to (1, 16000, 1) and cast to int16 for the quantized model
    input_data = waveform.astype(np.int16).reshape(1, -1, 1)
    interpreter.set_tensor(input_idx, input_data)
    interpreter.invoke()
    probs = interpreter.get_tensor(output_idx)[0]
    intent_id = np.argmax(probs)
    return intent_id, probs[intent_id]
Enter fullscreen mode Exit fullscreen mode

Map IDs to human‑readable intents:

INTENT_MAP = {
    0: "TURN_ON",
    1: "TURN_OFF",
    2: "GET_TIME",
    3: "STOP"
}
Enter fullscreen mode Exit fullscreen mode

5. Glue It All Together

Now we combine the audio stream, Whisper transcription, and intent classifier into a single loop. We’ll use a 1‑second sliding window for intent detection (fast) and a 2‑second window for Whisper (more accurate).

import time
import datetime

def execute_intent(intent):
    if intent == "TURN_ON":
        print("💡 Turning lamp ON")
        # Example GPIO call:
        # import RPi.GPIO as GPIO
        # GPIO.output(LAMP_PIN, GPIO.HIGH)
    elif intent == "TURN_OFF":
        print("💡 Turning lamp OFF")
    elif intent == "GET_TIME":
        now = datetime.datetime.now().strftime("%H:%M")
        print(f"🕒 The time is {now}")
    elif intent == "STOP":
        print("👋 Stopping assistant")
        raise KeyboardInterrupt

try:
    while True:
        # ---- Intent detection (fast) ----
        if len(audio_buffer) >= SAMPLE_RATE:  # need at least 1 sec
            recent = np.array(list(audio_buffer)[-SAMPLE_RATE:])  # last 1 sec
            intent_id, confidence = predict_intent(recent)
            if confidence > 0.85:  # ignore low‑confidence guesses
                intent = INTENT_MAP[intent_id]
                print(f"[Intent] {intent} ({confidence:.2f})")
                execute_intent(intent)

        # ---- Whisper transcription (every 2 sec) ----
        if len(audio_buffer) >= 2 * SAMPLE_RATE:
            chunk = np.array(list(audio_buffer)[-2 * SAMPLE_RATE:])
            text = transcribe_chunk(chunk)
            if text:
                print(f"[Transcription] {text}")

        time.sleep(0.2)  # tiny pause to keep CPU happy

except KeyboardInterrupt:
    print("\n🛑 Assistant stopped")
finally:
    stream.stop()
    stream.close()
Enter fullscreen mode Exit fullscreen mode

What’s happening?

  1. Audio callback continuously fills audio_buffer.
  2. Every loop we grab a 1‑second slice, run the quantized intent model, and instantly act on high‑confidence predictions.
  3. Every 2 seconds we feed a larger slice to Whisper for a full transcription – useful for debugging or for commands that need more context.
  4. The script exits gracefully on “stop listening”.

6. Optimizing for Real‑World Use

Area Quick win
CPU usage Set torch.set_num_threads(2) to limit Whisper’s thread count.
Power Use pico mode on the Pi (sudo raspi-config → Performance → Low‑Power).
Audio quality Add a simple high‑pass filter (scipy.signal.butter) to remove rumble.
Model size Swap Whisper‑small for Whisper‑tiny if RAM is a bottleneck.
Hotword detection Keep the intent model always on; only invoke Whisper after a hotword is detected (e.g., “hey pi”).

7. Deploying as a System Service

Running the script manually is fine for testing, but for a production‑grade assistant you’ll want it to start on boot.

sudo nano /etc/systemd/system/voice-assistant.service
Enter fullscreen mode Exit fullscreen mode

Paste:

[Unit]
Description=Edge Voice Assistant
After=network.target

[Service]
WorkingDirectory=/home/pi/voice-assistant
ExecStart=/home/pi/voice-assistant/venv/bin/python3 assistant.py
Restart=on-failure
User=pi

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable voice-assistant.service
sudo systemctl start voice-assistant.service
Enter fullscreen mode Exit fullscreen mode

Check logs with journalctl -u voice-assistant -f.


Key takeaways

  • Whisper‑small can run on a Raspberry Pi 4 in real time when you limit beam size and use CPU‑only inference.
  • A tiny 1‑D ConvNet, post‑training quantized to TensorFlow Lite, provides sub‑millisecond intent detection.
  • Using a circular buffer with sounddevice lets you stream audio without dropping frames.
  • Combining a fast intent classifier with occasional Whisper transcriptions yields both low latency and high accuracy.
  • Packaging the script as a systemd service makes the assistant start automatically and stay resilient.

Top comments (0)