DEV Community

Akash kumar
Akash kumar

Posted on

Voice-Controlled Local AI Agent (Works Even on 8GB RAM)

What if you could control your computer using just your voice โ€” without needing a powerful GPU or heavy local models?

I built a Voice-Controlled AI Agent that:

  • Understands speech ๐ŸŽค
  • Detects user intent ๐Ÿง 
  • Executes real actions like file creation, code generation, and summarization โšก

And the best part?
๐Ÿ‘‰ It works smoothly even on low-end systems (8GB RAM).


๐ŸŽฌ Demo

๐Ÿ“ฝ๏ธ Watch the full demo here:
๐Ÿ‘‰ https://drive.google.com/file/d/17Uvp72dDi82pAqEqbJ6pl3LaLphxwaGm/view?usp=sharing

(Replace with your YouTube / Drive / Loom video link)

https://youtu.be/Pl3lwBoYruM

LIVE LINK

https://localaiagent-twxulfwrigcagqtbecnomh.streamlit.app/

โœจ Features

  • ๐ŸŽค Audio Input

    • Record directly from microphone
    • Upload audio files
  • ๐Ÿง  Intent Classification

    • Converts speech โ†’ structured JSON
    • Accurately detects user commands
  • โšก Core Actions

    • create_file โ†’ Creates files safely
    • write_code โ†’ Generates and saves code
    • summarize_text โ†’ Summarizes content
    • general_chat โ†’ Handles normal queries
  • ๐Ÿ”’ Safe Execution

    • All outputs are restricted to /output directory
    • Prevents accidental system modification

๐Ÿ—๏ธ System Architecture

Building AI systems locally with limited RAM is challenging. Here's how I solved it:

1. ๐ŸŽ™๏ธ Speech-to-Text (STT)

  • Local Mode:
    Uses openai-whisper (tiny model) โ†’ runs on CPU

  • Fast Mode (Recommended):
    Uses Groq API (Whisper-large-v3) โ†’ extremely fast โšก


2. ๐Ÿง  LLM + Intent Engine

Running large models locally was not feasible:

  • 8B models consume ~5GB RAM โŒ
  • Causes system slowdown

๐Ÿ‘‰ Solution:

  • Used Groq API (Llama 3 - 8B / 70B)
  • Provides:

    • Fast inference โšก
    • Structured JSON output
    • Reliable intent classification

3. ๐Ÿ–ฅ๏ธ Frontend

  • Built using Streamlit
  • Uses st.audio_input for seamless recording
  • Simple and clean UI

๐Ÿ”„ How It Works

  1. User speaks or uploads audio ๐ŸŽค
  2. Whisper converts speech โ†’ text
  3. LLM processes text โ†’ structured JSON
  4. System executes action locally

Example:

{
  "action": "create_file",
  "filename": "hello.py"
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ’ป Example Use Case

๐Ÿ—ฃ๏ธ User says:

"Create a Python file called hello.py"

โš™๏ธ System:

  • Transcribes audio
  • Detects create_file intent
  • Creates file in /output folder
  • Shows success message

โšก Setup Instructions

Prerequisites


Installation

git clone <your-repo-link>
cd local_ai_agent
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Environment Setup

GROQ_API_KEY=your_api_key_here
Enter fullscreen mode Exit fullscreen mode

Run the App

streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

โš ๏ธ Challenges Faced

  • Running LLMs on 8GB RAM
  • Slow transcription using CPU Whisper
  • Ensuring consistent JSON output from LLM
  • Managing safe file execution

๐Ÿ’ก Key Learnings

  • Hybrid approach (local + API) is powerful
  • Structured prompts = better automation
  • UI simplicity improves usability massively

๐Ÿ”ฎ Future Improvements

  • Add more actions (email automation, system control)
  • Improve offline performance
  • Add memory (conversation history)
  • Multi-command execution

๐Ÿ”— Links


๐Ÿ™Œ Final Thoughts

This project shows that you donโ€™t need expensive hardware to build powerful AI systems.

With the right architecture and smart trade-offs, even a mid-range laptop can run intelligent AI agents efficiently.

If you found this useful, feel free to โญ the repo or share your thoughts!


๐Ÿท๏ธ Tags

python #ai #machinelearning #streamlit #opensource #productivity

Top comments (0)