I Built a Local RAG Cache in 48 Hours — Here's Why Nobody Uses It

#ai #automation #productivity #opensource

I spent last weekend building a local vector cache for my AI coding assistant. It took exactly 46 hours from idea to deployment. The result saves me about $120 a month in API costs and cuts response latency by 60%.

Yet, when I mentioned this on Twitter, the reaction was lukewarm at best. Most developers are still obsessed with building new agents or fine-tuning massive models. Nobody seems interested in the boring infrastructure that makes those tools actually usable in production.

This is a story about why caching is the most underrated optimization in the AI stack right now. It is also a confession of how I wasted three days over-engineering a solution that should have been simple.

The Problem With "Smart" Context

By early 2026, every developer uses some form of AI context injection. We dump entire codebases into prompts. We attach documentation PDFs. We paste error logs.

The problem is redundancy.

I checked my usage logs from March 2026. I found that 40% of my tokens were spent re-sending the same file contents. If I asked a question about authMiddleware.ts on Tuesday, and then asked a related question on Wednesday, the system sent the entire file both times.

Large Language Models do not have memory between sessions unless you build it. Most hosted solutions charge you for every token sent, regardless of whether the model has seen it before.

I wanted a system that recognized repeated content. It needed to store embeddings locally. It had to check if a file chunk already existed in the vector store before sending it to the API.

This sounds trivial. In practice, it is messy.

My First Attempt Was A Disaster

I started by trying to use Redis as a vector store. This was a mistake.

Redis is great for key-value pairs. It is terrible for semantic search unless you install specific modules that bloat your memory usage. I spun up a Redis instance with the RediSearch module. It consumed 4GB of RAM just to index 500 files from my monorepo.

My laptop fan sounded like a jet engine. The indexing process took 20 minutes. Every time I saved a file, the re-indexing lag spiked my CPU to 100%.

I scrapped it after 12 hours.

I switched to SQLite with the sqlite-vss extension. This runs entirely on disk. It uses negligible RAM. The setup was harder because I had to compile custom binaries for my M3 chip, but the performance difference was night and day.

Here is the basic schema I ended up using:

CREATE TABLE IF NOT EXISTS documents (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    file_path TEXT NOT NULL,
    chunk_hash TEXT NOT NULL UNIQUE,
    content TEXT NOT NULL,
    embedding BLOB NOT NULL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX IF NOT EXISTS idx_chunk_hash ON documents(chunk_hash);

The chunk_hash is critical. I use SHA-256 to fingerprint each code chunk. Before generating an embedding, I check if that hash exists. If it does, I skip the embedding generation and the API call entirely.

The Implementation Details

The workflow looks like this:

Watcher detects file change.
Split file into chunks of 500 tokens.
Generate SHA-256 hash for each chunk.
Query SQLite for existing hashes.
Only embed and store new chunks.
On query, retrieve relevant chunks from SQLite.
Send only unique, relevant context to the LLM.

Step 5 is where the money is saved.

I used nomic-embed-text for local embeddings. It runs fast on CPU. The model is small enough to load in under two seconds.

Here is a simplified Python snippet showing the deduplication logic:

import hashlib
import sqlite3

def get_existing_hashes(conn, hashes):
    placeholders = ','.join('?' for _ in hashes)
    query = f"SELECT chunk_hash FROM documents WHERE chunk_hash IN ({placeholders})"
    cursor = conn.execute(query, hashes)
    return set(row[0] for row in cursor.fetchall())

def process_chunk(conn, file_path, content):
    chunk_hash = hashlib.sha256(content.encode()).hexdigest()

    # Check if we already have this exact content
    existing = get_existing_hashes(conn, [chunk_hash])

    if chunk_hash in existing:
        return None  # Skip processing

    # Generate embedding here (omitted for brevity)
    embedding = generate_embedding(content)

    conn.execute(
        "INSERT INTO documents (file_path, chunk_hash, content, embedding) VALUES (?, ?, ?, ?)",
        (file_path, chunk_hash, content, embedding)
    )
    conn.commit()
    return chunk_hash

This code is not revolutionary. It is basic database hygiene. But most AI wrappers ignore it. They assume context is ephemeral. They treat every request as a blank slate.