Everyone's talking about RAG (Retrieval-Augmented Generation). Most tutorials make it look like you need a full cloud stack, a vector database subscription, and three Dockerfiles just to get started.
You don't.
Here's the bare minimum that actually works — and what tripped me up along the way.
What even is RAG?
Quick version: instead of relying purely on what an LLM already knows, you feed it your own documents at query time. The model doesn't memorize your data. It reads a relevant chunk of it fresh for every question.
That's why it's called "retrieval-augmented." You retrieve, then you generate.
The Setup
We need three things:
- A way to split documents into chunks
- A way to find the most relevant chunks for a given query (embeddings + similarity search)
- An LLM to generate an answer using those chunks
I'm using sentence-transformers for embeddings, numpy for similarity math, and openai for the final answer. No vector database. Just Python.
pip install openai sentence-transformers numpy
The Code
import numpy as np
from sentence_transformers import SentenceTransformer
from openai import OpenAI
client = OpenAI()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# Your documents — could be loaded from files, a DB, wherever
documents = [
"Python was created by Guido van Rossum and released in 1991.",
"RAG stands for Retrieval-Augmented Generation.",
"The Eiffel Tower is located in Paris, France.",
"Transformers are the backbone of modern large language models.",
"Python is widely used in data science and machine learning.",
]
# Embed all documents once at startup
doc_embeddings = embedder.encode(documents)
def retrieve(query: str, top_k: int = 2) -> list[str]:
query_embedding = embedder.encode([query])
scores = np.dot(doc_embeddings, query_embedding.T).flatten()
top_indices = scores.argsort()[-top_k:][::-1]
return [documents[i] for i in top_indices]
def ask(question: str) -> str:
context_chunks = retrieve(question)
context = "\n".join(context_chunks)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "Answer the user's question using only the context below.\n\n"
f"Context:\n{context}"
},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
# Try it
print(ask("Who created Python?"))
print(ask("Where is the Eiffel Tower?"))
That's it. Run it and it works.
What's Actually Happening
When you call retrieve(), it converts the query into a vector (a list of numbers that represents its meaning), then compares it against all the pre-computed document vectors using dot product similarity. Higher score = more relevant.
The all-MiniLM-L6-v2 model is small, fast, and good enough for most use cases. If you need better accuracy, swap it for all-mpnet-base-v2. If you need multilingual support, use paraphrase-multilingual-MiniLM-L12-v2.
Where It Breaks Down
A few things I ran into:
Chunking matters a lot. In this example I used full sentences, but in real life you'll have paragraphs, PDFs, and web pages. If your chunks are too big, the model gets noisy context. Too small, and you lose meaning. I usually aim for 200-400 tokens per chunk with some overlap.
The retrieval isn't magic. If none of your documents are relevant, the model will either confabulate or say it doesn't know — depending on how well your system prompt is written. Always tell it explicitly to stick to the context.
In-memory embeddings don't scale. This works fine for hundreds of documents. For thousands or millions, you'll want a real vector store like chromadb, qdrant, or pinecone. But start here. Don't over-engineer upfront.
What I'd Do Next
If I were taking this further:
- Add a chunking function — split long documents by sentence or token count with overlap
-
Persist embeddings — save them to disk with
numpy.save()so you're not re-embedding on every startup - Add a reranker — after retrieval, score chunks again with a cross-encoder for better precision
-
Swap the vector store —
chromadbis dead simple and runs locally
RAG doesn't need to be complicated. The core loop is always the same: embed, retrieve, generate. Get that working first, then optimize.
If this helped or you ran into something weird trying it, drop a comment. I'm curious what use cases people are actually building this for.
Author
Exact Solution
Top comments (0)