DEV Community

Cover image for RAG Series (13): Query Optimization — Asking Better Questions
WonderLab
WonderLab

Posted on

RAG Series (13): Query Optimization — Asking Better Questions

The Same Question, Completely Different Results

Vector retrieval has a fragility that's easy to overlook: rephrase the same question, and the results can change dramatically.

"How does the BGE model perform on Chinese text?" and "Which embedding is recommended for Chinese?" are semantically near-identical — but their embedding vectors sit at different positions in high-dimensional space, often returning different document sets entirely.

This is a structural property of Bi-Encoders: query and document are each encoded without knowing the other exists, making the result sensitive to subtle phrasing differences.

Previous articles optimized the document side — better chunking strategies help documents get found. This article works on the query side: transform the question itself before it touches the vector index, so retrieval is more stable and more complete.

Three strategies:

  1. Multi-Query: Generate multiple phrasings, retrieve from each, merge results
  2. HyDE: Generate a hypothetical answer first, then retrieve using that answer
  3. Query Decomposition: Break a complex question into sub-questions and retrieve each independently

Multi-Query: Multiple Angles, Wider Recall

Core Idea

A single query maps to a single point in the vector space. That point might happen to be far from some relevant documents. Search from multiple directions and you cover a larger area.

original query → LLM rewrite → [phrasing 1, phrasing 2, phrasing 3]
                                          ↓
                              retrieve each, merge, deduplicate
                                          ↓
                                  return top-K results
Enter fullscreen mode Exit fullscreen mode

Implementation

from langchain_classic.retrievers import MultiQueryRetriever

MULTI_QUERY_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "You are a question-rewriting assistant."),
    ("human",
     "Rewrite the following question into 3 different phrasings "
     "from different angles, to retrieve more relevant documents "
     "from a vector database.\n"
     "Output one question per line. No numbering, no explanation.\n\n"
     "Original question: {question}"),
])

# Option A: LangChain's built-in wrapper
retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    llm=llm,
)

# Option B: Manual — full control over prompt and merge logic
multi_query_chain = MULTI_QUERY_PROMPT | llm | StrOutputParser()
variants_text = multi_query_chain.invoke({"question": question})
variants = [q.strip() for q in variants_text.strip().split("\n") if q.strip()]

all_docs = base_retriever.invoke(question)   # original
for variant in variants:
    all_docs.extend(base_retriever.invoke(variant))

return dedup_docs(all_docs)[:TOP_K]
Enter fullscreen mode Exit fullscreen mode

Best for: High-vocabulary-variance scenarios, where users describe the same concept in many ways. Cost: 3 extra retrieval calls per query (plus one LLM rewrite call).


HyDE: Search with a Fake Answer

Core Idea

HyDE (Hypothetical Document Embeddings), proposed in 2022, is built on a key observation:

A question's embedding and its answer's embedding live in different semantic spaces.

The vector index stores documents (answer space). But retrieval uses a query (question space). These two distributions don't perfectly overlap, even for semantically related content.

HyDE's fix: have the LLM generate a hypothetical answer first, then embed that instead of the question. The hypothetical answer lives in the same semantic space as real documents — it's closer to what you're looking for.

query → LLM generates hypothetical answer (~100 words) → embed the answer
                                                               ↓
                                               vector search (find nearest docs)
                                                               ↓
                                               return real documents to LLM
Enter fullscreen mode Exit fullscreen mode

The hypothetical answer doesn't need to be correct — it just needs to be semantically in the right neighborhood so the embedding lands near the relevant documents.

Implementation

HYDE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "You are a technical knowledge assistant."),
    ("human",
     "Write a hypothetical answer to the following question in about 100 words. "
     "This answer will be used for vector retrieval, not shown to the user. "
     "It doesn't need to be perfectly accurate — just semantically close to "
     "what a real answer would look like.\n\n"
     "Question: {question}"),
])

hyde_chain = HYDE_PROMPT | llm | StrOutputParser()
hypothetical_answer = hyde_chain.invoke({"question": question})

# Embed the hypothetical answer, not the original question
hyp_embedding = embeddings.embed_query(hypothetical_answer)
results = vectorstore.similarity_search_by_vector(hyp_embedding, k=TOP_K)
Enter fullscreen mode Exit fullscreen mode

Best for: Scenarios where question phrasing and document language diverge — user asks conversationally, document is written in technical language. Cost: one extra LLM call plus one extra embedding computation.


Query Decomposition: Break It Down

Core Idea

Some questions are inherently multi-hop:

"For a RAG system targeting Chinese, what embedding model and vector database should I choose?"

This contains two independent sub-questions:

  1. Which embedding model is recommended for Chinese text?
  2. Which vector database is best for enterprise use?

One retrieval pass trying to answer both at once will usually do justice to neither.

Query Decomposition: decompose first, retrieve each part separately, then give the LLM all the results together.

complex question → LLM decomposes → [sub-question 1, sub-question 2, ...]
                                               ↓
                                  retrieve each independently, merge
                                               ↓
                               all docs passed to LLM for unified answer
Enter fullscreen mode Exit fullscreen mode

Implementation

DECOMPOSE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", "You are a question analysis assistant."),
    ("human",
     "Break down the following complex question into 2-3 simple sub-questions, "
     "each of which can be answered by an independent retrieval.\n"
     "Output one sub-question per line. No numbering, no explanation.\n\n"
     "Original question: {question}"),
])

decompose_chain = DECOMPOSE_PROMPT | llm | StrOutputParser()
sub_questions_text = decompose_chain.invoke({"question": question})
sub_questions = [q.strip() for q in sub_questions_text.strip().split("\n") if q.strip()]

all_docs = []
for sub_q in sub_questions:
    all_docs.extend(base_retriever.invoke(sub_q))

return dedup_docs(all_docs)[:TOP_K]
Enter fullscreen mode Exit fullscreen mode

Best for: Questions spanning multiple concepts or requiring synthesis across topics. Cost: one LLM decomposition call plus 2–3 retrieval calls.


Experimental Results

==================================================================================
  RAGAS Metrics Comparison (Four Query Optimization Strategies)
==================================================================================

  Metric               Naive   Multi-Query    HyDE   Decomposed
  ────────────────────────────────────────────────────────────────
  context_recall       0.625       0.625      0.750     0.875 ◀
  context_precision    0.583       0.583      0.726 ◀   0.590
  faithfulness         0.833       0.883      0.946 ◀   0.911
  answer_relevancy     0.406       0.412      0.377     0.474 ◀
==================================================================================
Enter fullscreen mode Exit fullscreen mode

Reading the numbers:

Multi-Query (context_recall = 0.625, same as Naive)

Surprising — rewriting queries didn't improve recall at all on this knowledge base. The reason: with only 8 documents, the base vector search is already hitting near-perfect recall. The three rewritten variants retrieve the same documents as the original query — merging and deduplication produces no increment. Multi-Query's value becomes apparent at scale, when different phrasings genuinely reach different regions of the vector space.

HyDE (context_recall = 0.750, +0.125; context_precision = 0.726, +0.143)

Both metrics improve, and faithfulness reaches 0.946 — the highest of all four strategies. The hypothetical answer's embedding genuinely lands closer to the document space, finding more relevant documents and ranking them better. A clean win on retrieval quality.

Query Decomposition (context_recall = 0.875, +0.250)

The largest recall improvement. Breaking questions into sub-questions and retrieving each separately surfaces documents that a single query misses. The resulting document pool is more comprehensive, and faithfulness rises to 0.911 as a knock-on effect.


The Core Difference Between the Three

Multi-Query HyDE Query Decomposition
Problem solved Phrasing sensitivity, single perspective Query-answer semantic space mismatch Multi-hop or multi-concept questions
What changes How the question is asked (multiple phrasings) What is used to search (answer instead of question) What is asked (whole → parts)
Extra LLM calls 1 (rewrite) 1 (hypothetical answer) 1 (decompose)
Extra retrieval calls 3 0 2–3
Top metric in this experiment context_precision, faithfulness context_recall
Best scenario Large knowledge bases, conversational queries Technical docs, large question-answer style gap Multi-concept questions, synthesis tasks

The key distinction is the axis of transformation:

  • Multi-Query varies how to ask — same intent, different words
  • HyDE varies what to search with — question becomes answer
  • Query Decomposition varies what to ask — one question becomes many

Strategies Can Stack

These three are not mutually exclusive. You can combine them based on the scenario:

# Example: HyDE + Multi-Query stacked
# 1. Generate hypothetical answer and embed it
hyp_embedding = embeddings.embed_query(hyde_answer)

# 2. Also retrieve with 3 rewritten variants
all_docs = vectorstore.similarity_search_by_vector(hyp_embedding, k=4)
for variant in multi_query_variants:
    all_docs.extend(base_retriever.invoke(variant))

return dedup_docs(all_docs)[:TOP_K]
Enter fullscreen mode Exit fullscreen mode

Stacking widens the recall net further, but API call count grows proportionally. In production, adaptive selection based on query type tends to outperform blanket stacking — reserve the heavier strategies for queries that actually need them.


Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/13-query-optimization

Key file:

  • query_optimization.py — Full four-strategy comparison experiment

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 13-query-optimization
cp .env.example .env
pip install -r requirements.txt
python query_optimization.py
Enter fullscreen mode Exit fullscreen mode

Summary

This article benchmarked three query optimization strategies against a naive baseline:

  1. Multi-Query — Helps when vocabulary is varied and the knowledge base is large; on small knowledge bases the redundancy means no gain
  2. HyDE — Bridges the question-answer semantic gap by searching with a generated answer; best improvement in ranking quality (context_precision +0.143, faithfulness +0.113)
  3. Query Decomposition — Handles multi-hop questions by splitting and retrieving independently; strongest improvement in recall (context_recall +0.250)

All three optimizations happen before the query touches the vector index — no changes to chunking, no changes to embeddings, no reindexing required. They are among the cheapest wins available in a RAG system.


References

Top comments (0)