Jared AI Hub
Published on

Advanced RAG: Beyond Basic Retrieval

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Basic RAG is straightforward: embed documents, store in a vector database, retrieve similar chunks, and generate a response. But production RAG systems face challenges that simple implementations can't handle dense technical content, ambiguous queries, multi-hop reasoning, and the need for high precision at scale.

This guide covers advanced techniques that separate prototype RAG from production-grade systems.

Advanced RAG Pipeline

The Limitations of Basic RAG

Before diving into solutions, let's understand what goes wrong:

ProblemSymptomRoot Cause
Poor recallMisses relevant documentsSimple similarity isn't enough
Low precisionReturns irrelevant chunksNo reranking or filtering
Context fragmentationLoses important contextNaive chunking strategies
Query mismatchUser query doesn't match doc languageNo query transformation
HallucinationMakes up informationRetrieved context too sparse

Query Transformation Techniques

The user's query is rarely optimal for retrieval. Transform it first.

Query Expansion

Generate multiple query variants to improve recall. A single query might miss relevant documents that use different terminology. By searching with multiple phrasings—including technical synonyms and related concepts—you cast a wider net and reduce the chance of missing important content.

def expand_query(query: str, llm) -> list[str]:
    prompt = f"""Generate 3 alternative phrasings of this query for search.
    Include technical synonyms and related concepts.

    Query: {query}

    Return as a Python list of strings."""

    response = llm.invoke(prompt)
    variants = eval(response.content)
    return [query] + variants

# Example
query = "How do I fine-tune LLaMA?"
expanded = expand_query(query, llm)
# ["How do I fine-tune LLaMA?",
#  "LLaMA model training customization",
#  "Adapting LLaMA weights for specific tasks",
#  "Fine-tuning open source large language models"]

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, then search for documents similar to that answer. The insight is that a question's embedding and its answer's embedding often live in different parts of the vector space. By generating what an ideal answer might look like, you can search using an embedding that's closer to your actual documents.

def hyde_transform(query: str, llm) -> str:
    prompt = f"""Write a detailed paragraph that would answer this question.
    Write as if you're an expert, but don't make up specific facts.

    Question: {query}"""

    hypothetical_doc = llm.invoke(prompt).content
    return hypothetical_doc

# Search using the hypothetical document embedding
# instead of the query embedding
query = "What causes transformer attention to be slow?"
hyde_doc = hyde_transform(query, llm)
results = vectorstore.similarity_search(hyde_doc, k=5)

When to use HyDE: Technical queries where user language differs significantly from document language.

Step-Back Prompting

For specific questions, first ask a more general question:

def stepback_query(query: str, llm) -> str:
    prompt = f"""Given this specific question, generate a more general
    question that would help answer it.

    Specific: {query}
    General:"""

    return llm.invoke(prompt).content

# Example
specific = "What's the learning rate for fine-tuning BERT on NER?"
general = stepback_query(specific, llm)
# "What are best practices for fine-tuning BERT models?"

# Search for both and combine results

Advanced Chunking Strategies

How you split documents dramatically affects retrieval quality.

Semantic Chunking

Split based on semantic shifts, not arbitrary lengths. Fixed-size chunking often cuts sentences in the middle or separates related concepts. Semantic chunking uses embeddings to detect topic boundaries—it groups sentences that discuss the same concept and splits when the topic changes significantly.

from langchain.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

splitter = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)

chunks = splitter.split_text(document)

Parent-Child Chunking

Store small chunks for retrieval, but return larger context. Small chunks embed more precisely because each embedding represents a focused concept. But small chunks lack the surrounding context needed for good answers. Parent-child chunking gives you the best of both: precise matching with rich context.

The pattern works by:

  1. Splitting documents into large "parent" chunks (e.g., 1000 tokens)
  2. Further splitting each parent into small "child" chunks (e.g., 200 tokens)
  3. Embedding and searching the children
  4. Returning the parents of matched children
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Small chunks for precise matching
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Larger chunks for context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Searches small chunks, returns parent chunks
docs = retriever.get_relevant_documents(query)

Proposition-Based Chunking

Extract atomic facts from documents:

def extract_propositions(text: str, llm) -> list[str]:
    prompt = f"""Extract atomic facts from this text.
    Each fact should be self-contained and understandable without context.

    Text: {text}

    Facts:"""

    response = llm.invoke(prompt)
    return parse_facts(response.content)

# Each proposition becomes a chunk
# "BERT uses 12 transformer layers"
# "BERT was trained on BookCorpus and Wikipedia"

Hybrid Search

Combine dense (vector) and sparse (keyword) search for better results.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Keyword-based retrieval
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Vector-based retrieval
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Combine with weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]  # Tune based on your data
)

When Hybrid Beats Pure Vector

Query TypeBest Approach
Exact terms (API names, error codes)BM25 heavy
Conceptual questionsVector heavy
Mixed (concept + specific term)Balanced hybrid

Reranking

Initial retrieval prioritizes recall. Reranking improves precision.

Cross-Encoder Reranking

Cross-encoders are more accurate than bi-encoders but slower. The key difference: bi-encoders (used for initial retrieval) embed the query and documents independently, while cross-encoders process query-document pairs together. This joint processing allows the model to capture subtle relevance signals that independent embeddings miss.

The trade-off is speed—cross-encoders can't be pre-computed, so you can only apply them to a small candidate set. The pattern is to retrieve broadly first (20-50 documents), then rerank to get your final top results.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query: str, documents: list, top_k: int = 3):
    # Score each document against the query
    pairs = [[query, doc.page_content] for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score
    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return [doc for doc, score in scored_docs[:top_k]]

# Retrieve many, rerank to few
initial_docs = vectorstore.similarity_search(query, k=20)
final_docs = rerank_results(query, initial_docs, top_k=5)

Cohere Rerank API

Production-ready reranking as a service:

import cohere

co = cohere.Client(api_key)

def cohere_rerank(query: str, documents: list, top_k: int = 5):
    results = co.rerank(
        query=query,
        documents=[doc.page_content for doc in documents],
        top_n=top_k,
        model="rerank-english-v2.0"
    )

    return [documents[r.index] for r in results]

LLM-Based Reranking

Use the LLM itself to judge relevance:

def llm_rerank(query: str, documents: list, llm, top_k: int = 3):
    prompt = f"""Rate the relevance of each document to the query.
    Score 1-10 where 10 is highly relevant.

    Query: {query}

    Documents:
    {format_documents(documents)}

    Return scores as JSON: {{"doc_0": score, "doc_1": score, ...}}"""

    scores = json.loads(llm.invoke(prompt).content)
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

    return [documents[int(k.split("_")[1])] for k, v in ranked[:top_k]]

Multi-Vector Retrieval

Represent documents with multiple embeddings for richer matching.

Summary + Content Embeddings

def create_multi_vector_doc(doc, llm):
    # Generate summary
    summary = llm.invoke(f"Summarize: {doc.page_content}").content

    # Generate questions this doc answers
    questions = llm.invoke(
        f"What questions does this answer? {doc.page_content}"
    ).content

    return {
        "content": doc.page_content,
        "content_embedding": embed(doc.page_content),
        "summary_embedding": embed(summary),
        "questions_embedding": embed(questions),
    }

# Search across all embedding types
# Return the original content

Contextual Compression

Reduce noise by extracting only relevant portions:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(k=10)
)

# Returns only the relevant sentences from each document
docs = compression_retriever.get_relevant_documents(query)

Self-Query Retrieval

Let the LLM write the query filters:

from langchain.retrievers.self_query.base import SelfQueryRetriever

metadata_field_info = [
    AttributeInfo(name="category", type="string", description="Document category"),
    AttributeInfo(name="date", type="date", description="Publication date"),
    AttributeInfo(name="author", type="string", description="Author name"),
]

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_content_description="Technical blog posts about ML",
    metadata_field_info=metadata_field_info,
)

# User: "Articles about transformers from 2024"
# Auto-generates: filter={"date": {"$gte": "2024-01-01"}}

Evaluation and Iteration

You can't improve what you don't measure.

Key Metrics

def evaluate_rag(test_set, retriever, generator):
    results = {
        "retrieval_precision": [],
        "retrieval_recall": [],
        "answer_relevance": [],
        "faithfulness": [],
    }

    for item in test_set:
        query = item["query"]
        ground_truth_docs = item["relevant_docs"]
        ground_truth_answer = item["answer"]

        # Retrieval metrics
        retrieved = retriever.get_relevant_documents(query)
        precision = calculate_precision(retrieved, ground_truth_docs)
        recall = calculate_recall(retrieved, ground_truth_docs)

        # Generation metrics
        answer = generator.generate(query, retrieved)
        relevance = judge_relevance(answer, query)
        faithfulness = check_faithfulness(answer, retrieved)

        results["retrieval_precision"].append(precision)
        # ... etc

    return {k: sum(v)/len(v) for k, v in results.items()}

Building Test Sets

Create diverse test cases:

test_set = [
    {
        "query": "What is the attention mechanism?",
        "type": "factual",
        "difficulty": "easy"
    },
    {
        "query": "Compare BERT and GPT architectures",
        "type": "comparison",
        "difficulty": "medium"
    },
    {
        "query": "How would you fine-tune for low-resource NER?",
        "type": "reasoning",
        "difficulty": "hard"
    },
]

Putting It All Together

A production RAG pipeline might look like:

class AdvancedRAGPipeline:
    def __init__(self, vectorstore, llm, reranker):
        self.vectorstore = vectorstore
        self.llm = llm
        self.reranker = reranker

    def query(self, user_query: str) -> str:
        # 1. Query expansion
        queries = self.expand_query(user_query)

        # 2. Hybrid retrieval
        all_docs = []
        for q in queries:
            vector_docs = self.vectorstore.similarity_search(q, k=10)
            bm25_docs = self.bm25_search(q, k=10)
            all_docs.extend(vector_docs + bm25_docs)

        # 3. Deduplicate
        unique_docs = self.deduplicate(all_docs)

        # 4. Rerank
        top_docs = self.reranker.rerank(user_query, unique_docs, k=5)

        # 5. Compress context
        compressed = self.compress_context(user_query, top_docs)

        # 6. Generate
        response = self.generate(user_query, compressed)

        return response

Conclusion

Advanced RAG is about combining multiple techniques thoughtfully. Start with the basics, measure your results, identify failure modes, and add complexity only where it helps.

The best RAG system isn't the most sophisticated one it's the one that reliably answers your users' questions with accurate, grounded responses.