Jared AI Hub
Published on

Building RAG Systems: Retrieval Augmented Generation from Scratch

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Large Language Models are powerful but have limitations: they can hallucinate, have knowledge cutoffs, and don't know about your private data. Retrieval Augmented Generation (RAG) solves these problems by retrieving relevant context from a knowledge base before generating responses.

In this post, we'll build a production-ready RAG system from scratch, covering document processing, chunking strategies, retrieval techniques, and evaluation.

Prerequisites

# Core packages
pip install openai chromadb sentence-transformers
pip install PyPDF2 python-docx nltk tiktoken
pip install rank-bm25 ragas

# Download NLTK data
python -c "import nltk; nltk.download('punkt')"

Why RAG?

ProblemWithout RAGWith RAG
Knowledge cutoffCan't answer about recent eventsRetrieves up-to-date information
HallucinationsMakes up facts confidentlyGrounds answers in retrieved context
Private dataNo access to your documentsSearches your knowledge base
Source attributionNo way to verify claimsCan cite specific sources
CostLong prompts with all contextOnly includes relevant context

How RAG Works

User Query → Embed Query → Vector Search → Retrieve Documents → LLM + Context → Response

The RAG Pipeline

  1. Embed: Convert the query to a vector using the same embedding model used for indexing
  2. Retrieve: Find similar documents in the vector database (typically top-k)
  3. Augment: Add retrieved documents to the LLM prompt as context
  4. Generate: LLM produces an answer grounded in the context
┌─────────────────────────────────────────────────────────────────┐
│                         INDEXING PHASE                          │
│   Documents → Chunk → Embed → Store in Vector DB                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                         QUERY PHASE                             │
│   Query → Embed → Search Vector DB → Retrieve Top-K → LLM       │
└─────────────────────────────────────────────────────────────────┘

Building RAG from Scratch

Step 1: Document Loading

from pathlib import Path
import PyPDF2
from docx import Document

def load_pdf(file_path: str) -> str:
    """Extract text from PDF."""
    text = ""
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    return text

def load_docx(file_path: str) -> str:
    """Extract text from Word document."""
    doc = Document(file_path)
    return "\n".join([para.text for para in doc.paragraphs])

def load_documents(directory: str) -> list[dict]:
    """Load all documents from a directory."""
    documents = []
    path = Path(directory)

    for file_path in path.rglob("*"):
        if file_path.suffix == ".pdf":
            content = load_pdf(str(file_path))
        elif file_path.suffix == ".docx":
            content = load_docx(str(file_path))
        elif file_path.suffix in [".txt", ".md"]:
            content = file_path.read_text()
        else:
            continue

        documents.append({
            "content": content,
            "source": str(file_path),
            "filename": file_path.name
        })

    return documents

Step 2: Chunking Strategies

Chunking is crucial for RAG performance. Too large chunks lose specificity; too small chunks lose context.

from typing import List

def chunk_by_sentences(text: str, chunk_size: int = 3) -> List[str]:
    """Chunk by number of sentences."""
    import nltk
    nltk.download('punkt', quiet=True)

    sentences = nltk.sent_tokenize(text)
    chunks = []

    for i in range(0, len(sentences), chunk_size):
        chunk = " ".join(sentences[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

def chunk_with_overlap(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """Chunk with character overlap for context preservation."""
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        # Find natural break point (end of sentence)
        if end < len(text):
            # Look for period, question mark, or exclamation
            for i in range(end, max(start, end - 100), -1):
                if text[i] in '.!?\n':
                    end = i + 1
                    break

        chunks.append(text[start:end].strip())
        start = end - overlap

    return chunks

def recursive_chunk(text: str, chunk_size: int = 500) -> List[str]:
    """Recursively split by different separators."""
    separators = ["\n\n", "\n", ". ", " "]

    def split_text(text: str, separators: list) -> List[str]:
        if not separators or len(text) <= chunk_size:
            return [text] if text.strip() else []

        sep = separators[0]
        parts = text.split(sep)
        chunks = []
        current = ""

        for part in parts:
            if len(current) + len(part) <= chunk_size:
                current += (sep if current else "") + part
            else:
                if current:
                    chunks.append(current)
                if len(part) > chunk_size:
                    chunks.extend(split_text(part, separators[1:]))
                else:
                    current = part

        if current:
            chunks.append(current)

        return chunks

    return split_text(text, separators)

Semantic Chunking

Advanced approach that chunks based on semantic similarity:

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, threshold: float = 0.5) -> List[str]:
    """Chunk based on semantic similarity between sentences."""
    import nltk
    sentences = nltk.sent_tokenize(text)

    if len(sentences) <= 1:
        return [text]

    model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = model.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Compare with previous sentence
        similarity = np.dot(embeddings[i], embeddings[i-1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
        )

        if similarity >= threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Step 3: Indexing

import chromadb
from sentence_transformers import SentenceTransformer

class RAGIndexer:
    def __init__(self, collection_name: str = "rag_docs"):
        self.client = chromadb.PersistentClient(path="./rag_db")
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def index_documents(self, documents: list[dict]):
        """Index documents with their chunks."""
        all_chunks = []
        all_metadatas = []
        all_ids = []

        for doc_idx, doc in enumerate(documents):
            chunks = recursive_chunk(doc["content"])

            for chunk_idx, chunk in enumerate(chunks):
                all_chunks.append(chunk)
                all_metadatas.append({
                    "source": doc["source"],
                    "filename": doc["filename"],
                    "chunk_index": chunk_idx
                })
                all_ids.append(f"doc_{doc_idx}_chunk_{chunk_idx}")

        # Generate embeddings
        embeddings = self.model.encode(all_chunks).tolist()

        # Add to collection
        self.collection.add(
            documents=all_chunks,
            embeddings=embeddings,
            metadatas=all_metadatas,
            ids=all_ids
        )

        print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")

    def search(self, query: str, n_results: int = 5) -> list[dict]:
        """Search for relevant chunks."""
        query_embedding = self.model.encode(query).tolist()

        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=n_results
        )

        return [
            {
                "content": doc,
                "metadata": meta,
                "distance": dist
            }
            for doc, meta, dist in zip(
                results["documents"][0],
                results["metadatas"][0],
                results["distances"][0]
            )
        ]

Step 4: Generation with LLM

from openai import OpenAI

class RAGGenerator:
    def __init__(self, indexer: RAGIndexer):
        self.indexer = indexer
        self.client = OpenAI()

    def generate(self, query: str, n_context: int = 5) -> str:
        """Generate answer using retrieved context."""
        # Retrieve relevant chunks
        results = self.indexer.search(query, n_results=n_context)

        # Build context
        context = "\n\n---\n\n".join([r["content"] for r in results])

        # Create prompt
        prompt = f"""Answer the question based on the following context. If the answer is not in the context, say "I don't have enough information to answer this question."

Context:
{context}

Question: {query}

Answer:"""

        # Generate response
        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7
        )

        return response.choices[0].message.content

    def generate_with_sources(self, query: str, n_context: int = 5) -> dict:
        """Generate answer with source attribution."""
        results = self.indexer.search(query, n_results=n_context)

        context_with_refs = ""
        for i, r in enumerate(results, 1):
            context_with_refs += f"[{i}] {r['content']}\n\n"

        prompt = f"""Answer the question based on the following numbered sources. Cite sources using [1], [2], etc.

Sources:
{context_with_refs}

Question: {query}

Answer (with citations):"""

        response = self.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )

        return {
            "answer": response.choices[0].message.content,
            "sources": [{"content": r["content"], "metadata": r["metadata"]} for r in results]
        }

Advanced RAG Techniques

Hybrid Search (BM25 + Vector)

Combine keyword matching with semantic search:

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, documents: list[str]):
        self.documents = documents
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

        # Prepare BM25
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

        # Prepare embeddings
        self.embeddings = self.model.encode(documents)

    def search(self, query: str, k: int = 5, alpha: float = 0.5) -> list[dict]:
        """
        Hybrid search with alpha weighting.
        alpha=1.0 is pure vector, alpha=0.0 is pure BM25
        """
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_scores = bm25_scores / (bm25_scores.max() + 1e-6)  # Normalize

        # Vector scores
        query_embedding = self.model.encode(query)
        vector_scores = np.dot(self.embeddings, query_embedding) / (
            np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_embedding)
        )

        # Combine scores
        combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores

        # Get top-k
        top_indices = np.argsort(combined_scores)[::-1][:k]

        return [
            {"content": self.documents[i], "score": combined_scores[i]}
            for i in top_indices
        ]

Query Expansion

Improve retrieval by expanding the query:

def expand_query(query: str, client: OpenAI) -> list[str]:
    """Generate multiple query variations for better retrieval."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Generate 3 alternative phrasings of the query."},
            {"role": "user", "content": f"Query: {query}\n\nAlternative phrasings:"}
        ]
    )

    alternatives = response.choices[0].message.content.strip().split("\n")
    return [query] + [alt.strip("1234567890.-) ") for alt in alternatives if alt.strip()]

Re-ranking

Use a cross-encoder for more accurate relevance scoring:

from sentence_transformers import CrossEncoder

class Reranker:
    def __init__(self):
        self.model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    def rerank(self, query: str, documents: list[str], top_k: int = 5) -> list[dict]:
        """Rerank documents using cross-encoder."""
        pairs = [[query, doc] for doc in documents]
        scores = self.model.predict(pairs)

        ranked = sorted(
            zip(documents, scores),
            key=lambda x: x[1],
            reverse=True
        )

        return [{"content": doc, "score": score} for doc, score in ranked[:top_k]]

Evaluation

Metrics for RAG

  1. Retrieval Quality: Are we finding the right documents?

    • Recall@K
    • Mean Reciprocal Rank (MRR)
    • Normalized Discounted Cumulative Gain (NDCG)
  2. Generation Quality: Is the answer correct and well-formed?

    • Faithfulness: Is the answer supported by context?
    • Relevance: Does it answer the question?
    • Coherence: Is it well-written?
def evaluate_retrieval(queries: list[str], ground_truth: list[list[str]], retriever, k: int = 5):
    """Evaluate retrieval using Recall@K."""
    recalls = []

    for query, relevant_docs in zip(queries, ground_truth):
        results = retriever.search(query, k=k)
        retrieved_docs = [r["content"] for r in results]

        # Calculate recall
        hits = sum(1 for doc in relevant_docs if doc in retrieved_docs)
        recall = hits / len(relevant_docs) if relevant_docs else 0
        recalls.append(recall)

    return {
        "recall@k": np.mean(recalls),
        "k": k
    }

Using RAGAS for Evaluation

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

# Prepare evaluation data
eval_data = {
    "question": ["What is machine learning?"],
    "answer": ["Machine learning is a subset of AI..."],
    "contexts": [["ML is a type of AI that learns from data..."]],
    "ground_truth": ["Machine learning is a field of AI..."]
}

# Evaluate
result = evaluate(
    dataset=eval_data,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

print(result)

Production Considerations

  1. Caching: Cache embeddings and frequent queries
  2. Streaming: Stream LLM responses for better UX
  3. Monitoring: Track retrieval quality and latency
  4. Updates: Handle document updates efficiently
  5. Cost: Balance chunk size with API costs
import functools

@functools.lru_cache(maxsize=1000)
def cached_embed(text: str) -> tuple:
    """Cache embeddings for repeated queries."""
    return tuple(model.encode(text).tolist())

Complete Production RAG System

Here's a complete, production-ready RAG system you can adapt:

from openai import OpenAI
from sentence_transformers import SentenceTransformer, CrossEncoder
import chromadb
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Dict, Optional
from dataclasses import dataclass
import hashlib
import tiktoken

@dataclass
class Chunk:
    """A document chunk with metadata."""
    content: str
    metadata: Dict
    id: str

class ProductionRAG:
    """Production-ready RAG system with hybrid search and reranking."""

    def __init__(
        self,
        embedding_model: str = "all-MiniLM-L6-v2",
        llm_model: str = "gpt-4o-mini",
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        collection_name: str = "rag_docs",
        persist_path: str = "./rag_db"
    ):
        # Models
        self.embedder = SentenceTransformer(embedding_model)
        self.reranker = CrossEncoder(reranker_model)
        self.llm_model = llm_model
        self.client = OpenAI()

        # Vector store
        self.chroma_client = chromadb.PersistentClient(path=persist_path)
        self.collection = self.chroma_client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )

        # BM25 index (built on first search)
        self.bm25 = None
        self.bm25_corpus = []
        self.bm25_ids = []

        # Token counter for context management
        self.tokenizer = tiktoken.encoding_for_model("gpt-4")

    def chunk_document(
        self,
        text: str,
        chunk_size: int = 500,
        overlap: int = 50,
        source: str = "unknown"
    ) -> List[Chunk]:
        """Chunk a document with overlap."""
        chunks = []
        start = 0
        chunk_index = 0

        while start < len(text):
            end = min(start + chunk_size, len(text))

            # Find sentence boundary
            if end < len(text):
                for i in range(end, max(start, end - 100), -1):
                    if text[i] in '.!?\n':
                        end = i + 1
                        break

            chunk_text = text[start:end].strip()
            if chunk_text:
                chunk_id = hashlib.md5(f"{source}:{chunk_index}".encode()).hexdigest()[:16]
                chunks.append(Chunk(
                    content=chunk_text,
                    metadata={"source": source, "chunk_index": chunk_index},
                    id=chunk_id
                ))
                chunk_index += 1

            start = end - overlap

        return chunks

    def index_documents(self, documents: List[Dict[str, str]], chunk_size: int = 500):
        """Index documents into the vector store."""
        all_chunks = []

        for doc in documents:
            chunks = self.chunk_document(
                doc["content"],
                chunk_size=chunk_size,
                source=doc.get("source", "unknown")
            )
            all_chunks.extend(chunks)

        # Batch embed and store
        contents = [c.content for c in all_chunks]
        embeddings = self.embedder.encode(contents, show_progress_bar=True).tolist()

        self.collection.upsert(
            ids=[c.id for c in all_chunks],
            documents=contents,
            embeddings=embeddings,
            metadatas=[c.metadata for c in all_chunks]
        )

        # Rebuild BM25 index
        self._build_bm25_index()

        print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")

    def _build_bm25_index(self):
        """Build BM25 index from all documents."""
        data = self.collection.get(include=["documents"])
        self.bm25_corpus = [doc.lower().split() for doc in data["documents"]]
        self.bm25_ids = data["ids"]
        self.bm25 = BM25Okapi(self.bm25_corpus)

    def hybrid_search(
        self,
        query: str,
        top_k: int = 20,
        alpha: float = 0.5
    ) -> List[Dict]:
        """Hybrid search combining vector and BM25."""
        # Vector search
        query_embedding = self.embedder.encode(query).tolist()
        vector_results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )

        # BM25 search
        if self.bm25 is None:
            self._build_bm25_index()

        bm25_scores = self.bm25.get_scores(query.lower().split())
        top_bm25_indices = np.argsort(bm25_scores)[::-1][:top_k]

        # Combine results
        combined = {}

        # Add vector results
        for i, doc_id in enumerate(vector_results["ids"][0]):
            distance = vector_results["distances"][0][i]
            vector_score = 1 - distance  # Convert distance to similarity
            combined[doc_id] = {
                "content": vector_results["documents"][0][i],
                "metadata": vector_results["metadatas"][0][i],
                "vector_score": vector_score,
                "bm25_score": 0.0
            }

        # Add BM25 results
        max_bm25 = max(bm25_scores) + 1e-6
        for idx in top_bm25_indices:
            doc_id = self.bm25_ids[idx]
            bm25_score = bm25_scores[idx] / max_bm25  # Normalize

            if doc_id in combined:
                combined[doc_id]["bm25_score"] = bm25_score
            else:
                # Fetch from collection
                result = self.collection.get(ids=[doc_id], include=["documents", "metadatas"])
                combined[doc_id] = {
                    "content": result["documents"][0],
                    "metadata": result["metadatas"][0],
                    "vector_score": 0.0,
                    "bm25_score": bm25_score
                }

        # Calculate combined score
        results = []
        for doc_id, data in combined.items():
            combined_score = alpha * data["vector_score"] + (1 - alpha) * data["bm25_score"]
            results.append({
                "id": doc_id,
                "content": data["content"],
                "metadata": data["metadata"],
                "score": combined_score
            })

        return sorted(results, key=lambda x: x["score"], reverse=True)[:top_k]

    def rerank(self, query: str, results: List[Dict], top_k: int = 5) -> List[Dict]:
        """Rerank results using cross-encoder."""
        if not results:
            return []

        pairs = [[query, r["content"]] for r in results]
        scores = self.reranker.predict(pairs)

        for i, result in enumerate(results):
            result["rerank_score"] = float(scores[i])

        return sorted(results, key=lambda x: x["rerank_score"], reverse=True)[:top_k]

    def count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.tokenizer.encode(text))

    def build_context(
        self,
        results: List[Dict],
        max_tokens: int = 3000
    ) -> tuple[str, List[Dict]]:
        """Build context string from results, respecting token limit."""
        context_parts = []
        used_results = []
        total_tokens = 0

        for i, result in enumerate(results):
            source = result["metadata"].get("source", "Unknown")
            chunk_text = f"[Source {i+1}: {source}]\n{result['content']}\n"
            chunk_tokens = self.count_tokens(chunk_text)

            if total_tokens + chunk_tokens > max_tokens:
                break

            context_parts.append(chunk_text)
            used_results.append(result)
            total_tokens += chunk_tokens

        return "\n".join(context_parts), used_results

    def generate(
        self,
        query: str,
        context: str,
        temperature: float = 0.7
    ) -> str:
        """Generate answer using LLM."""
        system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Rules:
1. Only use information from the context to answer
2. If the context doesn't contain the answer, say "I don't have enough information"
3. Cite sources using [Source N] when referencing information
4. Be concise but thorough"""

        user_prompt = f"""Context:
{context}

Question: {query}

Answer:"""

        response = self.client.chat.completions.create(
            model=self.llm_model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=temperature
        )

        return response.choices[0].message.content

    def query(
        self,
        question: str,
        top_k_retrieve: int = 20,
        top_k_rerank: int = 5,
        alpha: float = 0.5,
        max_context_tokens: int = 3000
    ) -> Dict:
        """Full RAG pipeline: retrieve, rerank, generate."""
        # Step 1: Hybrid search
        search_results = self.hybrid_search(question, top_k=top_k_retrieve, alpha=alpha)

        # Step 2: Rerank
        reranked_results = self.rerank(question, search_results, top_k=top_k_rerank)

        # Step 3: Build context
        context, used_sources = self.build_context(reranked_results, max_tokens=max_context_tokens)

        # Step 4: Generate
        answer = self.generate(question, context)

        return {
            "question": question,
            "answer": answer,
            "sources": [
                {
                    "content": r["content"][:200] + "...",
                    "source": r["metadata"].get("source", "Unknown"),
                    "score": r.get("rerank_score", r.get("score", 0))
                }
                for r in used_sources
            ]
        }


# Usage example
if __name__ == "__main__":
    rag = ProductionRAG()

    # Sample documents to index
    documents = [
        {
            "content": """
            Machine learning is a subset of artificial intelligence that enables systems
            to learn and improve from experience without being explicitly programmed.
            It focuses on developing algorithms that can access data and use it to learn
            for themselves. The process begins with observations or data, such as examples,
            direct experience, or instruction, to look for patterns in data and make better
            decisions in the future.
            """,
            "source": "ml_intro.txt"
        },
        {
            "content": """
            Deep learning is a type of machine learning based on artificial neural networks
            with multiple layers (hence 'deep'). These layers progressively extract higher-level
            features from raw input. In image processing, lower layers may identify edges,
            while higher layers may identify human faces. Deep learning has revolutionized
            fields like computer vision, natural language processing, and speech recognition.
            """,
            "source": "deep_learning.txt"
        },
        {
            "content": """
            Natural Language Processing (NLP) is a branch of AI that helps computers
            understand, interpret, and manipulate human language. NLP combines computational
            linguistics with statistical, machine learning, and deep learning models.
            Applications include machine translation, sentiment analysis, chatbots, and
            text summarization.
            """,
            "source": "nlp_overview.txt"
        }
    ]

    # Index documents
    rag.index_documents(documents)

    # Query
    result = rag.query("What is the relationship between deep learning and machine learning?")

    print(f"Question: {result['question']}\n")
    print(f"Answer: {result['answer']}\n")
    print("Sources:")
    for i, source in enumerate(result['sources'], 1):
        print(f"  [{i}] {source['source']} (score: {source['score']:.3f})")

Conversational RAG with Memory

For follow-up questions, maintain conversation history:

from dataclasses import dataclass, field
from typing import List, Tuple

@dataclass
class ConversationRAG:
    """RAG with conversation memory for follow-up questions."""
    rag: ProductionRAG
    history: List[Tuple[str, str]] = field(default_factory=list)
    max_history: int = 5

    def reformulate_query(self, question: str) -> str:
        """Use LLM to reformulate question with context from history."""
        if not self.history:
            return question

        history_text = "\n".join([
            f"User: {q}\nAssistant: {a[:200]}..."
            for q, a in self.history[-self.max_history:]
        ])

        prompt = f"""Given this conversation history and a follow-up question,
reformulate the question to be standalone (include all necessary context).

Conversation:
{history_text}

Follow-up question: {question}

Reformulated question:"""

        response = self.rag.client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        return response.choices[0].message.content.strip()

    def query(self, question: str) -> Dict:
        """Query with conversation context."""
        # Reformulate if there's history
        standalone_question = self.reformulate_query(question)

        # Run RAG
        result = self.rag.query(standalone_question)

        # Update history
        self.history.append((question, result["answer"]))

        return {
            **result,
            "original_question": question,
            "reformulated_question": standalone_question
        }

    def clear_history(self):
        """Clear conversation history."""
        self.history = []


# Usage
rag = ProductionRAG()
conversation = ConversationRAG(rag)

# First question
result1 = conversation.query("What is machine learning?")
print(result1["answer"])

# Follow-up (will be reformulated to include context)
result2 = conversation.query("How does deep learning relate to it?")
print(f"Reformulated: {result2['reformulated_question']}")
print(result2["answer"])

Debugging Common RAG Issues

Issue 1: Poor Retrieval

def debug_retrieval(rag: ProductionRAG, query: str, expected_source: str):
    """Debug why expected content isn't being retrieved."""
    results = rag.hybrid_search(query, top_k=20)

    print(f"Query: {query}")
    print(f"Looking for source: {expected_source}\n")

    found = False
    for i, r in enumerate(results):
        source = r["metadata"].get("source", "")
        if expected_source in source:
            found = True
            print(f"Found at position {i+1}")
            print(f"  Score: {r['score']:.4f}")
            print(f"  Content preview: {r['content'][:100]}...")
            break

    if not found:
        print("Not found in top 20 results!")
        print("\nTop 5 retrieved sources:")
        for r in results[:5]:
            print(f"  - {r['metadata'].get('source')}: {r['score']:.4f}")

Issue 2: Hallucination Check

def check_grounding(answer: str, sources: List[str]) -> Dict:
    """Check if answer is grounded in sources."""
    client = OpenAI()

    prompt = f"""Analyze whether this answer is fully supported by the sources.

Answer: {answer}

Sources:
{chr(10).join(sources)}

Return JSON with:
- "grounded": true/false
- "unsupported_claims": list of claims not in sources
- "confidence": 0-1 score"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )

    import json
    return json.loads(response.choices[0].message.content)

Conclusion

RAG is the bridge between LLMs and your private knowledge. Key takeaways:

  • Chunking matters: Test different strategies for your domain (500-1000 chars with overlap is a good start)
  • Hybrid search: Combine BM25 and vectors for better recall (alpha=0.5 is a reasonable default)
  • Re-ranking: Cross-encoders improve precision significantly (use for top-20 to top-5)
  • Token management: Track context size to stay within LLM limits
  • Evaluate: Use metrics like RAGAS to measure quality
  • Conversation: Reformulate follow-up questions for better retrieval

Start simple (vector search only), then add hybrid search and reranking as needed.

References

  • Lewis et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
  • Gao et al. "Retrieval-Augmented Generation for Large Language Models: A Survey" (2024)
  • RAGAS: https://docs.ragas.io
  • Robertson & Zaragoza "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)