- Published on
Building RAG Systems: Retrieval Augmented Generation from Scratch
- Authors

- Name
- Jared Chung
Introduction
Large Language Models are powerful but have limitations: they can hallucinate, have knowledge cutoffs, and don't know about your private data. Retrieval Augmented Generation (RAG) solves these problems by retrieving relevant context from a knowledge base before generating responses.
In this post, we'll build a production-ready RAG system from scratch, covering document processing, chunking strategies, retrieval techniques, and evaluation.
Prerequisites
# Core packages
pip install openai chromadb sentence-transformers
pip install PyPDF2 python-docx nltk tiktoken
pip install rank-bm25 ragas
# Download NLTK data
python -c "import nltk; nltk.download('punkt')"
Why RAG?
| Problem | Without RAG | With RAG |
|---|---|---|
| Knowledge cutoff | Can't answer about recent events | Retrieves up-to-date information |
| Hallucinations | Makes up facts confidently | Grounds answers in retrieved context |
| Private data | No access to your documents | Searches your knowledge base |
| Source attribution | No way to verify claims | Can cite specific sources |
| Cost | Long prompts with all context | Only includes relevant context |
How RAG Works
User Query → Embed Query → Vector Search → Retrieve Documents → LLM + Context → Response
The RAG Pipeline
- Embed: Convert the query to a vector using the same embedding model used for indexing
- Retrieve: Find similar documents in the vector database (typically top-k)
- Augment: Add retrieved documents to the LLM prompt as context
- Generate: LLM produces an answer grounded in the context
┌─────────────────────────────────────────────────────────────────┐
│ INDEXING PHASE │
│ Documents → Chunk → Embed → Store in Vector DB │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ QUERY PHASE │
│ Query → Embed → Search Vector DB → Retrieve Top-K → LLM │
└─────────────────────────────────────────────────────────────────┘
Building RAG from Scratch
Step 1: Document Loading
from pathlib import Path
import PyPDF2
from docx import Document
def load_pdf(file_path: str) -> str:
"""Extract text from PDF."""
text = ""
with open(file_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text() + "\n"
return text
def load_docx(file_path: str) -> str:
"""Extract text from Word document."""
doc = Document(file_path)
return "\n".join([para.text for para in doc.paragraphs])
def load_documents(directory: str) -> list[dict]:
"""Load all documents from a directory."""
documents = []
path = Path(directory)
for file_path in path.rglob("*"):
if file_path.suffix == ".pdf":
content = load_pdf(str(file_path))
elif file_path.suffix == ".docx":
content = load_docx(str(file_path))
elif file_path.suffix in [".txt", ".md"]:
content = file_path.read_text()
else:
continue
documents.append({
"content": content,
"source": str(file_path),
"filename": file_path.name
})
return documents
Step 2: Chunking Strategies
Chunking is crucial for RAG performance. Too large chunks lose specificity; too small chunks lose context.
from typing import List
def chunk_by_sentences(text: str, chunk_size: int = 3) -> List[str]:
"""Chunk by number of sentences."""
import nltk
nltk.download('punkt', quiet=True)
sentences = nltk.sent_tokenize(text)
chunks = []
for i in range(0, len(sentences), chunk_size):
chunk = " ".join(sentences[i:i + chunk_size])
chunks.append(chunk)
return chunks
def chunk_with_overlap(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
"""Chunk with character overlap for context preservation."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
# Find natural break point (end of sentence)
if end < len(text):
# Look for period, question mark, or exclamation
for i in range(end, max(start, end - 100), -1):
if text[i] in '.!?\n':
end = i + 1
break
chunks.append(text[start:end].strip())
start = end - overlap
return chunks
def recursive_chunk(text: str, chunk_size: int = 500) -> List[str]:
"""Recursively split by different separators."""
separators = ["\n\n", "\n", ". ", " "]
def split_text(text: str, separators: list) -> List[str]:
if not separators or len(text) <= chunk_size:
return [text] if text.strip() else []
sep = separators[0]
parts = text.split(sep)
chunks = []
current = ""
for part in parts:
if len(current) + len(part) <= chunk_size:
current += (sep if current else "") + part
else:
if current:
chunks.append(current)
if len(part) > chunk_size:
chunks.extend(split_text(part, separators[1:]))
else:
current = part
if current:
chunks.append(current)
return chunks
return split_text(text, separators)
Semantic Chunking
Advanced approach that chunks based on semantic similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunk(text: str, threshold: float = 0.5) -> List[str]:
"""Chunk based on semantic similarity between sentences."""
import nltk
sentences = nltk.sent_tokenize(text)
if len(sentences) <= 1:
return [text]
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Compare with previous sentence
similarity = np.dot(embeddings[i], embeddings[i-1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i-1])
)
if similarity >= threshold:
current_chunk.append(sentences[i])
else:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Step 3: Indexing
import chromadb
from sentence_transformers import SentenceTransformer
class RAGIndexer:
def __init__(self, collection_name: str = "rag_docs"):
self.client = chromadb.PersistentClient(path="./rag_db")
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
self.model = SentenceTransformer('all-MiniLM-L6-v2')
def index_documents(self, documents: list[dict]):
"""Index documents with their chunks."""
all_chunks = []
all_metadatas = []
all_ids = []
for doc_idx, doc in enumerate(documents):
chunks = recursive_chunk(doc["content"])
for chunk_idx, chunk in enumerate(chunks):
all_chunks.append(chunk)
all_metadatas.append({
"source": doc["source"],
"filename": doc["filename"],
"chunk_index": chunk_idx
})
all_ids.append(f"doc_{doc_idx}_chunk_{chunk_idx}")
# Generate embeddings
embeddings = self.model.encode(all_chunks).tolist()
# Add to collection
self.collection.add(
documents=all_chunks,
embeddings=embeddings,
metadatas=all_metadatas,
ids=all_ids
)
print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")
def search(self, query: str, n_results: int = 5) -> list[dict]:
"""Search for relevant chunks."""
query_embedding = self.model.encode(query).tolist()
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results
)
return [
{
"content": doc,
"metadata": meta,
"distance": dist
}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)
]
Step 4: Generation with LLM
from openai import OpenAI
class RAGGenerator:
def __init__(self, indexer: RAGIndexer):
self.indexer = indexer
self.client = OpenAI()
def generate(self, query: str, n_context: int = 5) -> str:
"""Generate answer using retrieved context."""
# Retrieve relevant chunks
results = self.indexer.search(query, n_results=n_context)
# Build context
context = "\n\n---\n\n".join([r["content"] for r in results])
# Create prompt
prompt = f"""Answer the question based on the following context. If the answer is not in the context, say "I don't have enough information to answer this question."
Context:
{context}
Question: {query}
Answer:"""
# Generate response
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
{"role": "user", "content": prompt}
],
temperature=0.7
)
return response.choices[0].message.content
def generate_with_sources(self, query: str, n_context: int = 5) -> dict:
"""Generate answer with source attribution."""
results = self.indexer.search(query, n_results=n_context)
context_with_refs = ""
for i, r in enumerate(results, 1):
context_with_refs += f"[{i}] {r['content']}\n\n"
prompt = f"""Answer the question based on the following numbered sources. Cite sources using [1], [2], etc.
Sources:
{context_with_refs}
Question: {query}
Answer (with citations):"""
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.7
)
return {
"answer": response.choices[0].message.content,
"sources": [{"content": r["content"], "metadata": r["metadata"]} for r in results]
}
Advanced RAG Techniques
Hybrid Search (BM25 + Vector)
Combine keyword matching with semantic search:
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, documents: list[str]):
self.documents = documents
self.model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare BM25
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
# Prepare embeddings
self.embeddings = self.model.encode(documents)
def search(self, query: str, k: int = 5, alpha: float = 0.5) -> list[dict]:
"""
Hybrid search with alpha weighting.
alpha=1.0 is pure vector, alpha=0.0 is pure BM25
"""
# BM25 scores
bm25_scores = self.bm25.get_scores(query.lower().split())
bm25_scores = bm25_scores / (bm25_scores.max() + 1e-6) # Normalize
# Vector scores
query_embedding = self.model.encode(query)
vector_scores = np.dot(self.embeddings, query_embedding) / (
np.linalg.norm(self.embeddings, axis=1) * np.linalg.norm(query_embedding)
)
# Combine scores
combined_scores = alpha * vector_scores + (1 - alpha) * bm25_scores
# Get top-k
top_indices = np.argsort(combined_scores)[::-1][:k]
return [
{"content": self.documents[i], "score": combined_scores[i]}
for i in top_indices
]
Query Expansion
Improve retrieval by expanding the query:
def expand_query(query: str, client: OpenAI) -> list[str]:
"""Generate multiple query variations for better retrieval."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Generate 3 alternative phrasings of the query."},
{"role": "user", "content": f"Query: {query}\n\nAlternative phrasings:"}
]
)
alternatives = response.choices[0].message.content.strip().split("\n")
return [query] + [alt.strip("1234567890.-) ") for alt in alternatives if alt.strip()]
Re-ranking
Use a cross-encoder for more accurate relevance scoring:
from sentence_transformers import CrossEncoder
class Reranker:
def __init__(self):
self.model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(self, query: str, documents: list[str], top_k: int = 5) -> list[dict]:
"""Rerank documents using cross-encoder."""
pairs = [[query, doc] for doc in documents]
scores = self.model.predict(pairs)
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True
)
return [{"content": doc, "score": score} for doc, score in ranked[:top_k]]
Evaluation
Metrics for RAG
Retrieval Quality: Are we finding the right documents?
- Recall@K
- Mean Reciprocal Rank (MRR)
- Normalized Discounted Cumulative Gain (NDCG)
Generation Quality: Is the answer correct and well-formed?
- Faithfulness: Is the answer supported by context?
- Relevance: Does it answer the question?
- Coherence: Is it well-written?
def evaluate_retrieval(queries: list[str], ground_truth: list[list[str]], retriever, k: int = 5):
"""Evaluate retrieval using Recall@K."""
recalls = []
for query, relevant_docs in zip(queries, ground_truth):
results = retriever.search(query, k=k)
retrieved_docs = [r["content"] for r in results]
# Calculate recall
hits = sum(1 for doc in relevant_docs if doc in retrieved_docs)
recall = hits / len(relevant_docs) if relevant_docs else 0
recalls.append(recall)
return {
"recall@k": np.mean(recalls),
"k": k
}
Using RAGAS for Evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
# Prepare evaluation data
eval_data = {
"question": ["What is machine learning?"],
"answer": ["Machine learning is a subset of AI..."],
"contexts": [["ML is a type of AI that learns from data..."]],
"ground_truth": ["Machine learning is a field of AI..."]
}
# Evaluate
result = evaluate(
dataset=eval_data,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)
Production Considerations
- Caching: Cache embeddings and frequent queries
- Streaming: Stream LLM responses for better UX
- Monitoring: Track retrieval quality and latency
- Updates: Handle document updates efficiently
- Cost: Balance chunk size with API costs
import functools
@functools.lru_cache(maxsize=1000)
def cached_embed(text: str) -> tuple:
"""Cache embeddings for repeated queries."""
return tuple(model.encode(text).tolist())
Complete Production RAG System
Here's a complete, production-ready RAG system you can adapt:
from openai import OpenAI
from sentence_transformers import SentenceTransformer, CrossEncoder
import chromadb
from rank_bm25 import BM25Okapi
import numpy as np
from typing import List, Dict, Optional
from dataclasses import dataclass
import hashlib
import tiktoken
@dataclass
class Chunk:
"""A document chunk with metadata."""
content: str
metadata: Dict
id: str
class ProductionRAG:
"""Production-ready RAG system with hybrid search and reranking."""
def __init__(
self,
embedding_model: str = "all-MiniLM-L6-v2",
llm_model: str = "gpt-4o-mini",
reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
collection_name: str = "rag_docs",
persist_path: str = "./rag_db"
):
# Models
self.embedder = SentenceTransformer(embedding_model)
self.reranker = CrossEncoder(reranker_model)
self.llm_model = llm_model
self.client = OpenAI()
# Vector store
self.chroma_client = chromadb.PersistentClient(path=persist_path)
self.collection = self.chroma_client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
# BM25 index (built on first search)
self.bm25 = None
self.bm25_corpus = []
self.bm25_ids = []
# Token counter for context management
self.tokenizer = tiktoken.encoding_for_model("gpt-4")
def chunk_document(
self,
text: str,
chunk_size: int = 500,
overlap: int = 50,
source: str = "unknown"
) -> List[Chunk]:
"""Chunk a document with overlap."""
chunks = []
start = 0
chunk_index = 0
while start < len(text):
end = min(start + chunk_size, len(text))
# Find sentence boundary
if end < len(text):
for i in range(end, max(start, end - 100), -1):
if text[i] in '.!?\n':
end = i + 1
break
chunk_text = text[start:end].strip()
if chunk_text:
chunk_id = hashlib.md5(f"{source}:{chunk_index}".encode()).hexdigest()[:16]
chunks.append(Chunk(
content=chunk_text,
metadata={"source": source, "chunk_index": chunk_index},
id=chunk_id
))
chunk_index += 1
start = end - overlap
return chunks
def index_documents(self, documents: List[Dict[str, str]], chunk_size: int = 500):
"""Index documents into the vector store."""
all_chunks = []
for doc in documents:
chunks = self.chunk_document(
doc["content"],
chunk_size=chunk_size,
source=doc.get("source", "unknown")
)
all_chunks.extend(chunks)
# Batch embed and store
contents = [c.content for c in all_chunks]
embeddings = self.embedder.encode(contents, show_progress_bar=True).tolist()
self.collection.upsert(
ids=[c.id for c in all_chunks],
documents=contents,
embeddings=embeddings,
metadatas=[c.metadata for c in all_chunks]
)
# Rebuild BM25 index
self._build_bm25_index()
print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")
def _build_bm25_index(self):
"""Build BM25 index from all documents."""
data = self.collection.get(include=["documents"])
self.bm25_corpus = [doc.lower().split() for doc in data["documents"]]
self.bm25_ids = data["ids"]
self.bm25 = BM25Okapi(self.bm25_corpus)
def hybrid_search(
self,
query: str,
top_k: int = 20,
alpha: float = 0.5
) -> List[Dict]:
"""Hybrid search combining vector and BM25."""
# Vector search
query_embedding = self.embedder.encode(query).tolist()
vector_results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
# BM25 search
if self.bm25 is None:
self._build_bm25_index()
bm25_scores = self.bm25.get_scores(query.lower().split())
top_bm25_indices = np.argsort(bm25_scores)[::-1][:top_k]
# Combine results
combined = {}
# Add vector results
for i, doc_id in enumerate(vector_results["ids"][0]):
distance = vector_results["distances"][0][i]
vector_score = 1 - distance # Convert distance to similarity
combined[doc_id] = {
"content": vector_results["documents"][0][i],
"metadata": vector_results["metadatas"][0][i],
"vector_score": vector_score,
"bm25_score": 0.0
}
# Add BM25 results
max_bm25 = max(bm25_scores) + 1e-6
for idx in top_bm25_indices:
doc_id = self.bm25_ids[idx]
bm25_score = bm25_scores[idx] / max_bm25 # Normalize
if doc_id in combined:
combined[doc_id]["bm25_score"] = bm25_score
else:
# Fetch from collection
result = self.collection.get(ids=[doc_id], include=["documents", "metadatas"])
combined[doc_id] = {
"content": result["documents"][0],
"metadata": result["metadatas"][0],
"vector_score": 0.0,
"bm25_score": bm25_score
}
# Calculate combined score
results = []
for doc_id, data in combined.items():
combined_score = alpha * data["vector_score"] + (1 - alpha) * data["bm25_score"]
results.append({
"id": doc_id,
"content": data["content"],
"metadata": data["metadata"],
"score": combined_score
})
return sorted(results, key=lambda x: x["score"], reverse=True)[:top_k]
def rerank(self, query: str, results: List[Dict], top_k: int = 5) -> List[Dict]:
"""Rerank results using cross-encoder."""
if not results:
return []
pairs = [[query, r["content"]] for r in results]
scores = self.reranker.predict(pairs)
for i, result in enumerate(results):
result["rerank_score"] = float(scores[i])
return sorted(results, key=lambda x: x["rerank_score"], reverse=True)[:top_k]
def count_tokens(self, text: str) -> int:
"""Count tokens in text."""
return len(self.tokenizer.encode(text))
def build_context(
self,
results: List[Dict],
max_tokens: int = 3000
) -> tuple[str, List[Dict]]:
"""Build context string from results, respecting token limit."""
context_parts = []
used_results = []
total_tokens = 0
for i, result in enumerate(results):
source = result["metadata"].get("source", "Unknown")
chunk_text = f"[Source {i+1}: {source}]\n{result['content']}\n"
chunk_tokens = self.count_tokens(chunk_text)
if total_tokens + chunk_tokens > max_tokens:
break
context_parts.append(chunk_text)
used_results.append(result)
total_tokens += chunk_tokens
return "\n".join(context_parts), used_results
def generate(
self,
query: str,
context: str,
temperature: float = 0.7
) -> str:
"""Generate answer using LLM."""
system_prompt = """You are a helpful assistant that answers questions based on the provided context.
Rules:
1. Only use information from the context to answer
2. If the context doesn't contain the answer, say "I don't have enough information"
3. Cite sources using [Source N] when referencing information
4. Be concise but thorough"""
user_prompt = f"""Context:
{context}
Question: {query}
Answer:"""
response = self.client.chat.completions.create(
model=self.llm_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=temperature
)
return response.choices[0].message.content
def query(
self,
question: str,
top_k_retrieve: int = 20,
top_k_rerank: int = 5,
alpha: float = 0.5,
max_context_tokens: int = 3000
) -> Dict:
"""Full RAG pipeline: retrieve, rerank, generate."""
# Step 1: Hybrid search
search_results = self.hybrid_search(question, top_k=top_k_retrieve, alpha=alpha)
# Step 2: Rerank
reranked_results = self.rerank(question, search_results, top_k=top_k_rerank)
# Step 3: Build context
context, used_sources = self.build_context(reranked_results, max_tokens=max_context_tokens)
# Step 4: Generate
answer = self.generate(question, context)
return {
"question": question,
"answer": answer,
"sources": [
{
"content": r["content"][:200] + "...",
"source": r["metadata"].get("source", "Unknown"),
"score": r.get("rerank_score", r.get("score", 0))
}
for r in used_sources
]
}
# Usage example
if __name__ == "__main__":
rag = ProductionRAG()
# Sample documents to index
documents = [
{
"content": """
Machine learning is a subset of artificial intelligence that enables systems
to learn and improve from experience without being explicitly programmed.
It focuses on developing algorithms that can access data and use it to learn
for themselves. The process begins with observations or data, such as examples,
direct experience, or instruction, to look for patterns in data and make better
decisions in the future.
""",
"source": "ml_intro.txt"
},
{
"content": """
Deep learning is a type of machine learning based on artificial neural networks
with multiple layers (hence 'deep'). These layers progressively extract higher-level
features from raw input. In image processing, lower layers may identify edges,
while higher layers may identify human faces. Deep learning has revolutionized
fields like computer vision, natural language processing, and speech recognition.
""",
"source": "deep_learning.txt"
},
{
"content": """
Natural Language Processing (NLP) is a branch of AI that helps computers
understand, interpret, and manipulate human language. NLP combines computational
linguistics with statistical, machine learning, and deep learning models.
Applications include machine translation, sentiment analysis, chatbots, and
text summarization.
""",
"source": "nlp_overview.txt"
}
]
# Index documents
rag.index_documents(documents)
# Query
result = rag.query("What is the relationship between deep learning and machine learning?")
print(f"Question: {result['question']}\n")
print(f"Answer: {result['answer']}\n")
print("Sources:")
for i, source in enumerate(result['sources'], 1):
print(f" [{i}] {source['source']} (score: {source['score']:.3f})")
Conversational RAG with Memory
For follow-up questions, maintain conversation history:
from dataclasses import dataclass, field
from typing import List, Tuple
@dataclass
class ConversationRAG:
"""RAG with conversation memory for follow-up questions."""
rag: ProductionRAG
history: List[Tuple[str, str]] = field(default_factory=list)
max_history: int = 5
def reformulate_query(self, question: str) -> str:
"""Use LLM to reformulate question with context from history."""
if not self.history:
return question
history_text = "\n".join([
f"User: {q}\nAssistant: {a[:200]}..."
for q, a in self.history[-self.max_history:]
])
prompt = f"""Given this conversation history and a follow-up question,
reformulate the question to be standalone (include all necessary context).
Conversation:
{history_text}
Follow-up question: {question}
Reformulated question:"""
response = self.rag.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content.strip()
def query(self, question: str) -> Dict:
"""Query with conversation context."""
# Reformulate if there's history
standalone_question = self.reformulate_query(question)
# Run RAG
result = self.rag.query(standalone_question)
# Update history
self.history.append((question, result["answer"]))
return {
**result,
"original_question": question,
"reformulated_question": standalone_question
}
def clear_history(self):
"""Clear conversation history."""
self.history = []
# Usage
rag = ProductionRAG()
conversation = ConversationRAG(rag)
# First question
result1 = conversation.query("What is machine learning?")
print(result1["answer"])
# Follow-up (will be reformulated to include context)
result2 = conversation.query("How does deep learning relate to it?")
print(f"Reformulated: {result2['reformulated_question']}")
print(result2["answer"])
Debugging Common RAG Issues
Issue 1: Poor Retrieval
def debug_retrieval(rag: ProductionRAG, query: str, expected_source: str):
"""Debug why expected content isn't being retrieved."""
results = rag.hybrid_search(query, top_k=20)
print(f"Query: {query}")
print(f"Looking for source: {expected_source}\n")
found = False
for i, r in enumerate(results):
source = r["metadata"].get("source", "")
if expected_source in source:
found = True
print(f"Found at position {i+1}")
print(f" Score: {r['score']:.4f}")
print(f" Content preview: {r['content'][:100]}...")
break
if not found:
print("Not found in top 20 results!")
print("\nTop 5 retrieved sources:")
for r in results[:5]:
print(f" - {r['metadata'].get('source')}: {r['score']:.4f}")
Issue 2: Hallucination Check
def check_grounding(answer: str, sources: List[str]) -> Dict:
"""Check if answer is grounded in sources."""
client = OpenAI()
prompt = f"""Analyze whether this answer is fully supported by the sources.
Answer: {answer}
Sources:
{chr(10).join(sources)}
Return JSON with:
- "grounded": true/false
- "unsupported_claims": list of claims not in sources
- "confidence": 0-1 score"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
Conclusion
RAG is the bridge between LLMs and your private knowledge. Key takeaways:
- Chunking matters: Test different strategies for your domain (500-1000 chars with overlap is a good start)
- Hybrid search: Combine BM25 and vectors for better recall (alpha=0.5 is a reasonable default)
- Re-ranking: Cross-encoders improve precision significantly (use for top-20 to top-5)
- Token management: Track context size to stay within LLM limits
- Evaluate: Use metrics like RAGAS to measure quality
- Conversation: Reformulate follow-up questions for better retrieval
Start simple (vector search only), then add hybrid search and reranking as needed.
References
- Lewis et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)
- Gao et al. "Retrieval-Augmented Generation for Large Language Models: A Survey" (2024)
- RAGAS: https://docs.ragas.io
- Robertson & Zaragoza "The Probabilistic Relevance Framework: BM25 and Beyond" (2009)