Jared AI Hub
Published on

Vector Databases Explained: Architecture and Selection Guide

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Vector databases have become essential infrastructure for AI applications. They power semantic search, recommendation systems, and Retrieval Augmented Generation (RAG). But what exactly makes them different from traditional databases, and how do you choose the right one?

This guide explains the architecture behind vector databases, the algorithms that make them fast, and provides practical guidance for selecting the right option for your use case.

Why Vector Databases?

The Semantic Gap

Traditional databases excel at exact matching:

SELECT * FROM products WHERE name = 'iPhone 15';

But they can't handle queries like "smartphones with good cameras" because they don't understand that "good cameras" relates to megapixels, aperture, and low-light performance.

Semantic search solves this by:

  1. Converting text to vectors that capture meaning
  2. Finding similar vectors to match concepts, not just keywords
  3. Scaling efficiently to millions of documents

What is a Vector Embedding?

An embedding is a numerical representation of data where similar items have similar numbers. The core insight is that meaning can be captured in geometry:

TextVector (simplified)
"I love Python programming"[0.8, 0.3, 0.9, 0.1]
"Python is my favorite language"[0.75, 0.28, 0.85, 0.15]
"The weather is nice today"[0.1, 0.9, 0.2, 0.7]

The first two texts have similar vectors because they express similar meanings. The third is geometrically distant because it's semantically unrelated.

Modern embedding models like all-MiniLM-L6-v2 produce 384-dimensional vectors. Larger models like text-embedding-3-large produce 3072 dimensions. More dimensions capture finer semantic distinctions but require more storage and compute.

How Vector Databases Work

Vector Database Architecture

The Architecture

A vector database has two main flows:

Ingestion Flow

  1. Documents arrive (text, images, or any data)
  2. Embedding model converts them to vectors
  3. Vector index organizes vectors for efficient search
  4. Storage layer persists vectors and original documents

Query Flow

  1. Query arrives ("What is machine learning?")
  2. Same embedding model converts query to vector
  3. ANN search finds k nearest vectors in the index
  4. Results returned with similarity scores and original documents

The key insight: the same embedding model must be used for both documents and queries. Different models produce incompatible vector spaces.

The Nearest Neighbor Problem

Given a query vector, find the k most similar vectors in the database.

Brute Force Approach:

  • Compare query with every vector in database
  • Time complexity: O(n) where n = number of vectors
  • For 1 million 384-dimensional vectors: 384 million float comparisons per query
  • Result: Too slow for production

The Solution: Approximate Nearest Neighbor (ANN)

ANN algorithms trade perfect accuracy for speed. Instead of guaranteed exact matches, they find vectors that are very likely to be the nearest neighbors with high probability.

ANN Index Algorithms

The choice of index algorithm determines your speed/accuracy/memory tradeoffs.

HNSW (Hierarchical Navigable Small World)

HNSW is the most popular algorithm. It builds a multi-layer graph where each layer contains fewer nodes with longer-range connections.

How it works:

  1. Start search at the top layer (few nodes, long connections)
  2. Greedily navigate toward the query vector
  3. Drop to the next layer and continue
  4. Repeat until reaching the bottom layer with all nodes

Think of it like navigating a map: start with highways (top layer), then main roads, then local streets.

Characteristics:

AspectRating
Query SpeedExcellent
Recall (accuracy)Very Good
Memory UsageHigh
Update HandlingGood
Best ForGeneral purpose, real-time apps

Key Parameters:

  • M: Connections per node (higher = better recall, more memory)
  • ef_construction: Build-time search width (higher = better index quality)
  • ef_search: Query-time search width (higher = better recall, slower)

IVF (Inverted File Index)

IVF clusters vectors and only searches relevant clusters.

How it works:

  1. During indexing, group vectors into clusters (using k-means)
  2. For each query, identify the most promising clusters
  3. Search only within those clusters

Characteristics:

AspectRating
Query SpeedGood
RecallGood (depends on nprobe)
Memory UsageLower than HNSW
Update HandlingPoor (requires retraining)
Best ForStatic datasets, memory-constrained

Key Parameters:

  • nlist: Number of clusters
  • nprobe: Clusters to search (higher = better recall, slower)

Product Quantization (PQ)

PQ compresses vectors by splitting them into subvectors and quantizing each to a codebook ID.

How it works:

  1. Split 384-dimensional vector into 8 subvectors of 48 dimensions each
  2. Learn a codebook of 256 centroids for each subvector
  3. Replace each subvector with its nearest centroid ID (1 byte)
  4. Result: 384 floats (1536 bytes) → 8 bytes (192x compression)

Characteristics:

AspectRating
Query SpeedGood
RecallLower (lossy compression)
Memory UsageExcellent (32x+ reduction)
Best ForBillions of vectors, cost-sensitive

Often combined with IVF as IVF-PQ for memory-efficient large-scale search.

Distance Metrics

How similarity is calculated:

MetricUse Case
CosineNormalized embeddings (most common)
Euclidean (L2)When vector magnitude matters
Dot ProductRecommendation systems with magnitude as relevance

For text embeddings, cosine similarity is almost always the right choice because it measures directional similarity regardless of vector length.

Vector Database Comparison

The Options

DatabaseTypeBest For
ChromaDBEmbedded/LocalPrototyping, small datasets
PineconeManaged CloudProduction SaaS, minimal ops
WeaviateSelf-hosted/CloudHybrid search, GraphQL
QdrantSelf-hosted/CloudPerformance, complex filtering
MilvusSelf-hostedEnterprise scale, GPU support

Decision Framework

Choose ChromaDB when:

  • Prototyping or learning vector databases
  • Dataset under 1 million vectors
  • Want simplest possible setup
  • Embedded in application is acceptable

ChromaDB runs in-process with your Python code—no separate server needed.

Choose Pinecone when:

  • Production SaaS application
  • Don't want to manage infrastructure
  • Need enterprise features (SSO, audit logs)
  • Willing to pay for managed service
  • Serverless scaling is important

Pinecone handles sharding, replication, and backups automatically.

Choose Weaviate when:

  • Need hybrid search (vector + keyword BM25)
  • GraphQL API preferred
  • Multi-modal data (text + images)
  • Want self-hosted with cloud option
  • Need schema-based data modeling

Weaviate's hybrid search combines the precision of keyword matching with the semantic understanding of vectors.

Choose Qdrant when:

  • Performance is critical
  • Need advanced filtering on metadata
  • Prefer Rust-based reliability
  • Self-hosted infrastructure is acceptable
  • Large-scale filtering workloads

Qdrant's payload indexing enables efficient filtered search at scale.

Practical Implementation

ChromaDB Quick Start

ChromaDB is the fastest way to get started:

import chromadb

# Create client (persistent storage)
client = chromadb.PersistentClient(path="./chroma_db")

# Create collection
collection = client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

# Add documents (embedding handled automatically)
collection.add(
    documents=[
        "Machine learning is a subset of AI",
        "Deep learning uses neural networks",
        "NLP processes human language",
    ],
    metadatas=[{"topic": "ml"}, {"topic": "dl"}, {"topic": "nlp"}],
    ids=["doc1", "doc2", "doc3"]
)

# Query
results = collection.query(
    query_texts=["What is deep learning?"],
    n_results=2
)

Pinecone Production Setup

For production workloads:

from pinecone import Pinecone, ServerlessSpec
from sentence_transformers import SentenceTransformer

# Initialize
pc = Pinecone(api_key="your-api-key")
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create serverless index
pc.create_index(
    name="semantic-search",
    dimension=384,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("semantic-search")

# Upsert with embeddings
documents = ["Python for data science", "JavaScript for web dev"]
vectors = [
    {"id": str(i), "values": model.encode(doc).tolist(), "metadata": {"text": doc}}
    for i, doc in enumerate(documents)
]
index.upsert(vectors=vectors)

# Query
query_vector = model.encode("machine learning language").tolist()
results = index.query(vector=query_vector, top_k=3, include_metadata=True)

Metadata Filtering

All vector databases support filtering results by metadata:

# ChromaDB
results = collection.query(
    query_texts=["neural networks"],
    where={"topic": "dl"},  # Only deep learning docs
    n_results=5
)

# With complex filters
results = collection.query(
    query_texts=["machine learning"],
    where={
        "$and": [
            {"topic": {"$in": ["ml", "dl"]}},
            {"year": {"$gte": 2020}}
        ]
    }
)

Filtering happens before vector search, reducing the search space and improving relevance.

Hybrid Search

Pure vector search sometimes misses exact keyword matches. Hybrid search combines:

  • Vector search: Semantic similarity
  • Keyword search: BM25 or similar

When to use hybrid:

  • Technical documentation (acronyms, product names)
  • Legal/medical text (precise terminology matters)
  • E-commerce (exact brand/model matching)

How it works:

  1. Run both vector and keyword search
  2. Normalize scores from each
  3. Combine with weighted average (alpha parameter)
  4. Re-rank merged results
# Weaviate hybrid search
results = collection.query.hybrid(
    query="BERT transformer architecture",
    alpha=0.5,  # 0 = pure keyword, 1 = pure vector
    limit=10
)

Performance Optimization

Batch Operations

Always upsert in batches:

BATCH_SIZE = 100
for i in range(0, len(vectors), BATCH_SIZE):
    batch = vectors[i:i + BATCH_SIZE]
    index.upsert(vectors=batch)

Individual inserts incur network overhead. Batching can be 10-100x faster.

Index Tuning

For HNSW (most common):

ParameterLow ValueHigh Value
M (connections)Faster build, less memoryBetter recall
ef_constructionFaster buildHigher quality index
ef_searchFaster queriesBetter recall

Start with defaults, then tune based on your recall/latency requirements.

Embedding Optimization

Choose embedding model based on your needs:

ModelDimensionsSpeedQuality
all-MiniLM-L6-v2384FastGood
all-mpnet-base-v2768MediumBetter
text-embedding-3-small1536API callVery Good
text-embedding-3-large3072API callBest

Smaller dimensions = less storage, faster search. Test whether the quality difference matters for your use case.

Common Patterns

Multi-Tenancy with Namespaces

Isolate data per user or organization:

# Pinecone: use namespaces
index.upsert(vectors=user_data, namespace="user_123")
index.query(vector=query, top_k=5, namespace="user_123")

# ChromaDB: use separate collections
user_collection = client.get_or_create_collection(f"user_{user_id}")

Chunking Strategy

For long documents:

StrategyChunk SizeOverlap
Sentence1-3 sentences0
Paragraph200-500 tokens50-100 tokens
Fixed512 tokens128 tokens
SemanticVariableBased on topic shifts

Overlap ensures context isn't lost at chunk boundaries. Test different strategies on your data.

Caching Query Embeddings

If the same queries repeat, cache embeddings:

import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_query_embedding(query: str):
    return tuple(model.encode(query).tolist())

# Repeated queries hit cache
embedding = get_query_embedding("what is machine learning")

Conclusion

Vector databases enable semantic search at scale. Key takeaways:

Architecture:

  • Same embedding model for documents and queries
  • ANN algorithms trade accuracy for speed
  • HNSW is the default choice for most use cases

Selection:

  • ChromaDB for prototyping
  • Pinecone for managed production
  • Weaviate for hybrid search
  • Qdrant for performance-critical filtering

Optimization:

  • Batch upserts (100-1000 at a time)
  • Tune HNSW parameters for recall/latency tradeoff
  • Choose embedding dimensions based on quality needs
  • Use hybrid search when exact matches matter

Start with ChromaDB for development, then evaluate production options based on your scaling and operational requirements.

References