- Published on
Vector Databases Explained: Pinecone, Weaviate, ChromaDB and More
- Authors

- Name
- Jared Chung
Introduction
Vector databases are purpose-built to store and query high-dimensional embedding vectors efficiently. They've become essential infrastructure for modern AI applications like semantic search, recommendation systems, and Retrieval Augmented Generation (RAG). In this post, we'll explore what vector databases are, when to use them, and compare the leading options with complete, runnable code examples.
Prerequisites
# Core packages
pip install chromadb pinecone-client weaviate-client qdrant-client
pip install sentence-transformers numpy
# Optional: for advanced examples
pip install rank-bm25 fastembed
Why Vector Databases?
Traditional Databases vs. Vector Databases
Traditional databases excel at exact matching:
SELECT * FROM products WHERE name = 'iPhone 15';
SELECT * FROM products WHERE category = 'electronics' AND price < 500;
But what about semantic queries like "smartphones with good cameras"? Traditional databases can't understand that "good cameras" relates to megapixels, aperture, and image quality. This requires:
- Converting text to embeddings (dense vectors that capture meaning)
- Finding similar vectors efficiently
- Scaling to millions or billions of vectors
Vector databases solve all three challenges.
What is a Vector Embedding?
An embedding is a numerical representation of data (text, images, audio) where similar items have similar numbers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [
"I love programming in Python",
"Python is my favorite coding language",
"The weather is nice today"
]
embeddings = model.encode(texts)
print(f"Embedding shape: {embeddings[0].shape}") # (384,)
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
print("\nSimilarity matrix:")
for i, text in enumerate(texts):
print(f"{i}: {text[:40]}")
print(similarities)
# Texts 0 and 1 will have high similarity (~0.8+)
# Text 2 will have low similarity with 0 and 1 (~0.1-0.3)
The Nearest Neighbor Problem
Given a query vector, find the k most similar vectors in the database.
Brute Force: Compare with every vector - O(n) time complexity. For 1 million vectors of 384 dimensions, that's 384 million float comparisons per query. Too slow for production.
Solution: Approximate Nearest Neighbor (ANN) algorithms trade perfect accuracy for speed:
HNSW (Hierarchical Navigable Small World)
The most popular algorithm. Builds a multi-layer graph where each layer is a "navigable small world" network.
Layer 2: A -------- B -------- C (few nodes, long connections)
\ /
Layer 1: A -- D -- B -- E -- C (more nodes, medium connections)
\ / \ / \ / \ /
Layer 0: A-D-F-B-G-E-H-C-I-J (all nodes, short connections)
- Pros: Fast queries, good recall, handles updates well
- Cons: High memory usage (stores graph structure)
- Parameters:
M: Number of connections per node (higher = better recall, more memory)ef_construction: Build-time search width (higher = better quality index)ef_search: Query-time search width (higher = better recall, slower)
IVF (Inverted File Index)
Clusters vectors and only searches relevant clusters.
Cluster 1: [v1, v5, v9, v12]
Cluster 2: [v2, v3, v7, v15]
Cluster 3: [v4, v6, v8, v10, v11, v13, v14]
- Pros: Lower memory than HNSW, fast with many clusters
- Cons: Requires training, poor with updates
- Parameters:
nlist: Number of clustersnprobe: Number of clusters to search (higher = better recall)
PQ (Product Quantization)
Compresses vectors by splitting into subvectors and quantizing each.
Original: [0.1, 0.5, 0.3, 0.8, 0.2, 0.9, 0.4, 0.7]
Split: [0.1, 0.5] [0.3, 0.8] [0.2, 0.9] [0.4, 0.7]
Quantize: [2] [5] [1] [7] (code IDs)
- Pros: Massive memory reduction (32x or more)
- Cons: Lower accuracy, training required
- Often combined with IVF as IVF-PQ
Distance Metrics
| Metric | Formula | Use Case |
|---|---|---|
| Cosine | 1 - cos(a,b) | Normalized embeddings (most common) |
| Euclidean (L2) | sqrt(sum((a-b)^2)) | When magnitude matters |
| Dot Product | sum(a*b) | Recommendation systems |
| Manhattan (L1) | sum(abs(a-b)) | Sparse data |
Vector Database Comparison
| Feature | ChromaDB | Pinecone | Weaviate | Qdrant | Milvus |
|---|---|---|---|---|---|
| Type | Embedded/Server | Managed Cloud | Self-hosted/Cloud | Self-hosted/Cloud | Self-hosted |
| Pricing | Free | Pay per use | Free/Enterprise | Free/Cloud | Free |
| Setup | Easiest | Easy | Medium | Medium | Complex |
| Scalability | Small-Medium | Large | Large | Large | Very Large |
| Filtering | Basic | Advanced | GraphQL | Advanced | Advanced |
| Best For | Prototyping | Production SaaS | Hybrid search | Performance | Enterprise |
ChromaDB: Getting Started Quickly
ChromaDB is the easiest way to get started with vector databases. It runs embedded in your Python process.
Installation and Basic Usage
import chromadb
from chromadb.utils import embedding_functions
# Create client (in-memory)
client = chromadb.Client()
# Or persistent storage
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection with default embedding function
collection = client.create_collection(
name="documents",
metadata={"hnsw:space": "cosine"} # Distance metric
)
# Add documents (ChromaDB handles embedding automatically)
collection.add(
documents=[
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Natural language processing deals with text data",
"Computer vision processes image and video data",
],
metadatas=[
{"category": "ml"},
{"category": "dl"},
{"category": "nlp"},
{"category": "cv"},
],
ids=["doc1", "doc2", "doc3", "doc4"]
)
# Query
results = collection.query(
query_texts=["What is deep learning?"],
n_results=2
)
print(results)
Using Custom Embeddings
from sentence_transformers import SentenceTransformer
# Custom embedding function
model = SentenceTransformer('all-MiniLM-L6-v2')
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = client.create_collection(
name="custom_embeddings",
embedding_function=sentence_transformer_ef
)
Filtering with Metadata
# Query with metadata filter
results = collection.query(
query_texts=["neural networks"],
n_results=5,
where={"category": "dl"} # Only deep learning docs
)
# Complex filters
results = collection.query(
query_texts=["machine learning"],
where={
"$and": [
{"category": {"$in": ["ml", "dl"]}},
{"year": {"$gte": 2020}}
]
}
)
Pinecone: Production-Ready Cloud
Pinecone is a fully managed vector database optimized for production workloads.
Setup
from pinecone import Pinecone, ServerlessSpec
# Initialize
pc = Pinecone(api_key="your-api-key")
# Create index
pc.create_index(
name="semantic-search",
dimension=384, # Match your embedding dimension
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
# Connect to index
index = pc.Index("semantic-search")
Upserting and Querying
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare data
documents = [
{"id": "1", "text": "Python is great for data science"},
{"id": "2", "text": "JavaScript powers the modern web"},
{"id": "3", "text": "Rust offers memory safety without garbage collection"},
]
# Generate embeddings and upsert
vectors = []
for doc in documents:
embedding = model.encode(doc["text"]).tolist()
vectors.append({
"id": doc["id"],
"values": embedding,
"metadata": {"text": doc["text"]}
})
index.upsert(vectors=vectors)
# Query
query = "Which language is best for machine learning?"
query_embedding = model.encode(query).tolist()
results = index.query(
vector=query_embedding,
top_k=3,
include_metadata=True
)
for match in results["matches"]:
print(f"{match['score']:.3f}: {match['metadata']['text']}")
Namespaces for Multi-tenancy
# Upsert to specific namespace
index.upsert(vectors=vectors, namespace="user_123")
# Query specific namespace
results = index.query(
vector=query_embedding,
top_k=5,
namespace="user_123"
)
Weaviate: Hybrid Search
Weaviate combines vector search with keyword search (BM25) for best-of-both-worlds retrieval.
Setup with Docker
docker run -d \
-p 8080:8080 \
-p 50051:50051 \
cr.weaviate.io/semitechnologies/weaviate:latest
Python Client
import weaviate
from weaviate.classes.config import Configure, Property, DataType
# Connect
client = weaviate.connect_to_local()
# Create collection (class)
collection = client.collections.create(
name="Document",
vectorizer_config=Configure.Vectorizer.text2vec_transformers(),
properties=[
Property(name="content", data_type=DataType.TEXT),
Property(name="category", data_type=DataType.TEXT),
]
)
# Add objects
collection.data.insert_many([
{"content": "Machine learning automates analytical model building", "category": "ml"},
{"content": "Neural networks are inspired by biological neurons", "category": "dl"},
])
Hybrid Search
# Combine vector and keyword search
results = collection.query.hybrid(
query="neural network training",
alpha=0.5, # 0 = pure keyword, 1 = pure vector
limit=5
)
for obj in results.objects:
print(obj.properties)
Qdrant: High Performance
Qdrant is optimized for performance and offers advanced filtering capabilities.
Setup
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
# Local or cloud
client = QdrantClient(":memory:") # In-memory for testing
# client = QdrantClient(url="http://localhost:6333") # Docker
# client = QdrantClient(url="https://xxx.qdrant.io", api_key="...") # Cloud
# Create collection
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)
Indexing and Search
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare points
documents = [
"Transformers revolutionized NLP",
"CNNs are great for image processing",
"RNNs handle sequential data",
]
points = [
PointStruct(
id=i,
vector=model.encode(doc).tolist(),
payload={"text": doc, "index": i}
)
for i, doc in enumerate(documents)
]
# Upsert
client.upsert(collection_name="documents", points=points)
# Search
query_vector = model.encode("What handles sequences?").tolist()
results = client.search(
collection_name="documents",
query_vector=query_vector,
limit=3
)
for result in results:
print(f"{result.score:.3f}: {result.payload['text']}")
Advanced Filtering
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# Search with filters
results = client.search(
collection_name="documents",
query_vector=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value="nlp")
),
FieldCondition(
key="year",
range=Range(gte=2020)
)
]
),
limit=10
)
Choosing the Right Database
Use ChromaDB when:
- Prototyping and experimenting
- Small to medium datasets (less than 1M vectors)
- Embedded use cases
- Quick setup is priority
Use Pinecone when:
- Production SaaS applications
- Don't want to manage infrastructure
- Need enterprise features (SSO, audit logs)
- Serverless scaling is important
Use Weaviate when:
- Need hybrid search (vector + keyword)
- GraphQL API is preferred
- Multi-modal data (text, images)
- Self-hosted with cloud option
Use Qdrant when:
- Performance is critical
- Complex filtering requirements
- Rust-based reliability
- Self-hosted preferred
Performance Tips
- Batch operations: Upsert in batches of 100-1000
- Choose right metric: Cosine for normalized, Euclidean for absolute distance
- Index parameters: Tune HNSW ef_construction and M for speed/accuracy tradeoff
- Quantization: Reduce memory with scalar/product quantization
- Async operations: Use async clients for high-throughput apps
# Batch upsert example
BATCH_SIZE = 100
for i in range(0, len(vectors), BATCH_SIZE):
batch = vectors[i:i + BATCH_SIZE]
index.upsert(vectors=batch)
Complete Working Example
Here's a production-ready semantic search system you can adapt:
from sentence_transformers import SentenceTransformer
import chromadb
from typing import List, Dict, Optional
from dataclasses import dataclass
import hashlib
@dataclass
class Document:
"""Represents a document to be indexed."""
content: str
metadata: Dict
id: Optional[str] = None
def __post_init__(self):
if self.id is None:
self.id = hashlib.md5(self.content.encode()).hexdigest()[:16]
class SemanticSearchEngine:
"""Production-ready semantic search with ChromaDB."""
def __init__(
self,
collection_name: str = "documents",
model_name: str = "all-MiniLM-L6-v2",
persist_directory: str = "./vector_db"
):
# Initialize embedding model
self.model = SentenceTransformer(model_name)
self.embedding_dim = self.model.get_sentence_embedding_dimension()
# Initialize ChromaDB
self.client = chromadb.PersistentClient(path=persist_directory)
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={
"hnsw:space": "cosine",
"hnsw:M": 16,
"hnsw:ef_construction": 100
}
)
def add_documents(self, documents: List[Document], batch_size: int = 100):
"""Add documents to the index."""
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
ids = [doc.id for doc in batch]
contents = [doc.content for doc in batch]
metadatas = [doc.metadata for doc in batch]
# Generate embeddings
embeddings = self.model.encode(contents).tolist()
# Upsert to collection
self.collection.upsert(
ids=ids,
documents=contents,
embeddings=embeddings,
metadatas=metadatas
)
print(f"Indexed {len(documents)} documents")
def search(
self,
query: str,
n_results: int = 10,
filter_dict: Optional[Dict] = None,
min_score: float = 0.0
) -> List[Dict]:
"""Search for similar documents."""
query_embedding = self.model.encode(query).tolist()
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
where=filter_dict
)
# Format results with similarity scores
formatted_results = []
for i in range(len(results["ids"][0])):
# ChromaDB returns distances, convert to similarity
distance = results["distances"][0][i]
similarity = 1 - distance # For cosine distance
if similarity >= min_score:
formatted_results.append({
"id": results["ids"][0][i],
"content": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"similarity": round(similarity, 4)
})
return formatted_results
def delete(self, ids: List[str]):
"""Delete documents by ID."""
self.collection.delete(ids=ids)
def count(self) -> int:
"""Get total document count."""
return self.collection.count()
# Usage example
if __name__ == "__main__":
# Initialize search engine
engine = SemanticSearchEngine()
# Sample documents
docs = [
Document(
content="Python is a high-level programming language known for its readability",
metadata={"category": "programming", "language": "python"}
),
Document(
content="JavaScript is essential for web development and runs in browsers",
metadata={"category": "programming", "language": "javascript"}
),
Document(
content="Machine learning is a subset of artificial intelligence",
metadata={"category": "ml", "topic": "fundamentals"}
),
Document(
content="Deep learning uses neural networks with many layers",
metadata={"category": "ml", "topic": "deep-learning"}
),
Document(
content="Docker containers package applications with their dependencies",
metadata={"category": "devops", "tool": "docker"}
),
]
# Index documents
engine.add_documents(docs)
print(f"Total documents: {engine.count()}")
# Search
print("\n--- Search: 'How do I learn AI?' ---")
results = engine.search("How do I learn AI?", n_results=3)
for r in results:
print(f"[{r['similarity']:.3f}] {r['content'][:60]}...")
# Search with filter
print("\n--- Search: 'coding' (filtered to programming) ---")
results = engine.search(
"coding",
n_results=3,
filter_dict={"category": "programming"}
)
for r in results:
print(f"[{r['similarity']:.3f}] {r['content'][:60]}...")
Monitoring and Observability
Key Metrics to Track
import time
from dataclasses import dataclass
from typing import List, Callable
import functools
@dataclass
class SearchMetrics:
"""Track search performance."""
query: str
latency_ms: float
num_results: int
top_score: float
class MonitoredSearchEngine(SemanticSearchEngine):
"""Search engine with built-in monitoring."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.metrics: List[SearchMetrics] = []
def search(self, query: str, **kwargs) -> List[Dict]:
start = time.perf_counter()
results = super().search(query, **kwargs)
latency = (time.perf_counter() - start) * 1000
# Record metrics
self.metrics.append(SearchMetrics(
query=query,
latency_ms=latency,
num_results=len(results),
top_score=results[0]["similarity"] if results else 0.0
))
return results
def get_stats(self) -> Dict:
"""Get aggregated statistics."""
if not self.metrics:
return {}
latencies = [m.latency_ms for m in self.metrics]
scores = [m.top_score for m in self.metrics]
return {
"total_searches": len(self.metrics),
"avg_latency_ms": sum(latencies) / len(latencies),
"p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)],
"avg_top_score": sum(scores) / len(scores),
"zero_result_rate": sum(1 for m in self.metrics if m.num_results == 0) / len(self.metrics)
}
Migration Between Vector Databases
If you need to switch databases:
def migrate_chroma_to_qdrant(
chroma_client: chromadb.Client,
qdrant_client,
collection_name: str,
batch_size: int = 100
):
"""Migrate from ChromaDB to Qdrant."""
from qdrant_client.models import PointStruct, VectorParams, Distance
# Get all data from ChromaDB
chroma_collection = chroma_client.get_collection(collection_name)
data = chroma_collection.get(include=["documents", "embeddings", "metadatas"])
# Create Qdrant collection
vector_size = len(data["embeddings"][0])
qdrant_client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)
)
# Migrate in batches
points = []
for i, (id_, emb, doc, meta) in enumerate(zip(
data["ids"], data["embeddings"], data["documents"], data["metadatas"]
)):
points.append(PointStruct(
id=i,
vector=emb,
payload={"document": doc, **meta}
))
if len(points) >= batch_size:
qdrant_client.upsert(collection_name=collection_name, points=points)
points = []
if points:
qdrant_client.upsert(collection_name=collection_name, points=points)
print(f"Migrated {len(data['ids'])} vectors")
Conclusion
Vector databases are essential infrastructure for modern AI applications. Key takeaways:
- Understand ANN algorithms: HNSW for most cases, IVF-PQ for memory constraints
- ChromaDB: Best for getting started and prototyping
- Pinecone: Best managed service for production
- Weaviate: Best for hybrid search requirements
- Qdrant: Best for performance-critical applications
- Monitor performance: Track latency, recall, and result quality
Start with ChromaDB for development, then evaluate managed options (Pinecone) or self-hosted (Qdrant, Weaviate) based on your scaling needs.
References
- ChromaDB Documentation: https://docs.trychroma.com
- Pinecone Documentation: https://docs.pinecone.io
- Weaviate Documentation: https://weaviate.io/developers/weaviate
- Qdrant Documentation: https://qdrant.tech/documentation
- Malkov & Yashunin "Efficient and robust approximate nearest neighbor search using HNSW" (2018)
- Jegou et al. "Product Quantization for Nearest Neighbor Search" (2011)