- Published on
Understanding Text Embeddings: From Word2Vec to Sentence Transformers
- Authors

- Name
- Jared Chung
Introduction
Text embeddings are the foundation of modern NLP. They transform words, sentences, or documents into dense vector representations that capture semantic meaning. Understanding embeddings is crucial because nearly every NLP task - from search to classification to chatbots - relies on them.
In this post, we'll trace the evolution from early word embeddings to state-of-the-art sentence transformers, with complete code you can run immediately.
Prerequisites
pip install gensim sentence-transformers numpy scikit-learn matplotlib
Why Embeddings Matter
The Problem with Traditional Representations
Before embeddings, text was represented using sparse methods:
One-Hot Encoding
Each word gets a vector with a single 1 and all other positions 0:
# Vocabulary: ["cat", "dog", "bird"]
# cat = [1, 0, 0]
# dog = [0, 1, 0]
# bird = [0, 0, 1]
Problems:
- Vocabulary of 100,000 words = 100,000-dimensional vectors
- No semantic similarity: cos("cat", "dog") = 0, same as cos("cat", "quantum")
- Cannot handle out-of-vocabulary (OOV) words
TF-IDF (Term Frequency-Inverse Document Frequency)
Weights words by importance in a document relative to a corpus:
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"Machine learning is fascinating",
"Deep learning uses neural networks",
"Neural networks learn from data"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print(f"Shape: {tfidf_matrix.shape}") # (3, 10) - 3 docs, 10 unique words
print(f"Feature names: {vectorizer.get_feature_names_out()}")
Problems:
- Still high-dimensional and sparse
- No semantic understanding: "happy" and "joyful" are unrelated
- Order of words is lost (bag-of-words)
The Embedding Solution
Embeddings learn dense, low-dimensional vectors where semantically similar words are close together:
# Dense embedding (e.g., 300 dimensions)
# cat = [0.2, -0.4, 0.1, ..., 0.8] # 300 numbers
# dog = [0.3, -0.3, 0.2, ..., 0.7] # Similar to cat!
# bird = [0.1, -0.5, 0.3, ..., 0.6] # Also similar (animals)
# car = [-0.5, 0.2, -0.1, ..., -0.3] # Very different
Word2Vec: The Pioneer
Word2Vec, introduced by Mikolov et al. at Google in 2013, was revolutionary. It showed that neural networks could learn meaningful word representations from raw text.
The Core Insight: Distributional Hypothesis
"You shall know a word by the company it keeps" - J.R. Firth (1957)
Words that appear in similar contexts have similar meanings. Word2Vec operationalizes this by predicting context from words (or vice versa).
Two Architectures
Skip-gram: Predict Context from Word
Given a target word, predict surrounding context words.
Sentence: "The cat sat on the mat"
Target: "sat"
Context (window=2): ["The", "cat", "on", "the"]
Training pairs:
("sat", "The"), ("sat", "cat"), ("sat", "on"), ("sat", "the")
Best for: Rare words, smaller datasets
CBOW (Continuous Bag of Words): Predict Word from Context
Given context words, predict the target word.
Context: ["The", "cat", "on", "the"]
Target: "sat"
Best for: Frequent words, larger datasets (faster training)
Using Pre-trained Word2Vec
Google released Word2Vec trained on 100 billion words from Google News:
import gensim.downloader as api
import numpy as np
# Download pre-trained model (1.5GB, ~3 million words, 300 dimensions)
print("Loading Word2Vec (this may take a few minutes)...")
word2vec = api.load('word2vec-google-news-300')
# Basic operations
print(f"\nVocabulary size: {len(word2vec)}")
print(f"Vector dimensions: {word2vec.vector_size}")
# Get vector for a word
king_vector = word2vec['king']
print(f"\n'king' vector shape: {king_vector.shape}")
print(f"'king' vector (first 10 dims): {king_vector[:10]}")
# Find similar words
print("\nWords similar to 'python':")
for word, score in word2vec.most_similar('python', topn=5):
print(f" {word}: {score:.4f}")
# Word analogies: king - man + woman = queen
print("\nWord analogy: king - man + woman = ?")
result = word2vec.most_similar(
positive=['king', 'woman'],
negative=['man'],
topn=3
)
for word, score in result:
print(f" {word}: {score:.4f}")
# Calculate similarity between words
similarity = word2vec.similarity('cat', 'dog')
print(f"\nSimilarity('cat', 'dog'): {similarity:.4f}")
print(f"Similarity('cat', 'car'): {word2vec.similarity('cat', 'car'):.4f}")
Training Your Own Word2Vec
For domain-specific applications, train on your own corpus:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import logging
# Enable logging to see training progress
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Sample corpus (in practice, use thousands of documents)
corpus = [
"Machine learning algorithms learn patterns from data",
"Deep learning is a subset of machine learning",
"Neural networks are inspired by biological neurons",
"Supervised learning requires labeled training data",
"Unsupervised learning finds hidden patterns",
"Reinforcement learning learns through trial and error",
"Natural language processing deals with text data",
"Computer vision processes images and videos",
"Transfer learning reuses pre-trained models",
"Feature engineering creates informative features",
]
# Tokenize sentences
tokenized_corpus = [simple_preprocess(doc) for doc in corpus]
print(f"Tokenized example: {tokenized_corpus[0]}")
# Train Word2Vec model
model = Word2Vec(
sentences=tokenized_corpus,
vector_size=100, # Embedding dimensions
window=5, # Context window size
min_count=1, # Minimum word frequency
workers=4, # Parallel threads
epochs=100, # Training epochs (more for small corpus)
sg=1, # 1 for Skip-gram, 0 for CBOW
)
# Explore the model
print(f"\nVocabulary size: {len(model.wv)}")
print(f"Words in vocabulary: {list(model.wv.key_to_index.keys())[:10]}")
# Find similar words (limited by small corpus)
print("\nWords similar to 'learning':")
for word, score in model.wv.most_similar('learning', topn=3):
print(f" {word}: {score:.4f}")
# Save and load model
model.save("word2vec_custom.model")
loaded_model = Word2Vec.load("word2vec_custom.model")
How Word2Vec Actually Works (The Math)
Skip-gram uses a shallow neural network:
- Input layer: One-hot encoded word (V dimensions)
- Hidden layer: Word embedding (N dimensions, e.g., 300)
- Output layer: Softmax over vocabulary (V dimensions)
The training objective is to maximize:
Where p(w_O | w_I) = \frac{\exp(v'_{w_O}^T v_{w_I})}{\sum_{w=1}^{V} \exp(v'_w^T v_{w_I})}
Negative Sampling makes this tractable by only updating a few "negative" examples instead of the entire vocabulary.
GloVe: Global Vectors
GloVe (Stanford, 2014) takes a different approach: it directly factorizes the word co-occurrence matrix.
Key Insight
The ratio of co-occurrence probabilities encodes meaning:
| Word k | P(k|ice) | P(k|steam) | Ratio |
|---|---|---|---|
| solid | 0.00019 | 0.000022 | 8.9 |
| gas | 0.000066 | 0.00078 | 0.085 |
| water | 0.003 | 0.0022 | 1.36 |
Words related to "ice" but not "steam" have high ratios; vice versa for low ratios; neutral words have ratios near 1.
Using Pre-trained GloVe
import numpy as np
from typing import Dict
def load_glove(path: str) -> Dict[str, np.ndarray]:
"""
Load GloVe embeddings from a text file.
Download from: https://nlp.stanford.edu/projects/glove/
Common files:
- glove.6B.50d.txt (50 dimensions, trained on Wikipedia)
- glove.6B.300d.txt (300 dimensions)
- glove.42B.300d.txt (trained on Common Crawl, 42B tokens)
"""
embeddings = {}
with open(path, 'r', encoding='utf-8') as f:
for line in f:
values = line.strip().split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
# Load embeddings (download glove.6B.zip first)
# glove = load_glove('glove.6B.100d.txt')
# print(f"Loaded {len(glove)} word vectors")
# Example with synthetic data (replace with real GloVe)
glove = {
'king': np.random.randn(100).astype('float32'),
'queen': np.random.randn(100).astype('float32'),
'man': np.random.randn(100).astype('float32'),
'woman': np.random.randn(100).astype('float32'),
}
def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
"""Calculate cosine similarity between two vectors."""
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
def find_similar(word: str, embeddings: dict, topn: int = 5):
"""Find most similar words using cosine similarity."""
if word not in embeddings:
return []
word_vec = embeddings[word]
similarities = []
for other_word, other_vec in embeddings.items():
if other_word != word:
sim = cosine_similarity(word_vec, other_vec)
similarities.append((other_word, sim))
return sorted(similarities, key=lambda x: x[1], reverse=True)[:topn]
FastText: Subword Embeddings
FastText (Facebook, 2016) improves on Word2Vec by representing words as bags of character n-grams.
Key Innovation
The word "where" with n=3 becomes:
<wh, whe, her, ere, re>, <where>
The word embedding is the sum of all n-gram embeddings.
Benefits
- OOV handling: Can generate vectors for unseen words
- Morphology: Captures prefixes, suffixes, roots
- Rare words: Better representations through subword sharing
from gensim.models import FastText
from gensim.utils import simple_preprocess
# Training corpus
corpus = [
"Machine learning algorithms learn patterns from data",
"Deep learning is a subset of machine learning",
"Neural networks process information in layers",
"Convolutional neural networks excel at image recognition",
"Recurrent neural networks handle sequential data",
"Transformers revolutionized natural language processing",
"Pre-training and fine-tuning are common strategies",
"Embeddings represent words as dense vectors",
]
tokenized = [simple_preprocess(doc) for doc in corpus]
# Train FastText model
model = FastText(
sentences=tokenized,
vector_size=100,
window=5,
min_count=1,
min_n=2, # Minimum n-gram length
max_n=5, # Maximum n-gram length
epochs=50,
)
print("FastText trained successfully!")
print(f"Vocabulary: {list(model.wv.key_to_index.keys())}")
# Key advantage: Get vectors for OOV words!
oov_word = "deeplearning" # Not in vocabulary
try:
oov_vector = model.wv[oov_word]
print(f"\nVector for OOV word '{oov_word}':")
print(f" Shape: {oov_vector.shape}")
print(f" First 5 dims: {oov_vector[:5]}")
except KeyError as e:
print(f"Word not found: {e}")
# Compare: Word2Vec would fail on OOV words
# word2vec_model.wv["deeplearning"] # KeyError!
The Context Problem
All word embedding methods share a fundamental limitation: one embedding per word, regardless of context.
# "bank" has the same vector in both sentences:
sentence1 = "I deposited money in the bank" # Financial institution
sentence2 = "I sat by the river bank" # Edge of river
# This is a problem for polysemous words!
This motivated contextualized embeddings like ELMo, BERT, and ultimately Sentence Transformers.
Sentence Transformers (SBERT)
Sentence-BERT (Reimers & Gurevych, 2019) produces sentence-level embeddings that can be compared with cosine similarity. It's built on BERT but optimized for generating fixed-size sentence representations.
Why Not Just Use BERT Directly?
BERT wasn't designed for sentence similarity. Using BERT naively requires:
- Passing both sentences through BERT together
- O(n²) comparisons for n sentences (10,000 sentences = 50M+ forward passes!)
SBERT generates independent embeddings: O(n) forward passes, then fast cosine similarity.
Complete Working Example
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Load pre-trained model
print("Loading Sentence Transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
# Model info
print(f"Model: all-MiniLM-L6-v2")
print(f"Max sequence length: {model.max_seq_length}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")
# Encode sentences
sentences = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"AI systems can learn from experience",
"The weather is nice today",
"I enjoy hiking in the mountains",
"Neural networks are inspired by the human brain",
]
print("\nEncoding sentences...")
embeddings = model.encode(sentences, show_progress_bar=True)
print(f"Embeddings shape: {embeddings.shape}") # (6, 384)
# Calculate pairwise similarities
print("\n=== Pairwise Similarity Matrix ===")
similarity_matrix = cosine_similarity(embeddings)
for i, sent1 in enumerate(sentences):
print(f"\n'{sent1[:50]}...'")
for j, sent2 in enumerate(sentences):
if i != j:
print(f" vs '{sent2[:40]}...': {similarity_matrix[i][j]:.3f}")
Semantic Search Implementation
from sentence_transformers import SentenceTransformer, util
import torch
class SemanticSearch:
"""Production-ready semantic search with sentence transformers."""
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
self.model = SentenceTransformer(model_name)
self.corpus = []
self.corpus_embeddings = None
def index(self, documents: list[str]):
"""Index a corpus of documents."""
self.corpus = documents
print(f"Encoding {len(documents)} documents...")
self.corpus_embeddings = self.model.encode(
documents,
convert_to_tensor=True,
show_progress_bar=True
)
print(f"Indexed {len(documents)} documents")
def search(self, query: str, top_k: int = 5) -> list[dict]:
"""Search the corpus for relevant documents."""
if self.corpus_embeddings is None:
raise ValueError("No documents indexed. Call index() first.")
# Encode query
query_embedding = self.model.encode(query, convert_to_tensor=True)
# Calculate similarities
scores = util.cos_sim(query_embedding, self.corpus_embeddings)[0]
# Get top results
top_results = torch.topk(scores, k=min(top_k, len(self.corpus)))
results = []
for score, idx in zip(top_results.values, top_results.indices):
results.append({
'document': self.corpus[idx],
'score': score.item(),
'index': idx.item()
})
return results
# Usage
search_engine = SemanticSearch()
# Index documents
documents = [
"Python is a high-level programming language known for its simplicity",
"JavaScript is essential for web development and runs in browsers",
"Machine learning models can predict outcomes from data",
"Docker containers package applications with their dependencies",
"REST APIs enable communication between web services",
"Neural networks are the foundation of deep learning",
"Git is a version control system for tracking code changes",
"Kubernetes orchestrates containerized applications at scale",
"SQL databases store data in structured tables",
"NoSQL databases offer flexible schema designs",
]
search_engine.index(documents)
# Search
queries = [
"How do I learn AI?",
"What's the best language for beginners?",
"How to deploy applications?",
]
for query in queries:
print(f"\n🔍 Query: '{query}'")
results = search_engine.search(query, top_k=3)
for i, result in enumerate(results, 1):
print(f" {i}. [{result['score']:.3f}] {result['document'][:60]}...")
Choosing the Right Model
| Model | Dimensions | Speed | Quality | Use Case |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | ⚡ Fast | Good | General purpose, production |
| all-mpnet-base-v2 | 768 | Medium | Best | When quality matters most |
| paraphrase-multilingual-MiniLM-L12-v2 | 384 | Fast | Good | 50+ languages |
| all-MiniLM-L12-v2 | 384 | Fast | Better | Balance of speed/quality |
# Compare models
models_to_compare = [
'all-MiniLM-L6-v2',
'all-mpnet-base-v2',
]
test_sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps above a sleepy canine",
]
for model_name in models_to_compare:
model = SentenceTransformer(model_name)
embeddings = model.encode(test_sentences)
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
print(f"{model_name}: similarity = {similarity:.4f}")
Fine-tuning for Your Domain
For domain-specific applications, fine-tuning dramatically improves performance:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Prepare training data
# Option 1: Pairs with similarity scores (0-1)
train_examples = [
InputExample(texts=["Neural networks learn patterns", "Deep learning finds patterns in data"], label=0.9),
InputExample(texts=["Python is a programming language", "The weather is sunny"], label=0.1),
InputExample(texts=["Machine learning predicts outcomes", "ML models make predictions"], label=0.95),
InputExample(texts=["Transformers use attention", "Attention mechanisms in transformers"], label=0.85),
# Add more examples... (typically need 1000+)
]
# Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Define loss function
train_loss = losses.CosineSimilarityLoss(model)
# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=4,
warmup_steps=100,
output_path='fine-tuned-model',
show_progress_bar=True,
)
# Load fine-tuned model
fine_tuned_model = SentenceTransformer('fine-tuned-model')
Visualizing Embeddings
Use dimensionality reduction to visualize embeddings in 2D:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer
# Prepare sentences from different categories
sentences = {
'programming': [
"Python is a programming language",
"JavaScript runs in the browser",
"Java is used for enterprise applications",
"C++ offers low-level control",
],
'animals': [
"Dogs are loyal pets",
"Cats are independent animals",
"Birds can fly through the sky",
"Fish live underwater",
],
'food': [
"Pizza is a popular Italian dish",
"Sushi is Japanese cuisine",
"Tacos are Mexican food",
"Pasta comes in many shapes",
],
}
# Flatten and track categories
all_sentences = []
categories = []
for category, sents in sentences.items():
all_sentences.extend(sents)
categories.extend([category] * len(sents))
# Get embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(all_sentences)
# Reduce to 2D with t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)
# Plot
plt.figure(figsize=(10, 8))
colors = {'programming': 'blue', 'animals': 'green', 'food': 'red'}
for i, (x, y) in enumerate(embeddings_2d):
category = categories[i]
plt.scatter(x, y, c=colors[category], s=100, alpha=0.7)
plt.annotate(all_sentences[i][:20] + '...', (x, y), fontsize=8)
plt.legend(handles=[
plt.scatter([], [], c=color, label=cat)
for cat, color in colors.items()
])
plt.title('Sentence Embeddings Visualization (t-SNE)')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.tight_layout()
plt.savefig('embeddings_visualization.png', dpi=150)
plt.show()
Practical Tips and Best Practices
1. Always Normalize for Cosine Similarity
from sklearn.preprocessing import normalize
embeddings = model.encode(sentences)
normalized_embeddings = normalize(embeddings) # L2 normalization
# Now dot product = cosine similarity
similarity = np.dot(normalized_embeddings[0], normalized_embeddings[1])
2. Batch Processing for Efficiency
# Bad: One at a time
embeddings = [model.encode(s) for s in sentences] # Slow!
# Good: Batch encode
embeddings = model.encode(sentences, batch_size=32) # Fast!
3. Use GPU When Available
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
print(f"Using device: {device}")
4. Truncate Long Texts
# Models have maximum sequence lengths
max_length = model.max_seq_length # Usually 256 or 512 tokens
# For longer documents, consider:
# 1. Truncation (loses information)
# 2. Chunking and averaging
# 3. Using models designed for long text (e.g., INSTRUCTOR)
Summary Comparison
| Method | Year | Strengths | Weaknesses |
|---|---|---|---|
| Word2Vec | 2013 | Fast, interpretable | No context, OOV issues |
| GloVe | 2014 | Global statistics | No context, OOV issues |
| FastText | 2016 | Handles OOV, morphology | No context |
| SBERT | 2019 | Context-aware, high quality | Slower, needs GPU |
Conclusion
Text embeddings have evolved dramatically:
- Word2Vec/GloVe: Pioneered dense word representations
- FastText: Added subword information for OOV handling
- Sentence Transformers: State-of-the-art semantic similarity
For most modern NLP tasks, start with Sentence Transformers (specifically all-MiniLM-L6-v2 for speed or all-mpnet-base-v2 for quality). Only use word embeddings for specific use cases like word analogy tasks or when computational resources are extremely limited.
References
- Mikolov et al. "Efficient Estimation of Word Representations in Vector Space" (2013)
- Pennington et al. "GloVe: Global Vectors for Word Representation" (2014)
- Bojanowski et al. "Enriching Word Vectors with Subword Information" (2017)
- Reimers and Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (2019)
- Sentence Transformers Documentation: https://www.sbert.net/