Jared AI Hub
Published on

Understanding Text Embeddings: From Word2Vec to Sentence Transformers

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Text embeddings are the foundation of modern NLP. They transform words, sentences, or documents into dense vector representations that capture semantic meaning. Understanding embeddings is crucial because nearly every NLP task - from search to classification to chatbots - relies on them.

In this post, we'll trace the evolution from early word embeddings to state-of-the-art sentence transformers, with complete code you can run immediately.

Prerequisites

pip install gensim sentence-transformers numpy scikit-learn matplotlib

Why Embeddings Matter

The Problem with Traditional Representations

Before embeddings, text was represented using sparse methods:

One-Hot Encoding

Each word gets a vector with a single 1 and all other positions 0:

# Vocabulary: ["cat", "dog", "bird"]
# cat  = [1, 0, 0]
# dog  = [0, 1, 0]
# bird = [0, 0, 1]

Problems:

  • Vocabulary of 100,000 words = 100,000-dimensional vectors
  • No semantic similarity: cos("cat", "dog") = 0, same as cos("cat", "quantum")
  • Cannot handle out-of-vocabulary (OOV) words

TF-IDF (Term Frequency-Inverse Document Frequency)

Weights words by importance in a document relative to a corpus:

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "Machine learning is fascinating",
    "Deep learning uses neural networks",
    "Neural networks learn from data"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print(f"Shape: {tfidf_matrix.shape}")  # (3, 10) - 3 docs, 10 unique words
print(f"Feature names: {vectorizer.get_feature_names_out()}")

Problems:

  • Still high-dimensional and sparse
  • No semantic understanding: "happy" and "joyful" are unrelated
  • Order of words is lost (bag-of-words)

The Embedding Solution

Embeddings learn dense, low-dimensional vectors where semantically similar words are close together:

# Dense embedding (e.g., 300 dimensions)
# cat  = [0.2, -0.4, 0.1, ..., 0.8]  # 300 numbers
# dog  = [0.3, -0.3, 0.2, ..., 0.7]  # Similar to cat!
# bird = [0.1, -0.5, 0.3, ..., 0.6]  # Also similar (animals)
# car  = [-0.5, 0.2, -0.1, ..., -0.3]  # Very different

Word2Vec: The Pioneer

Word2Vec, introduced by Mikolov et al. at Google in 2013, was revolutionary. It showed that neural networks could learn meaningful word representations from raw text.

The Core Insight: Distributional Hypothesis

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words that appear in similar contexts have similar meanings. Word2Vec operationalizes this by predicting context from words (or vice versa).

Two Architectures

Skip-gram: Predict Context from Word

Given a target word, predict surrounding context words.

Sentence: "The cat sat on the mat"
Target: "sat"
Context (window=2): ["The", "cat", "on", "the"]

Training pairs:
("sat", "The"), ("sat", "cat"), ("sat", "on"), ("sat", "the")

Best for: Rare words, smaller datasets

CBOW (Continuous Bag of Words): Predict Word from Context

Given context words, predict the target word.

Context: ["The", "cat", "on", "the"]
Target: "sat"

Best for: Frequent words, larger datasets (faster training)

Using Pre-trained Word2Vec

Google released Word2Vec trained on 100 billion words from Google News:

import gensim.downloader as api
import numpy as np

# Download pre-trained model (1.5GB, ~3 million words, 300 dimensions)
print("Loading Word2Vec (this may take a few minutes)...")
word2vec = api.load('word2vec-google-news-300')

# Basic operations
print(f"\nVocabulary size: {len(word2vec)}")
print(f"Vector dimensions: {word2vec.vector_size}")

# Get vector for a word
king_vector = word2vec['king']
print(f"\n'king' vector shape: {king_vector.shape}")
print(f"'king' vector (first 10 dims): {king_vector[:10]}")

# Find similar words
print("\nWords similar to 'python':")
for word, score in word2vec.most_similar('python', topn=5):
    print(f"  {word}: {score:.4f}")

# Word analogies: king - man + woman = queen
print("\nWord analogy: king - man + woman = ?")
result = word2vec.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=3
)
for word, score in result:
    print(f"  {word}: {score:.4f}")

# Calculate similarity between words
similarity = word2vec.similarity('cat', 'dog')
print(f"\nSimilarity('cat', 'dog'): {similarity:.4f}")
print(f"Similarity('cat', 'car'): {word2vec.similarity('cat', 'car'):.4f}")

Training Your Own Word2Vec

For domain-specific applications, train on your own corpus:

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import logging

# Enable logging to see training progress
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample corpus (in practice, use thousands of documents)
corpus = [
    "Machine learning algorithms learn patterns from data",
    "Deep learning is a subset of machine learning",
    "Neural networks are inspired by biological neurons",
    "Supervised learning requires labeled training data",
    "Unsupervised learning finds hidden patterns",
    "Reinforcement learning learns through trial and error",
    "Natural language processing deals with text data",
    "Computer vision processes images and videos",
    "Transfer learning reuses pre-trained models",
    "Feature engineering creates informative features",
]

# Tokenize sentences
tokenized_corpus = [simple_preprocess(doc) for doc in corpus]
print(f"Tokenized example: {tokenized_corpus[0]}")

# Train Word2Vec model
model = Word2Vec(
    sentences=tokenized_corpus,
    vector_size=100,      # Embedding dimensions
    window=5,             # Context window size
    min_count=1,          # Minimum word frequency
    workers=4,            # Parallel threads
    epochs=100,           # Training epochs (more for small corpus)
    sg=1,                 # 1 for Skip-gram, 0 for CBOW
)

# Explore the model
print(f"\nVocabulary size: {len(model.wv)}")
print(f"Words in vocabulary: {list(model.wv.key_to_index.keys())[:10]}")

# Find similar words (limited by small corpus)
print("\nWords similar to 'learning':")
for word, score in model.wv.most_similar('learning', topn=3):
    print(f"  {word}: {score:.4f}")

# Save and load model
model.save("word2vec_custom.model")
loaded_model = Word2Vec.load("word2vec_custom.model")

How Word2Vec Actually Works (The Math)

Skip-gram uses a shallow neural network:

  1. Input layer: One-hot encoded word (V dimensions)
  2. Hidden layer: Word embedding (N dimensions, e.g., 300)
  3. Output layer: Softmax over vocabulary (V dimensions)

The training objective is to maximize:

1Tt=1Tcjc,j0logp(wt+jwt)\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t)

Where p(w_O | w_I) = \frac{\exp(v'_{w_O}^T v_{w_I})}{\sum_{w=1}^{V} \exp(v'_w^T v_{w_I})}

Negative Sampling makes this tractable by only updating a few "negative" examples instead of the entire vocabulary.

GloVe: Global Vectors

GloVe (Stanford, 2014) takes a different approach: it directly factorizes the word co-occurrence matrix.

Key Insight

The ratio of co-occurrence probabilities encodes meaning:

Word kP(k|ice)P(k|steam)Ratio
solid0.000190.0000228.9
gas0.0000660.000780.085
water0.0030.00221.36

Words related to "ice" but not "steam" have high ratios; vice versa for low ratios; neutral words have ratios near 1.

Using Pre-trained GloVe

import numpy as np
from typing import Dict

def load_glove(path: str) -> Dict[str, np.ndarray]:
    """
    Load GloVe embeddings from a text file.

    Download from: https://nlp.stanford.edu/projects/glove/
    Common files:
    - glove.6B.50d.txt (50 dimensions, trained on Wikipedia)
    - glove.6B.300d.txt (300 dimensions)
    - glove.42B.300d.txt (trained on Common Crawl, 42B tokens)
    """
    embeddings = {}
    with open(path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.strip().split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Load embeddings (download glove.6B.zip first)
# glove = load_glove('glove.6B.100d.txt')
# print(f"Loaded {len(glove)} word vectors")

# Example with synthetic data (replace with real GloVe)
glove = {
    'king': np.random.randn(100).astype('float32'),
    'queen': np.random.randn(100).astype('float32'),
    'man': np.random.randn(100).astype('float32'),
    'woman': np.random.randn(100).astype('float32'),
}

def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
    """Calculate cosine similarity between two vectors."""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def find_similar(word: str, embeddings: dict, topn: int = 5):
    """Find most similar words using cosine similarity."""
    if word not in embeddings:
        return []

    word_vec = embeddings[word]
    similarities = []

    for other_word, other_vec in embeddings.items():
        if other_word != word:
            sim = cosine_similarity(word_vec, other_vec)
            similarities.append((other_word, sim))

    return sorted(similarities, key=lambda x: x[1], reverse=True)[:topn]

FastText: Subword Embeddings

FastText (Facebook, 2016) improves on Word2Vec by representing words as bags of character n-grams.

Key Innovation

The word "where" with n=3 becomes:

<wh, whe, her, ere, re>, <where>

The word embedding is the sum of all n-gram embeddings.

Benefits

  1. OOV handling: Can generate vectors for unseen words
  2. Morphology: Captures prefixes, suffixes, roots
  3. Rare words: Better representations through subword sharing
from gensim.models import FastText
from gensim.utils import simple_preprocess

# Training corpus
corpus = [
    "Machine learning algorithms learn patterns from data",
    "Deep learning is a subset of machine learning",
    "Neural networks process information in layers",
    "Convolutional neural networks excel at image recognition",
    "Recurrent neural networks handle sequential data",
    "Transformers revolutionized natural language processing",
    "Pre-training and fine-tuning are common strategies",
    "Embeddings represent words as dense vectors",
]

tokenized = [simple_preprocess(doc) for doc in corpus]

# Train FastText model
model = FastText(
    sentences=tokenized,
    vector_size=100,
    window=5,
    min_count=1,
    min_n=2,          # Minimum n-gram length
    max_n=5,          # Maximum n-gram length
    epochs=50,
)

print("FastText trained successfully!")
print(f"Vocabulary: {list(model.wv.key_to_index.keys())}")

# Key advantage: Get vectors for OOV words!
oov_word = "deeplearning"  # Not in vocabulary
try:
    oov_vector = model.wv[oov_word]
    print(f"\nVector for OOV word '{oov_word}':")
    print(f"  Shape: {oov_vector.shape}")
    print(f"  First 5 dims: {oov_vector[:5]}")
except KeyError as e:
    print(f"Word not found: {e}")

# Compare: Word2Vec would fail on OOV words
# word2vec_model.wv["deeplearning"]  # KeyError!

The Context Problem

All word embedding methods share a fundamental limitation: one embedding per word, regardless of context.

# "bank" has the same vector in both sentences:
sentence1 = "I deposited money in the bank"  # Financial institution
sentence2 = "I sat by the river bank"         # Edge of river

# This is a problem for polysemous words!

This motivated contextualized embeddings like ELMo, BERT, and ultimately Sentence Transformers.

Sentence Transformers (SBERT)

Sentence-BERT (Reimers & Gurevych, 2019) produces sentence-level embeddings that can be compared with cosine similarity. It's built on BERT but optimized for generating fixed-size sentence representations.

Why Not Just Use BERT Directly?

BERT wasn't designed for sentence similarity. Using BERT naively requires:

  1. Passing both sentences through BERT together
  2. O(n²) comparisons for n sentences (10,000 sentences = 50M+ forward passes!)

SBERT generates independent embeddings: O(n) forward passes, then fast cosine similarity.

Complete Working Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load pre-trained model
print("Loading Sentence Transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')

# Model info
print(f"Model: all-MiniLM-L6-v2")
print(f"Max sequence length: {model.max_seq_length}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

# Encode sentences
sentences = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "AI systems can learn from experience",
    "The weather is nice today",
    "I enjoy hiking in the mountains",
    "Neural networks are inspired by the human brain",
]

print("\nEncoding sentences...")
embeddings = model.encode(sentences, show_progress_bar=True)
print(f"Embeddings shape: {embeddings.shape}")  # (6, 384)

# Calculate pairwise similarities
print("\n=== Pairwise Similarity Matrix ===")
similarity_matrix = cosine_similarity(embeddings)

for i, sent1 in enumerate(sentences):
    print(f"\n'{sent1[:50]}...'")
    for j, sent2 in enumerate(sentences):
        if i != j:
            print(f"  vs '{sent2[:40]}...': {similarity_matrix[i][j]:.3f}")

Semantic Search Implementation

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearch:
    """Production-ready semantic search with sentence transformers."""

    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.corpus = []
        self.corpus_embeddings = None

    def index(self, documents: list[str]):
        """Index a corpus of documents."""
        self.corpus = documents
        print(f"Encoding {len(documents)} documents...")
        self.corpus_embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            show_progress_bar=True
        )
        print(f"Indexed {len(documents)} documents")

    def search(self, query: str, top_k: int = 5) -> list[dict]:
        """Search the corpus for relevant documents."""
        if self.corpus_embeddings is None:
            raise ValueError("No documents indexed. Call index() first.")

        # Encode query
        query_embedding = self.model.encode(query, convert_to_tensor=True)

        # Calculate similarities
        scores = util.cos_sim(query_embedding, self.corpus_embeddings)[0]

        # Get top results
        top_results = torch.topk(scores, k=min(top_k, len(self.corpus)))

        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append({
                'document': self.corpus[idx],
                'score': score.item(),
                'index': idx.item()
            })

        return results

# Usage
search_engine = SemanticSearch()

# Index documents
documents = [
    "Python is a high-level programming language known for its simplicity",
    "JavaScript is essential for web development and runs in browsers",
    "Machine learning models can predict outcomes from data",
    "Docker containers package applications with their dependencies",
    "REST APIs enable communication between web services",
    "Neural networks are the foundation of deep learning",
    "Git is a version control system for tracking code changes",
    "Kubernetes orchestrates containerized applications at scale",
    "SQL databases store data in structured tables",
    "NoSQL databases offer flexible schema designs",
]

search_engine.index(documents)

# Search
queries = [
    "How do I learn AI?",
    "What's the best language for beginners?",
    "How to deploy applications?",
]

for query in queries:
    print(f"\n🔍 Query: '{query}'")
    results = search_engine.search(query, top_k=3)
    for i, result in enumerate(results, 1):
        print(f"  {i}. [{result['score']:.3f}] {result['document'][:60]}...")

Choosing the Right Model

ModelDimensionsSpeedQualityUse Case
all-MiniLM-L6-v2384⚡ FastGoodGeneral purpose, production
all-mpnet-base-v2768MediumBestWhen quality matters most
paraphrase-multilingual-MiniLM-L12-v2384FastGood50+ languages
all-MiniLM-L12-v2384FastBetterBalance of speed/quality
# Compare models
models_to_compare = [
    'all-MiniLM-L6-v2',
    'all-mpnet-base-v2',
]

test_sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps above a sleepy canine",
]

for model_name in models_to_compare:
    model = SentenceTransformer(model_name)
    embeddings = model.encode(test_sentences)
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    print(f"{model_name}: similarity = {similarity:.4f}")

Fine-tuning for Your Domain

For domain-specific applications, fine-tuning dramatically improves performance:

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Prepare training data
# Option 1: Pairs with similarity scores (0-1)
train_examples = [
    InputExample(texts=["Neural networks learn patterns", "Deep learning finds patterns in data"], label=0.9),
    InputExample(texts=["Python is a programming language", "The weather is sunny"], label=0.1),
    InputExample(texts=["Machine learning predicts outcomes", "ML models make predictions"], label=0.95),
    InputExample(texts=["Transformers use attention", "Attention mechanisms in transformers"], label=0.85),
    # Add more examples... (typically need 1000+)
]

# Create dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Define loss function
train_loss = losses.CosineSimilarityLoss(model)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=4,
    warmup_steps=100,
    output_path='fine-tuned-model',
    show_progress_bar=True,
)

# Load fine-tuned model
fine_tuned_model = SentenceTransformer('fine-tuned-model')

Visualizing Embeddings

Use dimensionality reduction to visualize embeddings in 2D:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sentence_transformers import SentenceTransformer

# Prepare sentences from different categories
sentences = {
    'programming': [
        "Python is a programming language",
        "JavaScript runs in the browser",
        "Java is used for enterprise applications",
        "C++ offers low-level control",
    ],
    'animals': [
        "Dogs are loyal pets",
        "Cats are independent animals",
        "Birds can fly through the sky",
        "Fish live underwater",
    ],
    'food': [
        "Pizza is a popular Italian dish",
        "Sushi is Japanese cuisine",
        "Tacos are Mexican food",
        "Pasta comes in many shapes",
    ],
}

# Flatten and track categories
all_sentences = []
categories = []
for category, sents in sentences.items():
    all_sentences.extend(sents)
    categories.extend([category] * len(sents))

# Get embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(all_sentences)

# Reduce to 2D with t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
embeddings_2d = tsne.fit_transform(embeddings)

# Plot
plt.figure(figsize=(10, 8))
colors = {'programming': 'blue', 'animals': 'green', 'food': 'red'}

for i, (x, y) in enumerate(embeddings_2d):
    category = categories[i]
    plt.scatter(x, y, c=colors[category], s=100, alpha=0.7)
    plt.annotate(all_sentences[i][:20] + '...', (x, y), fontsize=8)

plt.legend(handles=[
    plt.scatter([], [], c=color, label=cat)
    for cat, color in colors.items()
])
plt.title('Sentence Embeddings Visualization (t-SNE)')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.tight_layout()
plt.savefig('embeddings_visualization.png', dpi=150)
plt.show()

Practical Tips and Best Practices

1. Always Normalize for Cosine Similarity

from sklearn.preprocessing import normalize

embeddings = model.encode(sentences)
normalized_embeddings = normalize(embeddings)  # L2 normalization

# Now dot product = cosine similarity
similarity = np.dot(normalized_embeddings[0], normalized_embeddings[1])

2. Batch Processing for Efficiency

# Bad: One at a time
embeddings = [model.encode(s) for s in sentences]  # Slow!

# Good: Batch encode
embeddings = model.encode(sentences, batch_size=32)  # Fast!

3. Use GPU When Available

import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
print(f"Using device: {device}")

4. Truncate Long Texts

# Models have maximum sequence lengths
max_length = model.max_seq_length  # Usually 256 or 512 tokens

# For longer documents, consider:
# 1. Truncation (loses information)
# 2. Chunking and averaging
# 3. Using models designed for long text (e.g., INSTRUCTOR)

Summary Comparison

MethodYearStrengthsWeaknesses
Word2Vec2013Fast, interpretableNo context, OOV issues
GloVe2014Global statisticsNo context, OOV issues
FastText2016Handles OOV, morphologyNo context
SBERT2019Context-aware, high qualitySlower, needs GPU

Conclusion

Text embeddings have evolved dramatically:

  • Word2Vec/GloVe: Pioneered dense word representations
  • FastText: Added subword information for OOV handling
  • Sentence Transformers: State-of-the-art semantic similarity

For most modern NLP tasks, start with Sentence Transformers (specifically all-MiniLM-L6-v2 for speed or all-mpnet-base-v2 for quality). Only use word embeddings for specific use cases like word analogy tasks or when computational resources are extremely limited.

References

  • Mikolov et al. "Efficient Estimation of Word Representations in Vector Space" (2013)
  • Pennington et al. "GloVe: Global Vectors for Word Representation" (2014)
  • Bojanowski et al. "Enriching Word Vectors with Subword Information" (2017)
  • Reimers and Gurevych "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (2019)
  • Sentence Transformers Documentation: https://www.sbert.net/